vendredi 24 juin 2016

What is the best way to remove duplicate objects from the Django database


I am mining the Twitter Search API for tweets of a certain hashtag and storing them to a Postgresql database using the Django ORM.

Here is the code from my tasks.py file that handles this routine.

"""Get some tweets and store them to the database using Djano's ORM."""

import tweepy
from celery import shared_task

auth = tweepy.OAuthHandler(CONSUMER_KEY, CONSUMER_SECRET)
auth.set_access_token(ACCESS_TOKEN, ACCESS_TOKEN_SECRET)

api = tweepy.API(auth, wait_on_rate_limit=True)


@shared_task(name='get_tweets')
"""Get some tweets from the twiter api and store them to the db."""
def get_tweets():
    tweets = api.search(
        q='#python',
        since='2016-06-14',
        until='2016-06-21',
        count=5
    )
    tweets_date = [tweet.created_at for tweet in tweets]
    tweets_id = [tweet.id for tweet in tweets]
    tweets_text = [tweet.text for tweet in tweets]

    for i, j, k in zip(tweets_date, tweets_id, tweets_text):
        update = Tweet(
            tweet_date=i,
            tweet_id=j,
            tweet_text=k
        )
        update.save()

Here is my models.py

from django.db import models


class Tweet(models.Model):
    tweet_date = models.DateTimeField()
    tweet_id = models.CharField(max_length=50, unique=True)
    tweet_text = models.TextField()

    def __str__(self):
        return str(self.tweet_date) + '  |  ' + str(self.tweet_id)

I am getting duplicates, do to the Twitter API.

Is there a way to check for duplicates before the object gets save to the database. Here:

for i, j, k in zip(tweets_date, tweets_id, tweets_text):
        update = Tweet(
            tweet_date=i,
            tweet_id=j,
            tweet_text=k
        )
        update.save()

Is this something I can take care of in the extraction process here or is it something that I need to clean up afterward, like in the transformation phase?.


Aucun commentaire:

Enregistrer un commentaire