mardi 14 juin 2016

Looping through a paginated api asynchronously


I'm currently ingesting data through an API that returns close to 100,000 documents in a paginated fashion (100 per page). I currently have some code that roughly functions as follows:

while c <= limit:
    if not api_url:
        break

    req = urllib2.Request(api_url)
    opener = urllib2.build_opener()
    f = opener.open(req)
    response = simplejson.load(f)

    for item in response['documents']:
        # DO SOMETHING HERE 
    if 'more_url' in response:
        api_url = response['more_url']
    else:
        api_url = None
        break
    c += 1

Downloading the data this way is really slow and I was wondering if there is any way to loop through the pages in an async way. I have been recommended to take a look at twisted, but I am not entirely sure how to proceed.


Aucun commentaire:

Enregistrer un commentaire