I'm currently ingesting data through an API that returns close to 100,000 documents in a paginated fashion (100 per page). I currently have some code that roughly functions as follows:
while c <= limit:
if not api_url:
break
req = urllib2.Request(api_url)
opener = urllib2.build_opener()
f = opener.open(req)
response = simplejson.load(f)
for item in response['documents']:
# DO SOMETHING HERE
if 'more_url' in response:
api_url = response['more_url']
else:
api_url = None
break
c += 1
Downloading the data this way is really slow and I was wondering if there is any way to loop through the pages in an async way. I have been recommended to take a look at twisted, but I am not entirely sure how to proceed.
Aucun commentaire:
Enregistrer un commentaire