lundi 11 juillet 2016

Change user agent used with robotparser in Python


I am using the robotparser from the urlib module in Python to determine if can download webpages. One site I am accessing however returns a 403 error when the robot.txt file is accessed via the default user-agent, but correct response if e.g. downloaded via requests with my user-agent string. (The site also gives a 403 when accessed with the requests packages default user-agent, suggesting they are just blocking common/generic user-agent strings, rather than adding them to the robot.txt file).

Anyway, is it possible to change the user-agent in the rootparser module? Or alternatively, to load in a robot.txt file downloaded seperately?


Aucun commentaire:

Enregistrer un commentaire