I am using the robotparser
from the urlib module in Python to determine if can download webpages. One site I am accessing however returns a 403 error when the robot.txt file is accessed via the default user-agent, but correct response if e.g. downloaded via requests with my user-agent string. (The site also gives a 403 when accessed with the requests packages default user-agent, suggesting they are just blocking common/generic user-agent strings, rather than adding them to the robot.txt file).
Anyway, is it possible to change the user-agent in the rootparser module? Or alternatively, to load in a robot.txt file downloaded seperately?
Aucun commentaire:
Enregistrer un commentaire