dimanche 12 juin 2016

python: Dowloading and caching XML files - how to handle encoding declaration?


from urllib.request import urlopen
from lxml import objectify

I am trying to write a program that will download XML files into a cache and then open them using objectify. If I download the files using urlopen() then I can read them in using objectify.fromstring() just fine:

r = urlopen(my_url)
o = objectify.fromstring(r.read())

However, if I download them and write them to a file, I end up with an encoding declaration at the top of the file that objectify doesn't like. To wit:

# download the file
my_file = 'foo.xml'
r = urlopen(my_url)

# save locally
with open(my_file, 'wb') as fp:
    fp.write(r.read())

# open saved copy
with open(my_file, 'r') as fp:
    o1 = objectify.fromstring(fp.read())

results in ValueError: Unicode strings with encoding declaration are not supported. Please use bytes input or XML fragments without declaration.

If I use objectify.parse(fp) then that works fine- soo-- I could go through and change all the client code to use parse() instead, but I feel like that is not the right approach. I have other XML files stored locally for which .fromstring() works just fine-- based on a cursory review they appear to have utf-8 encoding.

I just don't know what is the right resolution here- should I change the encoding when I save the file? should I strip the encoding declaration? should I fill my code with try.. except ValueError clauses? please advise.


Aucun commentaire:

Enregistrer un commentaire