I am working on a document parser for an internship. The code is fairly simple and I was able to get everything to work... or so I thought.
The company has hundreds of documents that reference each other. They all have the title 200XXX
. The program takes in a document number and compares it to the content of all the documents in an attempt to see if other documents reference that document number.
Here is the section of code that isn't working: (please excuse the mess, I'm teaching my self to code for this project)
for infile in listing:
if infile[0] == '2' and (infile[-3:] =='doc' or infile[-4:]=='docx'):
while True:
file = open("Z:/"+infile ,'r',encoding='latin1')
count+=1
for line in file:
if user_input in line:
first_strip=line.replace('',' ')
second_strip=first_strip.replace('',' ')
if second_strip[0:2]=='Do':
ref_count+=1
print(user_input+' was referenced in: '+infile+ 'n')
textbox.append(infile)
print(textbox)
print('that was from the list')
continue
else:
bad_count+=1
continue
print(str(count)+' Current document is: '+infile)
file.close()
break
This is not the entire program, just the part that searches the documents. Surprisingly, it will work but only for Word documents from 97-2003. For some reason it will not work for newer Word doc files.
I have tried using different encodings such as UTF8, ASCII, etc., but the documents only work with Latin1. I'm currently attempting to see if it can't actually read newer document types or if it can't compare the input to what it is reading.
I'm really hoping that it is something simple that I overlooked due to my programming ignorance. If anyone knows anything or has had a similar situation I would really appreciate the help.
I am coding with Python 3.5 under Windows.
Aucun commentaire:
Enregistrer un commentaire