mardi 5 juillet 2016

Fast .gz Log File Parsing in Python


I have multiple log files that contain 10000+ lines of info and are Gzipped. I need a way to quickly parse each log file for relevant information and then display stats based on the information contained in all the log files. I currently use gzip.open() to recursively open each .gz file and then run the contents through a primitive parser. def parse(logfile): for line in logfile: if "REPORT" in line: info = line.split() username = info[2] area = info[4] # Put info into dicts/lists etc. elif "ERROR" in line: info = line.split() ... def main(args): argdir = args[1] for currdir, subdirs, files in os.walk(argdir): for filename in files: with gzip.open(os.path.join(currdir, filename), "rt") as log: parse(log) # Create a report at the end: createreport() Is there any way to optimize this process for each file? It currently takes ~28 seconds per file on my computer to go through each .gz and every little optimization counts. I've tried using pypy and for some reason it takes 2 times longer to process a file.

Aucun commentaire:

Enregistrer un commentaire