vendredi 24 juin 2016

how to parallelize many (fuzzy) string comparisons in Pandas?


I have the following problem I have a dataframe master that contains sentences, such as master Out[8]: original 0 this is a nice sentence 1 this is another one 2 stackoverflow is nice For every row in Master, I lookup into another Dataframe slave for the best match using fuzzywuzzy. I use fuzzywuzzy because the matched sentences between the two dataframes could differ a bit (extra characters, etc). For instance, slave could be slave Out[10]: my_value name 0 2 hello world 1 1 congratulations 2 2 this is a nice sentence 3 3 this is another one 4 1 stackoverflow is nice Here is a fully-functional, wonderful, compact working example :) from fuzzywuzzy import fuzz import pandas as pd import numpy as np import difflib master= pd.DataFrame({'original':['this is a nice sentence', 'this is another one', 'stackoverflow is nice']}) slave= pd.DataFrame({'name':['hello world', 'congratulations', 'this is a nice sentence ', 'this is another one', 'stackoverflow is nice'],'my_value': [2,1,2,3,1]}) def fuzzy_score(str1, str2): return fuzz.token_set_ratio(str1, str2) def helper(orig_string, slave_df): #use fuzzywuzzy to see how close original and name are slave_df['score'] = slave_df.name.apply(lambda x: fuzzy_score(x,orig_string)) #return my_value corresponding to the highest score return slave_df.ix[slave_df.score.idxmax(),'my_value'] master['my_value'] = master.original.apply(lambda x: helper(x,slave)) the 1 million dollar question is: can I parallelize my apply code above? After all, every row in master is compared to all the rows in slave (slave is a small dataset and I can hold many copies of the data into the RAM). I dont see why I could not run multiple comparisons (i.e. process multiple rows at the same time). Problem: I dont know how to do that or if thats even possible. Any help greatly appreciated!

Aucun commentaire:

Enregistrer un commentaire