jquery: how to parallelize many (fuzzy) string comparisons in Pandas?

vendredi 24 juin 2016

how to parallelize many (fuzzy) string comparisons in Pandas?

I have the following problem I have a dataframe master that contains sentences, such as master Out[8]: original 0 this is a nice sentence 1 this is another one 2 stackoverflow is nice For every row in Master, I lookup into another Dataframe slave for the best match using fuzzywuzzy. I use fuzzywuzzy because the matched sentences between the two dataframes could differ a bit (extra characters, etc). For instance, slave could be slave Out[10]: my_value name 0 2 hello world 1 1 congratulations 2 2 this is a nice sentence 3 3 this is another one 4 1 stackoverflow is nice Here is a fully-functional, wonderful, compact working example :) from fuzzywuzzy import fuzz import pandas as pd import numpy as np import difflib master= pd.DataFrame({'original':['this is a nice sentence', 'this is another one', 'stackoverflow is nice']}) slave= pd.DataFrame({'name':['hello world', 'congratulations', 'this is a nice sentence ', 'this is another one', 'stackoverflow is nice'],'my_value': [2,1,2,3,1]}) def fuzzy_score(str1, str2): return fuzz.token_set_ratio(str1, str2) def helper(orig_string, slave_df): #use fuzzywuzzy to see how close original and name are slave_df['score'] = slave_df.name.apply(lambda x: fuzzy_score(x,orig_string)) #return my_value corresponding to the highest score return slave_df.ix[slave_df.score.idxmax(),'my_value'] master['my_value'] = master.original.apply(lambda x: helper(x,slave)) the 1 million dollar question is: can I parallelize my apply code above? After all, every row in master is compared to all the rows in slave (slave is a small dataset and I can hold many copies of the data into the RAM). I dont see why I could not run multiple comparisons (i.e. process multiple rows at the same time). Problem: I dont know how to do that or if thats even possible. Any help greatly appreciated!

jquery

vendredi 24 juin 2016

how to parallelize many (fuzzy) string comparisons in Pandas?

Aucun commentaire:

Enregistrer un commentaire