jeudi 23 juin 2016

how to parallelize many string comparisons in Pandas?


I have the following problem

I have a dataframe master that contains sentences, such as

master
Out[8]: 
                  original
0  this is a nice sentence
1      this is another one
2    stackoverflow is nice

For every row in Master, I lookup into another Dataframe slave for the best match using fuzzywuzzy. I use fuzzywuzzy because the matched sentences between the two dataframes could differ a bit (extra characters, etc).

For instance, slave could be

slave
Out[10]: 
   my_value                      name
0         2               hello world
1         1           congratulations
2         2  this is a nice sentence 
3         3       this is another one
4         1     stackoverflow is nice

Here is a fully-functional, wonderful, compact working example :)

from fuzzywuzzy import fuzz
import pandas as pd
import numpy as np
import difflib


master= pd.DataFrame({'original':['this is a nice sentence',
'this is another one',
'stackoverflow is nice']})


slave= pd.DataFrame({'name':['hello world',
'congratulations',
'this is a nice sentence ',
'this is another one',
'stackoverflow is nice'],'my_value': [2,1,2,3,1]})

def fuzzy_score(str1, str2):
    return fuzz.token_set_ratio(str1, str2)

def helper(orig_string, slave_df):
    #use fuzzywuzzy to see how close original and name are
    slave_df['score'] = slave_df.name.apply(lambda x: fuzzy_score(x,orig_string))
    #return my_value corresponding to the highest score
    return slave_df.ix[slave_df.score.idxmax(),'my_value']

master['my_value'] = master.original.apply(lambda x: helper(x,slave))

the 1 million dollar question is: can I parallelize my apply code above?

After all, every row in master is compared to all the rows in slave (slave is a small dataset and I can hold many copies of the data into the RAM).

I dont see why I could not run multiple comparisons (i.e. process multiple rows at the same time).

Problem: I dont know how to do that or if thats even possible.

Any help greatly appreciated!


Aucun commentaire:

Enregistrer un commentaire