jeudi 7 juillet 2016

group rows if atleast one word overlaps with other in a dataframe column


I have a data frame as below

       words    group_id
0  set([a, c, b, d])   1
1        set([a, b])   2
2  set([h, e, g, f])   3

I need to group the rows into one even if one word in the set(words) overlaps with a word in set of other row and update the group_id.

       words    group_id
0  set([a, c, b, d])   1
1        set([a, b])   1
2  set([h, e, g, f])   3

I tried this way

word_frequency = Counter()

for val in df['words'].values:
    word_frequency.update(val)

to_return = np.array(word_frequency.most_common())
count = 1

df['group_id'] = np.zeros(len(df)) * np.nan
for val in to_return:
    df['group_id'] = df[['group_id','words']].apply(lambda x: count if (val in x) else np.NAN)
    count += 1

How can I do that?


Aucun commentaire:

Enregistrer un commentaire