jeudi 30 juin 2016

Using pandas, calculate Cramér's coefficient matrix


I have a dataframe in pandas which contains metrics calculated on Wikipedia articles. Two categorical variables nation which nation the article is about, and lang which language Wikipedia this was taken from. For a single metric, I would like to see how closely the nation and language variable correlate, I believe this is done using Cramer's statistic.

index   qid     subj    nation  lang    metric          value
5   Q3488399    economy     cdi     fr  informativeness 0.787117
6   Q3488399    economy     cdi     fr  referencerate   0.000945
7   Q3488399    economy     cdi     fr  completeness    43.200000
8   Q3488399    economy     cdi     fr  numheadings     11.000000
9   Q3488399    economy     cdi     fr  articlelength   3176.000000
10  Q7195441    economy     cdi     en  informativeness 0.626570
11  Q7195441    economy     cdi     en  referencerate   0.008610
12  Q7195441    economy     cdi     en  completeness    6.400000
13  Q7195441    economy     cdi     en  numheadings     7.000000
14  Q7195441    economy     cdi     en  articlelength   2323.000000

I would like to generate a matrix that displays Cramer's coefficient between all combinations of nation (france, usa, cote d'ivorie, and uganda) ['fra','usa','uga'] and three languages ['fr','en','sw']. So there would be a resulting 4 by 3 matrix like:

       en         fr          sw
usa    Cramer11   Cramer12    ... 
fra    Cramer21   Cramer22    ... 
cdi    ...
uga    ...

Eventually then I will do this over all the different metrics I am tracking.

for subject in list_of_subjects:
    for metric in list_of_metrics:
        cramer_matrix(metric, df)

Then I can test my hypothesis that metrics will be higher for articles whose language is the language of the Wikipedia. Thanks


Aucun commentaire:

Enregistrer un commentaire