samedi 18 juin 2016

How to convert Spark LDA Modeling numerical results to original words


I am trying to learn to use Spark LDA modeling in Python, hoping to extract topics of hundreds of Reddit posts.

However, after training the model, when I applied method describeTopics(), the result looks like this:

[([441832, 8563, 731824, 381507, 933925], [0.0062627265685400516, 0.005369477351664474, 0.005309586577412947, 0.00503830331115649, 0.004271026596928107]),...

I was wondering whether in this output [441832, 8563, 731824, 381507, 933925] indicates the word index of Spark Vocabulary. If it is, maybe I could find a way to know which index points to which word.

So I did a very simple test to see whether these numbers are word index. The process of this test is just like what I am doing on hundreds of posts, but instead, I'm only using 1 post here.

First of all, I split the post words and removed stop words

test_str = "Emmanuel is the most lovely cat in the whole universe. No one is more lovely than Emmanuel! Sweetest Emmanuel, aha?! Sweetest Emmanuel, haha!"

replace_punctuation = string.maketrans(string.punctuation, ' '*len(string.punctuation))
review_text = test_str.translate(replace_punctuation).split()
review_words = [w.lower() for w in review_text if w.lower() not in stopwords]

print review_words

The output:

['emmanuel', 'lovely', 'cat', 'whole', 'universe', 'one', 'lovely', 'emmanuel', 'sweetest', 'emmanuel', 'aha', 'sweetest', 'emmanuel', 'haha']

Then, I convert the words into tf-idf score and normalize the score

from pyspark.mllib.feature import HashingTF
from pyspark.mllib.feature import IDF
from pyspark.mllib.feature import Normalizer

def get_tfidf_features(txt_rdd):
    hashingTF = HashingTF()
    tf = hashingTF.transform(txt_rdd)
    tf.cache()
    idf = IDF().fit(tf)
    tfidf = idf.transform(tf)

    return tfidf

nor = Normalizer(1)

review_words_rdd = sc.parallelize(review_words)
test_words_bag = get_tfidf_features(review_words_rdd)
nor_test_words_bag = nor.transform(test_words_bag)

Where I got stuck: Now, no matter I apply print test_words_bag.collect() or print nor_test_words_bag.collect(), to my surprise, I got a list of SparseVector instead of just 1.....

[SparseVector(1048576, {145380: 0.4463, 282612: 0.9163, 664173: 0.6286, 738284: 2.1972, 812399: 0.7621, 897504: 0.6286}), SparseVector(1048576, {145380: 0.2231, 356727: 1.3218, 579064: 1.6094, 664173: 1.2572, 886510: 1.0986}), SparseVector(1048576, {208501: 1.3218, 897504: 0.6286, 1045730: 2.0149}), SparseVector(1048576, {145380: 0.2231, 367721: 1.3218, 430838: 1.3218, 664173: 0.6286, 886510: 1.0986}), SparseVector(1048576, {60275: 2.0149, 134386: 1.3218, 145380: 0.4463, 282612: 0.9163, 356727: 1.3218, 441832: 2.0149, 812399: 0.7621}), SparseVector(1048576, {145380: 0.2231, 812399: 0.7621, 886510: 1.0986}), SparseVector(1048576, {145380: 0.2231, 356727: 1.3218, 579064: 1.6094, 664173: 1.2572, 886510: 1.0986}), SparseVector(1048576, {145380: 0.4463, 282612: 0.9163, 664173: 0.6286, 738284: 2.1972, 812399: 0.7621, 897504: 0.6286}), SparseVector(1048576, {134386: 2.6435, 145380: 0.6694, 208501: 2.6435, 430838: 1.3218}), SparseVector(1048576, {145380: 0.4463, 282612: 0.9163, 664173: 0.6286, 738284: 2.1972, 812399: 0.7621, 897504: 0.6286}), SparseVector(1048576, {367721: 1.3218, 897504: 1.2572}), SparseVector(1048576, {134386: 2.6435, 145380: 0.6694, 208501: 2.6435, 430838: 1.3218}), SparseVector(1048576, {145380: 0.4463, 282612: 0.9163, 664173: 0.6286, 738284: 2.1972, 812399: 0.7621, 897504: 0.6286}), SparseVector(1048576, {367721: 2.6435, 897504: 1.2572})]

I don't really understand why I got a list of SparseVector when the input is just 1 string. And therefore I am not sure how can I changes these numbers which look like index back to the original words, so that I will know how does the extracted topics look like

Do you know anyway to get human readable topics from Spark LDA Modeling output?


Aucun commentaire:

Enregistrer un commentaire