jquery: How to convert Spark LDA Modeling numerical results to original words

samedi 18 juin 2016

How to convert Spark LDA Modeling numerical results to original words

I am trying to learn to use Spark LDA modeling in Python, hoping to extract topics of hundreds of Reddit posts.

However, after training the model, when I applied method describeTopics(), the result looks like this:

[([441832, 8563, 731824, 381507, 933925], [0.0062627265685400516, 0.005369477351664474, 0.005309586577412947, 0.00503830331115649, 0.004271026596928107]),...

I was wondering whether in this output [441832, 8563, 731824, 381507, 933925] indicates the word index of Spark Vocabulary. If it is, maybe I could find a way to know which index points to which word.

So I did a very simple test to see whether these numbers are word index. The process of this test is just like what I am doing on hundreds of posts, but instead, I'm only using 1 post here.

First of all, I split the post words and removed stop words

test_str = "Emmanuel is the most lovely cat in the whole universe. No one is more lovely than Emmanuel! Sweetest Emmanuel, aha?! Sweetest Emmanuel, haha!"

replace_punctuation = string.maketrans(string.punctuation, ' '*len(string.punctuation))
review_text = test_str.translate(replace_punctuation).split()
review_words = [w.lower() for w in review_text if w.lower() not in stopwords]

print review_words

The output:

['emmanuel', 'lovely', 'cat', 'whole', 'universe', 'one', 'lovely', 'emmanuel', 'sweetest', 'emmanuel', 'aha', 'sweetest', 'emmanuel', 'haha']

Then, I convert the words into tf-idf score and normalize the score

from pyspark.mllib.feature import HashingTF
from pyspark.mllib.feature import IDF
from pyspark.mllib.feature import Normalizer

def get_tfidf_features(txt_rdd):
    hashingTF = HashingTF()
    tf = hashingTF.transform(txt_rdd)
    tf.cache()
    idf = IDF().fit(tf)
    tfidf = idf.transform(tf)

    return tfidf

nor = Normalizer(1)

review_words_rdd = sc.parallelize(review_words)
test_words_bag = get_tfidf_features(review_words_rdd)
nor_test_words_bag = nor.transform(test_words_bag)

Where I got stuck: Now, no matter I apply print test_words_bag.collect() or print nor_test_words_bag.collect(), to my surprise, I got a list of SparseVector instead of just 1.....

[SparseVector(1048576, {145380: 0.4463, 282612: 0.9163, 664173: 0.6286, 738284: 2.1972, 812399: 0.7621, 897504: 0.6286}), SparseVector(1048576, {145380: 0.2231, 356727: 1.3218, 579064: 1.6094, 664173: 1.2572, 886510: 1.0986}), SparseVector(1048576, {208501: 1.3218, 897504: 0.6286, 1045730: 2.0149}), SparseVector(1048576, {145380: 0.2231, 367721: 1.3218, 430838: 1.3218, 664173: 0.6286, 886510: 1.0986}), SparseVector(1048576, {60275: 2.0149, 134386: 1.3218, 145380: 0.4463, 282612: 0.9163, 356727: 1.3218, 441832: 2.0149, 812399: 0.7621}), SparseVector(1048576, {145380: 0.2231, 812399: 0.7621, 886510: 1.0986}), SparseVector(1048576, {145380: 0.2231, 356727: 1.3218, 579064: 1.6094, 664173: 1.2572, 886510: 1.0986}), SparseVector(1048576, {145380: 0.4463, 282612: 0.9163, 664173: 0.6286, 738284: 2.1972, 812399: 0.7621, 897504: 0.6286}), SparseVector(1048576, {134386: 2.6435, 145380: 0.6694, 208501: 2.6435, 430838: 1.3218}), SparseVector(1048576, {145380: 0.4463, 282612: 0.9163, 664173: 0.6286, 738284: 2.1972, 812399: 0.7621, 897504: 0.6286}), SparseVector(1048576, {367721: 1.3218, 897504: 1.2572}), SparseVector(1048576, {134386: 2.6435, 145380: 0.6694, 208501: 2.6435, 430838: 1.3218}), SparseVector(1048576, {145380: 0.4463, 282612: 0.9163, 664173: 0.6286, 738284: 2.1972, 812399: 0.7621, 897504: 0.6286}), SparseVector(1048576, {367721: 2.6435, 897504: 1.2572})]

I don't really understand why I got a list of SparseVector when the input is just 1 string. And therefore I am not sure how can I changes these numbers which look like index back to the original words, so that I will know how does the extracted topics look like

Do you know anyway to get human readable topics from Spark LDA Modeling output?

jquery

samedi 18 juin 2016

How to convert Spark LDA Modeling numerical results to original words

Aucun commentaire:

Enregistrer un commentaire