I am trying to learn to use Spark LDA modeling in Python, hoping to extract topics of hundreds of Reddit posts.
However, after training the model, when I applied method describeTopics()
, the result looks like this:
[([441832, 8563, 731824, 381507, 933925], [0.0062627265685400516, 0.005369477351664474, 0.005309586577412947, 0.00503830331115649, 0.004271026596928107]),...
I was wondering whether in this output [441832, 8563, 731824, 381507, 933925]
indicates the word index of Spark Vocabulary. If it is, maybe I could find a way to know which index points to which word.
So I did a very simple test to see whether these numbers are word index. The process of this test is just like what I am doing on hundreds of posts, but instead, I'm only using 1 post here.
First of all, I split the post words and removed stop words
test_str = "Emmanuel is the most lovely cat in the whole universe. No one is more lovely than Emmanuel! Sweetest Emmanuel, aha?! Sweetest Emmanuel, haha!"
replace_punctuation = string.maketrans(string.punctuation, ' '*len(string.punctuation))
review_text = test_str.translate(replace_punctuation).split()
review_words = [w.lower() for w in review_text if w.lower() not in stopwords]
print review_words
The output:
['emmanuel', 'lovely', 'cat', 'whole', 'universe', 'one', 'lovely', 'emmanuel', 'sweetest', 'emmanuel', 'aha', 'sweetest', 'emmanuel', 'haha']
Then, I convert the words into tf-idf score and normalize the score
from pyspark.mllib.feature import HashingTF
from pyspark.mllib.feature import IDF
from pyspark.mllib.feature import Normalizer
def get_tfidf_features(txt_rdd):
hashingTF = HashingTF()
tf = hashingTF.transform(txt_rdd)
tf.cache()
idf = IDF().fit(tf)
tfidf = idf.transform(tf)
return tfidf
nor = Normalizer(1)
review_words_rdd = sc.parallelize(review_words)
test_words_bag = get_tfidf_features(review_words_rdd)
nor_test_words_bag = nor.transform(test_words_bag)
Where I got stuck:
Now, no matter I apply print test_words_bag.collect()
or print nor_test_words_bag.collect()
, to my surprise, I got a list of SparseVector instead of just 1.....
[SparseVector(1048576, {145380: 0.4463, 282612: 0.9163, 664173: 0.6286, 738284: 2.1972, 812399: 0.7621, 897504: 0.6286}), SparseVector(1048576, {145380: 0.2231, 356727: 1.3218, 579064: 1.6094, 664173: 1.2572, 886510: 1.0986}), SparseVector(1048576, {208501: 1.3218, 897504: 0.6286, 1045730: 2.0149}), SparseVector(1048576, {145380: 0.2231, 367721: 1.3218, 430838: 1.3218, 664173: 0.6286, 886510: 1.0986}), SparseVector(1048576, {60275: 2.0149, 134386: 1.3218, 145380: 0.4463, 282612: 0.9163, 356727: 1.3218, 441832: 2.0149, 812399: 0.7621}), SparseVector(1048576, {145380: 0.2231, 812399: 0.7621, 886510: 1.0986}), SparseVector(1048576, {145380: 0.2231, 356727: 1.3218, 579064: 1.6094, 664173: 1.2572, 886510: 1.0986}), SparseVector(1048576, {145380: 0.4463, 282612: 0.9163, 664173: 0.6286, 738284: 2.1972, 812399: 0.7621, 897504: 0.6286}), SparseVector(1048576, {134386: 2.6435, 145380: 0.6694, 208501: 2.6435, 430838: 1.3218}), SparseVector(1048576, {145380: 0.4463, 282612: 0.9163, 664173: 0.6286, 738284: 2.1972, 812399: 0.7621, 897504: 0.6286}), SparseVector(1048576, {367721: 1.3218, 897504: 1.2572}), SparseVector(1048576, {134386: 2.6435, 145380: 0.6694, 208501: 2.6435, 430838: 1.3218}), SparseVector(1048576, {145380: 0.4463, 282612: 0.9163, 664173: 0.6286, 738284: 2.1972, 812399: 0.7621, 897504: 0.6286}), SparseVector(1048576, {367721: 2.6435, 897504: 1.2572})]
I don't really understand why I got a list of SparseVector when the input is just 1 string. And therefore I am not sure how can I changes these numbers which look like index back to the original words, so that I will know how does the extracted topics look like
Do you know anyway to get human readable topics from Spark LDA Modeling output?
Aucun commentaire:
Enregistrer un commentaire