lundi 4 juillet 2016

Stemming of Text using NLTK in Python


I am trying to implement LDA upon a set of tweets treated as a document. While preprocessing, in the stemming part it shows error as : UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 0: ordinal not in range(128)

My code is as shown below:- from nltk.tokenize import RegexpTokenizer from stop_words import get_stop_words from nltk.stem.porter import PorterStemmer from gensim import corpora, models import gensim import csv import itertools

tokenizer = RegexpTokenizer(r'w+')


en_stop = get_stop_words('en')


p_stemmer = PorterStemmer()

reader = csv.reader(open('/home/balki/Documents/Bangalore-13062016.csv', 'rU'), dialect=csv.excel_tab)

your_list = list(reader)
chain=itertools.chain(*your_list)
your_list2=list(chain)




texts = []




for i in your_list2:

    raw = i.lower()
    tokens = tokenizer.tokenize(raw)

    stopped_tokens = [i for i in tokens if not i in en_stop]


    stemmed_tokens = [p_stemmer.stem(i) for i in stopped_tokens]



    print(stemmed_tokens)

Please suggest what should be done.


Aucun commentaire:

Enregistrer un commentaire