I am trying to implement LDA upon a set of tweets treated as a document. While preprocessing, in the stemming part it shows error as : UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 0: ordinal not in range(128)
My code is as shown below:- from nltk.tokenize import RegexpTokenizer from stop_words import get_stop_words from nltk.stem.porter import PorterStemmer from gensim import corpora, models import gensim import csv import itertools
tokenizer = RegexpTokenizer(r'w+')
en_stop = get_stop_words('en')
p_stemmer = PorterStemmer()
reader = csv.reader(open('/home/balki/Documents/Bangalore-13062016.csv', 'rU'), dialect=csv.excel_tab)
your_list = list(reader)
chain=itertools.chain(*your_list)
your_list2=list(chain)
texts = []
for i in your_list2:
raw = i.lower()
tokens = tokenizer.tokenize(raw)
stopped_tokens = [i for i in tokens if not i in en_stop]
stemmed_tokens = [p_stemmer.stem(i) for i in stopped_tokens]
print(stemmed_tokens)
Please suggest what should be done.
Aucun commentaire:
Enregistrer un commentaire