自然語言處理之 nltk 英文分句、分詞、統計詞頻的工具:
需要引入包:
from nltk.tokenize import RegexpTokenizer from nltk.corpus import stopwords from nltk.stem.porter import PorterStemmer from nltk.tokenize import word_tokenize from gensim import corpora, models import gensim
1、nltk 英文分句:sentences = sen_tokenizer.tokenize(paragraph)
2、nltk 英文分詞:word_list = nltk.word_tokenize(paragraph)
3、統計詞頻:freq_dist = nltk.FreqDist(words) #nltk.FreqDist返回一個詞典,key是不同的詞,value是詞出現的次數
