做kaggle的quora比賽需要用Python處理英文
首先分詞
import nltk
sentence = "At eight o'clock on Thursday morning Arthur didn't feel very good."
tokens = nltk.word_tokenize(sentence)
print tokens
['At', 'eight', "o'clock", 'on', 'Thursday', 'morning', 'Arthur', 'did', "n't", 'feel', 'very', 'good', '.']
報錯
LookupError: ********************************************************************** Resource u'tokenizers/punkt/english.pickle' not found. Please use the NLTK Downloader to obtain the resource:
>>> nltk.download()
按照提示下載pickle模塊后,不再報錯
然后標注詞性
word_tag=nltk.pos_tag(tokens) print word_tag word_tag=nltk.pos_tag(tokens) print word_tag [('At', 'IN'), ('eight', 'CD'), ("o'clock", 'NN'), ('on', 'IN'), ('Thursday', 'NNP'), ('morning', 'NN'), ('Arthur', 'NNP'), ('did', 'VBD'), ("n't", 'RB'), ('feel', 'VB'), ('very', 'RB'), ('good', 'JJ'), ('.', '.')]
tag的含義解釋:http://blog.csdn.net/john159151/article/details/50255101
同義:
wordnet
參考:http://www.cnblogs.com/rcfeng/p/3918544.html
