python 英文分词


做kaggle的quora比赛需要用Python处理英文

首先分词

import nltk

sentence = "At eight o'clock on Thursday morning Arthur didn't feel very good."
tokens = nltk.word_tokenize(sentence)
print tokens

['At', 'eight', "o'clock", 'on', 'Thursday', 'morning', 'Arthur', 'did', "n't", 'feel', 'very', 'good', '.']
 

报错

LookupError: 
**********************************************************************
  Resource u'tokenizers/punkt/english.pickle' not found.  Please
  use the NLTK Downloader to obtain the resource:  

>>> nltk.download()

按照提示下载pickle模块后,不再报错

 然后标注词性

 

word_tag=nltk.pos_tag(tokens)
print word_tag
word_tag=nltk.pos_tag(tokens)
print word_tag
[('At', 'IN'), ('eight', 'CD'), ("o'clock", 'NN'), ('on', 'IN'), ('Thursday', 'NNP'), ('morning', 'NN'), ('Arthur', 'NNP'), ('did', 'VBD'), ("n't", 'RB'), ('feel', 'VB'), ('very', 'RB'), ('good', 'JJ'), ('.', '.')]

tag的含义解释:http://blog.csdn.net/john159151/article/details/50255101

同义:

wordnet

参考:http://www.cnblogs.com/rcfeng/p/3918544.html

 


免责声明!

本站转载的文章为个人学习借鉴使用,本站对版权不负任何法律责任。如果侵犯了您的隐私权益,请联系本站邮箱yoyou2525@163.com删除。



 
粤ICP备18138465号  © 2018-2025 CODEPRJ.COM