Lemmatisation is closely related to stemming. The difference is that a stemmer operates on a single word without knowledge of the context, and therefore cannot discriminate between words which have different meanings depending on part of speech. However, stemmers are typically easier to implement and run faster, and the reduced accuracy may not matter for some applications. 1.Stemmer 抽取詞的詞干或詞根形式(不一定能夠表達完整語義) Porter Stemmer基於Porter詞干提取算法
>>> from nltk.stem.porter import PorterStemmer >>> porter_stemmer = PorterStemmer() >>> porter_stemmer.stem(‘maximum’) u’maximum’ >>> porter_stemmer.stem(‘presumably’) u’presum’ >>> porter_stemmer.stem(‘multiply’) u’multipli’ >>> porter_stemmer.stem(‘provision’) u’provis’ >>> porter_stemmer.stem(‘owed’) u’owe’
Lancaster Stemmer 基於Lancaster 詞干提取算法
>>> from nltk.stem.lancaster import LancasterStemmer >>> lancaster_stemmer = LancasterStemmer() >>> lancaster_stemmer.stem(‘maximum’) ‘maxim’ >>> lancaster_stemmer.stem(‘presumably’) ‘presum’ >>> lancaster_stemmer.stem(‘presumably’) ‘presum’ >>> lancaster_stemmer.stem(‘multiply’) ‘multiply’ >>> lancaster_stemmer.stem(‘provision’) u’provid’ >>> lancaster_stemmer.stem(‘owed’) ‘ow’
Snowball Stemmer基於Snowball 詞干提取算法
>>> from nltk.stem import SnowballStemmer >>> snowball_stemmer = SnowballStemmer(“english”) >>> snowball_stemmer.stem(‘maximum’) u’maximum’ >>> snowball_stemmer.stem(‘presumably’) u’presum’ >>> snowball_stemmer.stem(‘multiply’) u’multipli’ >>> snowball_stemmer.stem(‘provision’) u’provis’ >>> snowball_stemmer.stem(‘owed’) u’owe’
2.Lemmatization 把一個任何形式的語言詞匯還原為一般形式,標記詞性的前提下效果比較好
>>> from nltk.stem.wordnet import WordNetLemmatizer >>> lmtzr = WordNetLemmatizer() >>> lmtzr.lemmatize('cars') 'car' >>> lmtzr.lemmatize('feet') 'foot' >>> lmtzr.lemmatize('people') 'people' >>> lmtzr.lemmatize('fantasized',pos=“v”) #postag 'fantasize'
NLTK 里這個詞形還原工具的一個問題是需要手動指定詞性,比如上面例子中的 "working" 這個詞,如果不加后面那個 pos 參數,輸出的結果將會是 "working" 本身。
如果希望在實際應用中使用 NLTK 進行詞形還原,一個完整的解決方案是:
- 輸入一個完整的句子
- 用 NLTK 提供的工具對句子進行分詞和詞性標注
- 將得到的詞性標注結果轉換為 WordNet 的格式
- 使用 WordNetLemmatizer 對詞進行詞形還原
其中分詞和詞性標注又有數據依賴:
nltk.download("punkt")
nltk.download("maxnet_treebank_pos_tagger")
from nltk.corpus import wordnet from nltk import word_tokenize, pos_tag from nltk.stem import WordNetLemmatizer def get_wordnet_pos(treebank_tag): if treebank_tag.startswith('J'): return wordnet.ADJ elif treebank_tag.startswith('V'): return wordnet.VERB elif treebank_tag.startswith('N'): return wordnet.NOUN elif treebank_tag.startswith('R'): return wordnet.ADV else: return None def lemmatize_sentence(sentence): res = [] lemmatizer = WordNetLemmatizer() for word, pos in pos_tag(word_tokenize(sentence)): wordnet_pos = get_wordnet_pos(pos) or wordnet.NOUN res.append(lemmatizer.lemmatize(word, pos=wordnet_pos)) return res
3.MaxMatch 在中文自然語言處理中常常用來進行分詞
from nltk.stem import WordNetLemmatizer from nltk.corpus import words wordlist = set(words.words()) wordnet_lemmatizer = WordNetLemmatizer() def max_match(text): pos2 = len(text) result = '' while len(text) > 0: word = wordnet_lemmatizer.lemmatize(text[0:pos2]) if word in wordlist: result = result + text[0:pos2] + ' ' text = text[pos2:] pos2 = len(text) else: pos2 = pos2-1 return result[0:-1] >>> string = 'theyarebirds' >>> print(max_match(string)) they are birds
https://marcobonzanini.com/2015/01/26/stemming-lemmatisation-and-pos-tagging-with-python-and-nltk/
http://blog.csdn.net/baimafujinji/article/details/51069522