利用Python進行文章特征提取（一）

本文轉載自查看原文 2016-02-26 16:57 9671 NLTK/ NLP/ Python

# 文字特征提取 詞庫模型（bag of words） 2016年2月26，星期五 # 1.詞庫表示法

In [9]:

# sklearn 的 CountVectorizer類能夠把文檔詞塊化（tokenize），代碼如下
from sklearn.feature_extraction.text import CountVectorizer corpus=['UNC played Duke in basketball','Duke lost the basketball game','I ate a sandwich'] vectorizer=CountVectorizer() corpusTotoken=vectorizer.fit_transform(corpus).todense() corpusTotoken #[[1, 1, 0, 1, 0, 1, 0, 1], # [1, 1, 1, 0, 1, 0, 1, 0]] vectorizer.vocabulary_ #{u'ate': 0, # u'basketball': 1, # u'duke': 2, # u'game': 3, # u'in': 4, # u'lost': 5, # u'played': 6, # u'sandwich': 7, # u'the': 8, # u'unc': 9}

In [14]:

# 2. 計算向量之間的歐式距離，sklearn中引入euclidean_distances，代碼如下：
from sklearn.metrics.pairwise import euclidean_distances counts=vectorizer.fit_transform(corpus).todense() for x,y in [[0,1],[0,2],[1,2]]: dist=euclidean_distances(counts[x],counts[y]) print('文檔{}與文檔{}的距離{}'.format(x,y,dist)) #文檔0與文檔1的距離[[ 2.44948974]] #文檔0與文檔2的距離[[ 2.64575131]] #文檔1與文檔2的距離[[ 2.64575131]]

In [17]:

# 3.停用詞過濾，停用詞通常是構建文檔意思的功能詞匯，其字面意義並不體現。CountVectorizer類可以通過設置stop_words參數過濾停用詞。默認是英語常用的停用詞。代碼如下 vectorizer=CountVectorizer(stop_words='english') print(vectorizer.fit_transform(corpus).todense()) #[[0 1 1 0 0 1 0 1] # [0 1 1 1 1 0 0 0] # [1 0 0 0 0 0 1 0]] print(vectorizer.vocabulary_) #{u'duke': 2, u'basketball': 1, u'lost': 4, u'played': 5, u'game': 3, u'sandwich': 6, u'unc': 7, u'ate': 0}

#4. 詞根還原與詞性還原。特征向量里面的單詞很多都是一個詞的不同形式，比如jumping和jumps都是jump的不同形式。詞根還原與詞形還原就是為了將單詞從不同的時態、派生形式還原。可利用Python里面的NLTK（Natural Language ToolKit）庫來處理

In [28]:

import nltk nltk.download()

showing info http://www.nltk.org/nltk_data/

Out[28]:

True

In [26]:

from nltk.stem.wordnet import WordNetLemmatizer lemm=WordNetLemmatizer()

In [29]:

print(lemm.lemmatize('gathering'),'v') print(lemm.lemmatize('gathering'),'n')

#('gathering', 'v')

#('gathering', 'n')

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 利用Python進行文章特征提取（二）利用主成分PCA進行特征提取 python圖像特征提取 python—sift特征提取 python—sift特征提取 3. opencv進行SIFT特征提取對中文漢字進行特征提取 2 python 文本特征提取 CountVectorizer, TfidfVectorizer HOG特征提取+python+opencv python —— 文本特征提取 CountVectorize