ML--文本數據處理


ML–文本數據處理

一直以來,自然語言處理(Natual Language Processing,NLP)作為人工智能的重要分支之一,其研究的內容是如何實現人與計算機之間用自然語言進行有效的通信。自然語言處理中的基礎知識–如何對文本數據進行處理

主要涉及的知識點有:

  • 文本數據的特征提取
  • 中文文本的分詞辦法
  • 用n-Gram模型優化文本數據
  • 使用tf-idf模型改善特征提取
  • 刪除停用詞(Stopwords)

一.文本數據的特征提取,中文分詞及詞袋模型

1.使用CountVectorizer對文本進行特征提取

數據特征大致可以分為兩種:一種是用來表示數值的連續特征;另一種是表示樣本所在分類的類型特征。而在自然語言處理的領域中,我們會接觸到的是第三種數據類型–文本數據

文本數據在計算機中往往被存儲為字符串類型(String)。中文不像英語那樣,在每個詞之間有空格作為分界線,這就要求我們在處理中文文本的時候,需要先進行分詞處理

例如英語:“The quick brown fox jumps over a lazy dog”,中文意思是"那只敏捷的棕色狐狸跳過了一只懶惰的狗"

from sklearn.feature_extraction.text import CountVectorizer

vect=CountVectorizer()

# 使用CountVectorizer擬合文本數據
en=['The quick brown fox jumps over a lazy dog']
vect.fit(en)

print('單詞數:{}'.format(len(vect.vocabulary_)))
print('分詞:{}'.format(vect.vocabulary_))
單詞數:8
分詞:{'the': 7, 'quick': 6, 'brown': 0, 'fox': 2, 'jumps': 3, 'over': 5, 'lazy': 4, 'dog': 1}

[結果分析] 程序沒有將冠詞"a"統計進來。因為"a"只有一個字母,所以程序沒有把它作為一個單詞

下面我們再來看中文的情況

# 使用中文文本進行實驗
cn=['那只敏捷的棕色狐狸跳過了一只懶惰的狗']

# 擬合中文文本數據
vect.fit(cn)

print('單詞數:{}'.format(len(vect.vocabulary_)))
print('分詞:{}'.format(vect.vocabulary_))
單詞數:1
分詞:{'那只敏捷的棕色狐狸跳過了一只懶惰的狗': 0}

[結果分析] 程序無法對中文語句進行分詞,它把整句話當成了一個詞,這是因為中文與英語不同,英語的詞與詞之間有空格作為天然的分隔符,而中文卻沒有

2.使用分詞工具對中文文本進行分詞

我們使用jieba模塊來對上文中的中文語句進行分詞

import jieba

cn=jieba.cut('那只敏捷的棕色狐狸跳過了一個只懶惰的狗')

# 使用空格作為詞與詞之間的分界線
cn=[' '.join(cn)]

print(cn)
Building prefix dict from the default dictionary ...
Dumping model to file cache C:\Users\lenovo\AppData\Local\Temp\jieba.cache
Loading model cost 1.013 seconds.
Prefix dict has been built succesfully.


['那 只 敏捷 的 棕色 狐狸 跳過 了 一個 只 懶惰 的 狗']

借助jieba模塊,我們把這句中文語句進行了分詞操作,並在每個單詞之間插入空格作為分界線

下面我們重新使用CountVectorizer對其進行特征抽取

# 使用CountVectorizer對中文文本向量化
vect.fit(cn)

print('單詞數:{}'.format(len(vect.vocabulary_)))
print('分詞:{}'.format(vect.vocabulary_))
單詞數:6
分詞:{'敏捷': 2, '棕色': 3, '狐狸': 4, '跳過': 5, '一個': 0, '懶惰': 1}

3.使用詞袋模型將文本數據轉為數組

CountVectorizer給每個詞編碼為一個從0到5的整型數。經過這樣的處理之后,我們便可以用一個稀疏矩陣(sparse matrix)對這個文本數據進行表示了

# 定義詞袋模型
bag_of_words=vect.transform(cn)

print('轉化為詞袋的特征:\n{}'.format(repr(bag_of_words)))
轉化為詞袋的特征:
<1x6 sparse matrix of type '<class 'numpy.int64'>'
	with 6 stored elements in Compressed Sparse Row format>

[結果分析] 原來的那句話,被轉化為一個1行6列的稀疏矩陣,類型為64位整型數值,其中有6個元素

下面我們看看6個元素都是什么,輸入代碼如下:

# 打印詞袋模型的密度表達
print('詞袋的密度表達:\n{}'.format(bag_of_words.toarray()))
詞袋的密度表達:
[[1 1 1 1 1 1]]

[結果分析] 我們通過分詞工具拆分出的6個單詞在這句話中出現的次數。比如在數組中的第一個元素是1,它代表在這句話中,"一只"這個詞出現的次數是1次;第二個元素1,代表這句話中,"懶惰"這個詞出現的次數也是1次

現在我們可以試着換一句話來看看結果有什么不同。例如"懶惰的狐狸不如敏捷的狐狸敏捷,敏捷的狐狸不如懶惰的狐狸懶惰"

cn_1=jieba.cut('懶惰的狐狸不如敏捷的狐狸敏捷,敏捷的狐狸不如懶惰的狐狸懶惰')

# 以空格進行分隔
cn2=[' '.join(cn_1)]

print(cn2)
['懶惰 的 狐狸 不如 敏捷 的 狐狸 敏捷 , 敏捷 的 狐狸 不如 懶惰 的 狐狸 懶惰']
vect.fit(cn2)

print('單詞數:{}'.format(len(vect.vocabulary_)))
print('分詞:{}'.format(vect.vocabulary_))
單詞數:4
分詞:{'懶惰': 1, '狐狸': 3, '不如': 0, '敏捷': 2}

接下來,我們再用CountVectorizer將這句話進行轉化,輸入代碼如下:

# 建立新的詞袋模型
new_bag=vect.transform(cn2)

print('轉化為詞袋的特征:\n{}'.format(repr(new_bag)))
print('詞袋的密度表達:\n{}'.format(new_bag.toarray()))
轉化為詞袋的特征:
<1x4 sparse matrix of type '<class 'numpy.int64'>'
	with 4 stored elements in Compressed Sparse Row format>
詞袋的密度表達:
[[2 3 3 4]]

二.對文本數據進一步優化處理

學習如何使用n_Gram算法來改善詞袋模型,以及如何使用tf-idf算法對文本數據進行處理,和如何刪除文本數據中的停用詞

1.使用n-Gram改善詞袋模型

雖然用詞袋模型可以簡化自然語言,利於機器學習算法建模,但是它的劣勢也很明顯–由於詞袋模型把句子看作單詞的簡單集合,那么單詞出現的順序就會被無視,這樣一來可能會導致包含同樣單詞,但是順序不一樣的兩句話在機器看成了完全一樣的意思

# 隨便寫一句話
joke=jieba.cut('道士看見和尚親吻了尼姑的嘴唇')

# 插入空格
joke=[' '.join(joke)]

vect.fit(joke)
joke_feature=vect.transform(joke)

print('這句話的特征表達:\n{}'.format(joke_feature.toarray()))
這句話的特征表達:
[[1 1 1 1 1 1]]

接下來我們把這句話的順序打亂,變成"尼姑看見道士的嘴唇親吻了和尚"

joke2=jieba.cut('尼姑看見道士的嘴唇親吻了和尚')

joke2=[' '.join(joke2)]

joke2_feature=vect.transform(joke2)

print('這句話的特征表達:\n{}'.format(joke2_feature.toarray()))
這句話的特征表達:
[[1 1 1 1 1 1]]

[結果分析] 兩個結果是完全一樣的。也就是說,這兩句意思完全不同的話,對於機器來說,意思是一模一樣的

要解決這個問題,我們可以對CountVectorizer中的ngram_range參數進行調節。這里先介紹一下,n_Gram是大詞匯連續文本或語音識別中的常用的一種語言模型,它是利用上下文相鄰詞的搭配信息來進行文本數據轉換的,其中n代表一個整型數值,例如n等於2的時候,模型稱為bi-Gram,意思是n-Gram會對相鄰的兩個單詞進行配對;而n等於3時,模型稱為tri-Gram,也就是會對相鄰的3個單詞進行配對

# 修改CountVectorizer的ngram參數
vect=CountVectorizer(ngram_range=(2,2))

cv=vect.fit(joke)
joke_feature=cv.transform(joke)

print('調整n-Gram參數后的詞典:{}'.format(cv.get_feature_names()))
print('新的特征表達:{}'.format(joke_feature.toarray()))
調整n-Gram參數后的詞典:['親吻 尼姑', '和尚 親吻', '尼姑 嘴唇', '看見 和尚', '道士 看見']
新的特征表達:[[1 1 1 1 1]]

現在我們再來試試;另外一句:“尼姑看見道士的嘴唇親吻了和尚”

joke2=jieba.cut('尼姑看見道士的嘴唇親吻了和尚')

joke2=[' '.join(joke2)]

joke2_feature=vect.transform(joke2)

print('這句話的特征表達:\n{}'.format(joke2_feature.toarray()))
這句話的特征表達:
[[0 0 0 0 0]]

2.使用tf-idf模型對文本數據進行處理

tf-idf全稱為"term frequency-inverse document frequency",一般翻譯為"詞頻-逆向文本頻率"。它是一種用來評估某個詞對於一個語料庫中的某一份文件的重要程度,如果某個詞在某個文件中出現的次數非常高,但在其他文件中出現的次數很少,那么tf-idf就會認為這個詞能夠很好地將文件進行區分,重要程度就會較高
第一步:計算詞頻

第二步:計算逆文檔頻率

第三步:計算TF-IDF

scikit-learn當中,有兩個類使用了tf-idf方法,其中一個是TfidfTransformer,它用來將CountVectorizer從文本中提取的特征矩陣進行轉化;另一個是TfidfVectorizer,它和CountVecterizer的用法是相同的

為了進一步介紹TfidfVectorizer的用法,以及它和CountVectorizer的區別,我們下面使用一個相對復雜的數據集,也是一個非常經典的用於進行自然語言處理的案例,就是IMDB電影評論數據集,IMDB數據集下載

為了能夠減低數據載人的時間,更好地為大家進行展示,我們從train和test文件夾中各抽取50個正面評論和50個負面評論,保存在新的文件夾中,命名為Imdblite

from sklearn.datasets import load_files

train_set=load_files('Imdblite/train')

X_train,y_train=train_set.data,train_set.target

print('訓練集文件數量:{}'.format(len(X_train)))
print('隨機抽一個看看:',X_train[22])
訓練集文件數量:100
隨機抽一個看看: b"All I could think of while watching this movie was B-grade slop. Many have spoken about it's redeeming quality is how this film portrays such a realistic representation of the effects of drugs and an individual and their subsequent spiral into a self perpetuation state of unfortunate events. Yet really, the techniques used (as many have already mentioned) were overused and thus unconvincing and irrelevant to the film as a whole.<br /><br />As far as the plot is concerned, it was lacklustre, unimaginative, implausible and convoluted. You can read most other reports on this film and they will say pretty much the same as I would.<br /><br />Granted some of the actors and actresses are attractive but when confronted with such boring action... looks can only carry a film so far. The action is poor and intermittent: a few punches thrown here and there, and a final gunfight towards the end. Nothing really to write home about.<br /><br />As others have said, 'BAD' movies are great to watch for the very reason that they are 'bad', you revel in that fact. This film, however, is a void. It's nothing.<br /><br />Furthermore, if one is really in need of an educational movie to scare people away from drug use then I would seriously recommend any number of other movies out there that board such issues in a much more effective way. 'Requiem For A Dream', 'Trainspotting', 'Fear and Loathing in Las Vegas' and 'Candy' are just a few examples. Though one should also check out some more lighthearted films on the same subject like 'Go' (overall, both serious and funny) and 'Halfbaked'.<br /><br />On a final note, the one possibly redeeming line in this movie, delivered by Vinnie Jones was stolen from 'Lock, Stock and Two Smokling Barrels'. To think that a bit of that great movie has been tainted by 'Loaded' is vile.<br /><br />Overall, I strongly suggest that you save you money and your time by NOT seeing this movie."

為了不讓"**
"**影響機器學習的模型,我們把它用空格替換掉

# 將文本中的<br />去掉
X_train=[doc.replace(b'<br />',b' ') for doc in X_train]
print('隨機抽一個看看:',X_train[22])
隨機抽一個看看: b"All I could think of while watching this movie was B-grade slop. Many have spoken about it's redeeming quality is how this film portrays such a realistic representation of the effects of drugs and an individual and their subsequent spiral into a self perpetuation state of unfortunate events. Yet really, the techniques used (as many have already mentioned) were overused and thus unconvincing and irrelevant to the film as a whole.  As far as the plot is concerned, it was lacklustre, unimaginative, implausible and convoluted. You can read most other reports on this film and they will say pretty much the same as I would.  Granted some of the actors and actresses are attractive but when confronted with such boring action... looks can only carry a film so far. The action is poor and intermittent: a few punches thrown here and there, and a final gunfight towards the end. Nothing really to write home about.  As others have said, 'BAD' movies are great to watch for the very reason that they are 'bad', you revel in that fact. This film, however, is a void. It's nothing.  Furthermore, if one is really in need of an educational movie to scare people away from drug use then I would seriously recommend any number of other movies out there that board such issues in a much more effective way. 'Requiem For A Dream', 'Trainspotting', 'Fear and Loathing in Las Vegas' and 'Candy' are just a few examples. Though one should also check out some more lighthearted films on the same subject like 'Go' (overall, both serious and funny) and 'Halfbaked'.  On a final note, the one possibly redeeming line in this movie, delivered by Vinnie Jones was stolen from 'Lock, Stock and Two Smokling Barrels'. To think that a bit of that great movie has been tainted by 'Loaded' is vile.  Overall, I strongly suggest that you save you money and your time by NOT seeing this movie."
# 載人測試集
test=load_files('Imdblite/test/')

X_test,y_test=test.data,test.target

# 將文本中的<br />去掉
X_test=[doc.replace(b'<br />',b' ') for doc in X_test]

len(X_test)
100

下面要對文本數據進行特征提取,首先使用前面學到的CountVectorizer來進行特征提取

# 用CountVectorizer擬合訓練數據
vect=CountVectorizer().fit(X_train)

X_train_vect=vect.transform(X_train)

print('訓練集樣本特征數量:{}'.format(len(vect.get_feature_names())))
print('最后10個訓練集樣本特征:{}'.format(vect.get_feature_names()[-10:]))
訓練集樣本特征數量:3941
最后10個訓練集樣本特征:['young', 'your', 'yourself', 'yuppie', 'zappa', 'zero', 'zombie', 'zoom', 'zooms', 'zsigmond']

使用一個有監督學習算法來進行交叉驗證評分

from sklearn.svm import LinearSVC

from sklearn.model_selection import cross_val_score

scores=cross_val_score(LinearSVC(),X_train_vect,y_train)

print('模型平均分:{:.3f}'.format(scores.mean()))
模型平均分:0.778


E:\Anaconda\envs\mytensorflow\lib\site-packages\sklearn\model_selection\_split.py:2053: FutureWarning: You should specify a value for 'cv' instead of relying on the default value. The default value will change from 3 to 5 in version 0.22.
  warnings.warn(CV_WARNING, FutureWarning)
# 把測試數據集轉化為向量
X_test_vect=vect.transform(X_test)

# 使用線性SVC擬合訓練數據集
clf=LinearSVC().fit(X_train_vect,y_train)

print('測試集模型得分:{}'.format(clf.score(X_test_vect,y_test)))
測試集模型得分:0.58

希望能稍微提高一下模型的表現,所以接下來嘗試用tf-idf算法來處理一個數據

# 導入tfidf轉化工具
from sklearn.feature_extraction.text import TfidfTransformer

# 用tfidf工具轉化訓練集和測試集
tfidf=TfidfTransformer(smooth_idf=False)
tfidf.fit(X_train_vect)

X_train_tfidf=tfidf.transform(X_train_vect)
X_test_tfidf=tfidf.transform(X_test_vect)

print('未經tfidf處理的特征:\n',X_train_vect[:5,:5].toarray())
print('經過tfidf處理的特征:\n',X_train_tfidf[:5,:5].toarray())
未經tfidf處理的特征:
 [[0 0 0 0 0]
 [0 0 0 0 0]
 [0 1 0 0 0]
 [0 0 0 0 0]
 [0 0 0 0 0]]
經過tfidf處理的特征:
 [[ 0.          0.          0.          0.          0.        ]
 [ 0.          0.          0.          0.          0.        ]
 [ 0.          0.13862307  0.          0.          0.        ]
 [ 0.          0.          0.          0.          0.        ]
 [ 0.          0.          0.          0.          0.        ]]
# 重新訓練線性SVC模型
clf=LinearSVC().fit(X_train_tfidf,y_train)

# 使用新數據進行交叉驗證
scores2=cross_val_score(LinearSVC(),X_train_tfidf,y_train)

print('經過tfidf處理的訓練集交叉驗證得分:{:.3f}'.format(scores.mean()))
print('經過tfidf處理的測試集得分:{:.3f}'.format(clf.score(X_test_tfidf,y_test)))
經過tfidf處理的訓練集交叉驗證得分:0.778
經過tfidf處理的測試集得分:0.580


E:\Anaconda\envs\mytensorflow\lib\site-packages\sklearn\model_selection\_split.py:2053: FutureWarning: You should specify a value for 'cv' instead of relying on the default value. The default value will change from 3 to 5 in version 0.22.
  warnings.warn(CV_WARNING, FutureWarning)

3.刪除文本中的停用次

在自然語言處理領域,有一個概念稱為"停用詞(Stopwords)",指的是那些在文本處理過程中被篩選出去的,出現頻率很高但又沒有什么實際意義的詞

關於停用詞,可以參考:停用詞

在我們所使用的scikit-learn中,也內置了英語的停用詞表,其中包括常見的停用詞318個

# 導入內置的停用詞庫
from sklearn.feature_extraction.text import ENGLISH_STOP_WORDS

print('停用詞個數:',len(ENGLISH_STOP_WORDS))

# 打印停用詞中前20個和后20個
print('列出前20個和最后20個:\n',list(ENGLISH_STOP_WORDS)[:20],list(ENGLISH_STOP_WORDS)[-20:])
停用詞個數: 318
列出前20個和最后20個:
 ['meanwhile', 'five', 'were', 'because', 'during', 'must', 'eight', 'becomes', 'serious', 'has', 'twenty', 'nowhere', 'amongst', 'himself', 'beyond', 'other', 'take', 'now', 'hundred', 'third'] ['thereby', 'otherwise', 'co', 'find', 'never', 'by', 'a', 'everything', 'on', 'its', 'very', 'wherein', 'each', 'ltd', 'onto', 'see', 'whoever', 'being', 'enough', 'he']

接下來嘗試在精簡版IMDB影評數據集中進行停用詞的刪除,看是否可以提高模型的分數

# 導入Tfidf模型
from sklearn.feature_extraction.text import TfidfVectorizer

# 激活英語停用詞參數
tfidf=TfidfVectorizer(smooth_idf=False,stop_words='english')

tfidf.fit(X_train)

# 將訓練數據集文本轉化為向量
X_train_tfidf=tfidf.transform(X_train)

scores3=cross_val_score(LinearSVC(),X_train_tfidf,y_train)
clf.fit(X_train_tfidf,y_train)

# 將測試數據集轉化為向量
X_test_tfidf=tfidf.transform(X_test)

print('去掉停用詞后訓練集交叉驗證平均分:{:.3f}'.format(scores3.mean()))
print('去掉停用詞后測試集模型得分:{:.3f}'.format(clf.score(X_test_tfidf,y_test)))
去掉停用詞后訓練集交叉驗證平均分:0.890
去掉停用詞后測試集模型得分:0.670


E:\Anaconda\envs\mytensorflow\lib\site-packages\sklearn\model_selection\_split.py:2053: FutureWarning: You should specify a value for 'cv' instead of relying on the default value. The default value will change from 3 to 5 in version 0.22.
  warnings.warn(CV_WARNING, FutureWarning)


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM