加載語料庫及預處理
本文選用的語料庫為sklearn自帶API的20newsgroups語料庫,該語料庫包含商業、科技、運動、航空航天等多領域新聞資料,很適合NLP的初學者進行使用。sklearn_20newsgroups給出了非常詳細的介紹。
預處理方面,直接調用了NLTK的接口進行小寫化、分詞、去除停用詞、POS篩選及詞干化。這里進行哪些操作完全根據實際需要和數據來定,比如我就經常放棄詞干化或者放棄POS篩選(原因通常是結果不好==)…以下代碼為加載20newsgroups數據及文本預處理部分代碼。
# 1.加載數據
# 該語料庫包含商業、科技、運動、航空航天等多領域新聞資料
from sklearn.datasets import fetch_20newsgroups
dataset = fetch_20newsgroups(shuffle=True, random_state=1, remove=('headers', 'footers', 'quotes'))
data_samples = dataset.data[:2000] # 截取需要的量,2000條
# print(data_samples)
# 2.文本預處理, 可選項
import nltk
import string
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
# 每次訪問數據需要添加數據至路徑當中
def textPrecessing(text):
# 小寫化
text = text.lower()
# 去除特殊標點
for c in string.punctuation:
text = text.replace(c, ' ')
# 分詞
wordLst = nltk.word_tokenize(text)
# 去除停用詞
filtered = [w for w in wordLst if w not in stopwords.words('english')]
# 僅保留名詞或特定POS
refiltered = nltk.pos_tag(filtered)
filtered = [w for w, pos in refiltered if pos.startswith('NN')]
# 詞干化
ps = PorterStemmer()
filtered = [ps.stem(w) for w in filtered]
return " ".join(filtered)
# 該區域僅首次運行,進行文本預處理,第二次運行起注釋掉
# docList = []
# for desc in data_samples:
# docLst.append(textPrecessing(desc).encode('utf-8'))
# with open('D:/data/LDA/20newsgroups(2000).txt', 'a') as f:
# for line in docLst:
# f.writelines(str(line) +
# '\n')
# ==============================================================================
# 從第二次運行起,直接獲取預處理過的docLst,前面load數據、預處理均注釋掉
docList = []
with open('D:/data/LDA/20newsgroups(2000).txt', 'r') as f:
for line in f.readlines():
if line != '':
docList.append(line.strip())
# ==============================================================================
CountVectorizer統計詞頻
LDA模型學習時的訓練數據並不是一篇篇文本,而是Document-word matrix,它可以是array也可以是稀疏矩陣,維數是n_samples*n_features,其中n_features為詞(term)的個數。因此在訓練LDA主題模型前,需要先利用CountVectorizer統計詞頻並保存,代碼如下:
# 3.統計詞頻
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.externals import joblib # 也可以選擇pickle等保存模型,請隨意
# 構建詞匯統計向量並保存,僅運行首次 API: https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html
tf_vectorizer = CountVectorizer(max_df=0.95, min_df=2,
max_features=1500,
stop_words='english')
tf = tf_vectorizer.fit_transform(docList)
joblib.dump(tf_vectorizer, 'D:/saved_model/vectorizer_sklearn/vectorizer_sklearn.model')
# ==============================================================================
# #得到存儲的tf_vectorizer,節省預處理時間
# tf_vectorizer = joblib.load('D:/saved_model/vectorizer_sklearn/vectorizer_sklearn.model')
# tf = tf_vectorizer.fit_transform(docList)
# print(tf)
# ==============================================================================
CountVectorizer的API請自行參考sklearn,文中代碼限定term出現次數必須大於2,最終保留前n_features=2500的term作為features。訓練得到的tf_vectorizer 利用joblib保存到文件,第二次起可以直接從文件中load進來避免重復計算。該步驟得到的tf矩陣為一個“文章-詞語”稀疏矩陣,可以通過tf_vectorizer.get_feature_names()得到每一維feature對應的term。
LDA主題模型訓練
終於到了最關鍵的LDA主題模型訓練階段。雖說此階段最關鍵,但如果數據質量高,如果前面的步驟沒有偷工減料,這步其實水到渠成;反之,問題可能都會累計到此階段集中的反映出來。要想訓練優秀的主題模型,兩個重要的前提就是數據質量和文本預處理。在此特別安利一下用起來舒服的預處理包:中文–>jieba,英文–>spaCy。上文采用nltk實屬無奈,因為這台電腦無法成功安裝spaCy唉。。
好了不跑題。LDA訓練代碼如下,其中參數請參考最后面的附錄sklearn LDA API 中文解釋。
# 4.LDA主題模型訓練
# API: https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.LatentDirichletAllocation.html
from sklearn.decomposition import LatentDirichletAllocation
# ===============================================================================
# 訓練並存儲為模型,初次執行取消注釋
lda = LatentDirichletAllocation(n_components=20, # 文章表示成20維的向量
max_iter=200,
learning_method='batch',
verbose=True)
lda.fit(tf) # tf即為Document_word Sparse Matrix
joblib.dump(lda, 'D:/saved_model/LDA_sklearn/LDA_sklearn_main.model')
# ===============================================================================
# 加載lda模型,初次請執行上面的訓練過程,注釋加載模型
# lda = joblib.load('D:/saved_model/LDA_sklearn/LDA_sklearn_main.model')
print(lda.perplexity(tf)) # 收斂效果
(4)結果展示
LDA的訓練時間根據max_iter設置的不同以及數據收斂情況的不同而差別很大。測試時max_iter設置為幾十次通常很快就會結束,當然如果實際應用的話,建議至少上千次吧。
texts = [
"In this morning's TechBytes, we look back at the technology that changes the world in the past decade.\"5,4...\" As we counted down to 2000, fears Y2K would crash the world's computers had many questioning if we become too dependent on technology. Most of us had no idea just how hooked we get.Google was just a few years old then, a simple search engine with a loyal following. A few months later, it would explode into the world's largest. Today, it is the most visited site on the web, with over 1 billion searches everyday.\"The iPod, it's cute.\" MP3 players were nothing new when the first iPod was introduced in the fall of 2001, but this player from Apple was different.\"You can download 1,000 of your favourite songs from your Apple computer in less than 10 minutes.\"TV was revolutionized, too. HDTV, huge flat screens but the most life changing development— TiVo and the DVR. Now we can watch shows on our time and rewind to see something we missed. Today, more than 38 million US households have a DVR.\"People for 2001 are gonna wanna take it on the roads to see something like the Blackberry.""From this to this tiny thing?""Well...\" Little devices called Blackberries became Crackberries. Now, the office is always at your fingertips.And the decade brought friends closer together. Friendster and MySpace got it started, but Facebook took it mainstream.\"It's everyone's, like Santa, like life.\"At first, it was all college kids, but soon their parents and even grandparents followed. Today, Facebook is the second most visited site on the web with 350 million users.That was a look at some of the biggest tech stories of the past decade. For the latest tech news, log on to the technology page of abcnews.com. Those are your TechBytes. I'm Winnie Tanare.",
"Movement is usually the sign of a person healthy, because only people who love sports will be healthy. I am a love sports, so I was born to now only had a disease. Of the many sports I like table tennis best.Table tennis is a sport, it does not hurt our friendship don't like football, in front of the play is a pair of inseparable friends, when the play is the enemy, the enemy after the play. When playing table tennis, as long as you aim at the ball back and go. If the wind was blowing when playing, curving, touch you, you can only on the day scold: \"it doesn't help me also. If is another person with technical won, you can only blame yourself technology is inferior to him. Table tennis is also a not injured movement, not like basketball, in play when it is pulled down, injured, or the first prize. When playing table tennis, even if be hit will not feel pain. I'm enjoying this movement at the same time, also met many table tennis masters, let my friends every day.",
"While starting out on a business endeavour, following a set of rules is crucial for finding success.Without proper rules a business can go spiralling down and without taking too long at it. Following are golden rules that will ensure your success in business.Map it outMap where you want to head. Plant goals and results all across that mental map and keep checking it off once you start achieving them one by one.Care for your peoplePeople are your biggest asset. They are the ones who will drive your business to the top. Treat them well and they will treat you well, too.Aim for greatness.Build a great company. Build great services or products. Instil a fun culture at your workplace. Inspire innovation. Inspire your people to keep coming with great ideas, because great ideas bring great changes.Be wary.Keep a close eye on the people who you partner with. It doesn’t mean you have to be sceptical of them. But you shouldn’t naively believe everything you hear. Be smart and keep your eyes and ears open all the time.Commit and stick to it.Once you make a decision, commit to it and follow through. Give it your all. If for some reason that decision doesn’t work, retract, go back to the drawing board and pick an alternate route. In business, you will have to make lots of sacrifices. Be prepared for that. It will all be worth it in the end.Be proactive.Be proactive. Just having goals and not doing anything about them will not get you anywhere. If you don’t act, you will not get the results you’re looking for.Perfect timing.Anticipation is the key to succeed in business. You should have the skills to anticipate changes in the market place and, the changing consumer preferences. You have to keep a tab on all this. Never rest on your past laurels and always look to inject newness into your business processes.Not giving up.That’s the difference between those who succeed and those who don’t. As a businessman you should never give up, no matter what the circumstance. Keep on persevering. You will succeed sooner or later. The key is to never quit trying.Follow these rules and you'll find yourself scaling up the ladder of succcess."]
# 文本先預處理,再在詞頻模型中結構化,然后將結構化的文本list傳入LDA主題模型,判斷主題分布。
processed_texts = []
for text in texts:
temp = textPrecessing(text)
processed_texts.append(temp)
vectorizer_texts = tf_vectorizer.transform(processed_texts)
# print(vectorizer_texts)
print(lda.transform(vectorizer_texts)) # 獲得分布矩陣
# 5.結果
def print_top_words(model, feature_names, n_top_words):
# 打印每個主題下權重較高的term
for topic_idx, topic in enumerate(model.components_):
print("Topic #%d:" % topic_idx)
print(" ".join([feature_names[i]
for i in topic.argsort()[:-n_top_words - 1:-1]]))
print()
# 打印主題-詞語分布矩陣
print(model.components_)
n_top_words = 20
tf_feature_names = tf_vectorizer.get_feature_names()
print_top_words(lda, tf_feature_names, n_top_words)
(Optional)調參過程
可以調整的參數
n_topics
: 主題的個數
n_features
: feature的個數,即常用詞個數
doc_topic_prior
:即我們的文檔主題先驗Dirichlet分布θd的參數α
topic_word_prior
:即我們的主題詞先驗Dirichlet分布βk的參數η
learning_method
: 即LDA的求解算法,有’batch’和’online’兩種選擇
其余sklearn提供的參數:根據LDA求解算法的不同,存在一些其它參數可以調節,參見最后的附錄:sklearn LDA API 中文解釋。
兩種可行的調參方案
一、以n_topics為例,按照perplexity的大小選擇最佳模型。當然,topic數目的不同勢必會導致perplexity計算的不同,因此perplexity僅能作為參考,topic數目還需要根據實際需求主觀指定。n_topics調參代碼如下:
# 同迭代次數下,維度的個數
from time import time
docList = []
with open('D:/data/LDA/20newsgroups(2000).txt', 'r') as f:
for line in f.readlines():
if line != '':
docList.append(line.strip())
from sklearn.externals import joblib
tf_vectorizer = joblib.load('D:/saved_model/vectorizer_sklearn/vectorizer_sklearn.model')
tf = tf_vectorizer.fit_transform(docList)
from sklearn.decomposition import LatentDirichletAllocation
n_topics = range(20, 35, 5)
perplexityLst = [1.0]*len(n_topics)
#訓練LDA並打印訓練時間
lda_models = []
for idx, n_topic in enumerate(n_topics):
lda = LatentDirichletAllocation(n_components=n_topic,
max_iter=20,
learning_method='batch',
evaluate_every=200,
# perp_tol=0.1, #default
# doc_topic_prior=1/n_topic, #default
# topic_word_prior=1/n_topic, #default
verbose=0)
t0 = time()
lda.fit(tf)
perplexityLst[idx] = lda.perplexity(tf)
lda_models.append(lda)
print("# of Topic: %d, " % n_topics[idx])
print("done in %0.3fs, N_iter %d, " % ((time() - t0), lda.n_iter_))
print("Perplexity Score %0.3f" % perplexityLst[idx])
#打印最佳模型
best_index = perplexityLst.index(min(perplexityLst))
best_n_topic = n_topics[best_index]
best_model = lda_models[best_index]
print("Best # of Topic: ", best_n_topic)
import matplotlib.pyplot as plt
import os
#繪制不同主題數perplexity的不同
fig = plt.figure()
ax = fig.add_subplot(1,1,1)
ax.plot(n_topics, perplexityLst)
ax.set_xlabel("# of topics")
ax.set_ylabel("Approximate Perplexity")
plt.grid(True)
plt.savefig(os.path.join('lda_result', 'perplexityTrend.png'))
plt.show()
二、如果想一次性調整所有參數也可以直接利用sklearn作cv,但是這樣做的結果一定是,耗時十分長。以下代碼僅供參考,可以根據自身的需求進行增減。
from sklearn.model_selection import GridSearchCV
parameters = {'learning_method':('batch', 'online'),
'n_topics':range(20, 75, 5),
'perp_tol': (0.001, 0.01, 0.1),
'doc_topic_prior':(0.001, 0.01, 0.05, 0.1, 0.2),
'topic_word_prior':(0.001, 0.01, 0.05, 0.1, 0.2)
'max_iter':1000}
lda = LatentDirichletAllocation()
model = GridSearch(lda, parameters)
model.fit(tf)
sorted(model.cv_results_.keys())
附錄:sklearn LDA API 中文解釋
Class sklearn.decomposition.LatentDirichletAllocation(n_topics=10, doc_topic_prior=None, topic_word_prior=None, learning_method=None, learning_decay=0.7, learning_offset=10.0, max_iter=10, batch_size=128, evaluate_every=-1, total_samples=1000000.0, perp_tol=0.1, mean_change_tol=0.001, max_doc_update_iter=100, n_jobs=1, verbose=0, random_state=None)
參數:
n_topics
: 即我們的隱含主題數K,需要調參。K的大小取決於我們對主題划分的需求,比如我們只需要類似區分是動物,植物,還是非生物這樣的粗粒度需求,那么K值可以取的很小,個位數即可。如果我們的目標是類似區分不同的動物以及不同的植物,不同的非生物這樣的細粒度需求,則K值需要取的很大,比如上千上萬。此時要求我們的訓練文檔數量要非常的多。doc_topic_prior
:即我們的文檔主題先驗Dirichlet分布θd的參數α。一般如果我們沒有主題分布的先驗知識,可以使用默認值1/K。topic_word_prior
:即我們的主題詞先驗Dirichlet分布βk的參數η。一般如果我們沒有主題分布的先驗知識,可以使用默認值1/K。learning_method
: 即LDA的求解算法。有 ‘batch’ 和 ‘online’兩種選擇。 ‘batch’即我們在原理篇講的變分推斷EM算法,而”online”即在線變分推斷EM算法,在”batch”的基礎上引入了分步訓練,將訓練樣本分批,逐步一批批的用樣本更新主題詞分布的算法。默認是”online”。選擇了‘online’則我們可以在訓練時使用partial_fit函數分布訓練。不過在scikit-learn 0.20版本中默認算法會改回到”batch”。建議樣本量不大只是用來學習的話用”batch”比較好,這樣可以少很多參數要調。而樣本太多太大的話,”online”則是首先了。
5)learning_decay
:僅僅在算法使用”online”時有意義,取值最好在(0.5, 1.0],以保證”online”算法漸進的收斂。主要控制”online”算法的學習率,默認是0.7。一般不用修改這個參數。
6)learning_offset
:僅僅在算法使用”online”時有意義,取值要大於1。用來減小前面訓練樣本批次對最終模型的影響。
7)max_iter
:EM算法的最大迭代次數。
8)total_samples
:僅僅在算法使用”online”時有意義, 即分步訓練時每一批文檔樣本的數量。在使用partial_fit函數時需要。
9)batch_size
: 僅僅在算法使用”online”時有意義, 即每次EM算法迭代時使用的文檔樣本的數量。
10)mean_change_tol
:即E步更新變分參數的閾值,所有變分參數更新小於閾值則E步結束,轉入M步。一般不用修改默認值。
11)max_doc_update_iter
: 即E步更新變分參數的最大迭代次數,如果E步迭代次數達到閾值,則轉入M步。
方法:
1)fit(X[, y])
:利用訓練數據訓練模型,輸入的X為文本詞頻統計矩陣。
2)fit_transform(X[, y])
:利用訓練數據訓練模型,並返回訓練數據的主題分布。
3)get_params([deep])
:獲取參數
4)partial_fit(X[, y])
:利用小batch數據進行Online方式的模型訓練。
5)perplexity(X[, doc_topic_distr, sub_sampling])
:計算X數據的approximate perplexity。
6)score(X[, y])
:計算approximate log-likelihood。
7)set_params(**params)
:設置參數。
8)transform(X)
:利用已有模型得到語料X中每篇文檔的主題分布。
鏈接:https://pan.baidu.com/s/1RrckbSNEs1dZB4NItlg07Q
提取碼:s5p6