文本處理 Python（大創案例實踐總結）

　　之前用Python進行一些文本的處理，現在在這里對做過的一個案例進行整理。對於其它類似的文本數據，只要看着套用就可以了。

　　會包含以下幾方面內容：

　　　　1.中文分詞；

　　　　2.去除停用詞；

　　　　3.IF-IDF的計算；

　　　　4.詞雲；

　　　　5.Word2Vec簡單實現；

　　　　6.LDA主題模型的簡單實現；

　　但不會按順序講，會以幾個案例的方式來綜合展示。

　　首先我們給計算機輸入的是一個CSV文件，假設我們叫它data.csv。假設就是以下這樣子的：

部分截圖

　　接下來看看如何中文分詞和去停用詞操作，這兩部分很基礎的東西先講，我之前有試過很多方式，覺得以下代碼寫法比較好（當然可能有更好的做法）

1.中文分詞（jieba）和去停用詞

　　分詞用的是結巴分詞，可以這樣定義一個分詞函數：

import jieba mycut=lambda s:' '.join(jieba.cut(s))

　　下面案例中會介紹怎樣用。

　　接下來看看去停用詞，先是下載一個停用詞表，導入到程序為stoplists變量（list類型）,然后可以像下面操作：

import codecs with codecs.open("stopwords.txt", "r", encoding="utf-8") as f: text = f.read() stoplists=text.splitlines()
texts = [[word for word in document.split()if word not in stoplists] for document in documents]

　　document變量在下面的LDA案例中會提到。
　　接下來我根據LDA主題模型、Word2Vec實現、IF-IDF與詞雲的順序進行案例的總結。

2.LDA主題模型案例

2.1導入相關庫和數據

from gensim import corpora, models, similarities import logging import jieba import pandas as pd df = pd.read_csv('data.csv',encoding='gbk',header=None,sep="xovm02") df= df[0] .dropna()  #[0]是因為我們的數據就是第一列，dropna去空

2.2分詞處理

mycut=lambda s:' '.join(jieba.cut(s)) data=df[0].apply(mycut)
documents =data

2.3LDA模型計算（gensim）

# configuration 參數配置
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO) #去停用詞處理
texts = [[word for word in document.split() if word not in stoplists] for document in documents] # load id->word mapping (the dictionary) 單詞映射成字典
dictionary = corpora.Dictionary(texts) # word must appear >10 times, and no more than 40% documents
dictionary.filter_extremes(no_below=40, no_above=0.1) # save dictionary
dictionary.save('dict_v1.dict') # load corpus 加載語料庫
corpus = [dictionary.doc2bow(text) for text in texts] # initialize a model #使用TFIDF初始化
tfidf = models.TfidfModel(corpus) # use the model to transform vectors, apply a transformation to a whole corpus 使用該模型來轉換向量，對整個語料庫進行轉換
corpus_tfidf = tfidf[corpus] # extract 100 LDA topics, using 1 pass and updating once every 1 chunk (10,000 documents), using 500 iterations
lda = models.LdaModel(corpus_tfidf, id2word=dictionary, num_topics=30, iterations=500) # save model to files
lda.save('mylda_v1.pkl') # print topics composition, and their scores, for the first document. 
for index, score in sorted(lda[corpus_tfidf[0]], key=lambda tup: -1*tup[1]): print ("Score: {}\t Topic: {}".format(score, lda.print_topic(index, 5))) # print the most contributing words for 100 randomly selected topics
lda.print_topics(30) # print the most contributing words for 100 randomly selected topics
lda.print_topics(30)

　　應對不同情況就根據英語提示修改參數。

　　我在嘗試的時候gensim的LDA函數還有另一個用法：

import gensim #模型擬合，主題設為25
lda = gensim.models.ldamodel.LdaModel(corpus=corpus, id2word=dictionary, num_topics=25) 
#打印25個主題，每個主題顯示最高貢獻的10個
lda.print_topics(num_topics=25, num_words=10)

3.Word2Vec案例（gensim）

　　為了簡便，這里假設分詞和去停用詞已經弄好了。並保存為新的data.csv。

3.1導入相關庫和文件

import pandas as pd from gensim.models import Word2Vec from gensim.models.word2vec import LineSentence df = pd.read_csv('data.csv',encoding='gbk',header=None,sep="xovm02")

3.2開始Word2Vec的實現

sentences = df[0] line_sent = [] for s in sentences: line_sent.append(s.split()) #句子組成list
 model = Word2Vec(line_sent, size=300, window=5, min_count=2, workers=2)      #word2vec主函數（API）
 model.save('./word2vec.model') print(model.wv.vocab)  #vocab是個字典類型
print (model.wv['分手']) #打印“分手”這個詞的向量
 model.similarity(u"分手", u"愛情")#計算兩個詞之間的余弦距離
model.most_similar(u"分手")#計算余弦距離最接近“分手”的10個詞

4.TF-IDF的計算和詞雲

4.1TF-IDF的計算

　　如果需要計算文本的TF-IDF值，可以參考下面操作：

#導入sklearn相關庫
from sklearn.feature_extraction.text import TfidfTransformer from sklearn.feature_extraction.text import CountVectorizer vectorizer=CountVectorizer() transformer=TfidfTransformer() #計算TF-IDF值（data是你自己的文本數據） #這一個是主函數
tfidf=transformer.fit_transform(vectorizer.fit_transform(data)) weight=tfidf.toarray() #后續需要： #選出的單詞結果
words=vectorizer.get_feature_names() #得到詞頻
vectorizer.fit_transform(data)

4.2詞雲

from scipy.misc import imread import matplotlib.pyplot as plt from wordcloud import WordCloud, ImageColorGenerator font_path = 'E:\simkai.ttf' # 為matplotlib設置中文字體路徑沒

#詞雲主函數
# 設置詞雲屬性
wc = WordCloud(font_path=font_path,  # 設置字體
               background_color="white",  # 背景顏色
               max_words=200,  # 詞雲顯示的最大詞數
               #mask=back_coloring,  # 設置背景圖片
               max_font_size=200,  # 字體最大值
               random_state=42,
               width=1000, height=860, margin=2,# 設置圖片默認的大小,但是如果使用背景圖片的話,那么保存的圖片大小將會按照其大小保存,margin為詞語邊緣距離
               )
#詞雲計算（假設data為你處理好需要做詞雲的文本數據）
wc.generate(data)

plt.figure()
# 以下代碼顯示圖片
plt.imshow(wc)
plt.axis("off")
plt.show()
# 繪制詞雲

#將結果保存到當前文件夾
from os import path
d = path.dirname('.')
wc.to_file(path.join(d, "詞雲.png"))

5.結語

　　以上就是我在實踐過程中使用的處理文本的一些方法，當然肯定有比這些更好的方法。

　　寫的比較匆忙，我想等我放假再把里面不夠詳細的地方補全。

　　這一次的學習經歷大約到這里就要結束了，后面就是不斷總結補充和改進。

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 Python文本處理 python 文本處理操作 Python之路-awk文本處理 Python文本處理nltk基礎 Linux文本處理 Egret 文本處理 * 星號的居中文本處理樣式的總結 awk文本處理 thymeleaf文本處理 linux文本處理常用指令總結