1.我希望老師能講一點python在數據挖掘,數據分析領域的應用,最好能舉些實例,或者說帶我們實際操作一波。
2.中文分詞
- 下載一中文長篇小說,並轉換成UTF-8編碼。
- 使用jieba庫,進行中文詞頻統計,輸出TOP20的詞及出現次數。
- **排除一些無意義詞、合並同一詞。
- **使用wordcloud庫繪制一個詞雲。
import jieba book = "活着.txt" txt = open(book,"r",encoding='utf-8').read() ex = {'有慶','我們','知道','看到','自己','起來'} ls = [] words = jieba.lcut(txt) counts = {} for word in words: ls.append(word) if len(word) == 1: continue else: counts[word] = counts.get(word,0)+1 for word in ex: del(counts[word]) items = list(counts.items()) items.sort(key = lambda x:x[1], reverse = True) for i in range(10): word , count = items[i] print ("{:<10}{:>5}".format(word,count)) wz = open('ms.txt','w+') wz.write(str(ls)) import matplotlib.pyplot as plt from wordcloud import WordCloud wzhz = WordCloud().generate(txt) plt.imshow(wzhz) plt.show()
輸出結果:
Building prefix dict from the default dictionary ... Loading model from cache C:\Users\ADMINI~1\AppData\Local\Temp\jieba.cache Loading model cost 0.723 seconds. Prefix dict has been built succesfully. 家珍 575 鳳霞 413 二喜 175 隊長 166 什么 151 他們 148 一個 145 看着 115 孩子 114 沒有 113
詞雲顯示結果: