機器學習之路: python 朴素貝葉斯分類器 MultinomialNB 預測新聞類別


 

使用python3 學習朴素貝葉斯分類api

設計到字符串提取特征向量

歡迎來到我的git下載源代碼: https://github.com/linyi0604/MachineLearning

 

 1 from sklearn.datasets import fetch_20newsgroups
 2 from sklearn.cross_validation import train_test_split
 3 # 導入文本特征向量轉化模塊
 4 from sklearn.feature_extraction.text import CountVectorizer
 5 # 導入朴素貝葉斯模型
 6 from sklearn.naive_bayes import MultinomialNB
 7 # 模型評估模塊
 8 from sklearn.metrics import classification_report
 9 
10 '''
11 朴素貝葉斯模型廣泛用於海量互聯網文本分類任務。
12 由於假設特征條件相互獨立,預測需要估計的參數規模從冪指數量級下降接近線性量級,節約內存和計算時間
13 但是 該模型無法將特征之間的聯系考慮,數據關聯較強的分類任務表現不好。
14 '''
15 
16 '''
17 1 讀取數據部分
18 '''
19 # 該api會即使聯網下載數據
20 news = fetch_20newsgroups(subset="all")
21 # 檢查數據規模和細節
22 # print(len(news.data))
23 # print(news.data[0])
24 '''
25 18846
26 
27 From: Mamatha Devineni Ratnam <mr47+@andrew.cmu.edu>
28 Subject: Pens fans reactions
29 Organization: Post Office, Carnegie Mellon, Pittsburgh, PA
30 Lines: 12
31 NNTP-Posting-Host: po4.andrew.cmu.edu
32 
33 I am sure some bashers of Pens fans are pretty confused about the lack
34 of any kind of posts about the recent Pens massacre of the Devils. Actually,
35 I am  bit puzzled too and a bit relieved. However, I am going to put an end
36 to non-PIttsburghers' relief with a bit of praise for the Pens. Man, they
37 are killing those Devils worse than I thought. Jagr just showed you why
38 he is much better than his regular season stats. He is also a lot
39 fo fun to watch in the playoffs. Bowman should let JAgr have a lot of
40 fun in the next couple of games since the Pens are going to beat the pulp out of Jersey anyway. I was very disappointed not to see the Islanders lose the final
41 regular season game.          PENS RULE!!!
42 '''
43 
44 '''
45 2 分割數據部分
46 '''
47 x_train, x_test, y_train, y_test = train_test_split(news.data,
48                                                     news.target,
49                                                     test_size=0.25,
50                                                     random_state=33)
51 
52 '''
53 3 貝葉斯分類器對新聞進行預測
54 '''
55 # 進行文本轉化為特征
56 vec = CountVectorizer()
57 x_train = vec.fit_transform(x_train)
58 x_test = vec.transform(x_test)
59 # 初始化朴素貝葉斯模型
60 mnb = MultinomialNB()
61 # 訓練集合上進行訓練, 估計參數
62 mnb.fit(x_train, y_train)
63 # 對測試集合進行預測 保存預測結果
64 y_predict = mnb.predict(x_test)
65 
66 '''
67 4 模型評估
68 '''
69 print("准確率:", mnb.score(x_test, y_test))
70 print("其他指標:\n",classification_report(y_test, y_predict, target_names=news.target_names))
71 '''
72 准確率: 0.8397707979626485
73 其他指標:
74                            precision    recall  f1-score   support
75 
76              alt.atheism       0.86      0.86      0.86       201
77            comp.graphics       0.59      0.86      0.70       250
78  comp.os.ms-windows.misc       0.89      0.10      0.17       248
79 comp.sys.ibm.pc.hardware       0.60      0.88      0.72       240
80    comp.sys.mac.hardware       0.93      0.78      0.85       242
81           comp.windows.x       0.82      0.84      0.83       263
82             misc.forsale       0.91      0.70      0.79       257
83                rec.autos       0.89      0.89      0.89       238
84          rec.motorcycles       0.98      0.92      0.95       276
85       rec.sport.baseball       0.98      0.91      0.95       251
86         rec.sport.hockey       0.93      0.99      0.96       233
87                sci.crypt       0.86      0.98      0.91       238
88          sci.electronics       0.85      0.88      0.86       249
89                  sci.med       0.92      0.94      0.93       245
90                sci.space       0.89      0.96      0.92       221
91   soc.religion.christian       0.78      0.96      0.86       232
92       talk.politics.guns       0.88      0.96      0.92       251
93    talk.politics.mideast       0.90      0.98      0.94       231
94       talk.politics.misc       0.79      0.89      0.84       188
95       talk.religion.misc       0.93      0.44      0.60       158
96 
97              avg / total       0.86      0.84      0.82      4712
98 '''

 


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM