http://blog.csdn.net/lxg0807/article/details/52960072
环境说明:python2.7、linux
自己打自己脸,目前官方的包只能在linux,mac环境下使用。误导大家了,对不起。
测试facebook开源的基于深度学习的对文本分类的fastText模型
fasttext python包的安装:
pip install fasttext
- 1
第一步获取分类文本,文本直接用的清华大学的新闻分本,可在文本系列的第三篇找到下载地址。
输出数据格式: 样本 + 样本标签
说明:这一步不是必须的,可以直接从第二步开始,第二步提供了处理好的文本格式。写这一步主要是为了记忆当时是怎么处理原始文本的。
import jieba
import os
basedir = "/home/li/corpus/news/" #这是我的文件地址,需跟据文件夹位置进行更改 dir_list = ['affairs','constellation','economic','edu','ent','fashion','game','home','house','lottery','science','sports','stock'] ##生成fastext的训练和测试数据集 ftrain = open("news_fasttext_train.txt","w") ftest = open("news_fasttext_test.txt","w") num = -1 for e in dir_list: num += 1 indir = basedir + e + '/' files = os.listdir(indir) count = 0 for fileName in files: count += 1 filepath = indir + fileName with open(filepath,'r') as fr: text = fr.read() text = text.decode("utf-8").encode("utf-8") seg_text = jieba.cut(text.replace("\t"," ").replace("\n"," ")) outline = " ".join(seg_text) outline = outline.encode("utf-8") + "\t__label__" + e + "\n" # print outline # break if count < 10000: ftrain.write(outline) ftrain.flush() continue elif count < 20000: ftest.write(outline) ftest.flush() continue else: break ftrain.close() ftest.close()
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 9
- 10
- 11
- 12
- 13
- 14
- 15
- 16
- 17
- 18
- 19
- 20
- 21
- 22
- 23
- 24
- 25
- 26
- 27
- 28
- 29
- 30
- 31
- 32
- 33
- 34
- 35
- 36
- 37
- 38
- 39
- 40
- 41
第二步:利用fasttext进行分类。使用的是fasttext的python包。
整理好的数据:百度网盘下载
news_fasttext_train.txt
news_fasttext_test.txt
# _*_coding:utf-8 _*_ import logging logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO) import fasttext #训练模型 classifier = fasttext.supervised("news_fasttext_train.txt","news_fasttext.model",label_prefix="__label__") #load训练好的模型 #classifier = fasttext.load_model('news_fasttext.model.bin', label_prefix='__label__')
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 9
- 10
#测试模型 result = classifier.test("news_fasttext_test.txt") print result.precision print result.recall
- 1
- 2
- 3
- 4
- 5
0.92240420242
0.92240420242
- 1
- 2
- 3
由于fasttext貌似只提供全部结果的p值和r值,想要统计不同分类的结果,就需要自己写代码来实现了。
# -*- coding: utf-8 -*- """ Created on Wed Oct 18 14:17:27 2017 @author: xiaoguangli """ import logging logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO) import fasttext classifier = fasttext.load_model('news_fasttext.model.bin', label_prefix='__label__') labels_right = [] texts = [] with open("news_fasttext_test.txt") as fr: for line in fr: line = line.decode("utf-8").rstrip() labels_right.append(line.split("\t")[1].replace("__label__","")) texts.append(line.split("\t")[0]) # print labels # print texts # break labels_predict = [e[0] for e in classifier.predict(texts)] #预测输出结果为二维形式 # print labels_predict text_labels = list(set(labels_right)) text_predict_labels = list(set(labels_predict)) print text_predict_labels print text_labels A = dict.fromkeys(text_labels,0) #预测正确的各个类的数目 B = dict.fromkeys(text_labels,0) #测试数据集中各个类的数目 C = dict.fromkeys(text_predict_labels,0) #预测结果中各个类的数目 for i in range(0,len(labels_right)): B[labels_right[i]] += 1 C[labels_predict[i]] += 1 if labels_right[i] == labels_predict[i]: A[labels_right[i]] += 1 print A print B print C #计算准确率,召回率,F值 for key in B: try: r = float(A[key]) / float(B[key]) p = float(A[key]) / float(C[key]) f = p * r * 2 / (p + r) print "%s:\t p:%f\t r:%f\t f:%f" % (key,p,r,f) except: print "error:", key, "right:", A.get(key,0), "real:", B.get(key,0), "predict:",C.get(key,0)
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 9
- 10
- 11
- 12
- 13
- 14
- 15
- 16
- 17
- 18
- 19
- 20
- 21
- 22
- 23
- 24
- 25
- 26
- 27
- 28
- 29
- 30
- 31
- 32
- 33
- 34
- 35
- 36
- 37
- 38
- 39
- 40
- 41
- 42
- 43
- 44
- 45
- 46
- 47
- 48
- 49
- 50
- 51
- 52
- 53
- 54
实验数据分类
[u'affairs', u'fashion', u'lottery', u'house', u'science', u'sports', u'game', u'economic', u'ent', u'edu', u'home', u'constellation', u'stock']
['affairs', 'fashion', 'house', 'sports', 'game', 'economic', 'ent', 'edu', 'home', 'stock', 'science']
{'science': 8415, 'affairs': 8257, 'fashion': 3173, 'house': 9491, 'sports': 9739, 'game': 9506, 'economic': 9235, 'ent': 9665, 'edu': 9491, 'home': 9315, 'stock': 9015}
{'science': 10000, 'affairs': 10000, 'fashion': 3369, 'house': 10000, 'sports': 10000, 'game': 10000, 'economic': 10000, 'ent': 10000, 'edu': 10000, 'home': 10000, 'stock': 10000}
{u'affairs': 8562, u'fashion': 3585, u'lottery': 96, u'science': 9088, u'edu': 10068, u'sports': 10099, u'game': 10151, u'economic': 10131, u'ent': 10798, u'house': 10000, u'home': 10103, u'constellation': 432, u'stock': 10256}
- 1
- 2
- 3
- 4
- 5
- 6
#实验结果
science: p:0.841500 r:0.925946r: f:0.881706 affairs: p:0.825700 r:0.964377r: f:0.889667 fashion: p:0.941822 r:0.885077r: f:0.912568 house: p:0.949100 r:0.949100r: f:0.949100 sports: p:0.973900 r:0.964353r: f:0.969103 game: p:0.950600 r:0.936459r: f:0.943477 economic: p:0.923500 r:0.911559r: f:0.917490 ent: p:0.966500 r:0.895073r: f:0.929416 edu: p:0.949100 r:0.942690r: f:0.945884 home: p:0.931500 r:0.922003r: f:0.926727 stock: p:0.901500 r:0.878998r: f:0.890107
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 9
- 10
- 11
从结果上,看出fasttext的分类效果还是不错的,没有进行对fasttext的调参,结果都基本在90以上,不过在预测的时候,不知道怎么多出了一个分类constellation。难道。。。。查找原因中。。。。
2016/11/7更正:从集合B中可以看出训练集的标签中是没有lottery和constellation的数据的,说明在数据准备的时候,每类选取10000篇,导致在测试数据集中lottery和constellation不存在数据了。因此在第一步准备数据的时候可以根据lottery和constellation类的数据进行训练集和测试集的大小划分,或者简单粗暴点,这两类没有达到我们的数量要求,可以直接删除掉