搜狐新聞文本分類與分析

本文轉載自查看原文 2020-10-06 15:45 1224 一些學習總結

【實驗目的】

掌握數據預處理的方法，對訓練集數據進行預處理；
掌握文本建模的方法，對語料庫的文檔進行建模；
掌握分類算法的原理，基於有監督的機器學習方法，訓練文本分類器；
利用學習的文本分類器，對未知文本進行分類判別；
掌握評價分類器性能的評估方法。

【實驗要求】

文本類別數：>=10類；
訓練集文檔數：>=500000篇；每類平均50000篇。
測試集文檔數：>=500000篇；每類平均50000篇（）

【實驗內容】

preview

1.訓練集獲取

本次實驗采用搜狗新聞語料庫（http://www.sogou.com/labs/resource/list_news.php），本次實驗使用的使搜狐新聞數據，歷史完整版下載下來解壓縮為3.3GB。下載完成后解壓縮如圖：

2.文本預處理

上圖中每個文檔大致內容如下：

可以看到文件含有大量多余信息，文本預處理的目的是提取其中的文本並存入對應分類的文件夾中，這里確定分類標簽的依據是 < url>標簽中的字段，例如< url>http://sports.sohu.com/20080128/n254928174.shtml< /url>，則其文本對應的標簽為sports。另外，需要注意原始文本為ANSI編碼，在執行提取文本操作前應該轉換其編碼為UTF-8，否則后續執行會出錯。下面貼上轉換編碼以及提取文本的python代碼：

# -*- coding:UTF-8 -*-				#轉換編碼
import os
import codecs
import chardet

def list_folders_files(path):
    """
    返回 "文件夾" 和 "文件" 名字
    :param path: "文件夾"和"文件"所在的路徑
    :return:  (list_folders, list_files)
            :list_folders: 文件夾
            :list_files: 文件
    """
    list_files = []
    for root, dirs, files in os.walk(path):
        for file in files:
            list_files.append(path+'\\'+file)
        
    return list_files

def convert(file, in_enc = "ANSI", out_enc = "utf-8"):
    in_enc = in_enc.upper()
    out_enc = out_enc.upper()

    try:
        print("convert [ " + file.split('\\')[-1] + " ].....From " + in_enc + " --> " + out_enc)
        f = codecs.open(file, 'r', in_enc, "ignore")
        new_content = f.read()
        codecs.open(file, 'w', out_enc).write(new_content)
    except IOError as err:
        print("I/O error: {0}".format(err))


path = 'C:\Users\iloveacm\pytorch\sohu\sougou_all\SogouCS'
lists = list_folders_files(path)
for list in lists:
    convert(list, 'GB2312', 'UTF-8')

import os
from xml.dom import minidom
from urlparse import urlparse
import codecs
import importlib
import sys
import re
import io
default_encoding = 'utf-8'
reload(sys)
sys.setdefaultencoding(default_encoding)

file_dir = 'C:\Users\iloveacm\pytorch\sohu\sougou_after2'
""" for root, dirs, files in os.walk('C:\Users\iloveacm\pytorch\sohu\sougou_before2'):
        for f in files:
            print(f)
            print(f)
            tmp_dir = 'C:\Users\iloveacm\pytorch\sohu\sougou_after2' + '\\' + f 
            text_init_dir = file_dir + '\\' + f
            print text_init_dir
            file_source = open(text_init_dir, 'r')
            ok_file = open(tmp_dir, 'w')
            ok_file.close() """

main_config = 'C:\Users\iloveacm\pytorch\sohu\sougou_after2'

for root, dirs, files in os.walk('C:\Users\iloveacm\pytorch\sohu\sougou_after2'):
    for file in files:
        text = open(main_config +'\\' + file, 'rb').read().decode("UTF-8")
        content = re.findall('<url>(.*?)</url>.*?<contenttitle>(.*?)</contenttitle>.*?<content>(.*?)</content>', text, re.S)
        for news in content:
            url_title     = news[0]
            content_title = news[1]
            news_text     = news[2]
            title = re.findall('http://(.*?).sohu.com', url_title)[0]
            if len(title)>0 and len(news_text)>30:
                print('[{}][{}][{}]'.format(file, title, content_title))
                save_config = main_config + '\\' + title
                if not os.path.exists(save_config):
                    os.makedirs(save_config)
                else:
                    print('Is Exists')
                f = open('{}/{}.txt'.format(save_config, (len(os.listdir(save_config)) + 1)), 'w')
                f.write(news_text)
                f.close()

轉換后目錄結構如下：

其中每個文件夾包含代表一個類，共18類。每個txt文檔包含一則新聞消息。至此預處理完畢。

2.數據清洗與特征提取

查看數據特征

為了符合短文本分類的特性，用以下代碼查看以數據集下各個文本在長度上的分類。

import os											#此段代碼依次讀取各個文本分類文件夾中txt文檔，獲取長度后存入lists中
import numpy as np								     #lists[i],其中i表示文本長度，lists[i]的值表示文本長度為i的文本個數
import matplotlib.pyplot as plt

lists = [0]
lists = lists*20001
def EnumPathFiles(path, callback):
    if not os.path.isdir(path):
        print('Error:"',path,'" is not a directory or does not exist.')
        return
    list_dirs = os.walk(path)

    for root, dirs, files in list_dirs:
        for d in dirs:
            print(d)
            EnumPathFiles(os.path.join(root, d), callback)
        for f in files:
            callback(root, f)

def callback1(path, filename):
    textpath =  path+'\\'+filename
    print(textpath)
    text = open(textpath,'rb').read()
    length = len(text)/3
    if length <= 20000:
        lists[length]+=1

if __name__ == '__main__':
    EnumPathFiles(r'C:\\Users\\iloveacm\\pytorch\\sohu\\sougou_all', callback1)
    m = np.array(lists)
    np.save('demo.npy',m)
    a=np.load('demo.npy')
    graphTable=a.tolist()
    print(graphTable)

根據上面得到的數組畫圖，結果如下

# -*- coding:UTF-8 -*-
import os
import numpy as np
import matplotlib.pyplot as plt

a=np.load('demo.npy')
graphTable=a.tolist()
#print(graphTable)
plt.plot(graphTable)
plt.ylabel('the number of texts')    #x軸代表文本的長度，y軸代表文本長度為x的文本數量
plt.xlabel('the length of text')
plt.axis([0,3000,0,3000])
plt.show()

可以看到，絕大部分文檔長度集中在小於等於500的長度范圍內。

分詞

本文采用結巴分詞，分詞的同時去除一次些中文停用詞，即一些沒有意義的詞語，如：即，而已，等等。與此同時，去除標點符號，標點符號包含在停用詞列表中。這是本實驗采用的停用詞列表。下面是處理停用詞代碼

#encoding=utf-8								#遍歷文件，用ProsessofWords處理文件
import jieba
import os
import numpy as np
import sys
reload(sys)
sys.setdefaultencoding('utf-8')

def EnumPathFiles(path, callback, stop_words_list):
    if not os.path.isdir(path):
        print('Error:"',path,'" is not a directory or does not exist.')
        return
    list_dirs = os.walk(path)

    for root, dirs, files in list_dirs:
        for d in dirs:
            print(d)
            EnumPathFiles(os.path.join(root, d), callback, stop_words_list)
        for f in files:
            callback(root, f, stop_words_list)
        
def ProsessofWords(textpath, stop_words_list):
    f = open(textpath,'r')
    text = f.read()
    f.close()
    result  = list()
    outstr = ''
    seg_list = jieba.cut(text,cut_all=False)
    for word in seg_list:
        if word not in stop_words_list:
            if word != '\t':
                outstr += word
                outstr += " "
    f = open(textpath,'w+')
    f.write(outstr)
    f.close()

def callback1(path, filename, stop_words_list):
    textpath =  path+'\\'+filename
    print(textpath)
    ProsessofWords(textpath, stop_words_list)

if __name__ == '__main__':
    stopwords_file = "C:\Users\iloveacm\pytorch\sohu\\stop_words2.txt"
    stop_f = open(stopwords_file, "r")
    stop_words = list()
    for line in stop_f.readlines():
        line = line.strip()
        if not len(line):
            continue
        stop_words.append(line)
    stop_f.close()
    print(len(stop_words))

    EnumPathFiles(r'C:\\Users\\iloveacm\\pytorch\sohu\sougou_all', callback1, stop_words)

結果示例如下

為了能夠適應keras的讀取數據格式，用以下代碼整理並為每條數據打上標簽。最后得到以下結果文件：

（1）新聞文本數據，每行 1 條新聞，每條新聞由若干個詞組成，詞之間以空格隔開，總共428993行，並且做了截斷處理，只選取了前大約1000個漢字；

（2）新聞標簽數據，每行 1 個數字，對應這條新聞所屬的類別編號，訓練標簽428993行；

新聞標簽如下

dict = {'2008': '1', 'business':'2', 'hourse': '3', 'it': '4', 'learning':'5', 'news':'6', 'sports':'7', 'travel':'8', 'women':'9', 'yule':'10'}

處理代碼如下

#encoding=utf-8
import os

def merge_file(path):
    files = os.listdir(path)
    print(files)
    dict = {'2008': '1', 'business':'2', 'hourse': '3', 'it': '4', 'learning':'5', 'news':'6', 'sports':'7', 'travel':'8', 'women':'9', 'yule':'10'}
    outfile_train = 'C:\\Users\\iloveacm\\pytorch\\sohu\\train.txt'
    outfile_label = 'C:\\Users\\iloveacm\\pytorch\\sohu\\label.txt'
    result_train = open(outfile_train, 'a')
    result_label = open(outfile_label, 'a')
    for file in files:
        text_dir = path + '\\' + file
        texts = os.listdir(text_dir)
        for text in texts:
            txt_file_dir = text_dir + '\\' + text
            print(txt_file_dir)
            f= open(txt_file_dir,'r')
            content = f.read()
            if len(content) > 3000:
                content = content.decode('utf8')[0:3000].encode('utf8')			#截取字段

            result_train.write(content+'\n')		#合並文件
            result_label.write(dict[file]+'\n')
    result_label.close()
    result_train.close()

if __name__=="__main__":
    path = r"C:\\Users\\iloveacm\\pytorch\\sohu\\sougou_all"
    merge_file(path)

至此，數據預處理，數據清洗以及數據集准備階段完畢。

3.分類算法

本實驗機器學習算法的選擇參考這篇文章（搜狐新聞文本分類：機器學習大亂斗），下圖是文章給出的各個算法的比較：

CNN模型：

模型代碼以及模型示意圖如下：

跑起來發現驗證集准確率異常低，仔細一想是因為數據集是按種類分割，所以驗證集上的種類未被訓練過。

對模型進行改正后結果如下：

訓練集上准確率0.9792，驗證集准確率0.9020，訓練集損失0.0596，驗證集損失0.3829，用時22分鍾，

CNN_WORD2VEC:

訓練集上准確率達到了0.8648，驗證集上達到了0.8683，訓練集損失0.4064，驗證集損失0.3978，用時8.75分鍾

LSTM：

訓練集上准確率達到了0.9957，驗證集上達到了0.9684，訓練集損失0.0158，驗證集損失0.1326，用時32.7分鍾

LSTM_W2V:

訓練集上准確率達到了0.9206，驗證集上達到了0.9327，訓練集損失0.2390，驗證集損失0.2021，用時21.3分鍾

從上面四個例子可以看到，用了詞向量模型節省了時間但是准確率反而有所下降。

MLP:

訓練集上准確率0.9930，驗證集准確率0.9472，訓練集損失0.0287，驗證集損失0.2608，用時22分鍾

MLP這里由於內存不足，將Tokenizer(num_words)中num_words設置為5000，即取前5000個詞作為訓練目標。

對比表格如下

	數據集准確率	驗證集准確率	訓練集損失	驗證集損失	時間花費
CNN	0.9792	0.9020	0.0596	0.3829	22分鍾
CNN_W2V	0.8648	0.8683	0.4064	0.3978	8.75分鍾
LSTM	0.9957	0.9684	0.0158	0.1326	32.7分鍾
LSTM_W2V	0.9206	0.9327	0.2390	0.2021	21.3分鍾
MLP	0.9930	0.9472	0.0287	0.2608	5分鍾（20）

其中MLP訓練較快，大約15秒一個epoch，所以訓練了20個epoch

參考文章

搜狐新聞文本分類：機器學習大亂斗

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 基於jieba,TfidfVectorizer,LogisticRegression進行搜狐新聞文本分類利用jieba,word2vec,LR進行搜狐新聞文本分類機器學習-文本分類（2）-新聞文本分類新聞文本分類——關鍵詞提取 NLP-零基礎入門NLP之新聞文本分類 Tensorflow+RNN實現新聞文本分類機器學習 - 文本分析案例 - 新聞分析 pyhanlp 文本分類與情感分析【數據挖掘實驗】利用朴素貝葉斯方法對百萬搜狐新聞文本數據進行分類文本分類TextCNN