中文短文本分類

本文轉載自查看原文 2019-12-03 15:47 724 NLP

文本分類，屬於有監督學習中的一部分，在很多場景下都有應用，下面通過小數據的實例，一步步完成中文短文本的分類實現，整個過程盡量做到少理論重實戰。

下面使用的數據是一份司法數據，需求是對每一條輸入數據，判斷事情的主體是誰，比如報警人被老公打，報警人被老婆打，報警人被兒子打，報警人被女兒打等來進行文本有監督的分類操作。

整個過程分為以下幾個步驟：

語料加載
分詞
去停用詞
抽取詞向量特征
分別進行算法建模和模型訓練
評估、計算 AUC 值
模型對比

基本流程如下圖所示：

enter image description here

下面開始項目實戰。

1. 首先進行語料加載，在這之前，引入所需要的 Python 依賴包，並將全部語料和停用詞字典讀入內存中。

第一步，引入依賴庫，有隨機數庫、jieba 分詞、pandas 庫等：

import random
import jieba
import pandas as pd

第二步，加載停用詞字典，停用詞詞典為 stopwords.txt 文件，可以根據場景自己在該文本里面添加要去除的詞（比如冠詞、人稱、數字等特定詞）：

#加載停用詞
stopwords=pd.read_csv('stopwords.txt',index_col=False,quoting=3,sep="\t",names=['stopword'], encoding='utf-8')
stopwords=stopwords['stopword'].values

第三步，加載語料，語料是4個已經分好類的 csv 文件，直接用 pandas 加載即可，加載之后可以首先刪除 nan 行，並提取要分詞的 content 列轉換為 list 列表：

# 加載語料
laogong_df = pd.read_csv('beilaogongda.csv', encoding='utf-8', sep=',')
laopo_df = pd.read_csv('beilaogongda.csv', encoding='utf-8', sep=',')
erzi_df = pd.read_csv('beierzida.csv', encoding='utf-8', sep=',')
nver_df = pd.read_csv('beinverda.csv', encoding='utf-8', sep=',')
# 刪除語料的nan行
laogong_df.dropna(inplace=True)
laopo_df.dropna(inplace=True)
erzi_df.dropna(inplace=True)
nver_df.dropna(inplace=True)
# 轉換
laogong = laogong_df.segment.values.tolist()
laopo = laopo_df.segment.values.tolist()
erzi = erzi_df.segment.values.tolist()
nver = nver_df.segment.values.tolist()

2. 分詞和去停用詞。

第一步，定義分詞、去停用詞和批量打標簽的函數，函數包含3個參數：content_lines 參數為語料列表；sentences 參數為預先定義的 list，用來存儲分詞並打標簽后的結果；category 參數為標簽：

# 定義分詞和打標簽函數preprocess_text
# 參數content_lines即為上面轉換的list
# 參數sentences是定義的空list，用來儲存打標簽之后的數據
# 參數category 是類型標簽
def preprocess_text(content_lines, sentences, category):
    for line in content_lines:
        try:
            segs = jieba.lcut(line)
            segs = [v for v in segs if not str(v).isdigit()]  # 去數字
            segs = list(filter(lambda x: x.strip(), segs))  # 去左右空格
            segs = list(filter(lambda x: len(x) > 1, segs))  # 長度為1的字符
            segs = list(filter(lambda x: x not in stopwords, segs))  # 去掉停用詞
            sentences.append((" ".join(segs), category))  # 打標簽
        except Exception:
            print(line)
            continue

第二步，調用函數、生成訓練數據，根據我提供的司法語料數據，分為報警人被老公打，報警人被老婆打，報警人被兒子打，報警人被女兒打，標簽分別為0、1、2、3，具體如下：

sentences = []
preprocess_text(laogong, sentences,0)
preprocess_text(laopo, sentences, 1)
preprocess_text(erzi, sentences, 2)
preprocess_text(nver, sentences, 3)

第三步，將得到的數據集打散，生成更可靠的訓練集分布，避免同類數據分布不均勻：

random.shuffle(sentences)

第四步，我們在控制台輸出前10條數據，觀察一下：

for sentence in sentences[:10]:
　　print(sentence[0], sentence[1])  #下標0是詞列表，1是標簽

得到的結果如圖所示：

enter image description here

3. 抽取詞向量特征。

第一步，抽取特征，我們定義文本抽取詞袋模型特征：

from sklearn.feature_extraction.text import CountVectorizer
vec = CountVectorizer(
　　analyzer='word', # tokenise by character ngrams
　　max_features=4000,  # keep the most common 1000 ngrams
    )

第二步，把語料數據切分，用 sk-learn 對數據切分，分成訓練集和測試集：

from sklearn.model_selection import train_test_split
x, y = zip(*sentences)
x_train, x_test, y_train, y_test = train_test_split(x, y, random_state=1256)

第三步，把訓練數據轉換為詞袋模型：

vec.fit(x_train)

4. 分別進行算法建模和模型訓練。

定義朴素貝葉斯模型，然后對訓練集進行模型訓練，直接使用 sklearn 中的 MultinomialNB：

from sklearn.naive_bayes import MultinomialNB
classifier = MultinomialNB()
classifier.fit(vec.transform(x_train), y_train)

5. 評估、計算 AUC 值。

第一步，上面步驟1-4完成了從語料到模型的訓練，訓練之后，我們要用測試集來計算 AUC 值：

print(classifier.score(vec.transform(x_test), y_test))

得到的結果評分為：0.647331786543。

第二步，進行測試集的預測：

 pre = classifier.predict(vec.transform(x_test))

6. 模型對比。

整個模型從語料到訓練評估步驟1-5就完成了，接下來我們來看看，改變特征向量模型和訓練模型對結果有什么變化。

（1）改變特征向量模型

下面可以把特征做得更強一點，嘗試加入抽取 2-gram 和 3-gram 的統計特征，把詞庫的量放大一點。

   from sklearn.feature_extraction.text import CountVectorizer
    vec = CountVectorizer(
        analyzer='word', # tokenise by character ngrams
        ngram_range=(1,4),  # use ngrams of size 1 and 2
        max_features=20000,  # keep the most common 1000 ngrams
    )
    vec.fit(x_train)
    #用朴素貝葉斯算法進行模型訓練
    classifier = MultinomialNB()
    classifier.fit(vec.transform(x_train), y_train)
    #對結果進行評分
    print(classifier.score(vec.transform(x_test), y_test))

得到的結果評分為：0.649651972158，確實有一點提高，但是不太明顯。

（2）改變訓練模型

使用 SVM 訓練：

from sklearn.svm import SVC
svm = SVC(kernel='linear')
svm.fit(vec.transform(x_train), y_train)
print(svm.score(vec.transform(x_test), y_test))

使用決策樹、隨機森林、XGBoost、神經網絡等等：

import xgboost as xgb  
from sklearn.model_selection import StratifiedKFold  
import numpy as np
# xgb矩陣賦值  
xgb_train = xgb.DMatrix(vec.transform(x_train), label=y_train)  
xgb_test = xgb.DMatrix(vec.transform(x_test))

在 XGBoost 中，下面主要是調參指標，可以根據參數進行調參：

    params = {  
            'booster': 'gbtree',     #使用gbtree
            'objective': 'multi:softmax',  # 多分類的問題、  
            # 'objective': 'multi:softprob',   # 多分類概率  
            #'objective': 'binary:logistic',  #二分類
            'eval_metric': 'merror',   #logloss
            'num_class': 4,  # 類別數，與 multisoftmax 並用  
            'gamma': 0.1,  # 用於控制是否后剪枝的參數,越大越保守，一般0.1、0.2這樣子。  
            'max_depth': 8,  # 構建樹的深度，越大越容易過擬合  
            'alpha': 0,   # L1正則化系數  
            'lambda': 10,  # 控制模型復雜度的權重值的L2正則化項參數，參數越大，模型越不容易過擬合。  
            'subsample': 0.7,  # 隨機采樣訓練樣本  
            'colsample_bytree': 0.5,  # 生成樹時進行的列采樣  
            'min_child_weight': 3,  
            # 這個參數默認是 1，是每個葉子里面 h 的和至少是多少，對正負樣本不均衡時的 0-1 分類而言  
            # 假設 h 在 0.01 附近，min_child_weight 為 1 葉子節點中最少需要包含 100 個樣本。  
            'silent': 0,  # 設置成1則沒有運行信息輸出，最好是設置為0.  
            'eta': 0.03,  # 如同學習率  
            'seed': 1000,  
            'nthread': -1,  # cpu 線程數  
            'missing': 1 
        }

總結

上面通過真實司法數據，一步步實現中文短文本分類的方法，整個示例代碼可以當做模板來用，從優化和提高模型准確率來說，主要有兩方面可以嘗試：

特征向量的構建，除了詞袋模型，可以考慮使用 word2vec 和 doc2vec 等；
模型上可以選擇有監督的分類算法、集成學習以及神經網絡等。

  1 import random
  2 import jieba
  3 import pandas as pd
  4 
  5 #加載停用詞
  6 stopwords=pd.read_csv('./data6/stopwords.txt',index_col=False,quoting=3,sep="\t",names=['stopword'], encoding='utf-8')
  7 stopwords=stopwords['stopword'].values
  8 
  9 # 加載語料
 10 laogong_df = pd.read_csv('./data6/beilaogongda.csv', encoding='utf-8', sep=',')
 11 laopo_df = pd.read_csv('./data6/beilaogongda.csv', encoding='utf-8', sep=',')
 12 erzi_df = pd.read_csv('./data6/beierzida.csv', encoding='utf-8', sep=',')
 13 nver_df = pd.read_csv('./data6/beinverda.csv', encoding='utf-8', sep=',')
 14 # 刪除語料的nan行
 15 laogong_df.dropna(inplace=True)
 16 laopo_df.dropna(inplace=True)
 17 erzi_df.dropna(inplace=True)
 18 nver_df.dropna(inplace=True)
 19 # 轉換
 20 laogong = laogong_df.segment.values.tolist()
 21 laopo = laopo_df.segment.values.tolist()
 22 erzi = erzi_df.segment.values.tolist()
 23 nver = nver_df.segment.values.tolist()
 24 
 25 # 定義分詞和打標簽函數preprocess_text
 26 # 參數content_lines即為上面轉換的list
 27 # 參數sentences是定義的空list，用來儲存打標簽之后的數據
 28 # 參數category 是類型標簽
 29 def preprocess_text(content_lines, sentences, category):
 30     for line in content_lines:
 31         try:
 32             segs = jieba.lcut(line)
 33             segs = [v for v in segs if not str(v).isdigit()]  # 去數字
 34             segs = list(filter(lambda x: x.strip(), segs))  # 去左右空格
 35             segs = list(filter(lambda x: len(x) > 1, segs))  # 長度為1的字符
 36             segs = list(filter(lambda x: x not in stopwords, segs))  # 去掉停用詞
 37             sentences.append((" ".join(segs), category))  # 打標簽
 38         except Exception:
 39             print(line)
 40             continue
 41 
 42 sentences = []
 43 preprocess_text(laogong, sentences,0)
 44 preprocess_text(laopo, sentences, 1)
 45 preprocess_text(erzi, sentences, 2)
 46 preprocess_text(nver, sentences, 3)
 47 
 48 random.shuffle(sentences)
 49 
 50 for sentence in sentences[:10]:
 51     print(sentence[0], sentence[1])  #下標0是詞列表，1是標簽
 52 
 53 from sklearn.feature_extraction.text import CountVectorizer
 54 vec = CountVectorizer(
 55     analyzer='word', # tokenise by character ngrams
 56     max_features=4000,  # keep the most common 1000 ngrams
 57     )
 58 
 59 from sklearn.model_selection import train_test_split
 60 x, y = zip(*sentences)
 61 x_train, x_test, y_train, y_test = train_test_split(x, y, random_state=1256)
 62 
 63 vec.fit(x_train)
 64 
 65 from sklearn.naive_bayes import MultinomialNB
 66 classifier = MultinomialNB()
 67 classifier.fit(vec.transform(x_train), y_train)
 68 print(classifier.score(vec.transform(x_test), y_test))
 69 
 70 pre = classifier.predict(vec.transform(x_test))
 71 
 72 from sklearn.feature_extraction.text import CountVectorizer
 73 vec = CountVectorizer(
 74         analyzer='word', # tokenise by character ngrams
 75         ngram_range=(1,4),  # use ngrams of size 1 and 2
 76         max_features=20000,  # keep the most common 1000 ngrams
 77     )
 78 vec.fit(x_train)
 79 #用朴素貝葉斯算法進行模型訓練
 80 classifier = MultinomialNB()
 81 classifier.fit(vec.transform(x_train), y_train)
 82 #對結果進行評分
 83 print(classifier.score(vec.transform(x_test), y_test))
 84 
 85 print("------------------------------")
 86 from sklearn.svm import SVC
 87 svm = SVC(kernel='linear')
 88 svm.fit(vec.transform(x_train), y_train)
 89 print(svm.score(vec.transform(x_test), y_test))
 90 
 91 
 92 print("------------------------------")
 93 import xgboost as xgb
 94 from sklearn.model_selection import StratifiedKFold
 95 import numpy as np
 96 # xgb矩陣賦值
 97 xgb_train = xgb.DMatrix(vec.transform(x_train), label=y_train)
 98 xgb_test = xgb.DMatrix(vec.transform(x_test))
 99 
100 params = {
101             'booster': 'gbtree',     #使用gbtree
102             'objective': 'multi:softmax',  # 多分類的問題、
103             # 'objective': 'multi:softprob',   # 多分類概率
104             #'objective': 'binary:logistic',  #二分類
105             'eval_metric': 'merror',   #logloss
106             'num_class': 4,  # 類別數，與 multisoftmax 並用
107             'gamma': 0.1,  # 用於控制是否后剪枝的參數,越大越保守，一般0.1、0.2這樣子。
108             'max_depth': 8,  # 構建樹的深度，越大越容易過擬合
109             'alpha': 0,   # L1正則化系數
110             'lambda': 10,  # 控制模型復雜度的權重值的L2正則化項參數，參數越大，模型越不容易過擬合。
111             'subsample': 0.7,  # 隨機采樣訓練樣本
112             'colsample_bytree': 0.5,  # 生成樹時進行的列采樣
113             'min_child_weight': 3,
114             # 這個參數默認是 1，是每個葉子里面 h 的和至少是多少，對正負樣本不均衡時的 0-1 分類而言
115             # 假設 h 在 0.01 附近，min_child_weight 為 1 葉子節點中最少需要包含 100 個樣本。
116             'silent': 0,  # 設置成1則沒有運行信息輸出，最好是設置為0.
117             'eta': 0.03,  # 如同學習率
118             'seed': 1000,
119             'nthread': -1,  # cpu 線程數
120             'missing': 1
121         }

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 中文短文本分類項目實踐基於keras的fasttext短文本分類【文本分類-中文】textRNN 中文文本分類【文本分類-中文】textCNN 中文文本分類深度學習與中文短文本分析總結與梳理中文文本分類之CharCNN 中文文本分類之TextRNN TextGrocery中文文本分類處理