文本分類 Text Classification

本文轉載自查看原文 2020-06-19 19:58 1184 NLP

什么是文本分類

文本分類任務是NLP十分常見的任務大類，他的輸入一般是文本信息，輸出則是預測得到的分類標簽。主要的文本分類任務有主題分類、情感分析、作品歸屬、真偽檢測等，很多問題其實通過轉化后也能用分類的方法去做。

常規步驟

選擇一個感興趣的任務
收集合適的數據集
做好標注
特征選擇
選擇一個機器學習方法
利用驗證集調參
可以多嘗試幾種算法和參數
訓練final模型
Evaluate測試集

機器學習算法

這里簡單介紹幾個機器學習（基礎）算法

1. 朴素貝葉斯 Naive Bayes

假設特征之間是相互獨立的，利用貝葉斯法則，尋找最有可能的class，

\[P(c_n|f_1...f_m) = \prod_{i=1}^mp(f_i|c_n)p(c_n) \]

優點：Fast to “train” and classify; robust, low- variance; good for low data situations; optimal classifier if independence assumption is correct; extremely simple to implement.

缺點：Independence assumption rarely holds; low accuracy compared to similar methods in most situations; smoothing required for unseen class/ feature combinations

2. 邏輯回歸 Logistic Regression

邏輯回歸是由線性回歸做了點改動得來的，利用一個link function進行轉化，有點”化曲為直“的味道，能夠輸出一個0-1的概率。

\[P(c_n|f_1...f_m) = \frac{1}{Z} * exp(\sum_{i=0}^mw_if_i) \]

訓練的方法和回歸模型差不多，利用cost函數來求weight，還可以添加正則項（regularisation）作為懲罰項。

優點: Unlike Naïve Bayes not confounded by diverse, correlated features

缺點: High bias; slow to train; some feature scaling issues; often needs a lot of data to work well; choosing regularisation a nuisance but important since overfitting is a big problem

3. Support Vector Machines (SVD)

主要思想：找到一個超平面能夠區分訓練數據從而進行測試集的分類，這里不展開。

優點: fast and accurate linear classifier; can do non-linearity with kernel trick; works well with huge feature sets

缺點: Multiclass classification awkward; feature scaling can be tricky; deals poorly with class imbalances; uninterpretable

4. K-Nearest Neighbour (KNN)

主要思想：根據觀測數據與已有數據的距離（可以是歐幾里得距離、cosine距離），取最接近的標簽作為觀測數據的標簽。

優點: Simple, effective; no training required; inherently multiclass; optimal with infinite data

缺點: Have to select k; issues with unbalanced classes; often slow (need to find those k-neighbours); features must be selected carefully

5. 決策樹 Decision Tree

主要思想：利用feature信息構建樹，最后的葉子節點就是class類。

優點: in theory, very interpretable; fast to build and test; feature representation/scaling irrelevant; good for small feature sets, handles non-linearly-separable problems

缺點: In practice, often not that interpretable; highly redundant sub-trees; not competitive for large feature sets

6. 隨機森林 Random Forest

主要思想：有多個決策樹構成，通過最后投票選定標簽。

優點: Usually more accurate and more robust than decision trees, a great classifier for small- to moderate-sized feature sets; training easily parallelised

缺點: Same negatives as decision trees: too slow with large feature sets

7. 神經網絡 Neural Network

主要思想：將多個神經層節點之間相互聯系，每個節點把前一層的weight傳遞到下一層，這里不展開，其實本質還是linear regression。

優點: Extremely powerful, state-of-the-art accuracy on many tasks in natural language processing and vision

缺點: Not an off-the-shelf classifier, very difficult to choose good parameters; slow to train; prone to overfitting

調參

我們在使用訓練集訓練完數據后，可以用驗證集進行調參，常用的調參方法有k-fold cross-validation，grid search

評估

常用的評估標准：

Accuracy = 正確數/總數
Precision = tp/tp+fp
Recall = tp/tp+fn
F1-score = 2 * precision * recall / (precision + recall)

另外還有macro f-score 和 micro f-score，想進一步了解的可以點這里。

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 CNN tensorflow text classification CNN文本分類的例子 Chinese-Text-Classification，用卷積神經網絡基於 Tensorflow 實現的中文文本分類。【論文筆記】通俗理解少樣本文本分類 (Few-Shot Text Classification with Distributional Signatures - ICLR 2020) Chinese-Text-Classification：Tensorflow CNN 模型實現的中文文本分類器[不分詞版] 文本分類大綜述-從淺層到深度學習（1961-2020）-A Survey on Text Classification: From Shallow to Deep Learning pytorch -- CNN 文本分類 -- 《 Convolutional Neural Networks for Sentence Classification》《Convolutional Neural Networks for Sentence Classification》文本分類 Text-CNN 文本分類 Text-CNN-文本分類-keras Text-CNN 文本分類