文本分類 Text Classification


什么是文本分類

文本分類任務是NLP十分常見的任務大類,他的輸入一般是文本信息,輸出則是預測得到的分類標簽。主要的文本分類任務有主題分類、情感分析 、作品歸屬、真偽檢測等,很多問題其實通過轉化后也能用分類的方法去做。

常規步驟

  1. 選擇一個感興趣的任務
  2. 收集合適的數據集
  3. 做好標注
  4. 特征選擇
  5. 選擇一個機器學習方法
  6. 利用驗證集調參
  7. 可以多嘗試幾種算法和參數
  8. 訓練final模型
  9. Evaluate測試集

機器學習算法

這里簡單介紹幾個機器學習(基礎)算法

1. 朴素貝葉斯 Naive Bayes

假設特征之間是相互獨立的,利用貝葉斯法則,尋找最有可能的class,

\[P(c_n|f_1...f_m) = \prod_{i=1}^mp(f_i|c_n)p(c_n) \]

優點:Fast to “train” and classify; robust, low- variance; good for low data situations; optimal classifier if independence assumption is correct; extremely simple to implement.

缺點:Independence assumption rarely holds; low accuracy compared to similar methods in most situations; smoothing required for unseen class/ feature combinations

2. 邏輯回歸 Logistic Regression

邏輯回歸是由線性回歸做了點改動得來的,利用一個link function進行轉化,有點”化曲為直“的味道,能夠輸出一個0-1的概率。

\[P(c_n|f_1...f_m) = \frac{1}{Z} * exp(\sum_{i=0}^mw_if_i) \]

訓練的方法和回歸模型差不多,利用cost函數來求weight,還可以添加正則項(regularisation)作為懲罰項。

優點: Unlike Naïve Bayes not confounded by diverse, correlated features

缺點: High bias; slow to train; some feature scaling issues; often needs a lot of data to work well; choosing regularisation a nuisance but important since overfitting is a big problem

3. Support Vector Machines (SVD)

主要思想:找到一個超平面能夠區分訓練數據從而進行測試集的分類,這里不展開。

優點: fast and accurate linear classifier; can do non-linearity with kernel trick; works well with huge feature sets

缺點: Multiclass classification awkward; feature scaling can be tricky; deals poorly with class imbalances; uninterpretable

4. K-Nearest Neighbour (KNN)

主要思想:根據觀測數據與已有數據的距離(可以是歐幾里得距離、cosine距離),取最接近的標簽作為觀測數據的標簽。

優點: Simple, effective; no training required; inherently multiclass; optimal with infinite data

缺點: Have to select k; issues with unbalanced classes; often slow (need to find those k-neighbours); features must be selected carefully

5. 決策樹 Decision Tree

主要思想:利用feature信息構建樹,最后的葉子節點就是class類。

優點: in theory, very interpretable; fast to build and test; feature representation/scaling irrelevant; good for small feature sets, handles non-linearly-separable problems

缺點: In practice, often not that interpretable; highly redundant sub-trees; not competitive for large feature sets

6. 隨機森林 Random Forest

主要思想:有多個決策樹構成,通過最后投票選定標簽。

優點: Usually more accurate and more robust than decision trees, a great classifier for small- to moderate-sized feature sets; training easily parallelised

缺點: Same negatives as decision trees: too slow with large feature sets

7. 神經網絡 Neural Network

nn

主要思想:將多個神經層節點之間相互聯系,每個節點把前一層的weight傳遞到下一層,這里不展開,其實本質還是linear regression。

優點: Extremely powerful, state-of-the-art accuracy on many tasks in natural language processing and vision

缺點: Not an off-the-shelf classifier, very difficult to choose good parameters; slow to train; prone to overfitting

調參

我們在使用訓練集訓練完數據后,可以用驗證集進行調參,常用的調參方法有k-fold cross-validation,grid search

評估

常用的評估標准:

  1. Accuracy = 正確數/總數

  2. Precision = tp/tp+fp

  3. Recall = tp/tp+fn

  4. F1-score = 2 * precision * recall / (precision + recall)

另外還有macro f-score 和 micro f-score,想進一步了解的可以點這里


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM