什么是文本分類
文本分類任務是NLP十分常見的任務大類,他的輸入一般是文本信息,輸出則是預測得到的分類標簽。主要的文本分類任務有主題分類、情感分析 、作品歸屬、真偽檢測等,很多問題其實通過轉化后也能用分類的方法去做。
常規步驟
- 選擇一個感興趣的任務
- 收集合適的數據集
- 做好標注
- 特征選擇
- 選擇一個機器學習方法
- 利用驗證集調參
- 可以多嘗試幾種算法和參數
- 訓練final模型
- Evaluate測試集
機器學習算法
這里簡單介紹幾個機器學習(基礎)算法
1. 朴素貝葉斯 Naive Bayes
假設特征之間是相互獨立的,利用貝葉斯法則,尋找最有可能的class,
優點:Fast to “train” and classify; robust, low- variance; good for low data situations; optimal classifier if independence assumption is correct; extremely simple to implement.
缺點:Independence assumption rarely holds; low accuracy compared to similar methods in most situations; smoothing required for unseen class/ feature combinations
2. 邏輯回歸 Logistic Regression
邏輯回歸是由線性回歸做了點改動得來的,利用一個link function進行轉化,有點”化曲為直“的味道,能夠輸出一個0-1的概率。
訓練的方法和回歸模型差不多,利用cost函數來求weight,還可以添加正則項(regularisation)作為懲罰項。
優點: Unlike Naïve Bayes not confounded by diverse, correlated features
缺點: High bias; slow to train; some feature scaling issues; often needs a lot of data to work well; choosing regularisation a nuisance but important since overfitting is a big problem
3. Support Vector Machines (SVD)
主要思想:找到一個超平面能夠區分訓練數據從而進行測試集的分類,這里不展開。
優點: fast and accurate linear classifier; can do non-linearity with kernel trick; works well with huge feature sets
缺點: Multiclass classification awkward; feature scaling can be tricky; deals poorly with class imbalances; uninterpretable
4. K-Nearest Neighbour (KNN)
主要思想:根據觀測數據與已有數據的距離(可以是歐幾里得距離、cosine距離),取最接近的標簽作為觀測數據的標簽。
優點: Simple, effective; no training required; inherently multiclass; optimal with infinite data
缺點: Have to select k; issues with unbalanced classes; often slow (need to find those k-neighbours); features must be selected carefully
5. 決策樹 Decision Tree
主要思想:利用feature信息構建樹,最后的葉子節點就是class類。
優點: in theory, very interpretable; fast to build and test; feature representation/scaling irrelevant; good for small feature sets, handles non-linearly-separable problems
缺點: In practice, often not that interpretable; highly redundant sub-trees; not competitive for large feature sets
6. 隨機森林 Random Forest
主要思想:有多個決策樹構成,通過最后投票選定標簽。
優點: Usually more accurate and more robust than decision trees, a great classifier for small- to moderate-sized feature sets; training easily parallelised
缺點: Same negatives as decision trees: too slow with large feature sets
7. 神經網絡 Neural Network
主要思想:將多個神經層節點之間相互聯系,每個節點把前一層的weight傳遞到下一層,這里不展開,其實本質還是linear regression。
優點: Extremely powerful, state-of-the-art accuracy on many tasks in natural language processing and vision
缺點: Not an off-the-shelf classifier, very difficult to choose good parameters; slow to train; prone to overfitting
調參
我們在使用訓練集訓練完數據后,可以用驗證集進行調參,常用的調參方法有k-fold cross-validation,grid search
評估
常用的評估標准:
-
Accuracy = 正確數/總數
-
Precision = tp/tp+fp
-
Recall = tp/tp+fn
-
F1-score = 2 * precision * recall / (precision + recall)
另外還有macro f-score 和 micro f-score,想進一步了解的可以點這里。