多標簽分類


1. 算法

多標簽分類的適用場景較為常見,比如,一份歌單可能既屬於標簽旅行也屬於標簽駕車。有別於多分類分類,多標簽分類中每個標簽不是互斥的。多標簽分類算法大概有兩類流派:

  • 采用One-vs-Rest(或其他方法)組合多個二分類基分類器;
  • 改造經典的單分類器,比如,AdaBoost-MH與ML-KNN。

One-vs-Rest

基本思想:為每一個標簽\(y_i\)構造一個二分類器,正樣本為含有標簽\(y_i\)的實例,負樣本為不含有標簽\(y_i\)的實例;最后組合N個二分類器結果得到N維向量,可視作為在多標簽上的得分。我實現一個Spark版本MultiLabelOneVsRest,源代碼見mllibX

AdaBoost-MH

AdaBoost-MH算法是由Schapire(AdaBoost算法作者)與Singer提出,基本思想與AdaBoost算法類似:自適應地調整樣本-類別的分布權重。對於訓練樣本\(\langle (x_1, Y_1), \cdots, (x_m, Y_m) \rangle\),任意一個實例 \(x_i \in \mathcal{X}\),標簽類別\(Y_i \subseteq \mathcal{Y}\),算法流程如下:

其中,\(D_t(i, \ell)\)表示在t次迭代實例\(x_i\)對應標簽\(\ell\)的權重,\(Y[\ell]\)標識標簽\(\ell\)是否屬於實例\((x, Y)\),若屬於則為+1,反之為-1(增加樣本標簽的權重);即

\[Y[\ell] = \left \{ { \matrix { {+1} & {\ell \in Y} \cr {-1} & {\ell \notin Y} \cr } } \right. \]

\(Z_t\)為每一次迭代的歸一化因子,保證權重分布矩陣\(D\)的所有權重之和為1,

\[Z_t = \sum_{i=1}^{m} \sum_{\ell \in \mathcal{Y}} D_{t}(i, \ell) \exp \large{(}-\alpha_{t} Y_i[\ell] h_t(x_i, \ell) \large{)} \]

ML-KNN

ML-KNN (multi-label K nearest neighbor)基於KNN算法,已知K近鄰的標簽信息,通過最大后驗概率(Maximum A Posteriori)估計實例\(t\)是否應打上標簽\(\ell\)

\[y_t(\ell) = \mathop{ \arg \max}_{b \in \{0,1\}} P(H_b^{\ell} | E_{C_t(\ell)}^{\ell} ) \]

其中,\(H_0^{\ell}\)表示實例\(t\)不應打上標簽\(\ell\)\(H_1^{\ell}\)則表示應被打上;\(E_{C_t(\ell)}^{\ell}\) 表示實例\(t\)的K近鄰中擁有標簽\(\ell\)的實例數為\(C_t(\ell)\)。上述式子可有貝葉斯定理求解:

\[y_t(\ell) = \mathop{ \arg \max}_{b \in \{0,1\}} P(H_b^{\ell}) P(E_{C_t(\ell)}^{\ell} | H_b^{\ell} ) \]

上面兩項計算細節見論文[2].

2. 實驗

AdaBoost.MH算法Spark實現見sparkboostscikit-multilearn實現ML-KNN算法。我在siam-competition2007數據集上做了幾個算法的對比實驗,結果如下:

算法 Hamming loss Precision Recall F1 Measure
LR+OvR 0.0569 0.6252 0.5586 0.5563
AdaBoost.MH 0.0587 0.6280 0.6082 0.5837
ML-KNN 0.0652 0.6204 0.6535 0.5977

此外,Mulan提供了眾多數據集,Kaggle也有多標簽分類的比賽WISE 2014

實驗部分代碼如下:

import numpy as np
from sklearn import metrics
from sklearn.datasets import load_svmlight_file
from sklearn.linear_model import LogisticRegression
from sklearn.multiclass import OneVsRestClassifier
from sklearn.preprocessing import MultiLabelBinarizer

# load svm file
X_train, y_train = load_svmlight_file('tmc2007_train.svm', dtype=np.float64, multilabel=True)
X_test, y_test = load_svmlight_file('tmc2007_test.svm', dtype=np.float64, multilabel=True)

# convert multi labels to binary matrix
mb = MultiLabelBinarizer()
y_train = mb.fit_transform(y_train)
y_test = mb.fit_transform(y_test)

# LR + OvR
clf = OneVsRestClassifier(LogisticRegression(), n_jobs=10)
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)

# multilabel classification metrics
loss = metrics.hamming_loss(y_test, y_pred)
prf = metrics.precision_recall_fscore_support(y_test, y_pred, average='samples')


"""
ML-KNN for multilabel classification
"""
from skmultilearn.adapt import MLkNN

clf = MLkNN(k=15)
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
// AdaBoost.MH for multilabel classification
val labels0Based = true
val binaryProblem = false

val learner = new AdaBoostMHLearner(sc)
learner.setNumIterations(params.numIterations) // 500 iter
learner.setNumDocumentsPartitions(params.numDocumentsPartitions)
learner.setNumFeaturesPartitions(params.numFeaturesPartitions)
learner.setNumLabelsPartitions(params.numLabelsPartitions)
val classifier = learner.buildModel(params.input, labels0Based, binaryProblem)

val testPath = "./tmc2007_test.svm"
val numRows = DataUtils.getNumRowsFromLibSvmFile(sc, testPath)
val testRdd = DataUtils.loadLibSvmFileFormatDataAsList(sc, testPath, labels0Based, binaryProblem, 0, numRows, -1);
val results = classifier.classifyWithResults(sc, testRdd, 20)

val predAndLabels = sc.parallelize(predLabels.zip(goldLabels)
  .map(t => {
    (t._1.map(e => e.toDouble), t._2.map(e => e.toDouble))
  }))
val metrics = new MultilabelMetrics(predAndLabels)

3. 參考文獻

[1] Schapire, Robert E., and Yoram Singer. "BoosTexter: A boosting-based system for text categorization." Machine learning 39.2-3 (2000): 135-168.
[2] Zhang, Min-Ling, and Zhi-Hua Zhou. "ML-KNN: A lazy learning approach to multi-label learning." Pattern recognition 40.7 (2007): 2038-2048.
[3] 基於PredictionIO的推薦引擎打造,及大規模多標簽分類探索.


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM