使用 scikit-learn 實現多類別及多標簽分類算法

本文轉載自查看原文 2018-08-17 11:14 5223 數據分析

多標簽分類格式

對於多標簽分類問題而言，一個樣本可能同時屬於多個類別。如一個新聞屬於多個話題。這種情況下，因變量 $y$

而多類別分類指的是y的可能取值大於2，但是y所屬類別是唯一的。它與多標簽分類問題是有嚴格區別的。所有的scikit-learn分類器都是默認支持多類別分類的。但是，當你需要自己修改算法的時候，也是可以使用scikit-learn實現多類別分類的前期數據准備的。

多類別或多標簽分類問題，有兩種構建分類器的策略：One-vs-All及One-vs-One。下面，通過一些例子進行演示如何實現這兩類策略。

# from sklearn.preprocessing import MultiLabelBinarizer y = [[2,3,4],[2],[0,1,3],[0,1,2,3,4],[0,1,2]] MultiLabelBinarizer().fit_transform(y)

array([[0, 0, 1, 1, 1], [0, 0, 1, 0, 0], [1, 1, 0, 1, 0], [1, 1, 1, 1, 1], [1, 1, 1, 0, 0]])

One-Vs-The-Rest策略

這個策略同時也稱為One-vs-all策略，即通過構造K個判別式（K為類別的個數），第 $i$

多類別分類學習

from sklearn import datasets from sklearn.multiclass import OneVsRestClassifier from sklearn.svm import LinearSVC iris = datasets.load_iris() X,y = iris.data,iris.target OneVsRestClassifier(LinearSVC(random_state = 0)).fit(X,y).predict(X)

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 2, 2, 2, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2])

多標簽分類學習

Kaggle上有一個關於多標簽分類問題的競賽：Multi-label classification of printed media articles to topics。

關於該競賽的介紹如下：

This is a multi-label classification competition for articles coming from Greek printed media. Raw data comes from the scanning of print media, article segmentation, and optical character segmentation, and therefore is quite noisy. Each article is examined by a human annotator and categorized to one or more of the topics being monitored. Topics range from specific persons, products, and companies that can be easily categorized based on keywords, to more general semantic concepts, such as environment or economy. Building multi-label classifiers for the automated annotation of articles into topics can support the work of human annotators by suggesting a list of all topics by order of relevance, or even automate the annotation process for media and/or categories that are easier to predict. This saves valuable time and allows a media monitoring company to expand the portfolio of media being monitored.

我們從該網站下載相應的數據，作為多標簽分類的案例學習。

數據描述

這個文本數據集已經用詞袋模型進行形式化表示，共201561個特征詞，每個文本對應一個或多個標簽，共203個分類標簽。該網站提供了兩種數據格式：ARFF和LIBSVM,ARFF格式的數據主要適用於weka，而LIBSVM格式適用於matlab中的LIBSVM模塊。這里，我們采用LIBSVM格式的數據。

數據的每一行以逗號分隔的整數序列開頭，代表類別標簽。緊接着是以\t分隔的id:value對。其中，id為特征詞的ID，value為特征詞在該文檔中的TF-IDF值。

形式如下。

58,152 833:0.032582 1123:0.003157 1629:0.038548 ...

數據載入

# load modules import os import sys import numpy as np from sklearn.datasets import load_svmlight_file from sklearn.preprocessing import LabelBinarizer from sklearn.preprocessing import MultiLabelBinarizer from sklearn.linear_model import LogisticRegression from sklearn.multiclass import OneVsRestClassifier from sklearn import metrics # set working directory os.chdir("D:\\my_python_workfile\\Thesis\\kaggle_multilabel_classification")

# read files X_train,y_train = load_svmlight_file("./data/wise2014-train.libsvm",dtype=np.float64,multilabel=True) X_test,y_test = load_svmlight_file("./data/wise2014-test.libsvm",dtype = np.float64,multilabel=True)

模型擬合及預測

# transform y into a matrix mb = MultiLabelBinarizer() y_train = mb.fit_transform(y_train) # fit the model and predict clf = OneVsRestClassifier(LogisticRegression(),n_jobs=-1) clf.fit(X_train,y_train) pred_y = clf.predict(X_test)

模型評估

由於沒有關於測試集的真實標簽，這里看看訓練集的預測情況。

# training set result y_predicted = clf.predict(X_train) #report #print(metrics.classification_report(y_train,y_predicted)) import numpy as np np.mean(y_predicted == y_train)

0.99604661023482433

保存結果

# write the output out_file = open("pred.csv","w") out_file.write("ArticleId,Labels\n") id = 64858 for i in xrange(pred_y.shape[0]): label = list(mb.classes_[np.where(pred_y[i,:]==1)[0]].astype("int")) label = " ".join(map(str,label)) if label == "": # if the label is empty label = "103" out_file.write(str(id+i)+","+label+"\n") out_file.close()

One-Vs-One策略

One-Vs-One策略即是兩兩類別之間建立一個判別式，這樣，總共需要 $K (K - 1) / 2$

多類別分類學習

from sklearn import datasets from sklearn.multiclass import OneVsOneClassifier from sklearn.svm import LinearSVC iris = datasets.load_iris() X,y = iris.data,iris.target OneVsOneClassifier(LinearSVC(random_state = 0)).fit(X,y).predict(X)

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2])

參考文獻

http://yphuang.github.io/blog/2016/04/22/Multiclass-and-Multilabel-algorithms-Implementation-in-sklearn/

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 scikit-learn一般實例之八:多標簽分類利用scikit-learn庫實現隨機森林分類算法 Scikit-learn使用總結使用Python scikit-learn 庫實現神經網絡算法 scikit-learn的梯度提升算法（Gradient Boosting）使用 scikit-learn 支持向量機算法庫使用小結 scikit-learn 多分類混淆矩陣使用scikit-learn決策樹實現簡單預測 scikit-learn 不同聚類算法的比較使用 Scikit-learn 和 ML.NET 實現朴素貝葉斯（Naive Bayes）分類器