sklearn多分類問題

本文轉載自查看原文 2018-02-05 14:13 2614 python/ 機器學習/ sklearn機器學習/ 多分類/ sklearn/ 回歸

python機器學習-乳腺癌細胞挖掘（博主親自錄制視頻）

https://study.163.com/course/introduction.htm?courseId=1005269003&utm_campaign=commission&utm_source=cp-400000000398149&utm_medium=share

sklearn：multiclass與multilabel，one-vs-rest與one-vs-one

針對多類問題的分類中，具體講有兩種，即multiclass classification和multilabel classification。multiclass是指分類任務中包含不止一個類別時，每條數據僅僅對應其中一個類別，不會對應多個類別。multilabel是指分類任務中不止一個分類時，每條數據可能對應不止一個類別標簽，例如一條新聞，可以被划分到多個板塊。

無論是multiclass，還是multilabel，做分類時都有兩種策略，一個是one-vs-the-rest(one-vs-all)，一個是one-vs-one。這個在之前的SVM介紹中（http://blog.sina.com.cn/s/blog_7103b28a0102w07f.html）也提到過。

在one-vs-all策略中，假設有n個類別，那么就會建立n個二項分類器，每個分類器針對其中一個類別和剩余類別進行分類。進行預測時，利用這n個二項分類器進行分類，得到數據屬於當前類的概率，選擇其中概率最大的一個類別作為最終的預測結果。

在one-vs-one策略中，同樣假設有n個類別，則會針對兩兩類別建立二項分類器，得到k=n*(n-1)/2個分類器。對新數據進行分類時，依次使用這k個分類器進行分類，每次分類相當於一次投票，分類結果是哪個就相當於對哪個類投了一票。在使用全部k個分類器進行分類后，相當於進行了k次投票，選擇得票最多的那個類作為最終分類結果。

在scikit-learn框架中，分別有sklearn.multiclass.OneVsRestClassifier和sklearn.multiclass.OneVsOneClassifier完成兩種策略，使用過程中要指明使用的二項分類器是什么。另外在進行mutillabel分類時，訓練數據的類別標簽Y應該是一個矩陣，第[i,j]個元素指明了第j個類別標簽是否出現在第i個樣本數據中。例如，np.array([[1, 0, 0], [0, 1, 1], [0, 0, 0]])，這樣的一條數據，指明針對第一條樣本數據，類別標簽是第0個類，第二條數據，類別標簽是第1，第2個類，第三條數據，沒有類別標簽。有時訓練數據中，類別標簽Y可能不是這樣的可是，而是類似[[2, 3, 4], [2], [0, 1, 3], [0, 1, 2, 3, 4], [0, 1, 2]]這樣的格式，每條數據指明了每條樣本數據對應的類標號。這就需要將Y轉換成矩陣的形式，sklearn.preprocessing.MultiLabelBinarizer提供了這個功能。

ons-vs-all的multiclass例子如下：

one-vs-one的multiclass例子如下：

https://www.cnblogs.com/taceywong/p/5932682.html

本例模擬一個多標簽文檔分類問題.數據集基於下面的處理隨機生成:

選取標簽的數目:泊松(n~Poisson,n_labels)
n次,選取類別C:多項式(c~Multinomial,theta)
選取文檔長度:泊松(k~Poisson,length)
k次,選取一個單詞:多項式(w~Multinomial,theta_c)

在上面的處理中,拒絕抽樣用來確保n大於2,文檔長度不為0.同樣,我們拒絕已經被選取的類別.被同事分配給兩個分類的文檔會被兩個圓環包圍.

通過投影到由PCA和CCA選取進行可視化的前兩個主成分進行分類.接着通過元分類器使用兩個線性核的SVC來為每個分類學習一個判別模型.注意,PCA用於無監督降維,CCA用於有監督.

注:在下面的繪制中,"無標簽樣例"不是說我們不知道標簽(就像半監督學習中的那樣),而是這些樣例根本沒有標簽~~~

# coding:utf-8 import numpy as np from pylab import * from sklearn.datasets import make_multilabel_classification from sklearn.multiclass import OneVsRestClassifier from sklearn.svm import SVC from sklearn.preprocessing import LabelBinarizer from sklearn.decomposition import PCA from sklearn.cross_decomposition import CCA myfont = matplotlib.font_manager.FontProperties(fname="Microsoft-Yahei-UI-Light.ttc") mpl.rcParams['axes.unicode_minus'] = False def plot_hyperplane(clf, min_x, max_x, linestyle, label): # 獲得分割超平面 w = clf.coef_[0] a = -w[0] / w[1] xx = np.linspace(min_x - 5, max_x + 5) # 確保線足夠長 yy = a * xx - (clf.intercept_[0]) / w[1] plt.plot(xx, yy, linestyle, label=label) def plot_subfigure(X, Y, subplot, title, transform): if transform == "pca": X = PCA(n_components=2).fit_transform(X) elif transform == "cca": X = CCA(n_components=2).fit(X, Y).transform(X) else: raise ValueError min_x = np.min(X[:, 0]) max_x = np.max(X[:, 0]) min_y = np.min(X[:, 1]) max_y = np.max(X[:, 1]) classif = OneVsRestClassifier(SVC(kernel='linear')) classif.fit(X, Y) plt.subplot(2, 2, subplot) plt.title(title,fontproperties=myfont) zero_class = np.where(Y[:, 0]) one_class = np.where(Y[:, 1]) plt.scatter(X[:, 0], X[:, 1], s=40, c='gray') plt.scatter(X[zero_class, 0], X[zero_class, 1], s=160, edgecolors='b', facecolors='none', linewidths=2, label=u'類別-1') plt.scatter(X[one_class, 0], X[one_class, 1], s=80, edgecolors='orange', facecolors='none', linewidths=2, label=u'類別-2') plot_hyperplane(classif.estimators_[0], min_x, max_x, 'k--', u'類別-1的\n邊界') plot_hyperplane(classif.estimators_[1], min_x, max_x, 'k-.', u'類別-2的\n邊界') plt.xticks(()) plt.yticks(()) plt.xlim(min_x - .5 * max_x, max_x + .5 * max_x) plt.ylim(min_y - .5 * max_y, max_y + .5 * max_y) if subplot == 2: plt.xlabel(u'第一主成分',fontproperties=myfont) plt.ylabel(u'第二主成分',fontproperties=myfont) plt.legend(loc="upper left",prop=myfont) plt.figure(figsize=(8, 6)) X, Y = make_multilabel_classification(n_classes=2, n_labels=1, allow_unlabeled=True, random_state=1) plot_subfigure(X, Y, 1, u"有無標簽樣例 + CCA", "cca") plot_subfigure(X, Y, 2, u"有無標簽樣例 + PCA", "pca") X, Y = make_multilabel_classification(n_classes=2, n_labels=1, allow_unlabeled=False, random_state=1) plot_subfigure(X, Y, 3, u"沒有無標簽樣例 + CCA", "cca") plot_subfigure(X, Y, 4, u"沒有無標簽樣例 + PCA", "pca") plt.subplots_adjust(.04, .02, .97, .94, .09, .2) plt.suptitle(u"多標簽分類", size=20,fontproperties=myfont) plt.show()

https://www.cnblogs.com/hapjin/p/6085278.html

# logistics 多分類

import pandas as pd df=pd.read_csv("logistic_data/train.tsv",header=0,delimiter='\t') print df.count() print df.head() df.Phrase.head(10) df.Sentiment.describe() df.Sentiment.value_counts() df.Sentiment.value_counts()/df.Sentiment.count()

import pandas as pd from sklearn.feature_extraction.text import TfidfVectorizer from sklearn.linear_model.logistic import LogisticRegression from sklearn.cross_validation import train_test_split from sklearn.metrics import classification_report,accuracy_score,confusion_matrix from sklearn.pipeline import Pipeline from sklearn.grid_search import GridSearchCV pipeline=Pipeline([ ('vect',TfidfVectorizer(stop_words='english')), ('clf',LogisticRegression())]) parameters={ 'vect__max_df':(0.25,0.5), 'vect__ngram_range':((1,1),(1,2)), 'vect__use_idf':(True,False), 'clf__C':(0.1,1,10), } df=pd.read_csv("logistic_data/train.tsv",header=0,delimiter='\t') X,y=df.Phrase,df.Sentiment.as_matrix() X_train,X_test,y_train,y_test=train_test_split(X,y,train_size=0.5) grid_search=GridSearchCV(pipeline,parameters,n_jobs=-1,verbose=1,scoring="accuracy") grid_search.fit(X_train,y_train) print u'最佳效果：%0.3f'%grid_search.best_score_ print u'最優參數組合：' best_parameters=grid_search.best_estimator_.get_params() for param_name in sorted(parameters.keys()): print '\t%s:%r'%(param_name,best_parameters[param_name])

數據結果：

Fitting 3 folds for each of 24 candidates, totalling 72 fits

[Parallel(n_jobs=-1)]: Done 46 tasks | elapsed: 2.0min [Parallel(n_jobs=-1)]: Done 72 out of 72 | elapsed: 4.5min finished

最佳效果：0.619 最優參數組合： clf__C:10 vect__max_df:0.25 vect__ngram_range:(1, 2) vect__use_idf:False

## 多類分類效果評估

predictions=grid_search.predict(X_test)
print u'准確率',accuracy_score(y_test,predictions) print u'混淆矩陣',confusion_matrix(y_test,predictions) print u'分類報告',classification_report(y_test,predictions)

數據結果：

准確率 0.636614122773
混淆矩陣 [[ 1133 1712   595    67     1]
[ 919 6136 6006   553    35]
[ 213 3212 32637 3634   138]
[   22   420 6548 8155 1274]
[    4    45   546 2411 1614]]
分類報告              precision    recall f1-score   support

          0       0.49      0.32      0.39      3508
          1       0.53      0.45      0.49     13649
          2       0.70      0.82      0.76     39834
          3       0.55      0.50      0.52     16419
          4       0.53      0.35      0.42      4620

avg / total       0.62      0.64      0.62     78030

1.11 多分類、多標簽分類

包：sklearn.multiclass

OneVsRestClassifier：1-rest多分類（多標簽）策略
OneVsOneClassifier：1-1多分類策略
OutputCodeClassifier：1個類用一個二進制碼表示
示例代碼

#coding=utf-8 from sklearn import metrics from sklearn import cross_validation from sklearn.svm import SVC from sklearn.multiclass import OneVsRestClassifier from sklearn.preprocessing import MultiLabelBinarizer import numpy as np from numpy import random X=np.arange(15).reshape(5,3) y=np.arange(5) Y_1 = np.arange(5) random.shuffle(Y_1) Y_2 = np.arange(5) random.shuffle(Y_2) Y = np.c_[Y_1,Y_2] def multiclassSVM(): X_train, X_test, y_train, y_test = cross_validation.train_test_split(X, y, test_size=0.2,random_state=0) model = OneVsRestClassifier(SVC()) model.fit(X_train, y_train) predicted = model.predict(X_test) print predicted def multilabelSVM(): Y_enc = MultiLabelBinarizer().fit_transform(Y) X_train, X_test, Y_train, Y_test = cross_validation.train_test_split(X, Y_enc, test_size=0.2, random_state=0) model = OneVsRestClassifier(SVC()) model.fit(X_train, Y_train) predicted = model.predict(X_test) print predicted if __name__ == '__main__': multiclassSVM() # multilabelSVM()

上面的代碼測試了svm在OneVsRestClassifier的包裝下，分別處理多分類和多標簽的情況。特別注意，在多標簽的情況下，輸入必須是二值化的。所以需要MultiLabelBinarizer()先處理。

2 具體模型

2.1 朴素貝葉斯（Naive Bayes）

包：sklearn.cross_validation

朴素貝葉斯.png

朴素貝葉斯的特點是分類速度快，分類效果不一定是最好的。

GasussianNB：高斯分布的朴素貝葉斯
MultinomialNB：多項式分布的朴素貝葉斯
BernoulliNB：伯努利分布的朴素貝葉斯

所謂使用什么分布的朴素貝葉斯，就是假設P(x_i|y)是符合哪一種分布，比如可以假設其服從高斯分布，然后用最大似然法估計高斯分布的參數。

高斯分布.png

多項式分布.png

伯努利分布.png

3 scikit-learn擴展

3.0 概覽

具體的擴展，通常要繼承sklearn.base包下的類。

BaseEstimator：估計器的基類
ClassifierMixin ：分類器的混合類
ClusterMixin：聚類器的混合類
RegressorMixin ：回歸器的混合類
TransformerMixin ：轉換器的混合類

關於什么是Mixin（混合類），具體可以看這個知乎鏈接。簡單地理解，就是帶有實現方法的接口，可以將其看做是組合模式的一種實現。舉個例子，比如說常用的TfidfTransformer，繼承了BaseEstimator， TransformerMixin，因此它的基本功能就是單一職責的估計器和轉換器的組合。

3.1 創建自己的轉換器

在特征抽取的時候，經常會發現自己的一些數據預處理的方法，sklearn里可能沒有實現，但若直接在數據上改，又容易將代碼弄得混亂，難以重現實驗。這個時候最好自己創建一個轉換器，在后面將這個轉換器放到pipeline里，統一管理。
例如《Python數據挖掘入門與實戰》書中的例子，我們想接收一個numpy數組，根據其均值將其離散化，任何高於均值的特征值替換為1，小於或等於均值的替換為0。
代碼實現：

from sklearn.base import TransformerMixin from sklearn.utils import as_float_array class MeanDiscrete(TransformerMixin): #計算出數據集的均值，用內部變量保存該值。 def fit(self, X, y=None): X = as_float_array(X) self.mean = np.mean(X, axis=0) #返回self，確保在轉換器中能夠進行鏈式調用（例如調用transformer.fit(X).transform(X)） return self def transform(self, X): X = as_float_array(X) assert X.shape[1] == self.mean.shape[0] return X > self.mean



作者：Cer_ml
鏈接：https://www.jianshu.com/p/516f009c0875
來源：簡書
著作權歸作者所有。商業轉載請聯系作者獲得授權，非商業轉載請注明出處。

sklearn學習筆記（3）svm多分類

http://blog.csdn.net/babybirdtofly/article/details/72886879

SVC、NuSVC、LinearSVC都可以在一個數據集上實現多分類。
SVC和NuSVC方法類似，但是有不同的輸入參數集和不同的數學表述。另一方面，linearSVC是SVC的在線性核的基礎上的另一種實現，所以LinearSVC不能不能接受關鍵字“kernel”，只能是線性。
二分類
和別的分類器一樣，三種分類器需要輸入兩個數組：X[n樣本][n維特征]（訓練數據集）Y[n個標簽]（類標簽）

from sklearn import svm X = [[0，0], [1，1]] Y = [0, 1]

模型學習之后可以進行預測：

clf = svm.SVC()
clf.fit(X,y)
clf.predict([[2.,2.]])

SVM的決策函數依賴於訓練數據集的支持向量子集。這些屬性可以通過下面函數進行查看

#get support vector clf.support_vectors_ #get indices of support vectors clf.support_ #get number of support vectors for each class clf.n_support_

多分類
SVC和NuSVC實現“1v1“的方法進行多分類（Knerr et al，1990）。如果n_class是類別的數量，那么需要建立n*n/2個分類器，desision_function_shape選項允許集成“1v1”分類器來刻畫（n_samples,n_features）

clf = svm.SVC(decision_function_shape='ovo') clf.fit(X, Y) dec = clf.decision_function([[1]]) print dec.shape[1] # 4 classes: 4*3/2 = 6 print clf.predict([[1]]) clf.decision_function_shape = "ovr" dec = clf.decision_function([[1]]) print dec.shape[1] print clf.predict([[2.4]])

同時，LinearSVC也實現了“one vs the rest”多分類策略。

lin_clf = svm.LinearSVC()
lin_clf.fit(X,Y)
dec = lin_clf.decision_function([[3]]) print dec.shape[1] print lin_clf.predict(2.4)

評分和概率
SVC方法decision_function給每個樣本中的每個類一個評分，當我們將probability設置為True之后，我們可以通過predict_proba和predict_log_proba可以對類別概率進行評估。
Wu, Lin and Weng, “Probability estimates for multi-class classification by pairwise coupling”, JMLR 5:975-1005, 2004.
不均衡問題
我們可以通過class_weight和sample_weight兩個關鍵字實現對特定類別或者特定樣本的權重調整。

本作業使用邏輯回歸(logistic regression)和神經網絡(neural networks)識別手寫的阿拉伯數字(0-9)

關於邏輯回歸的一個編程練習，可參考：Stanford coursera Andrew Ng 機器學習課程編程作業（Exercise 2）及總結

下面使用邏輯回歸實現多分類問題：識別手寫的阿拉伯數字(0-9)，使用神經網絡實現：識別手寫的阿拉伯數字(0-9)，請參考：神經網絡實現

數據加載到Matlab中的格式如下：

一共有5000個訓練樣本，每個訓練樣本是400維的列向量（20X20像素的 grayscale image），用矩陣 X 保存。樣本的結果(label of training set)保存在向量 y 中，y 是一個5000行1列的列向量。

比如 y = (1,2,3,4,5,6,7,8,9,10......)^T，注意，由於Matlab下標是從1開始的，故用 10 表示數字 0

①樣本數據的可視化

隨機選擇100個樣本數據，使用Matlab可視化的結果如下：

②使用邏輯回歸來實現多分類問題(one-vs-all)

所謂多分類問題，是指分類的結果為三類以上。比如，預測明天的天氣結果為三類：晴(用y==1表示)、陰(用y==2表示)、雨(用y==3表示)

分類的思想，其實與邏輯回歸分類(默認是指二分類，binary classification)很相似，對“晴天”進行分類時，將另外兩類(陰天和下雨)視為一類：(非晴天)，這樣，就把一個多分類問題轉化成了二分類問題。示意圖如下：（圖中的圓圈表示：不屬於某一類的所有其他類）

對於N分類問題(N>=3)，就需要N個假設函數(預測模型)，也即需要N組模型參數θ（θ一般是一個向量）

然后，對於每個樣本實例，依次使用每個模型預測輸出，選取輸出值最大的那組模型所對應的預測結果作為最終結果。

因為模型的輸出值，在sigmoid函數作用下，其實是一個概率值。，注意：h_θ⁽¹⁾(x)，h_θ⁽²⁾(x)，h_θ⁽³⁾(x)三組模型參數θ 一般是不同的。比如：

h_θ⁽¹⁾(x)，輸出預測為晴天(y==1)的概率

h_θ⁽²⁾(x)，輸出預測為陰天(y==2)的概率

h_θ⁽³⁾(x)，輸出預測為雨天(y==3)的概率

③Matlab代碼實現

對於上面的識別阿拉伯數字的問題，一共需要訓練出10個邏輯回歸模型，每個邏輯回歸模型對應着識別其中一個數字。

我們一共有5000個樣本，樣本的預測結果值就是：y=(1,2,3,4,5,6,7,8,9,10)，其中 10 代表數字0

我們使用Matlab fmincg庫函數來求解使得代價函數取最小值的模型參數θ

function [all_theta] = oneVsAll(X, y, num_labels, lambda)
%ONEVSALL trains multiple logistic regression classifiers and returns all
%the classifiers in a matrix all_theta, where the i-th row of all_theta 
%corresponds to the classifier for label i
%   [all_theta] = ONEVSALL(X, y, num_labels, lambda) trains num_labels % logisitc regression classifiers and returns each of these classifiers % in a matrix all_theta, where the i-th row of all_theta corresponds % to the classifier for label i % Some useful variables m = size(X, 1);% num of samples n = size(X, 2);% num of features % You need to return the following variables correctly all_theta = zeros(num_labels, n + 1); % Add ones to the X data matrix X = [ones(m, 1) X]; % ====================== YOUR CODE HERE ====================== % Instructions: You should complete the following code to train num_labels % logistic regression classifiers with regularization % parameter lambda. % % Hint: theta(:) will return a column vector. % % Hint: You can use y == c to obtain a vector of 1's and 0's that tell use % whether the ground truth is true/false for this class. % % Note: For this assignment, we recommend using fmincg to optimize the cost % function. It is okay to use a for-loop (for c = 1:num_labels) to % loop over the different classes. % % fmincg works similarly to fminunc, but is more efficient when we % are dealing with large number of parameters. % % Example Code for fmincg: % % % Set Initial theta % initial_theta = zeros(n + 1, 1); % % % Set options for fminunc % options = optimset('GradObj', 'on', 'MaxIter', 50); % % % Run fmincg to obtain the optimal theta % % This function will return theta and the cost % [theta] = ... % fmincg (@(t)(lrCostFunction(t, X, (y == c), lambda)), ... % initial_theta, options); % initial_theta = zeros(n + 1, 1); options = optimset('GradObj','on','MaxIter',50); for c = 1:num_labels %num_labels 為邏輯回歸訓練器的個數，num of logistic regression classifiers all_theta(c, :) = fmincg(@(t)(lrCostFunction(t, X, (y == c),lambda)), initial_theta,options ); end % ========================================================================= end

lrCostFunction，完全可參考：http://www.cnblogs.com/hapjin/p/6078530.html 里面的正則化的邏輯回歸模型實現costFunctionReg.m文件

下面來解釋一下 for循環：

num_labels 為分類器個數，共10個，每個分類器(模型)用來識別10個數字中的某一個。

我們一共有5000個樣本，每個樣本有400中特征變量，因此：模型參數θ 向量有401個元素。

initial_theta = zeros(n + 1, 1); % 模型參數θ的初始值(n == 400)

all_theta是一個10*401的矩陣，每一行存儲着一個分類器(模型)的模型參數θ 向量，執行上面for循環，就調用fmincg庫函數求出了所有模型的參數θ 向量了。

求出了每個模型的參數向量θ，就可以用訓練好的模型來識別數字了。對於一個給定的數字輸入(400個 feature variables) input instance，每個模型的假設函數h_θ⁽ⁱ⁾(x) 輸出一個值(i = 1,2,...10)。取這10個值中最大值那個值，作為最終的識別結果。比如g(h_θ⁽⁸⁾(x))==0.96 比其它所有的 g(h_θ⁽ⁱ⁾(x)) (i = 1,2,...10,但 i 不等於8) 都大，則識別的結果為數字 8

 

復制代碼
function p = predictOneVsAll(all_theta, X)
%PREDICT Predict the label for a trained one-vs-all classifier. The labels 
%are in the range 1..K, where K = size(all_theta, 1). 
%  p = PREDICTONEVSALL(all_theta, X) will return a vector of predictions
%  for each example in the matrix X. Note that X contains the examples in
%  rows. all_theta is a matrix where the i-th row is a trained logistic
%  regression theta vector for the i-th class. You should set p to a vector
%  of values from 1..K (e.g., p = [1; 3; 1; 2] predicts classes 1, 3, 1, 2
%  for 4 examples) 

m = size(X, 1);
num_labels = size(all_theta, 1);

% You need to return the following variables correctly 
p = zeros(size(X, 1), 1);

% Add ones to the X data matrix
X = [ones(m, 1) X];

% ====================== YOUR CODE HERE ======================
% Instructions: Complete the following code to make predictions using
%               your learned logistic regression parameters (one-vs-all).
%               You should set p to a vector of   (from 1 to
%               num_labels).
%
% Hint: This code can be done all vectorized using the max function.
%       In particular, the max function can also return the index of the 
%       max element, for more information see 'help max'. If your examples 
%       are in rows, then, you can use max(A, [], 2) to obtain the max 
%       for each row.
%       

[~,p] = max( X * all_theta',[],2); % 求矩陣(X*all_theta')每行的最大值，p 記錄矩陣每行的最大值的索引
% =========================================================================

python機器學習生物信息學系列課（博主錄制）：http://dwz.date/b9vw

歡迎關注博主主頁，學習python視頻資源

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 sklearn官網-多分類問題多分類問題 sklearn實現多分類邏輯回歸多分類問題的評價指標 libsvm處理多分類的問題 SVM處理多分類問題多分類問題multicalss classification Sklearn對多分類的每個類別進行指標評(P R) 利用sklearn對多分類的每個類別進行指標評價 sklearn中實現多分類任務（OVR和OVO）