菜鳥之路——機器學習之SVM分類器學習理解以及Python實現

本文轉載自查看原文 2018-08-23 21:06 24634 機器學習/ 菜鳥之路/ SVM/ Python

SVM分類器里面的東西好多呀，碾壓前兩個。怪不得稱之為深度學習出現之前表現最好的算法。

今天學到的也應該只是冰山一角，懂了SVM的一些原理。還得繼續深入學習理解呢。

一些關鍵詞：

超平面（hyper plane）SVM的目標就是找到一個超平面把兩類數據分開。使邊際（margin）最大。如果把超平面定義為w*x+b=0.那么超平面距離任意一個支持向量的距離就是1/||w||。（||w||是w的范數，也就是√w*w’）

SVM就是解決

這個優化問題。再經過拉格朗日公式和KKT條件等數學運算求解得到一個d（X^T）=∑yi*a_i X^T +b0。y_i是類別標記，X_i是支持向量點。X^T是待預測實例的轉置。a_i，b₀是求解出的固定值。根據d（X^T）的正負判斷分類。

簡單來說就是這樣的。下面寫一個簡單的程序

線性可分

 1 import numpy as np
 2 import pylab as pl   #畫圖用
 3 from sklearn import svm
 4 
 5 #隨機生成兩組二位數據
 6 np.random.seed(0)#使每次產生隨機數不變
 7 X = np.r_[np.random.randn(20,2)-[2,2],np.random.randn(20,2)+[2,2]]#注意這里np.r_[],而不是np.r_（）我都打錯了，會報錯TypeError: 'RClass' object is not callable
 8 #np.r_是按列連接兩個矩陣，就是把兩矩陣上下相加，要求列數相等，np.c_是按行連接兩個矩陣，就是把兩矩陣左右相加，要求行數相等
 9 
10 Y = [0] * 20+[1] * 20#Python原來可以這么簡單的創建重復的列表呀
11 
12 clf=svm.SVC(kernel='linear')
13 clf.fit(X,Y)
14 
15 w=clf.coef_[0]
16 a=-w[0]/w[1]
17 xx=np.linspace(-5,5)#產生-5到5的線性連續值，間隔為1
18 yy=a*xx-(clf.intercept_[0])/w[1]  #clf.intercept_[0]是w3.即為公式a1*x1+a2*x2+w3中的w3。(clf.intercept_[0])/w[1]即為直線的截距
19 
20 #得出支持向量的方程
21 b=clf.support_vectors_[0]
22 yy_down=a*xx+(b[1]-a*b[0])#(b[1]-a*b[0])就是簡單的算截距
23 b=clf.support_vectors_[-1]
24 yy_up=a*xx+(b[1]-a*b[0])
25 
26 print("w:",w) #打印出權重系數
27 print("a:",a) #打印出斜率
28 print("suport_vectors_:",clf.support_vectors_)#打印出支持向量
29 print("clf.coef_:",clf.coef_)                  #打印出權重系數，還是w
30 
31 
32 #這個就是畫出來而已。很簡單，也不太常用，都用matplotlib去了。不多說了
33 pl.plot(xx,yy,'k-')
34 pl.plot(xx,yy_down,'k--')
35 pl.plot(xx,yy_up,'k--')
36 
37 pl.scatter(clf.support_vectors_[:,0],clf.support_vectors_[:,0],s=80,facecolors='none')
38 pl.scatter(X[:,0],X[:,1],c=Y,cmap=pl.cm.Paired)
39 
40 pl.axis('tight')
41 pl.show()

小知識點：

np.r_是按列連接兩個矩陣，就是把兩矩陣上下相加，要求列數相等，np.c_是按行連接兩個矩陣，就是把兩矩陣左右相加，要求行數相等

運行之后就是

w: [0.90230696 0.64821811]
a: -1.391980476255765
suport_vectors_: [[-1.02126202 0.2408932 ]
[-0.46722079 -0.53064123]
[ 0.95144703 0.57998206]]
clf.coef_: [[0.90230696 0.64821811]]

實線就是傳說中的超平面，虛線就是支持向量所在的直線。

線性不可分

接下來就是線性不可分的情況啦。

顧名思義就是一根直線划分不出來兩個類。

這樣的話就用一個非線性的映射轉化到一個更高維度的空間中。很好理解。比如一維的[-1,0,1]三個點，0是一類，（-1,1）是另一類。一根直線不可能分開，那就用一個y=x²轉化為了[(-1,1),(0,0),(1,1)]這樣的話就能用一根直線分開啦，比如，y=0.5這個直線。這不畫圖了，可以自行畫圖理解一下，高維亦然。

但是吧轉化到高維可以分類了，然而又不便於計算了。維度越高，內積算的越慢，就令K（x_i，y_i）=φ（x_i）·φ（y_i）=x1*y1+x2*y2+......+xn*yn。這里的φ（x）就是上面說的映射。

核函數就簡化了算內積的數量。讓程序運行更快。將映射后的高維空間內積轉換成低維空間的函數。

常用的核函數：

線性核函數（Linear Kernel）： $K (x, z) = x ∙ z$

多項式核函數： $K (x, z) = （ γ x ∙ z + r)^{d}$ $γ, r, d$

　　　高斯核函數（Gaussian Kernel）：K(x,z)=exp(−γ||x−z||2)，也稱為徑向基核函數（Radial Basis Function,RBF），它是非線性分類SVM最主流的核函數。libsvm默認的核函數就是它。γ大於0，需要自己調參定義。

Sigmoid核函數（Sigmoid Kernel）： $K (x, z) = t a n h （ γ x ∙ z + r)$ γ,r

對於多類的分類問題。一般都對於每個類，有以下當前類和其他類的二類分類器。

下面是一個人臉識別的程序。

 1 #from __future__ import print_function #__future__模塊，把下一個新版本的特性導入到當前版本，於是我們就可以在當前版本中測試一些新版本的特性
 2                                         #我的Python版本是3.6.4.所以不需要這個
 3 
 4 from time import time   #對程序運行時間計時用的
 5 import logging           #打印程序進展日志用的
 6 import matplotlib.pyplot as plt  #繪圖用的
 7 
 8 from sklearn.model_selection import train_test_split
 9 from sklearn.datasets import fetch_lfw_people
10 from sklearn.model_selection import GridSearchCV
11 from sklearn.metrics import classification_report
12 from sklearn.metrics import confusion_matrix
13 from sklearn.decomposition import PCA
14 from sklearn.svm import SVC
15 
16 logging.basicConfig(level=logging.INFO,format='%(asctime)s %(message)s')
17 lfw_people=fetch_lfw_people(min_faces_per_person=70,resize=0.4)  #名人的人臉數據集、
18 
19 n_samples,h,w=lfw_people.images.shape  #多少個實例，h,w高度，寬度值
20 
21 X=lfw_people.data   #特征向量矩陣
22 n_feature=X.shape[1]#每個人有多少個特征值
23 
24 Y=lfw_people.target
25 target_names=lfw_people.target_names
26 n_classes=target_names.shape[0]     #多少類
27 print("Total dataset size")
28 print("n_samples:",n_samples)
29 print("n_feature:",n_feature)
30 print("n_classes:",n_classes)
31 
32 X_train,X_test,Y_train,Y_test=train_test_split(X,Y,test_size=0.25)  #選取0.25的測試集
33 
34 #降維
35 n_components=150  #PCA算法中所要保留的主成分個數n，也即保留下來的特征個數n
36 print("Extracting the top %d eigenfaces from %d faces" % (n_components,X_train.shape[0]))
37 t0=time()
38 pca=PCA(svd_solver='randomized',n_components=n_components,whiten=True).fit(X_train)#訓練一個pca模型
39 
40 print("Train PCA in %0.3fs" % (time()-t0))
41 
42 eigenfaces  = pca.components_.reshape((n_components,h,w))  #提取出來特征值之后的矩陣
43 
44 print("Prijecting the input data on the eigenfaces orthonarmal basis")
45 t0=time()
46 X_train_pca = pca.transform(X_train)     #將訓練集與測試集降維
47 X_test_pca = pca.transform(X_test)
48 print("Done PCA in %0.3fs" % (time()-t0))
49 
50 
51 #終於到SVM訓練了
52 print("Fiting the classifier to the training set")
53 t0=time()
54 param_grid ={'C':[1e3,5e3,1e4,5e4,1e5],#C是對錯誤的懲罰
55              'gamma':[0.0001,0.0005,0.001,0.005,0.01,0.1],}#gamma核函數里多少個特征點會被使用}#對參數嘗試不同的值
56 clf = GridSearchCV(SVC(kernel='rbf'),param_grid)
57 clf=clf.fit(X_train_pca,Y_train)
58 print("Done Fiting in %0.3fs" % (time()-t0))
59 
60 print("Best estimotor found by grid search:")
61 print(clf.best_estimator_)
62 
63 print("Predicting people's names on the test set")
64 t0=time()
65 Y_pred = clf.predict(X_test_pca)
66 print("done Predicting in %0.3fs" % (time()-t0))
67 
68 print(classification_report(Y_test,Y_pred,target_names=target_names)) #生成一個小報告呀
69 print(confusion_matrix(Y_test,Y_pred,labels=range(n_classes)))#這個也是，生成的矩陣的意思是有多少個被分為此類。
70 
71 
72 #把分類完的圖畫出來12個。
73 
74 #這個函數就是畫圖的
75 def plot_gallery(images,titles,h,w,n_row=3,n_col=4):
76     plt.figure(figsize=(1.8*n_col,2.4*n_row))
77     plt.subplots_adjust(bottom=0,left=.01,right=.99,top=.90,hspace=.35)
78     for i in range(n_row*n_col):
79         plt.subplot(n_row,n_col,i+1)
80         plt.imshow(images[i].reshape((h,w)),cmap=plt.cm.gray)
81         plt.title(titles[i],size=12)
82         plt.xticks(())
83         plt.yticks(())
84         
85 #這個函數是生成一個固定格式的字符串的
86 def title(y_pred,y_test,target_names,i):
87     pred_name=target_names[y_pred[i]].rsplit(' ',1)[-1]
88     true_name = target_names[y_test[i]].rsplit(' ', 1)[-1]
89     return "predicted: %s\n true: %s" %(pred_name,true_name)
90 
91 predicted_titles=[title(Y_pred,Y_test,target_names,i) for i in range(Y_pred.shape[0])] #這個for循環的用法很簡介
92 
93 plot_gallery(X_test,predicted_titles,h,w)
94 
95 eigenfaces_titles=["eigenface %d " % i for i in range(eigenfaces.shape[0])]
96 plot_gallery(eigenfaces,eigenfaces_titles,h,w)
97 
98 plt.show()

運行結果為

Total dataset size
n_samples: 1288
n_feature: 1850
n_classes: 7
Extracting the top 150 eigenfaces from 966 faces
Train PCA in 0.440s
Prijecting the input data on the eigenfaces orthonarmal basis
Done PCA in 0.019s
Fiting the classifier to the training set
Done Fiting in 52.087s
Best estimotor found by grid search:
SVC(C=1000.0, cache_size=200, class_weight=None, coef0=0.0,
decision_function_shape='ovr', degree=3, gamma=0.001, kernel='rbf',
max_iter=-1, probability=False, random_state=None, shrinking=True,
tol=0.001, verbose=False)
Predicting people's names on the test set
done Predicting in 0.147s
precision recall f1-score support

Ariel Sharon 0.75 0.80 0.77 15
Colin Powell 0.81 0.86 0.84 59
Donald Rumsfeld 0.86 0.81 0.83 37
George W Bush 0.90 0.90 0.90 134
Gerhard Schroeder 0.77 0.82 0.79 28
Hugo Chavez 0.92 0.86 0.89 14
Tony Blair 0.81 0.71 0.76 35

avg / total 0.85 0.85 0.85 322

[[ 12 3 0 0 0 0 0]
[ 1 51 0 5 0 0 2]
[ 2 1 30 2 0 1 1]
[ 1 5 5 120 2 0 1]
[ 0 1 0 2 23 0 2]
[ 0 1 0 1 0 12 0]
[ 0 1 0 4 5 0 25]]

下面這個是用主成分分析提取特征后的圖像

兩個矩陣的解釋

precision recall f1-score support

avg / total 0.85 0.85 0.85 322

這個

- TP，True Positive 本來是對的也預測對了
- FP，False Positive 本來是對的但預測錯了
- TN，True Negative 本來是錯的也預測對了
- FN，False Negative 本來是錯了但是也預測錯了

precision = TP / (TP + FP)

recall = TP / (TP + FN)

accuracy = (TP + TN) / (TP + FP + TN + FN)

F1 Score = P*R/2(P+R)，其中P和R分別為 precision 和 recall

[[ 12 3 0 0 0 0 0]
[ 1 51 0 5 0 0 2]
[ 2 1 30 2 0 1 1]
[ 1 5 5 120 2 0 1]
[ 0 1 0 2 23 0 2]
[ 0 1 0 1 0 12 0]
[ 0 1 0 4 5 0 25]]

就是七個類，7*7的矩陣，一個類被預測為的類加一，明顯對角線是預測為本身的數量。最多

其中一些函數的具體解釋

lfw_people=fetch_lfw_people(min_faces_per_person=70,resize=0.4)

'''
看他的官方文檔
def fetch_lfw_people(data_home=None, funneled=True, resize=0.5,
min_faces_per_person=0, color=False,
slice_=(slice(70, 195), slice(78, 172)),
download_if_missing=True):
"""Loader for the Labeled Faces in the Wild (LFW) people dataset

This dataset is a collection of JPEG pictures of famous people
collected on the internet, all details are available on the
official website:

http://vis-www.cs.umass.edu/lfw/

Each picture is centered on a single face. Each pixel of each channel
(color in RGB) is encoded by a float in range 0.0 - 1.0.

The task is called Face Recognition (or Identification): given the
picture of a face, find the name of the person given a training set
(gallery).
The original images are 250 x 250 pixels, but the default slice and resize
arguments reduce them to 62 x 74.

Parameters
----------
data_home : optional, default: None
Specify another download and cache folder for the datasets. By default
all scikit learn data is stored in '~/scikit_learn_data' subfolders.

funneled : boolean, optional, default: True
Download and use the funneled variant of the dataset.
resize : float, optional, default 0.5
Ratio used to resize the each face picture.

min_faces_per_person : int, optional, default None
The extracted dataset will only retain pictures of people that have at
least `min_faces_per_person` different pictures.

color : boolean, optional, default False
Keep the 3 RGB channels instead of averaging them to a single
gray level channel. If color is True the shape of the data has
one more dimension than the shape with color = False.

slice_ : optional
Provide a custom 2D slice (height, width) to extract the
'interesting' part of the jpeg files and avoid use statistical
correlation from the background

download_if_missing : optional, True by default
If False, raise a IOError if the data is not locally available
instead of trying to download the data from the source site.

Returns
-------
dataset : dict-like object with the following attributes:

dataset.data : numpy array of shape (13233, 2914)
Each row corresponds to a ravelled face image of original size 62 x 47
pixels. Changing the ``slice_`` or resize parameters will change the
shape of the output.

dataset.images : numpy array of shape (13233, 62, 47)
Each row is a face image corresponding to one of the 5749 people in
the dataset. Changing the ``slice_`` or resize parameters will change
the shape of the output.

dataset.target : numpy array of shape (13233,)
Labels associated to each face image. Those labels range from 0-5748
and correspond to the person IDs.

dataset.DESCR : string
Description of the Labeled Faces in the Wild (LFW) dataset

意思就是min_faces_per_person=定義了每個人取多少個不同圖片,resize=重新定義了圖片的大小
'''


pca=PCA(svd_solver='randomized',n_components=n_components,whiten=True).fit(X_train)

'''
n_components:PCA算法中所要保留的主成分個數n，也即保留下來的特征個數n
類型：int 或者 string，缺省時默認為None，所有成分被保留。
          賦值為int，比如n_components=1，將把原始數據降到一個維度。
          賦值為string，比如n_components='mle'，將自動選取特征個數n，使得滿足所要求的方差百分比。
          
copy:表示是否在運行算法時，將原始訓練數據復制一份。若為True，則運行PCA算法后，原始訓練數據的值不會有任何改變，因為是在原始數據的副本上進行運算；若為False，則運行PCA算法后，原始訓練數據的值會改，因為是在原始數據上進行降維計算。
類型：bool，True或者False，缺省時默認為True。

whiten:白化，使得每個特征具有相同的方差。由於圖像中相鄰像素之間具有很強的相關性，所以用於訓練時輸入是冗余的。白化的目的就是降低輸入的冗余性；更正式的說，我們希望通過白化過程使得學習算法的輸入具有如下性質：(i)特征之間相關性較低；(ii)所有特征具有相同的方差。

類型：bool，缺省時默認為False

PCA對象的屬性
components_ ：返回具有最大方差的成分。
explained_variance_ratio_：返回 所保留的n個成分各自的方差百分比。
n_components_：返回所保留的成分個數n。
mean_：
noise_variance_：

'''

很容易理解。用的時候再具體看。

同時我照着教程上的代碼寫的時候也遇到了一些錯誤

1、 DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. This module will be removed in 0.20.
DeprecationWarning)
該模塊在0.18版本中被棄用，支持所有重構的類和函數都被移動到的model_selection模塊

解決辦法。

將

from sklearn.model_selection import GridSearchCV

from sklearn.model_selection import train_test_split

改為：
from sklearn.grid_search import GridSearchCV

from sklearn.cross_validation import train_test_split

2、ValueError: class_weight must be dict, 'balanced', or None, got: 'auto'

在第五十行

clf = GridSearchCV(SVC(kernel='rbf'),param_grid)中原本是clf = GridSearchCV(SVC(kernel='rbf',class_weight ='auto'),param_grid).但是

class_weight ='auto'在此版本中不存在，刪除即可。或者用 'balanced' or None

3、DeprecationWarning: Class RandomizedPCA is deprecated; RandomizedPCA was deprecated in 0.18 and will be removed in 0.20. Use PCA(svd_solver='randomized') instead. The new implementation DOES NOT store whiten ``components_``. Apply transform to get them.
warnings.warn(msg, category=DeprecationWarning)

還是版本問題
將from sklearn.decomposition import RandomizedPCA修改為from sklearn.decomposition import PCA

然后將出錯的那個程序句子改為：pca = PCA(svd_solver='randomized', n_components = n_components, whiten = True).fit(X_train)即可

4、ValueError: Found input variables with inconsistent numbers of samples: [322, 966]

兩個矩陣行或者寬的不一樣而已，我打錯了。

今天也只是入個門。學的還不是太深。今后有機會會繼續深入學習的。

在此推薦個大佬寫的

支持向量機原理(一) 線性支持向量機

支持向量機原理(二) 線性支持向量機的軟間隔最大化模型

支持向量機原理(三)線性不可分支持向量機與核函數

支持向量機原理(四)SMO算法原理

支持向量機原理(五)線性支持回歸

有空一定要研究一下。

另外，我看到網上也有很多寫的跟我一樣的博客。學的教程都一樣。有的只是簡單復制粘貼，有的也寫的很詳細。我想說的是，既然學了就學的盡量詳細點，也就是代碼自己一句一句的打，每個函數都弄清楚。算法原理也要及時記下來。不要只看看教程看看別人實現一遍。

$ξ_{i} \geq 0$

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 機器學習之路: python k近鄰分類器 KNeighborsClassifier 鳶尾花分類預測機器學習之路： python 實踐提升樹 XGBoost 分類器機器學習之路： python 朴素貝葉斯分類器 MultinomialNB 預測新聞類別 Python機器學習筆記(1)——貝葉斯分類器—MultinomialNB Python機器學習(5)——朴素貝葉斯分類器機器學習筆記14-----SVM實踐和分類器的性能的評價指標(了解python畫圖的技巧) 機器學習之路：python線性回歸分類器 LogisticRegression SGDClassifier 進行良惡性腫瘤分類預測菜鳥之路——機器學習之BP神經網絡個人理解及Python實現菜鳥之路——機器學習之非線性回歸個人理解及python實現《機器學習Python實現_10_02_集成學習_boosting_adaboost分類器實現》