#對coursera上
#注:此筆記是我自己認為本節課里比較重要、難理解或容易忘記的內容並做了些補充,並非是課堂詳細筆記和要點;
#標記為<補充>的是我自己加的內容而非課堂內容,參考文獻列於文末。博主能力有限,若有錯誤,懇請指正;
#---------------------------------------------------------------------------------#
<補充>支持向量機方法的三要素(若不了解機器學習模型、策略、算法的具體意義,可參考機器學習三要素)
基本模型:間隔最大的線性分類器;若用上核技巧,成為實質上的非線性分類器;
學習策略:間隔最大化,可形式化為一個求解凸二次規划的問題;
學習算法:求解凸二次規划的最優化算法,如序列最小最優算法(SMO);
#---------------------------------------------------------------------------------#
由logistic regression引出SVM
z=ΘTx
預測函數:;
logistic函數的圖形:
;
當ΘTx 遠大於0時,hθ(x)接近於0;
logistic回歸的cost function:
;
當y=1時,上式變為-log(1 + e-z),見圖形
;
SVM的cost function對logistic回歸的cost function做了改變,當y=1時,SVM的cost function記為cost1(θT x),分為兩部分(見下圖紫線),當z>1時cost1(ΘTx)=0,當z<1時cost1(ΘTx)是條直線。這樣做有兩個好處,一是計算更快(從計算logistic函數轉變為計算直線函數),二是更有利於后來的優化;
;
同理對y=0時做同樣的處理,得到cost0(θT x),下圖紫線。
;
由此我們得到cost0(θT x)和cost1(ΘTx):
;
由此我們從最小化logistic回歸的cost function:
,
得到下式:
;
再令C=1/λ,去掉1/m(m是常數,不影響計算優化結果),得到最終SVM的cost function:
;
#---------------------------------------------------------------------------------#
Large margin intuition
再來看SVM的cost0(θT x)和cost1(ΘTx):
;
注意:SVM wants a bit more than that - doesn't want to *just* get it right, but have the value be quite a bit bigger than zero
- Throws in an extra safety margin factor
對於訓練數據,SVM不僅要求是分的對,而且還有額外的間隔條件來保證分的“好”;
;
- The green and magenta lines are functional decision boundaries which could be chosen by logistic regression
- But they probably don't generalize too well
- The black line, by contrast is the the chosen by the SVM because of this safety net imposed by the optimization graphMathematically, that black line has a larger minimum distance (margin) from any of the training examples
- More robust separator
- By separating with the largest margin,you incorporate robustness into your decision making process
<補充>什么是支持向量support vector?
下圖中兩個支撐着中間的 gap 的超平面,它們到中間的純紅線separating hyper plane 的距離相等,即我們所能得到的最大的 geometrical margin,而“支撐”這兩個超平面的必定會有一些點,而這些“支撐”的點便叫做支持向量Support Vector。
C的選擇對SVM的影響
C選的合適時,
;
C太大時造成過擬合(紫線),
;
<補充>最大間隔分離超平面存在唯一性:若訓練數據線性可分(這是前提),則可將訓練數據的樣本點完全正確分開的最大間隔分離超平面存在且唯一;
#---------------------------------------------------------------------------------#
Kernels
<補充>當訓練數據線性可分或近似線性可分時,通過間隔最大化,學習一個線性分類器;當訓練數據線性不可分時,使用核技巧(kernel trick),學習非線性分類器;
<補充>核函數(kernel function)表示將輸入從輸入空間映射到特征空間得到的特征向量之間的內積。通過使用核函數可以學習非線性支持向量機,等價於隱式地在高維空間的特征空間中學習線性支持向量機;也就是說,在核函數K(x,z)給定的條件下,可以利用解線性分類問題的方法去求解非線性分類問題的支持向量機。學習是隱式的在特征空間進行的,不需要顯式地定義特征空間和映射函數。這樣的技巧稱作核技巧;
幾個常用核函數
Gaussian kernel(使用最多的):Need to define σ (σ2);
;
linear kernel:no kernel;
others:Polynomial Kernel,String kernel,Chi-squared kernel...
#---------------------------------------------------------------------------------#
Logistic regression vs. SVM
If n (features) is large vs. m (training set)
- e.g. text classification problem
- Feature vector dimension is 10 000
- Training set is 10 - 1000
-
Then use logistic regression or SVM with a linear kernel
If n is small and m is intermediate
- n = 1 - 1000
- m = 10 - 10 000
- Gaussian kernel is good
If n is small and m is large
- n = 1 - 1000
- m = 50 000+
-
- SVM will be slow to run with Gaussian kernel
-
- In that case
-
- Manually create or add more features
- Use logistic regression of SVM with a linear kernel
-
Logistic regression and SVM with a linear kernel are pretty similar
- Do similar things
- Get similar performance
A lot of SVM's power is using diferent kernels to learn complex non-linear functions
For all these regimes a well designed NN should work
- But, for some of these problems a NN might be slower - SVM well implemented would be faster
SVM has a convex optimization problem - so you get a global minimum
#---------------------------------------------------------------------------------#
參考文獻
《統計學習方法》,李航著
理解SVM的三層境界-支持向量機通俗導論,July、pluskid著
standford machine learning, by Andrew Ng