機器學習技法之支持向量回歸（SVR）

本文轉載自查看原文 2021-04-28 23:24 280 機器學習技法

核邏輯回歸（Kernel Logistic Regression）

SVM 和 Regularization 之間的聯系

軟間隔支持向量機的原最優化問題為：

\[\begin{aligned} \min _ { b , \mathbf { w } , \xi } & \frac { 1 } { 2 } \mathbf { w } ^ { T } \mathbf { w } + C \cdot \sum _ { n = 1 } ^ { N } \xi _ { n } \\ \text { s.t. } & y _ { n } \left( \mathbf { w } ^ { T } \mathbf { z } _ { n } + b \right) \geq 1 - \xi _ { n } \text { and } \xi _ { n } \geq 0 \text { for all } n \end{aligned} \]

轉換為無約束問題如下：

\[\min _ { b , \mathbf { w } } \quad \frac { 1 } { 2 } \mathbf { w } ^ { T } \mathbf { w } + C \underbrace{\sum _ { n = 1 } ^ { N } \max \left( 1 - y _ { n } \left( \mathbf { w } ^ { T } \mathbf { z } _ { n } + b \right) , 0 \right)}_{\widehat { \mathrm { err } }} \]

可簡化為：

\[\min \quad \frac { 1 } { 2 } \mathbf { w } ^ { \top } \mathbf { w } + C \sum \widehat { \mathrm { err } } \]

與 L2 范數正則化相比：

\[\min \quad \frac { \lambda } { N } \mathbf { w } ^ { T } \mathbf { w } + \frac { 1 } { N } \sum \mathrm { err } \]

可見兩者十分相似。但是為什么不用該無約束問題求解呢：這是因為：

這不是一個 QP 問題，不能使用核技巧
取最大函數不是可微的，很難求解。

下面列出 SVM 與其他模型的簡單對比：

\[\begin{array} { c | c | c } & \text { minimize } & \text { constraint } \\ \hline \text { regularization by constraint } & E _ { \text {in } } & \mathbf { w } ^ { T } \mathbf { w } \leq C \\ \hline \text { hard-margin SVM } & \mathbf { w } ^ { T } \mathbf { w } & E _ { \text {in } } = 0 \text { [and more] } \\ \hline \hline \text { L2 regularization } & \frac { \lambda } { N } \mathbf { w } ^ { T } \mathbf { w } + E _ { \text {in } } & \\ \hline \text { soft-margin SVM } & \frac { 1 } { 2 } \mathbf { w } ^ { T } \mathbf { w } + C N \widehat {E _ { \text {in } }} & \\ \end{array} \]

可以觀察出以下特性：

\[\begin{array} { c } \text { large margin } \Longleftrightarrow \text { fewer hyperplanes } \Longleftrightarrow L 2 \text { regularization of short w } \\ \text { soft margin } \Longleftrightarrow \text { special } \widehat { \text { err } } \\ \text { larger } C \Longleftrightarrow \text { smaller } \lambda \Longleftrightarrow \text { less regularization } \end{array} \]

即間隔越大意味着更少的超平面，類似於L2正則化中系數的衰減。\(C\) 越大意味着更小的 \(\lambda\)，更弱的正則化。

將SVM看作一種正則化方法的話，可以跟簡單理解如何擴展和連接到其他學習模型。

SVM 和 Logistic Regression 之間的聯系

現在令 \(\text { linear score } s = \mathbf { w } ^ { T } \mathbf { z } _ { n } + b\)，那么不同的誤差測量（Error Measure）表達式可以寫為：

\[\begin{array} { l } \operatorname { err } _ { 0 / 1 } ( s , y ) = [ y s \leq 0 ] \\ \operatorname { err } _ { \text {svm } } ( s , y ) = \max ( 1 - y s , 0 ) \\\operatorname { err } _ { \text {sce } }( s , y ) = \log _ { 2 } ( 1 + \exp ( - y s ) ) \text { : } \end{array} \]

其中 \(\operatorname { err } _ { \text {svm} }\) 和 \(\operatorname { err } _ { \text {sce} }\) 均是 \(\operatorname{ err } _ { 0 / 1 }\) 的凸上限，\(\operatorname { err } _ { \text {sce} }\) 用在 Logistic Regression 中做誤差測量。

具體如下圖所示：
在這里插入圖片描述

\[\begin{array} { ccccc } - \infty & \longleftarrow & y s & \longrightarrow & & + \infty \\ \approx - y s & & \widehat { \mathrm { err } } _ { \mathrm { svm } } ( S , y ) & && = 0 \\ \approx - y s & & ( \ln 2 ) \cdot \operatorname { err }_{\text{sce}} ( s , y ) & & & \approx 0 \end{array} \]

所以可以看出 regularized LogReg 與 SVM 十分相似。

兩階段學習模型（Two-Level-Learning）

如何實現呢？概括來說為使用邏輯回歸在經過SVM映射的空間上學習。所以叫兩階段學習模型（Two-Level-Learning）。具體步驟為：

使用 SVM 找出分割超平面
在超平面周圍，根據距離，使用 Logistic Regression 學習出真實分數。具體操作為通過縮放（A 和 \(\theta\)）和偏移（B）建立距離和分數之間的聯系。

數學表達如下：

\[g ( \mathbf { x } ) = \theta \left( A \cdot \left( \mathbf { w} _ { \mathrm { svm } } ^ { T } \mathbf { \Phi } ( \mathbf { x } ) + b _ { \mathrm { SVM } } \right) + B \right) \]

通常 \(A > 0, B \approx 0\) 更合理一些。\(A > 0\) 代表了SVM的分類結果大體是對的，\(B \approx 0\) 代表了分割超平面與實際值偏差很小。

可以寫出其顯示數學表達：

\[\min _ { A , B } \frac { 1 } { N } \sum _ { n = 1 } ^ { N } \log \left( 1 + \exp \left( - y _ { n } ( A \cdot ( \underbrace { \mathbf { w } _ { \mathrm { SVM } } ^ { T } \mathbf { \Phi } \left( \mathbf { x } _ { n } \right) + b _ { \mathrm { SVM } } } _ { \Phi _ { \mathrm { SVM } } \left( \mathbf { x } _ { n } \right) } ) + B ) \right) \right) \]

那么該兩階段學習模型的具體步驟為：

\[\begin{array} { l } 1. \text { run SVM on } \mathcal { D } \text { to get } \left( b _ { \mathrm { svm} } , \mathbf { w } _ { \mathrm { svm} } \right) [ \mathrm { or } \text { the equivalent } \alpha ] , \text { and } \text { transform } \mathcal { D } \text { to } \mathbf { z } _ { n } ^ { \prime } = \mathbf { w } _ { \mathrm { SVM } } ^ { T } \mathbf { \Phi } \left( \mathbf { X } _ { n } \right) + b _ { \mathrm { SVM } } \\ \text { -actulal model performs this step in a more complicated manner } \\ 2. \text { run LogReg on } \left\{ \left( \mathbf { z } _ { n } ^ { \prime } , y _ { n } \right) \right\} _ { n = 1 } ^ { N } \text { to get } ( A , B ) \\ \text { -actual model adds some special regularization here } \\ 3. \text { return } g ( \mathbf { x } ) = \theta \left( A \cdot \left( \mathbf { w } _ { \mathrm { svm} } ^ { T } \mathbf { \Phi } ( \mathbf { x } ) + b _ { \mathrm { svm} } \right) + B \right) \end{array} \]

核邏輯回歸（Kernel Logistic Regression）

前面提到的二階段學習模型是一種核邏輯回歸的近似解法，那么如何實現真正的核邏輯回歸呢？

關鍵是最優解 \(\mathbf { w } _ { * }\) 滿足一下條件：

\[\mathbf { w } _ { * } = \sum _ { n = 1 } ^ { N } \beta _ { n } \mathbf { z } _ { n } \]

因為 \(\mathbf { w } _ { * } ^ { T } \mathbf { z } = \sum _ { n = 1 } ^ { N } \beta _ { n } \mathbf { z } _ { n } ^ { T } \mathbf { z } = \sum _ { n = 1 } ^ { N } \beta _ { n } K \left( \mathbf { x } _ { n } , \mathbf { x } \right)\)，這樣的話便可以使用核技巧了。

那么對於任何一個L2正則化線性模型（L2-regularized linear model）即：

\[\min _ { \mathbf { w } } \frac { \lambda } { N } \mathbf { w } ^ { T } \mathbf { w } + \frac { 1 } { N } \sum _ { n = 1 } ^ { N } \operatorname { err } \left( y _ { n } , \mathbf { w } ^ { T } \mathbf { z } _ { n } \right) \]

現在假設其最優解由兩個部分組成，\(\mathbf { w } _ { \| } \in \operatorname { span } \left( \mathbf { z } _ { n } \right)\) 及 \(\mathbf { w } _ { \perp } \perp \operatorname { span } \left( \mathbf { z } _ { n } \right)\) 即：

\[\mathbf { w } _ { * } = \mathbf { w } _ { \| } + \mathbf { w } _ { \perp } \]

那么有：

\[\mathbf { w } _ { * } ^ { T } \mathbf { w } _ { * } = \mathbf { w } _ { \| } ^ { T } \mathbf { w } _ { \| } + 2 \mathbf { w } _ { \| } ^ { T } \mathbf { w } _ { \perp } + \mathbf { w } _ { \perp } ^ { T } \mathbf { w } _ { \perp } \quad > \mathbf { w } _ { \| } ^ { T } \mathbf { w } _ { \| } \]

也就是說 \(\mathbf { w } _ { \| }\) 優於 \(\mathbf { w } _ { * }\)，與 \(\mathbf { w } _ { * }\) 是最優解的假設相悖，所以 \(\mathbf { w } _ { * } = \mathbf { w } _ { \| }\)，不存在 \(\mathbf { w } _ { \perp }\)。也就是說 \(\mathbf { w } _ { * }\) 可以由 \(\mathbf{z}_n\) 組成，\(\mathbf { w } _ { * }\) 位於 \(\mathcal{Z}\) 空間。

所以說任何一個L2正則化線性模型都可以被kernel，所以可以改寫為：

\[\min _ { \beta } \frac { \lambda } { N } \sum _ { n = 1 } ^ { N } \sum _ { m = 1 } ^ { N } \beta _ { n } \beta _ { m } K \left( \mathbf { x } _ { n } , \mathbf { x } _ { m } \right) + \frac { 1 } { N } \sum _ { n = 1 } ^ { N } \log \left( 1 + \exp \left( - y _ { n } \sum _ { m = 1 } ^ { N } \beta _ { m } K \left( \mathbf { x } _ { m } , \mathbf { x } _ { n } \right) \right) \right) \]

至此便可以使用 GD/SGD 等優化算法進行尋優了。值得注意的是雖然與SVM相似，但是與SVM不同的是其系數 \(\beta_n\) 常常是非零的。

核嶺回歸（Kernel Ridge Regression）

嶺回歸的原問題模型如下：

\[\min _ { \mathbf { w } } \frac { \lambda } { N } \mathbf { w } ^ { T } \mathbf { w } + \frac { 1 } { N } \sum _ { n = 1 } ^ { N } \left( y _ { n } - \mathbf { w } ^ { T } \mathbf { z } _ { n } \right) ^ { 2 } \]

前文已經證得任何L2正則化線性模型都可以被 kernel，所以這里寫出核嶺回歸的數學表達為：

\[\begin{aligned} \min _ { \boldsymbol { \beta } } & \underbrace{\frac { \lambda } { N } \sum _ { n = 1 } ^ { N } \sum _ { m = 1 } ^ { N } \beta _ { n } \beta _ { m } K \left( \mathbf { x } _ { n } , \mathbf { x } _ { m } \right)}_{ \text {regularization of } \beta \text{ K-based features } } + \frac { 1 } { N } \underbrace { \sum _ { n = 1 } ^ { N } \left( y _ { n } - \sum _ { m = 1 } ^ { N } \beta _ { m } K \left( \mathbf { x } _ { n } , \mathbf { x } _ { m } \right) \right) ^ { 2 } } _ { \text {linear regression of } \beta \text{ K-based features } \boldsymbol { \beta } }\\ & = \frac { \lambda } { N } \boldsymbol { \beta } ^ { T } \mathbf { K } \boldsymbol { \beta } + \frac { 1 } { N } \left( \boldsymbol { \beta } ^ { T } \mathbf { K } ^ { T } \mathbf { K } \boldsymbol { \beta } - 2 \boldsymbol { \beta } ^ { T } \mathbf { K } ^ { T } \mathbf { y } + \mathbf { y } ^ { T } \mathbf { y } \right) \end{aligned} \]

也就是說之前得所有核技巧都可以用到這里。

那么可以根據目標函數導數為零，求出系數 \(\beta\) 的解析解，其目標函數的導數求得如下：

\[\nabla E _ { \mathrm { aug } } ( \beta ) = \frac { 2 } { N } \left( \lambda \mathrm { K } ^ { T } \mathrm { I } \beta + \mathrm { K } ^ { T } \mathrm { K } \beta - \mathrm { K } ^ { T } \mathbf { y } \right) = \frac { 2 } { N } \mathrm { K } ^ { T } ( ( \lambda \mathrm { I } + \mathrm { K } ) \beta - \mathrm { y } ) \]

令其為零有：

\[\beta = ( \lambda I + K ) ^ { - 1 } y \]

由於 \(K\) 必然是半正定的（根據 Mercer’s condition），那么當 \(\lambda > 0\) 時，\(( \lambda I + K ) ^ { - 1 }\) 必然存在。這里的稠密矩陣求逆操作的時間復雜度為：\(O(N^3)\)。值得注意的是雖然與SVM相似，但是與SVM不同的是其系數 \(\beta_n\) 常常是非零的。

支持向量回歸（Support Vector Regression）

管回歸（Tube Regression）

管回歸說的是在回歸線周圍一定范圍內，不算錯誤，即實際值和預測值的差在一定范圍內認為其無錯：

\[\begin{array} { l } | s - y | \leq \epsilon : 0 \\ | s - y | > \epsilon : | s - y | - \epsilon \end{array} \]

即

\[\operatorname { err } ( y , s ) = \max ( 0 , | s - y | - \epsilon ) \\ \]

該誤差叫做 \(\epsilon\)-insensitive error（不敏感誤差）。

與平方（squared）誤差 \(\operatorname { err } ( y , s ) = ( s - y ) ^ { 2 }\) 相比，當 \(| s - y |\) 較小時，兩者相似。但是當 \(| s - y |\) 較大時，增長較緩，也就是雜訊（nosie）的影響相較小。

在這里插入圖片描述
那么 L2 管回歸的優化目標為：

\[\min _ { \mathbf { w } } \frac { \lambda } { N } \mathbf { w } ^ { T } \mathbf { w } + \frac { 1 } { N } \sum _ { n = 1 } ^ { N } \max \left( 0 , \left| \mathbf { w } ^ { T } \mathbf { z } _ { n } - y \right| - \epsilon \right) \]

標准支持向量回歸（Standard Support Vector Regression）

為了使用 SVM 的優點（稀疏系數矩陣），現在將 L2 管回歸的系數改變一下：

\[\min _ { b , \mathbf { w } } \quad \frac { 1 } { 2 } \mathbf { w } ^ { T } \mathbf { w } + C \sum _ { n = 1 } ^ { N } \max \left( 0 , \left| \mathbf { w } ^ { T } \mathbf { z } _ { n } + b - y _ { n } \right| - \epsilon \right) \]

同時由於這里的max操作不可微，所以需要轉換一下：

\[\begin{aligned} \min _ { b , \mathbf { w } , \xi ^ { \vee } , \xi ^ { \wedge } } & \frac { 1 } { 2 } \mathbf { w } ^ { T } \mathbf { w } + C \sum _ { n = 1 } ^ { N } \left( \xi _ { n } ^ { \vee } + \xi _ { n } ^ { \wedge } \right) \\ \text { s.t. } & - \epsilon - \xi _ { n } ^ { \vee } \leq y _ { n } - \mathbf { w } ^ { T } \mathbf { z } _ { n } - b \leq \epsilon + \xi _ { n } ^ { \wedge } \\ & \xi _ { n } ^ { \vee } \geq 0 , \xi _ { n } ^ { \wedge } \geq 0 \end{aligned} \]

由於這里有上下兩個邊界，所以相比SVM多了一個輔助系數，需要 \(\xi _ { n } ^ { \vee } , \xi _ { n } ^ { \wedge }\) 兩個輔助系數。至此便可以使用二次規划求解最優值了。

對偶支持向量回歸（Dual Support Vector Regression）

為了使用核技巧，仍需求解對偶問題。這里引入兩個拉格朗日乘數 \(\alpha_ { n } ^ { \vee } , \alpha_ { n } ^ { \wedge }\)

\[\begin{array} { l l l } \text { objective function } & &\frac { 1 } { 2 } \mathbf { w } ^ { T } \mathbf { w } + C \sum _ { n = 1 } ^ { N } \left( \xi _ { n } ^ { \vee } + \xi _ { n } ^ { \wedge } \right) \\ \text { Lagrange multiplier } \alpha _ { n } ^ { \wedge } & \text { for } & y _ { n } - \mathbf { w } ^ { T } \mathbf { z } _ { n } - b \leq \epsilon + \xi _ { n } ^ { \wedge } \\ \text { Lagrange multiplier } \alpha _ { n } ^ { \vee } & \text { for } & - \epsilon - \xi _ { n } ^ { \vee } \leq y _ { n } - \mathbf { w } ^ { T } \mathbf { z } _ { n } - b \end{array} \]

某些 KKT 條件如下：

\[\begin{array} { l } \frac { \partial \mathcal { L } } { \partial w _ { i } } = 0 : \mathbf { w } = \sum _ { n = 1 } ^ { N } \underbrace { \left( \alpha _ { n } ^ { \wedge } - \alpha _ { n } ^ { \vee } \right) } _ { \beta_n } \mathbf { z } _ { n } \\ \frac { \partial \mathcal { L } } { \partial b } = 0 : \sum _ { n = 1 } ^ { N } \left( \alpha _ { n } ^ { \wedge } - \alpha _ { n } ^ { \vee } \right) = 0 \\ \alpha _ { n } ^ { \wedge } \left( \epsilon + \xi _ { n } ^ { \wedge } - y _ { n } + \mathbf { w } ^ { T } \mathbf { z } _ { n } + b \right) = 0 \\ \alpha _ { n } ^ { \vee } \left( \epsilon + \xi _ { n } ^ { \vee } + y _ { n } - \mathbf { w } ^ { T } \mathbf { z } _ { n } - b \right) = 0 \end{array} \]

與SVM求解過程類似，可以寫出其對偶問題如下：

\[\begin{aligned} \min & \frac { 1 } { 2 } \sum _ { n = 1 } ^ { N } \sum _ { m = 1 } ^ { N } \left( \alpha _ { n } ^ { \wedge } - \alpha _ { n } ^ { \vee } \right) \left( \alpha _ { m } ^ { \wedge } - \alpha _ { m } ^ { \vee } \right) k _ { n , m } \\ & + \sum _ { n = 1 } ^ { N } \left( \left( \epsilon - y _ { n } \right) \cdot \alpha _ { n } ^ { \wedge } + \left( \epsilon + y _ { n } \right) \cdot \alpha _ { n } ^ { \vee } \right) \\ \text { s.t. } & \sum _ { n = 1 } ^ { N } \left( \alpha _ { n } ^ { \wedge } - \alpha _ { n } ^ { \vee } \right) = 0 \\ & 0 \leq \alpha _ { n } ^ { \wedge } \leq C , 0 \leq \alpha _ { n } ^ { \vee } \leq C \end{aligned} \]

系數的稀疏性分析：
當 \(\left| \mathbf { w } ^ { T } \mathbf { z } _ { n } + b - y _ { n } \right| < \epsilon\) ，也就是說樣本嚴格位於管內，那么會有：

\[\begin{array} { l } \Longrightarrow \xi _ { n } ^ { \wedge } = 0 \text { and } \xi _ { n } ^ { \vee } = 0 \\ \Longrightarrow \left( \epsilon + \xi _ { n } ^ { \wedge } - y _ { n } + \mathbf { w } ^ { T } \mathbf { z } _ { n } + b \right) \neq 0 \text { and } \left( \epsilon + \xi _ { n } ^ { \vee } + y _ { n } - \mathbf { w } ^ { T } \mathbf { z } _ { n } - b \right) \neq 0 \\ \Longrightarrow \alpha _ { n } ^ { \wedge } = 0 \text { and } \alpha _ { n } ^ { \vee } = 0 \\ \Longrightarrow \beta _ { n } = 0 \end{array} \]

所以說 \(\beta\) 是稀疏的，同時在 SVR 中，哪些在管上或外的樣本（\(\beta_n \neq 0\)）叫做支持向量。

線性或核模型總結

在這里插入圖片描述

第一行由於效果不好，所以不太常用。
第二行比較常用的工具箱是LIBLINEAR
第三行由於其稠密的 \(\beta\) 所以也不太常用
第四行比較常用的工具箱是LIBSVM

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 支持向量回歸機（SVR）代碼吳裕雄 python 機器學習——支持向量機非線性回歸SVR模型支持向量回歸《機器學習技法》---核型邏輯回歸 [機器學習]回歸--Support Vector Regression(SVR) 支持向量機 (三)：優化方法與支持向量回歸【ML-9-4】支持向量機--SVM回歸模型（SVR）機器學習技法-決策樹和CART分類回歸樹構建算法《機器學習技法》---隨機森林【機器學習與R語言】9- 支持向量機

機器學習技法 之 支持向量回歸（SVR）

核邏輯回歸（Kernel Logistic Regression）

SVM 和 Regularization 之間的聯系

SVM 和 Logistic Regression 之間的聯系

兩階段學習模型（Two-Level-Learning）

核邏輯回歸（Kernel Logistic Regression）

核嶺回歸（Kernel Ridge Regression）

支持向量回歸（Support Vector Regression）

管回歸（Tube Regression）

標准支持向量回歸（Standard Support Vector Regression）

對偶支持向量回歸（Dual Support Vector Regression）

線性或核模型總結

免責聲明！

機器學習技法之支持向量回歸（SVR）