Regularization 正則化
The Problem of Overfitting 過擬合問題
什么是過擬合問題、利用正則化技術改善或者減少過擬合問題。
Example: Linear regression (housing prices) 線性回歸中的過擬合
對5個訓練集建立線性回歸模型,分別進行如下圖所示的三種分析。
如果擬合一條直線到訓練數據(圖一),會出現欠擬合(underfitting)/高偏差(high bias)現象(指沒有很好地擬合訓練數據)。
試着擬合一個二次函數的曲線(圖二),符合各項要求。稱為just right。
接着擬合一個四次函數的曲線(圖三),雖然曲線對訓練數據做了一個很好的擬合,但是顯然是不合實際的,這種情況就叫做過擬合或高方差(variance)。
Overfitting: If we have too many features, the learned hypothesis may fit the training set very well(\(\sum_{i=1}^m(h_\theta(x^{(i)}) - y^{(i)})^2 \approx 0\)), but fail to generalize to new example and fails to predict prices on new examples.
過擬合:在變量過多時訓練出的方程總能很好的擬合訓練數據(這時你的代價函數會非常接近於\(0\)),但是這樣的曲線千方百計的擬合訓練數據,以至於它無法泛化(“泛化”指一個假設模型能夠應用到新樣本的能力)到新的樣本中。
邏輯回歸中的過擬合
對以下訓練集建立線性邏輯模型,分別進行如下圖所示的三種分析。
圖一:欠擬合。
圖二:Just Right。
圖三:過擬合。
Addressing overfitting 解決過度擬合
-
Reduce number of features. 減少變量選取的數量
—— Manually select which features to keep. 人工檢查決定變量去留
—— Model selection algorithm (later in course). 模型選擇算法:自動選擇采取哪些特征變量,自動舍棄不需要的變量(后續課程會講到)
-
Regularization. 正則化
—— Keep all the features, but reduce magnitude values of parameters \(\theta_j\).保留所有特征變量,但是減小參數\(\theta_j\)的數量級。
—— Works well when we have a lot of features, each of which contributes a bit to predicting \(y\). 在我們擁有大量的有用特征時往往非常有效。
Cost Function 通過控制代價函數實現正則化
Intuition
在我們進行擬合時,假如選擇了4次函數,那我們可以通過對 \(\theta_3,\theta_4\) 加上懲罰系數影響代價函數的大小從而達到控制 \(\theta_3,\theta_4\) 大小的目的。
Suppose we penalize and make \(\theta_3,\theta_4\) really small.
\(\rightarrow\) \(min_\theta\frac{1}{2m}\sum_{i = 1}^m(h_\theta(x^{(i)}) - y^{(i)})^2 + 1000\theta_3^2 + 1000\theta_4^2\)
Regularization
Small values for parameters \(\theta_0,\theta_1,\ldots,\theta_n\) 如果我們的參數\(\theta\)值都比較小那么我們將會:
—— "Simpler" hypothesis 得到形式更簡單的假設函數
—— Less prone to overfitting 不易發生過擬合的情況(因為\(\theta\)值越小,對應曲線越光滑)
Housing: 以房價預測為例:
—— Features: \(x_1, x_2. \ldots, x_{100}\)
—— Parameters: \(\theta_0,\theta_1,\theta_2,\ldots,\theta_{100}\)
In regularized linear regression, we choose \(\theta\) to minimize
\(J(\theta) = \frac{1}{2m}\left[\sum_{i=1}^m(h_\theta(x^{(i)})-y^{(i)})^2+\lambda\sum_{i=1}^n\theta_j^2\right]\).
最后一項稱為正則化項,\(\lambda\)稱為正則化參數。
What if \(\lambda\) is set to an extremely large value (perhaps for too large for our problem, say \(\lambda = 10^{10}\))?
如果\(\lambda\)設置的太大的話,\(\theta_1, \ldots, \theta_n\)將會接近於0,此時的函數圖像接近一條水平直線,對於數據來說就是欠擬合了。
Regularized Linear Regression 應用正則化到線性回歸中
梯度下降法
在原來的算法基礎上略微改動:把\(\theta_0\)的更新單獨取出來並加入正則化項,如下:
Repeat{
\(\theta_0 := \theta_0 - \alpha\frac{1}{m}\sum_{i=1}^m(h_\theta(x^{(i)})-y^{(i)})x_0^{(i)}\)
\(\theta_j := \theta_j - \alpha\frac{1}{m}\left[ \sum_{i=1}^m(h_\theta(x^{(i)})-y^{(i)})x_j^{(i)}+\lambda\theta_j\right]\) (j = 1,2,3,...,n)
}
\(\theta_0\)單獨取出來的原因是:對於正則化的線性回歸,我們的懲罰參數不包含\(\theta_0\)。
上式中的第二項也可以寫為:\(\theta_j := \theta_j (1- \alpha\frac{\lambda}{m}) - \alpha\frac{1}{m}\sum_{i=1}^m(h_\theta(x^{(i)})-y^{(i)})x_j^{(i)}\)
Normal equation 正規方程
\(X= \left[ \begin{matrix} (x^{(1)})^T \\ \vdots \\ (x^{(m)})^T \end{matrix} \right]\) \(y = \left[ \begin{matrix} (y^{(1)})^T \\ \vdots \\ (x^{(m)})^T \end{matrix} \right]\) \(\rightarrow\) \(min_\theta J(\theta)\)
\(\rightarrow\) \(\theta = (X^TX + \lambda diag(0,1,1,\ldots,1)_{(n+1)})^{-1}X^Ty\)
Non-invertibility(optional/advanced) 當矩陣不可逆時(選學)
Suppose \(m \leq n\), \(\theta = (X^TX)^{-1}X^Ty\)
If \(\lambda > 0\), \(\theta = X^TX + \lambda diag(0,1,1,\ldots,1)_{(n+1)}^{-1}X^Ty\)
Regularized Logistic Regression 應用正則化到邏輯回歸中
如何改進梯度下降算法和高級優化算法使其能夠應用於正則化的邏輯回歸。
Cost function:
\(J(\theta) = -\frac{1}{m}\left[ \sum_{i=1}^my^{(i)}log h_\theta(x^{(i)}) + (1-y^{(i)})log(1-h_\theta(x^{(i)})) \right] + \frac{\lambda}{2m}\sum_{j=1}^n\theta_j^2|(\theta_1,\theta_2,\ldots,\theta_n)\)
具體實現如下,其中\(h_\theta(x) = \frac{1}{1+e^{-\theta^Tx}}\).
Repeat{
\(\theta_0 := \theta_0 - \alpha\frac{1}{m}\sum_{i=1}^m(h_\theta(x^{(i)})-y^{(i)})x_0^{(i)}\)
\(\theta_j := \theta_j - \alpha\frac{1}{m}\left[ \sum_{i=1}^m(h_\theta(x^{(i)})-y^{(i)})x_j^{(i)}+\lambda\theta_j\right]\) (j = 1,2,3,...,n)
}
Advanced optimization 高級優化算法
自定義的函數(偽代碼):
function [jVal, gradient] = costFunction(theta)
jval = [code to compute J(\(\theta\))]
gradient(1) = [code to compute \(\frac{\partial}{\partial\theta_0}J(\theta)\)]
gradient(2) = [code to compute \(\frac{\partial}{\partial\theta_1}J(\theta)\)]
gradient(3) = [code to compute \(\frac{\partial}{\partial\theta_2}J(\theta)\)]
\(\vdots\)
gradient(n+1) = [code to compute \(\frac{\partial}{\partial\theta_n}J(\theta)\)]
其中:
- code to compute J(\(\theta\)):\(J(\theta) = -\left[ \frac{1}{m}\sum_{i=1}^my^{(i)}log h_\theta(x^{(i)}) + (1-y^{(i)})log(1-h_\theta(x^{(i)})) \right] + \frac{\lambda}{2m}\sum_{j=1}^n\theta_j^2\)
- code to compute \(\frac{\partial}{\partial\theta_0} J(\theta)\):\(\frac{1}{m}\sum_{i=1}^m(h_\theta(x^{(i)}) - y^{(i)})x_0^{(i)}\)
- code to compute \(\frac{\partial}{\partial\theta_1} J(\theta)\):\(\frac{1}{m}\sum_{i=1}^m(h_\theta(x^{(i)}) - y^{(i)})x_1^{(i)} + \frac{\lambda}{m}\theta_1\)
- code to compute \(\frac{\partial}{\partial\theta_2} J(\theta)\):\(\frac{1}{m}\sum_{i=1}^m(h_\theta(x^{(i)}) - y^{(i)})x_2^{(i)} + \frac{\lambda}{m}\theta_2\)
剩下要做的是將自定義函數代入到fminunc
函數中。
Review
測驗
-
You are training a classification model with logistic regression. Which of the following statements are true? Check all that apply.
- [x] Adding a new feature to the model always results in equal or better performance on the training set. 增加新的特征會讓預測模型更佳具有表達性,從而會更好的擬合訓練集。
- [ ] Adding a new feature to the model always results in equal or better performance on examples not in the training set. 如果出現過擬合,則無法更好的適應其他例子。
- [ ] Introducing regularization to the model always results in equal or better performance on the training set. 如果\(\lambda\)取得太大就會導致欠擬合,這樣不論對訓練集還是樣例都不好。
- [ ] Introducing regularization to the model always results in equal or better performance on examples not in the training set.
- [ ] Adding many new features to the model helps prevent overfitting on the training set. 更多的特征值使模型更好的適應數據,容易導致過擬合。
-
Suppose you ran logistic regression twice, once with \(\lambda = 0\), and once with \(\lambda = 1\). One of the times, you got parameters \(\theta = \left[ \begin{matrix} 74.81 \\ 45.05 \end{matrix} \right]\), and the other time you got \(\theta = \left[ \begin{matrix} 1.37 \\ 0.51 \end{matrix} \right]\). However, you forgot which value of \(\lambda\) corresponds to which value of \(\theta\). Which one do you think corresponds to \(\lambda = 1\)?
- [x] \(\theta = \left[ \begin{matrix} 1.37 \\ 0.51 \end{matrix} \right]\).
- [ ] \(\theta = \left[ \begin{matrix} 74.81 \\ 45.05 \end{matrix} \right]\).
-
Which of the following statements about regularization are true? Check all that apply.
- [ ] Because logistic regression outputs values \(0 \leq h_\theta(x) \leq 1\), its range of output values can only be "shrunk" slightly by regularization anyway, so regularization is generally not helpful for it. 正則化解決的是過擬合的問題。
- [ ] Using a very large value of \(\lambda\) cannot hurt the performance of your hypothesis; the only reason we do not set \(\lambda\) to be too large is to avoid numerical problems. 如果\(\lambda\)設置的太大的話,\(\theta_1, \ldots, \theta_n\)將會接近於\(0\),此時的函數圖像接近一條水平直線,對於數據來說就是欠擬合了。
- [ ] Using too large a value of \(\lambda\) can cause your hypothesis to overfit the data; this can be avoided by reducing \(\lambda\). underfit 不是 overfit。
- [x] Consider a classification problem. Adding regularization may cause your classifier to incorrectly classify some training examples (which it had correctly classified when not using regularization, i.e. when \(\lambda = 0\)). \(\lambda\)沒選好時,可能會導致訓練結果還不如沒有正則化項時好。
- [ ] Because regularization causes \(J(\theta)\) to no longer be convex, gradient descent may not always converge to the global minimum (when \(\lambda > 0\), and when using an appropriate learning rate \(\alpha\). 正則邏輯回歸和正則線性回歸都是凸的,因此梯度下降仍會收斂到全局最小值。
編程
-
plotData.m
% Find Indices of Positive and Negative Examples pos = find(y == 1); neg = find(y == 0); % Plot Examples plot(X(pos, 1), X(pos, 2), 'k+', 'LineWidth', 2, 'MarkerSize', 7); plot(X(neg, 1), X(neg, 2), 'ko', 'MarkerFaceColor', 'y', 'MarkerSize', 7);
-
sigmoid.m
g = 1 ./ (1 + exp(-z));
-
costFunction.m
J = 1 / m * (-y' * log(sigmoid(X * theta)) - (1 - y)' * log(1 - sigmoid(X * theta))); grad = 1 / m * X' * (sigmoid(X * theta) - y);
-
predict.m
p = sigmoid(X * theta) >= 0.5;
-
costFunctionReg.m
J = 1 / m * (-y' * log(sigmoid(X * theta)) - (1 - y)' * log(1 - sigmoid(X * theta))) + lambda / (2 * m) * theta(2:end)' * theta(2:end); grad = 1 / m * X' * (sigmoid(X * theta) - y) + lambda / m * theta; grad(1) = grad(1) - lambda / m * theta(1);