機器學習（Machine Learning）- 吳恩達（Andrew Ng）學習筆記（六）

本文轉載自查看原文 2020-01-18 20:38 1052 機器學習

Logistic Regression 邏輯回歸

Classification

examples

Email: Spam/Not Spam? 電子郵件是否是垃圾郵件

Online Transactions: Fraudulent(Yes / No)? 網上交易是否是詐騙

Turmor: Malignant / Benign? 腫瘤是良性還是惡性

$y \in \{0, 1\}$ 要預測的變量$y$能夠取$0$和$1$兩個值

$0$: "Negative Class" (e.g., benign tumor) 通常標記為$0$的類稱為“負類”，如良性腫瘤

$1$: "Positive Class" (e.g., malignant tumor) 通常標記為$1$的類稱為“正類”，如惡性腫瘤

用線性回歸來解決分類問題

If $h_\theta(x) \geq 0.5$, predict "$y = 1$"

If $h_\theta(x) \leq 0.5$, predict "$y = 0$"

遇到的問題

分類問題預測的變量$y$只能是$0$或$1$，而$h_\theta(x)$有時會$＞1$或$＜0$。

--> 邏輯回歸：$0 \leq h_\theta(x) \leq 1$（雖然名字中有“回歸”，但實際上是個分類算法）

Hypothesis Representation 假設函數的表達式

在分類問題中，用什么樣的函數來表示我們的假設。

Logistic Regression Model 邏輯回歸模型

want $0 \leq h_\theta(x) \leq 1$

--> 另$h_\theta(x) = g(\theta^Tx)$, 其中$g(z) = \frac{1}{1 + e^{-z}}$, 稱為邏輯函數(Sigmoid function/Logistic function)（這也是邏輯回歸這個名字的由來）。

--> $h_\theta(x) = \frac{1}{1 + e^{-\theta^Tx}}$

g(z)圖像

Interpretation of Hypothesis Output 對假設輸出結果的解釋

$h_\theta(x)$ = estimated probablity that $y = 1$ on input $x$.
Example: If $x = \left[ \begin{matrix} x_0 \\ x_1 \end{matrix} \right] = \left[ \begin{matrix} 1 \\ tumorSize \end{matrix} \right]$,$h_\theta(x) = 0.7$, tell patient that 70% chance of tumot being malignat. 對於一個特征值為x的患者，y = 1的概率是0.7，我將告訴我的病人腫瘤是惡性的可能性是70%。
$h_\theta(x) = P(y = 1|x; \theta)$ "probability that $y = 1$, given $x$, parameterized by $\theta$", $h_\theta$就是給定$x$，$y = 1$的概率。上面例子中的$x$就是我的病人的特征（腫瘤的大小）。
$h_\theta(x) = P(y = 0|x; \theta) + h_\theta(x) = P(y = 1|x; \theta) = 1$, $h_\theta(x) = P(y = 0|x; \theta) = 1 - h_\theta(x) = P(y = 1|x; \theta)$.

Decision Boundary 決策邊界

Logistic regression

$h_\theta(x) = g(\theta^Tx)$, $g(z) = \frac{1}{1 + e^{-z}}$.

Suppose predict "$y = 1$" if $h_\theta(x) \geq 0.5$, predict "$y = 0$" if $h_\theta(x) < 0.5$

$\because g(z) \geq 0.5$ when $z \geq 0$

$\therefore h_\theta(x) = g(\theta^Tx) \geq 0.5$ when $\theta^Tx \geq 0$.

$\rightarrow$ $\theta^Tx \geq 0$時$y = 1$；$\theta^Tx ＜ 0$時，$y = 0$。

Decision Boundary

假設我們有下圖所示的一個樣例，它的假設函數為$h_\theta(x) = g(\theta_0 + \theta_1x_1 + \theta_2x_2)$。請你預測一下當$\theta = \left[ \begin{matrix} -3 \\ 1 \\ 1 \end{matrix} \right]$時，"y = 1"的概率。

決策邊界樣例

如果我們將假設函數可視化，我們將得到下圖所示的一條分界線，這條分界線就叫做決策邊界。

決策邊界可視化

具體的說，這條直線上對應的點為$h_\theta(x) = 0.5$的點，它將平面划分為了兩片區域——分別是假設函數預測$y = 1$的區域和假設函數預測$y = 0$的區域。

決策邊界划分

【注】決策邊界是假設函數的一個屬性，它包括參數$\theta_0、\theta_1、\theta_2$，與數據集無關。

Non-linear decision boundaries 非線性的決策邊界

樣例如下圖，假設函數設為$h_\theta(x) = g(\theta_0 + \theta_1x_1 + \theta_2x_2 + \theta_3x_1^2 + \theta_4x_2^2)$，預測當$\theta = \left[ \begin{matrix} -1 \\ 0 \\ 0 \\ 1 \\ 1 \end{matrix} \right]$時，y = 1的概率。

非線性決策邊界划分

決策邊界可視化：

非線性決策邊界可視化

Cost Function

初尋代價函數

Linear regression: $J(\theta) = \frac{1}{m}\sum_{i=1}^m\frac{1}{2}(h_\theta(x^{(i)})-y^{(i)})^2$

另$Cost(h_\theta(x), y) = \frac{1}{2}(h_\theta(x)-y)^2$

由於$h_\theta(x)$函數是非線性的，故在這種情況下$J(\theta)$是非凸函數（下圖左）。但我們使用梯度下降算法必須要求$J(\theta)$為凸函數（下圖右）才可以。

非凸函數凸函數

對此我們提出了新的代價函數。

Logistic regression cost function 邏輯回歸的代價函數

\[Cost(h_\theta(x), y) = \left\{ \begin{aligned} -log(h_\theta(x))\ \ \ \ \ \ if\ \ \ y = 1\\ -log(1 - h_\theta(x))\ \ \ \ \ \ if\ \ \ y = 0 \end{aligned} \right. \]

當$y = 1$時的詳細解釋：當假設函數的值和預測值都為$1$時，代價是$0$；但是當假設函數值為$0$預測值為$1$時，代價是$\infty$。$y = 0$時道理相同，圖像剛好相反。

新代價函數的解釋1

Simplified Cost Function and Gradient Descent

尋找一個簡單點的方法來寫代價函數替代現在的算法以及弄清楚如何運用梯度下降算法來擬合出邏輯回歸的參數。

邏輯回歸的代價函數的等價寫法

$Cost(h_\theta(x), y) = -ylog(h_\theta(x)) - (1-y)log(1-h_\theta(x))$

$\rightarrow$ $J(\theta) = \frac{1}{m}\sum_{i=1}^mCost(h_\theta(x), y) = -\frac{1}{m}\left[\sum_{i=1}^mylog(h_\theta(x)) + (1-y)log(1-h_\theta(x))\right]$

這個式子是從統計學中的極大似然法中得來的，思路是基於如何為不同的模型有效地找出不同的參數。它還具有一個很好的性質——它是凸的。

接下來我們要做的就是想辦法為訓練集擬合出一個參數$\theta$，使得$J(\theta)$能取得最小值。而最小化$J(\theta)$的方法就是使用梯度下降法。

Gradient descent

原始表達式：

Want $min_\theta J(\theta)$:

Repeat {
$\theta_j := \theta_j - \alpha\frac{\partial}{\partial\theta_j}J(\theta) $
}
帶入上面化簡后的式子：

Want $min_\theta J(\theta)$:
Repeat {
$\theta_j := \theta_j - \alpha\sum_{i=1}^m(h_\theta(x^{(i)}) - y^{(i)})x_j^{(i)}$
}
與線性回歸的不同之處：假設函數的不同
- 線性回歸：$h_\theta(x) = \theta^Tx$
- 邏輯回歸：$h_\theta(x) = \frac{1}{1+e^{-\theta^T x}}$

Advanced Optimization 高級優化

利用高級優化算法和概念我們可以將邏輯回歸的速度大大提高，這也將使算法更適合大型的機器學習問題。

Optimization algorithm 優化算法

Given $\theta$, we have code that can compute $J(\theta)、\frac{\partial}{\partial \theta_j}J(\theta)$ (for $j$ = 0, 1, …, n)

Optimization algorithms:

Gradient descent
Conjugate gradient 共軛梯度法
BFGS
L-BFGS

后三種算法的優點：

No need to manually pick $\alpha$
Often faster than gradient descent

后三種算法的缺點：

More complex 太復雜了很難搞清楚其原理

調用函數時的建議

軟件庫有的函數，直接調用而不是自己寫。

例子

$\theta = \left[ \begin{matrix} \theta_1 \\ \theta_2 \end{matrix} \right]$

$J(\theta) = (\theta_1 - 5)^2 + (\theta_2 - 5)^2$

$\frac{\partial}{\partial\theta_1}J(\theta) = 2(\theta_1 - 5)$

$\frac{\partial}{\partial\theta_2}J(\theta) = 2(\theta_2 - 5)$

編寫函數：

function [jVal, gradient] = costFunction(theta)

jVal = (theta(1) - 5) ^2 + (theta(2) - 5)^2;
gradient = zeros(2, 1);
gradient(1) = 2 * (theta(1) - 5);
gradient(2) = 2 * (theta(2) - 5);

運行代碼：

fminunc函數是內置的高級優化函數，它表示Octave里無約束最小化函數。具體用法如下：設置幾個options，這些options變量作為一個數據結構可以存儲你想要的options。GradObj和on是設置梯度目標參數為打開，這意味着你現在確實要給這個算法提供一個梯度，然后設置最大迭代次數，下面的例子中設置的次數為100。
@符號表示剛剛定義的costFunction函數的指針。它會自動選擇學習速率$\alpha$，然后嘗試使用這些高級的優化算法，就像加強版的梯度下降法，為你找到最佳的$\theta$值。

options = optimset('GradObj', 'on', 'MaxIter', '100');
initialTheta = zeros(2, 1)
[optTheta, functionVal, exitFlag] = fminunc(@costFunction, initialTheta, options)

functionVal是函數最后的值，exitFlag表示函數是否已經收斂。
欲要了解更過可通過help函數。

Multiclass Classification: One-vs-all

Multiclass classification

多分類問題舉例：

Email foldering/tagging: Work, Friends, Family, Hobby.

假如你現在需要一個學習算法，可以自動地將郵件歸類到不同文件夾里或者自動加上標簽。
Medical diagrams: Not ill, Cold, Flu.

如果一個病人因為鼻塞來找你診斷，他可能並沒生病，或者感冒了，或者得了流感。
Weather: Sunny, Coludy, Rain, Snow.

你正在做天氣的機器學習分類問題，想要區分天氣是晴天、多雲、下雨天還是下雪天。

One-vs-all 一對多的方法

多分類問題圖示：

多分類

分類方法：制造新的“偽”訓練集。
以三角形為例，將其定為正類，另外兩種定為負類，我們創建一個新的訓練集。接着擬合出一個合適的分類器，可記為$h_\theta^{(1)}(x)$。
接着將正方形定為正類，另外兩種定為負類……便可得到$h_\theta^{(2)}(x)$、$h_\theta^{(3)}(x)$。

訓練第一個分類器

Train a logistic regression classifier $h_\theta^{(i)}(x)$ for each class $i$ to predict the probablity that $y = i$. 對每一個可能的$y = i$都訓練出一個邏輯回歸分類器$h_\theta^{(i)}(x)$。

On a new input $x$, to make a prediction, pick the class $i$ that maximizes $max_ih_\theta^{(i)}(x)$. 對於給出的$x$值，我們在我們得到的分類器里分別輸入$x$值，然后選擇一個讓$h$最大的$i$。

Review

測驗

Suppose that you have trained a logistic regression classifier, and it outputs on a new example $x$ a prediction $h_\theta(x)$ = 0.7. This means (check all that apply):
- [x] Our estimate for $P(y=1|x;\theta)$ is 0.7.
- [x] Our estimate for $P(y=0|x;\theta)$ is 0.3.
- [ ] Our estimate for $P(y=0|x;\theta)$ is 0.7.
- [ ] Our estimate for $P(y=1|x;\theta)$ is 0.3.
Suppose you have the following training set, and fit a logistic regression classifier $h_\theta(x) = g(\theta_0 + \theta_1x_1 + \theta_2x_2)$. Which of the following are true? Check all that apply.
- [x] Adding polynomial features (e.g., instead using $h_\theta(x) = g(\theta_0 + \theta_1x_1 + \theta_2 x_2 + \theta_3 x_1^2 + \theta_4 x_1 x_2 + \theta_5 x_2^2)$) could increase how well we can fit the training data. 這種數據線性回歸並不適用，適當的增加多項式特性，可以提高對數據的適應。
- [x] At the optimal value of $\theta$ (e.g., found by fminunc), we will have $J(\theta) \geq 0$.
- [ ] Adding polynomial features (e.g., instead using $h_\theta(x) = g(\theta_0 + \theta_1x_1 + \theta_2 x_2 + \theta_3 x_1^2 + \theta_4 x_1 x_2 + \theta_5 x_2^2)$ ) would increase $J(\theta)$ because we are now summing over more terms. 將會減小$J(\theta)$。
- [ ] If we train gradient descent for enough iterations, for some examples $x^{(i)}$ in the training set it is possible to obtain $h_\theta(x^{(i)}) > 1$. $0 \lt h_\theta(x^{(i)}) \lt 1$。
- [x] $J(\theta)$ will be a convex function, so gradient descent should converge to the global minimum.
- [ ] The positive and negative examples cannot be separated using a straight line. So, gradient descent will fail to converge.
- [ ] Because the positive and negative examples cannot be separated using a straight line, linear regression will perform as well as logistic regression on this data.
For logistic regression, the gradient is given by $\frac{\partial}{\partial\theta_j}J(\theta) = \frac{1}{m} \sum^m_{i=1}(h_\theta(x^{(i)})-y^{(i)})x_j^{(i)}$. Which of these is a correct gradient descent update for logistic regression with a learning rate of $\alpha$? Check all that apply.
- [x] $\theta_j := \theta_j - \alpha\frac{1}{m}\sum^m_{i=1}(\frac{1}{1+e^{-\theta^Tx^{(i)}}}-y^{(i)})x_j^{(i)}$ (simultaneously update for all $j$).
- [x] $\theta_j := \theta_j - \alpha\frac{1}{m}\sum^m_{i=1}(h_\theta(x^{(i)}) - y^{(i)})x_j^{(i)}$ (simultaneously update for all $j$).
- [ ] $\theta := \theta - \alpha\frac{1}{m}\sum^m_{i=1}(\theta^Tx - y^{(i)})x^{(i)}$. 線性回歸
- [ ] $\theta_j := \theta_j - \alpha\frac{1}{m}\sum^m_{i=1}(h_\theta(x^{(i)}) - y^{(i)})x^{(i)}$ (simultaneously update for all $j$).
Which of the following statements are true? Check all that apply.
- [x] The one-vs-all technique allows you to use logistic regression for problems in which each $y^{(i)}$ comes from a fixed, discrete set of values. 將要分類的定為正類，其它定為負類。
- [x] The cost function $J(\theta)$ for logistic regression trained with $m \geq 1$ examples is always greater than or equal to zero. $J(\theta) \ge 0$。
- [ ] For logistic regression, sometimes gradient descent will converge to a local minimum (and fail to find the global minimum). This is the reason we prefer more advanced optimization algorithms such as fminunc (conjugate gradient/BFGS/L-BFGS/etc). 高級優化算法優點：不用挑選學習速率$\alpha$，通常運行較快。
- [ ] Since we train one classifier when there are two classes, we train two classifiers when there are three classes (and we do one-vs-all classification). 3個類時要訓練3個分類器。
- [x] The sigmoid function $g(z)=\frac{1}{1+e^{−z}}$ is never greater than one (>1). sigmoid函數的取值范圍是(0,1)。
- [ ] Linear regression always works well for classification if you classify by using a threshold on the prediction made by linear regression. 分類問題，要么0，要么1，沒有什么threshold一說。
Suppose you train a logistic classifier $h_\theta(x) = g(\theta_0 + \theta_1x_1 + \theta_2 x_2)$. Suppose $\theta_0 = 6, \theta_1 = -1, \theta_2 = 0$. Which of the following figures represents the decision boundary found by your classifier?
- [x]
- [ ]
- [ ]
- [ ]