MLPClassifier 隱藏層不包括輸入和輸出

本文轉載自查看原文 2018-05-26 10:28 5021 python/ 機器學習

多層感知機（MLP）原理簡介

多層感知機（MLP，Multilayer Perceptron）也叫人工神經網絡（ANN，Artificial Neural Network），除了輸入輸出層，它中間可以有多個隱層，最簡單的MLP只含一個隱層，即三層的結構，如下圖：

從上圖可以看到，多層感知機層與層之間是全連接的（全連接的意思就是：上一層的任何一個神經元與下一層的所有神經元都有連接）。多層感知機最底層是輸入層，中間是隱藏層，最后是輸出層。

輸入層沒什么好說，你輸入什么就是什么，比如輸入是一個n維向量，就有n個神經元。

隱藏層的神經元怎么得來？首先它與輸入層是全連接的，假設輸入層用向量X表示，則隱藏層的輸出就是

f(W1X+b1)，W1是權重（也叫連接系數），b1是偏置，函數f 可以是常用的sigmoid函數或者tanh函數：

最后就是輸出層，輸出層與隱藏層是什么關系？其實隱藏層到輸出層可以看成是一個多類別的邏輯回歸，也即softmax回歸，所以輸出層的輸出就是softmax(W2X1+b2)，X1表示隱藏層的輸出f(W1X+b1)。

`sklearn.neural_network`.MLPClassifier

class sklearn.neural_network. MLPClassifier (hidden_layer_sizes=(100, ), activation=’relu’, solver=’adam’, alpha=0.0001, batch_size=’auto’, learning_rate=’constant’, learning_rate_init=0.001, power_t=0.5, max_iter=200, shuffle=True, random_state=None, tol=0.0001, verbose=False, warm_start=False, momentum=0.9, nesterovs_momentum=True, early_stopping=False, validation_fraction=0.1, beta_1=0.9, beta_2=0.999, epsilon=1e-08, n_iter_no_change=10)[source]

Multi-layer Perceptron classifier.

This model optimizes the log-loss function using LBFGS or stochastic gradient descent.

New in version 0.18.

Parameters:

Parameters:	hidden_layer_sizes : tuple, length = n_layers - 2, default (100,) The ith element represents the number of neurons in the ith hidden layer. activation : {‘identity’, ‘logistic’, ‘tanh’, ‘relu’}, default ‘relu’ Activation function for the hidden layer. ‘identity’, no-op activation, useful to implement linear bottleneck, returns f(x) = x ‘logistic’, the logistic sigmoid function, returns f(x) = 1 / (1 + exp(-x)). ‘tanh’, the hyperbolic tan function, returns f(x) = tanh(x). ‘relu’, the rectified linear unit function, returns f(x) = max(0, x) solver : {‘lbfgs’, ‘sgd’, ‘adam’}, default ‘adam’ The solver for weight optimization. ‘lbfgs’ is an optimizer in the family of quasi-Newton methods. ‘sgd’ refers to stochastic gradient descent. ‘adam’ refers to a stochastic gradient-based optimizer proposed by Kingma, Diederik, and Jimmy Ba Note: The default solver ‘adam’ works pretty well on relatively large datasets (with thousands of training samples or more) in terms of both training time and validation score. For small datasets, however, ‘lbfgs’ can converge faster and perform better. alpha : float, optional, default 0.0001 L2 penalty (regularization term) parameter. batch_size : int, optional, default ‘auto’ Size of minibatches for stochastic optimizers. If the solver is ‘lbfgs’, the classifier will not use minibatch. When set to “auto”, batch_size=min(200, n_samples) learning_rate : {‘constant’, ‘invscaling’, ‘adaptive’}, default ‘constant’ Learning rate schedule for weight updates. ‘constant’ is a constant learning rate given by ‘learning_rate_init’. ‘invscaling’ gradually decreases the learning rate `learning_rate_` at each time step ‘t’ using an inverse scaling exponent of ‘power_t’. effective_learning_rate = learning_rate_init / pow(t, power_t) ‘adaptive’ keeps the learning rate constant to ‘learning_rate_init’ as long as training loss keeps decreasing. Each time two consecutive epochs fail to decrease training loss by at least tol, or fail to increase validation score by at least tol if ‘early_stopping’ is on, the current learning rate is divided by 5. Only used when `solver='sgd'`. learning_rate_init : double, optional, default 0.001 The initial learning rate used. It controls the step-size in updating the weights. Only used when solver=’sgd’ or ‘adam’. power_t : double, optional, default 0.5 The exponent for inverse scaling learning rate. It is used in updating effective learning rate when the learning_rate is set to ‘invscaling’. Only used when solver=’sgd’.

hidden_layer_sizes : tuple, length = n_layers - 2, default (100,)

The ith element represents the number of neurons in the ith hidden layer.

activation : {‘identity’, ‘logistic’, ‘tanh’, ‘relu’}, default ‘relu’

Activation function for the hidden layer.

‘identity’, no-op activation, useful to implement linear bottleneck, returns f(x) = x
‘logistic’, the logistic sigmoid function, returns f(x) = 1 / (1 + exp(-x)).
‘tanh’, the hyperbolic tan function, returns f(x) = tanh(x).
‘relu’, the rectified linear unit function, returns f(x) = max(0, x)

solver : {‘lbfgs’, ‘sgd’, ‘adam’}, default ‘adam’

The solver for weight optimization.

‘lbfgs’ is an optimizer in the family of quasi-Newton methods.
‘sgd’ refers to stochastic gradient descent.
‘adam’ refers to a stochastic gradient-based optimizer proposed by Kingma, Diederik, and Jimmy Ba

Note: The default solver ‘adam’ works pretty well on relatively large datasets (with thousands of training samples or more) in terms of both training time and validation score. For small datasets, however, ‘lbfgs’ can converge faster and perform better.

alpha : float, optional, default 0.0001

L2 penalty (regularization term) parameter.

batch_size : int, optional, default ‘auto’

Size of minibatches for stochastic optimizers. If the solver is ‘lbfgs’, the classifier will not use minibatch. When set to “auto”, batch_size=min(200, n_samples)

learning_rate : {‘constant’, ‘invscaling’, ‘adaptive’}, default ‘constant’

Learning rate schedule for weight updates.

‘constant’ is a constant learning rate given by ‘learning_rate_init’.
‘invscaling’ gradually decreases the learning rate learning_rate_ at each time step ‘t’ using an inverse scaling exponent of ‘power_t’. effective_learning_rate = learning_rate_init / pow(t, power_t)
‘adaptive’ keeps the learning rate constant to ‘learning_rate_init’ as long as training loss keeps decreasing. Each time two consecutive epochs fail to decrease training loss by at least tol, or fail to increase validation score by at least tol if ‘early_stopping’ is on, the current learning rate is divided by 5.

Only used when solver='sgd'.

learning_rate_init : double, optional, default 0.001

The initial learning rate used. It controls the step-size in updating the weights. Only used when solver=’sgd’ or ‘adam’.

power_t : double, optional, default 0.5

The exponent for inverse scaling learning rate. It is used in updating effective learning rate when the learning_rate is set to ‘invscaling’. Only used when solver=’sgd’.

概述

以監督學習為例，假設我們有訓練樣本集 $\textstyle (x(^ i),y(^ i))$ ，那么神經網絡算法能夠提供一種復雜且非線性的假設模型 $\textstyle h_{W,b}(x)$ ，它具有參數 $\textstyle W, b$ ，可以以此參數來擬合我們的數據。

為了描述神經網絡，我們先從最簡單的神經網絡講起，這個神經網絡僅由一個“神經元”構成，以下即是這個“神經元”的圖示：

這個“神經元”是一個以 $\textstyle x_1, x_2, x_3$ 及截距 $\textstyle +1$ 為輸入值的運算單元，其輸出為 $\textstyle h_{W,b}(x) = f(W^Tx) = f(\sum_{i=1}^3 W_{i}x_i +b)$ ，其中函數 $\textstyle f : \Re \mapsto \Re$ 被稱為“激活函數”。在本教程中，我們選用sigmoid函數作為激活函數 $\textstyle f(\cdot)$

$f(z) = \frac{1}{1+\exp(-z)}.$

可以看出，這個單一“神經元”的輸入－輸出映射關系其實就是一個邏輯回歸（logistic regression）。

雖然本系列教程采用sigmoid函數，但你也可以選擇雙曲正切函數（tanh）：

$f(z) = \tanh(z) = \frac{e^z - e^{-z}}{e^z + e^{-z}},$

以下分別是sigmoid及tanh的函數圖像

$\textstyle \tanh(z)$ 函數是sigmoid函數的一種變體，它的取值范圍為 $\textstyle [-1,1]$ ，而不是sigmoid函數的 $\textstyle [0,1]$ 。

注意，與其它地方（包括OpenClassroom公開課以及斯坦福大學CS229課程）不同的是，這里我們不再令 $\textstyle x_0=1$ 。取而代之，我們用單獨的參數 $\textstyle b$ 來表示截距。

最后要說明的是，有一個等式我們以后會經常用到：如果選擇 $\textstyle f(z) = 1/(1+\exp(-z))$ ，也就是sigmoid函數，那么它的導數就是 $\textstyle f'(z) = f(z) (1-f(z))$ （如果選擇tanh函數，那它的導數就是 $\textstyle f'(z) = 1- (f(z))^2$ ，你可以根據sigmoid（或tanh）函數的定義自行推導這個等式。

神經網絡模型

所謂神經網絡就是將許多個單一“神經元”聯結在一起，這樣，一個“神經元”的輸出就可以是另一個“神經元”的輸入。例如，下圖就是一個簡單的神經網絡：

我們使用圓圈來表示神經網絡的輸入，標上“ $\textstyle +1$ ”的圓圈被稱為偏置節點，也就是截距項。神經網絡最左邊的一層叫做輸入層，最右的一層叫做輸出層（本例中，輸出層只有一個節點）。中間所有節點組成的一層叫做隱藏層，因為我們不能在訓練樣本集中觀測到它們的值。同時可以看到，以上神經網絡的例子中有3個輸入單元（偏置單元不計在內），3個隱藏單元及一個輸出單元。

我們用 $\textstyle {n}_l$ 來表示網絡的層數，本例中 $\textstyle n_l=3$ ，我們將第 $\textstyle l$ 層記為 $\textstyle L_l$ ，於是 $\textstyle L_1$ 是輸入層，輸出層是 $\textstyle L_{n_l}$ 。本例神經網絡有參數 $\textstyle (W,b) = (W^{(1)}, b^{(1)}, W^{(2)}, b^{(2)})$ ，其中 $\textstyle W^{(l)}_{ij}$ （下面的式子中用到）是第 $\textstyle l$ 層第 $\textstyle j$ 單元與第 $\textstyle l+1$ 層第 $\textstyle i$ 單元之間的聯接參數（其實就是連接線上的權重，注意標號順序）， $\textstyle b^{(l)}_i$ 是第 $\textstyle l+1$ 層第 $\textstyle i$ 單元的偏置項。因此在本例中， $\textstyle W^{(1)} \in \Re^{3\times 3}$ ， $\textstyle W^{(2)} \in \Re^{1\times 3}$ 。注意，沒有其他單元連向偏置單元(即偏置單元沒有輸入)，因為它們總是輸出 $\textstyle +1$ 。同時，我們用 $\textstyle s_l$ 表示第 $\textstyle l$ 層的節點數（偏置單元不計在內）。

我們用 $\textstyle a^{(l)}_i$ 表示第 $\textstyle l$ 層第 $\textstyle i$ 單元的激活值（輸出值）。當 $\textstyle l=1$ 時， $\textstyle a^{(1)}_i = x_i$ ，也就是第 $\textstyle i$ 個輸入值（輸入值的第 $\textstyle i$ 個特征）。對於給定參數集合 $\textstyle W,b$ ，我們的神經網絡就可以按照函數 $\textstyle h_{W,b}(x)$ 來計算輸出結果。本例神經網絡的計算步驟如下：

$\begin{align} a_1^{(2)} &= f(W_{11}^{(1)}x_1 + W_{12}^{(1)} x_2 + W_{13}^{(1)} x_3 + b_1^{(1)}) \\ a_2^{(2)} &= f(W_{21}^{(1)}x_1 + W_{22}^{(1)} x_2 + W_{23}^{(1)} x_3 + b_2^{(1)}) \\ a_3^{(2)} &= f(W_{31}^{(1)}x_1 + W_{32}^{(1)} x_2 + W_{33}^{(1)} x_3 + b_3^{(1)}) \\ h_{W,b}(x) &= a_1^{(3)} = f(W_{11}^{(2)}a_1^{(2)} + W_{12}^{(2)} a_2^{(2)} + W_{13}^{(2)} a_3^{(2)} + b_1^{(2)}) \end{align}$

我們用 $\textstyle z^{(l)}_i$ 表示第 $\textstyle l$ 層第 $\textstyle i$ 單元輸入加權和（包括偏置單元），比如， $\textstyle z_i^{(2)} = \sum_{j=1}^n W^{(1)}_{ij} x_j + b^{(1)}_i$ ，則 $\textstyle a^{(l)}_i = f(z^{(l)}_i)$ 。

這樣我們就可以得到一種更簡潔的表示法。這里我們將激活函數 $\textstyle f(\cdot)$ 擴展為用向量（分量的形式）來表示，即 $\textstyle f([z_1, z_2, z_3]) = [f(z_1), f(z_2), f(z_3)]$ ，那么，上面的等式可以更簡潔地表示為：

$\begin{align} z^{(2)} &= W^{(1)} x + b^{(1)} \\ a^{(2)} &= f(z^{(2)}) \\ z^{(3)} &= W^{(2)} a^{(2)} + b^{(2)} \\ h_{W,b}(x) &= a^{(3)} = f(z^{(3)}) \end{align}$

我們將上面的計算步驟叫作前向傳播。回想一下，之前我們用 $\textstyle a^{(1)} = x$ 表示輸入層的激活值，那么給定第 $\textstyle l$ 層的激活值 $\textstyle a^{(l)}$ 后，第 $\textstyle l+1$ 層的激活值 $\textstyle a^{(l+1)}$ 就可以按照下面步驟計算得到：

$\begin{align} z^{(l+1)} &= W^{(l)} a^{(l)} + b^{(l)} \\ a^{(l+1)} &= f(z^{(l+1)}) \end{align}$

將參數矩陣化，使用矩陣－向量運算方式，我們就可以利用線性代數的優勢對神經網絡進行快速求解。

目前為止，我們討論了一種神經網絡，我們也可以構建另一種結構的神經網絡（這里結構指的是神經元之間的聯接模式），也就是包含多個隱藏層的神經網絡。最常見的一個例子是 $\textstyle n_l$ 層的神經網絡，第 $\textstyle 1$ 層是輸入層，第 $\textstyle n_l$ 層是輸出層，中間的每個層 $\textstyle l$ 與層 $\textstyle l+1$ 緊密相聯。這種模式下，要計算神經網絡的輸出結果，我們可以按照之前描述的等式，按部就班，進行前向傳播，逐一計算第 $\textstyle L_2$ 層的所有激活值，然后是第 $\textstyle L_3$ 層的激活值，以此類推，直到第 $\textstyle L_{n_l}$ 層。這是一個前饋神經網絡的例子，因為這種聯接圖沒有閉環或回路。

神經網絡也可以有多個輸出單元。比如，下面的神經網絡有兩層隱藏層： $\textstyle L_2$ 及 $\textstyle L_3$ ，輸出層 $\textstyle L_4$ 有兩個輸出單元。

要求解這樣的神經網絡，需要樣本集 $\textstyle (x^{(i)}, y^{(i)})$ ，其中 $\textstyle y^{(i)} \in \Re^2$ 。如果你想預測的輸出是多個的，那這種神經網絡很適用。（比如，在醫療診斷應用中，患者的體征指標就可以作為向量的輸入值，而不同的輸出值 $\textstyle y_i$ 可以表示不同的疾病存在與否。）