一天一經典Reducing the Dimensionality of Data with Neural Networks [Science2006]

本文轉載自查看原文 2016-08-29 16:24 1509 Pretraining/ Neural Network/ Dimensionality Reduction/ Nonlinear/ DeepLearning/ PCA/ Layer-by-layer learning

別看本文沒有幾頁紙，本着把經典的文多讀幾遍的想法，把它彩印出來看，沒想到效果很好，比在屏幕上看着舒服。若用藍色的筆圈出重點，這篇文章中幾乎要全藍。字字珠璣。

Reducing the Dimensionality of Data with Neural Networks

G.E. Hinton and R.R. Salakhutdinov

摘要

訓練一個帶有很小的中間層的多層神經網絡，可以重構高維空間的輸入向量，實現從高維數據到低維編碼的效果。（原文為high-dimensional data can be converted to low-dimensional codes）在這樣的Autoencoder network中，通常使用Gradient Descent方法來對網絡權重進行微調（Fine-tuning），這樣做有效的前提是初始的網絡權重足夠好。（原文this works well only if the initial weights are close to a good solution.）本文提出了一個有效的初始化網絡權重的方法，使得采用deep autoencoder network學習到的低維編碼優於PCA降維的結果。

背景介紹

PCA（Principal Components Analysis）主成分分析是一個被廣泛采用的用來降維的方法。它旨在找到數據點的方差最大的方向，並用這些方向作為坐標來表示數據集中的各個點。（PCA finds the directions of greatest variance in the data set and represents each data points by its coordinates along each of these directions.）

Autoencoder: 我們采用了一種非線性的PCA的泛化版本，該方法使用了一個自適應的多層編碼網絡，將數據從高維變為低維的編碼，以及一個類似的解碼網絡從低維編碼中對數據進行重構。采用隨機權重對網絡初始化，通過最小化原始數據與重構數據之間的誤差來對網絡的整體結構進行訓練。采用鏈式法則計算梯度，並將梯度向后傳播以更新網絡權重。

問題來了~~~~

對於autoencoder來說，初始權重的設置非常重要，過大，則會導致陷入壞的局部最優；過小，則會導致訓練困難。只有找到一個好的初始化權重，才能保證后續的梯度算法能夠收斂到一個理想的局部解。找到這樣的一個初始化權重需要對每一層嘗試許多類型的算法，因此本文引入了pretraining的過程。（好好理解：Finding such initial weights requires a very different type of algorithm that learns one layer of features at a time.）

方法概覽

以binary data為例，說明pretraining的過程

用RBM（Restricted Boltzmann Machines）來建模binary data，由visible layer與hidden layer構成。以image為例，pixels對應RBM的visible units ，feature detectors對應RBM的hidden units ，則visible與hidden units共同對應的energy函數有如下定義：

$E(v,h)=-\sum_{i\in \text{pixels}}b_iv_i-\sum_{j\in\text{features}}b_jh_j-\sum_{i,j}v_ih_jw_{ij}$

通過該能量函數賦予每個image一個概率。通過調節權重以及偏差，來降低真實image的能量，提高虛構image的能量，使得網絡更加傾向於那些真實的數據。RBM的結構好：給定h，v之間是條件獨立的，給定v，h之間是條件獨立的。有如下公式：

$\begin{aligned} &P(v_i=1|h)=\sigma (b_i+\sum_jh_jw_{ij})\\ &P(h_j=1|v)=\sigma (b_j+\sum_iv_iw_{ij}) \end{aligned}$

權重w的調整方法：注意：該調整策略並非嚴格按照能量公式所對應的概率目標函數求導得到的~~其中， $\varepsilon$ 是學習速率， $\left \langle v_ih_j \right \rangle_{\text{data}}$ 對應真實圖像pixel i與feature detector j一起出現的概率， $\left \langle v_ih_j \right \rangle_{\text{recon}}$ 對應虛構圖像的概率。

$\bigtriangleup w_{ij}=\varepsilon \left ( \left \langle v_ih_j \right \rangle_{\text{data}}-\left \langle v_ih_j \right \rangle_{\text{recon}} \right )$

pretraining的過程：learning one layer of feature detector->將該層的輸出作為第二層feature學習的輸入數據（原文：treat activities as data for learning a second layer of features.），也就是說第一層的feature detector作為第二層RBM的visible units。（原文：The first layer of feature detectors then become the visible units for learning the next RBM.）這種逐層學習的思想可以重復多次。

可以證明的是，給定每層的單元數目不減少，有正確的初始化weight，多加一層，可以改善模型概率似然的下界。（原文：Adding an extra layer always improves a lower bound on the log probability that the model assigns to the training data, provided the number of feature detectors per layer does not decrease and their weights are intialized correctly.）

逐層的訓練方法對於pretrain deep autoencoder是非常有效的方法。（原文： The layer-by-layer learning algorithm is a very effective way to pretrain the weights of a deep autoencoder. ）每層的feature能夠很好的捕捉層間activities的強度以及高階相關性。對於大多數想要揭示數據的低維、非線性特征來說，逐層訓練是一個有效的方法。

縱觀整個deep autoencoder可以分為pretraining，unfolding以及fine-tuning三個階段。pretraining就是上述描述的逐層訓練的過程。unfolding是將pretraining階段學到的weight用於編碼與解碼的過程，從而得到真正的autoencoder的結構。fine-tuning指的是采用bp算法對整個autoencoder的結構的權重進行微調。

泛化到Continuous data

將visible units由原來的stochastic binary variable替換為linear units with Gaussian Noise。在實驗中，所有的visible units都是這種情況的。

實驗：

模擬數據

MNIST數據集

Reuter Corpus：The autoencoder clearly outperformed latent semantic analysis, a well known document retrieval method based on PCA.（應該是從優化目標函數的角度來說PCA與LDA的關系）

總結：

Pretraining的好處：由於網絡權重中的大部分信息來源於原始數據本身，因此pretraining的泛化性能好。

label中所蘊含的有限信息僅僅適合對網絡權重做微調。

從以前的實驗經驗中可以看出這一點是非常正確的。

Deep Autoencoder早在1980年就已經提出，事實上它對於非線性維度歸約（nonlinear dimensionality reduction）來說是非常有效的。但是它需要的三個條件迄今才滿足：（1）Computers are fast enough; （2）data sets are big enough; （3）the initial weights are close enough to a good solution。

Autoencoders give mappings in both directions between data and code spaces.

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 Deep Learning 16：用自編碼器對數據進行降維_讀論文“Reducing the Dimensionality of Data with Neural Networks”的筆記一天一經典Efficient Estimation of Word Representations in Vector Space Paper | Drop an Octave: Reducing Spatial Redundancy in Convolutional Neural Networks with Octave Convolution shell命令之一天一見：grep 【一天一個shell命令】【cut】一天一個 Linux 命令（28）：fsck 命令一天一個 Linux 命令（1）：vim 命令一天一個 Linux 命令（22）：xargs 命令【一天一個canvas】填充一個圓形（六）一天一個 Linux 命令（36）：kill 命令