【機器學習】無監督學習Autoencoder和VAE

本文轉載自查看原文 2018-10-17 11:01 2549 python/ 機器學習/ 無監督學習/ TensorFlow

眾所周知，機器學習的訓練數據之所以非常昂貴，是因為需要大量人工標注數據。

autoencoder可以輸入數據和輸出數據維度相同，這樣測試數據匹配時和訓練數據的輸出端直接匹配，從而實現無監督訓練的效果。並且，autoencoder可以起到降維作用，雖然輸入輸出端維度相同，但中間層可以維度很小，從而起到降維作用，形成數據的一個濃縮表示。

可以用autoencoder做Pretraining，對難以訓練的深度模型先把網絡結構確定，之后再用訓練數據去微調。

特定類型的autoencoder可以做生成模型生成新的東西，比如自動作詩等。

data representation：

人的記憶與數據的模式有強烈聯系。比如讓一位嫻熟的棋手記憶某局棋局狀態，會顯示出超強的記憶力，但如果面對的是一局雜亂無章的棋局，所展現的記憶能力與普通人沒什么差別。這體現了模式的力量，可以通過數據間關系進行記憶，效率更高。

autoencoder由於中間層有維度縮減的功效，因而強制它找到一個數據內部的pattern，從而起到高效的對訓練數據的記憶作用。

如下圖所示，一般中間層選取的維度很小，從而起到高效表示的作用。

如果完全做線性訓練，cost function選取MSE，則這個autoencoder訓練出來的效果相當於PCA的效果。

# 建立數據集
rnd.seed(4)
m = 200
w1, w2 = 0.1, 0.3
noise = 0.1
angles = rnd.rand(m) * 3 * np.pi / 2 - 0.5
data = np.empty((m, 3))
data[:, 0] = np.cos(angles) + np.sin(angles)/2 + noise * rnd.randn(m) / 2
data[:, 1] = np.sin(angles) * 0.7 + noise * rnd.randn(m) / 2
data[:, 2] = data[:, 0] * w1 + data[:, 1] * w2 + noise * rnd.randn(m)

# nomalize 訓練集
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_train = scaler.fit_transform(data[:100])
X_test = scaler.transform(data[100:])

# 開始建立autoencoder
import tensorflow as tf
from tensorflow.contrib.layers import fully_connected
n_inputs = 3 # 3D inputs
n_hidden = 2 # 2D codings
# 強制輸出層和輸入層相同
n_outputs = n_inputs
learning_rate = 0.01
X = tf.placeholder(tf.float32, shape=[None, n_inputs])
# 隱層和輸入層進行全連接
hidden = fully_connected(X, n_hidden, activation_fn=None)
# 不做任何非線性處理，activation=none
outputs = fully_connected(hidden, n_outputs, activation_fn=None)
# lost function使用均方差MSE
reconstruction_loss = tf.reduce_mean(tf.square(outputs - X)) # MSE
optimizer = tf.train.AdamOptimizer(learning_rate)
training_op = optimizer.minimize(reconstruction_loss)
init = tf.global_variables_initializer()

# 運行部分
# load the dataset
X_train, X_test = [...] 
n_iterations = 1000
# the output of the hidden layer provides the codings
codings = hidden 
with tf.Session() as sess:
    init.run()
    for iteration in range(n_iterations):
        # no labels (unsupervised)
        training_op.run(feed_dict={X: X_train}) 
    codings_val = codings.eval(feed_dict={X: X_test})

中間隱層作用如下圖所示，將左圖中3維的圖形選取一個最優截面，映射到二維平面上。

stacked autoencoder

做多個隱層，並且輸入到輸出形成一個對稱的關系，如下圖所示，從輸入到中間是encode，從中間到輸出是一個decode的過程。

但層次加深后，訓練時會有很多困難，比如如下代碼中，使用l2的regularization來正則化，使用ELU來做激活函數

n_inputs = 28 * 28 # for MNIST
n_hidden1 = 300
n_hidden2 = 150 # codings
n_hidden3 = n_hidden1
n_outputs = n_inputs
learning_rate = 0.01
l2_reg = 0.001
X = tf.placeholder(tf.float32, shape=[None, n_inputs])
# arg_scope相當於對fully_connected這個函數填公共參數，如正則化統一使用l2_regularizer等，則以下4個fully_connected的缺省參數全部使用with這里寫好的
with tf.contrib.framework.arg_scope(
         [fully_connected], activation_fn=tf.nn.elu,
　　　　　　weights_initializer=tf.contrib.layers.variance_scaling_initializer(),
　　　　   weights_regularizer=tf.contrib.layers.l2_regularizer(l2_reg)):
    hidden1 = fully_connected(X, n_hidden1)
    hidden2 = fully_connected(hidden1, n_hidden2) # codings
    hidden3 = fully_connected(hidden2, n_hidden3)
    # 最后一層用none來覆蓋之前缺省的參數設置
    outputs = fully_connected(hidden3, n_outputs, activation_fn=None)

# 由於之前使用了正則化，則之后可以直接把中間計算的loss從REGULARIZATION_LOSSES中提取出來，加入到reconstruction_loss中
reconstruction_loss = tf.reduce_mean(tf.square(outputs - X)) # MSE
reg_losses = tf.get_collection(tf.GraphKeys.REGULARIZATION_LOSSES)
loss = tf.add_n([reconstruction_loss] + reg_losses)
optimizer = tf.train.AdamOptimizer(learning_rate)
training_op = optimizer.minimize(loss)
init = tf.global_variables_initializer()

n_epochs = 5
batch_size = 150
with tf.Session() as sess:
    init.run()
    for epoch in range(n_epochs):
        n_batches = mnist.train.num_examples // batch_size
        for iteration in range(n_batches):
            X_batch, y_batch = mnist.train.next_batch(batch_size)
            # 只提供了x值，沒有標簽
            sess.run(training_op, feed_dict={X: X_batch})

既然autoencoder在權重上是對稱的，則權重也是可以共享的，相當於參數數量減少一半，減少overfitting的風險，提高訓練效率。

常見的訓練手段是逐層訓練，隱層1訓練出后固定權值，訓練hidden2，再對稱一下（hidden3與hidden1完全對應），得到最終訓練結果

或者可以定義不同的name scope，在不同的phase中訓練，

[...] # Build the whole stacked autoencoder normally.
# In this example, the weights are not tied.
optimizer = tf.train.AdamOptimizer(learning_rate)
with tf.name_scope("phase1"):
    phase1_outputs = tf.matmul(hidden1, weights4) + biases4
    phase1_reconstruction_loss = tf.reduce_mean(tf.square(phase1_outputs - X))
    phase1_reg_loss = regularizer(weights1) + regularizer(weights4)
    phase1_loss = phase1_reconstruction_loss + phase1_reg_loss
    phase1_training_op = optimizer.minimize(phase1_loss)

# 訓練phase2時，phase1會凍結
with tf.name_scope("phase2"):
    phase2_reconstruction_loss = tf.reduce_mean(tf.square(hidden3 - hidden1))
    phase2_reg_loss = regularizer(weights2) + regularizer(weights3)
    phase2_loss = phase2_reconstruction_loss + phase2_reg_loss
    train_vars = [weights2, biases2, weights3, biases3]
    phase2_training_op = optimizer.minimize(phase2_loss, var_list=train_vars)

Pretraining

若大量數據無label，少量數據有label，則用大量無label數據在第一階段作無監督的Pretraining訓練，將encoder部分直接取出，output部分做一個直接改造。減少由於有label數據過少導致的過擬合問題。比如下圖中的fully connected，和輸出的softmax。

去噪（denoising Autoencoder）

如下的強制加入噪聲，最后學到的是不帶噪聲的結果。並且訓練時可以加入dropout層，拿掉一部分網絡結構（測試時不加）。這些都可以增加訓練難度，從而增進網絡魯棒性，讓模型更加穩定。

sparse Autoencoder

中間層激活神經元數量有一個上限閾值約束，中間層非常稀疏，只有少量神經元有數據，正所謂言簡意賅，這樣可以增加中間層對信息的概括表達能力。

第一種加入平方誤差，第二種KL距離，如下圖可以看出KL距離和MSE之間差別比較。

def kl_divergence(p, q):
    return p * tf.log(p / q) + (1 - p) * tf.log((1 - p) / (1 - q))
learning_rate = 0.01
sparsity_target = 0.1
sparsity_weight = 0.2
[...] # Build a normal autoencoder (the coding layer is hidden1)
optimizer = tf.train.AdamOptimizer(learning_rate)
hidden1_mean = tf.reduce_mean(hidden1, axis=0) # batch mean
sparsity_loss = tf.reduce_sum(kl_divergence(sparsity_target, hidden1_mean))
reconstruction_loss = tf.reduce_mean(tf.square(outputs - X)) # MSE
loss = reconstruction_loss + sparsity_weight * sparsity_loss
training_op = optimizer.minimize(loss)

# kl距離不能取0值，因而不能使用tann的激活函數，故選取(0,1)的sigmoid函數
hidden1 = tf.nn.sigmoid(tf.matmul(X, weights1) + biases1)
# [...]
logits = tf.matmul(hidden1, weights2) + biases2)
outputs = tf.nn.sigmoid(logits)
reconstruction_loss = tf.reduce_sum(
tf.nn.sigmoid_cross_entropy_with_logits(labels=X, logits=logits))

Variational Autoencoder

通過抽樣決定輸出，使用時體現概率的隨機性。是一個generation，同訓練集有關，但只是類似，是一個完全新的實例。

如下圖，中間層加了一個關於分布均值方差的超正態分布的噪聲，從而中間學到的不是簡單編碼而是數據的模式，使得訓練數據與正態分布形成一個映射關系，這樣輸出層可以輸出和輸入層非常相像但又不一樣的數據。

使用時把encoder去掉，隨機加入一個高斯噪聲，在輸出端可以得到一個完全新的輸出。

即input通過NN Encoder之后生成兩個coding，其中一個經某種處理后與一個高斯噪聲（即一系列服從正態分布的噪聲）相乘，和另一個coding相加作為初始的中間coding。下圖與上圖同理，最終生成的output要最小化重構損失，即越接近0越好。

# smoothing term to avoid computing log(0)
eps = 1e-10 
# 對原輸入空間，通過最小化loss，將原本數據映射到規律的正態分布中
latent_loss = 0.5 * tf.reduce_sum(
           tf.square(hidden3_sigma) + tf.square(hidden3_mean) - 1 - tf.log(eps + tf.square(hidden3_sigma)))

latent_loss = 0.5 * tf.reduce_sum(
           tf.exp(hidden3_gamma) + tf.square(hidden3_mean) - 1 - hidden3_gamma)

n_inputs = 28 * 28 # for MNIST
n_hidden1 = 500
n_hidden2 = 500
n_hidden3 = 20 # codings
n_hidden4 = n_hidden2
n_hidden5 = n_hidden1
n_outputs = n_inputs
learning_rate = 0.001

with tf.contrib.framework.arg_scope(
    [fully_connected],
    activation_fn=tf.nn.elu,
    weights_initializer=tf.contrib.layers.variance_scaling_initializer()):
    X = tf.placeholder(tf.float32, [None, n_inputs])
    hidden1 = fully_connected(X, n_hidden1)
    hidden2 = fully_connected(hidden1, n_hidden2)
    # 中間層是一個分布的表示，並加入一個noise
    hidden3_mean = fully_connected(hidden2, n_hidden3, activation_fn=None)
    hidden3_gamma = fully_connected(hidden2, n_hidden3, activation_fn=None)
    hidden3_sigma = tf.exp(0.5 * hidden3_gamma)
    noise = tf.random_normal(tf.shape(hidden3_sigma), dtype=tf.float32) 
    # 使用帶noise的層來鍵之后的層
    hidden3 = hidden3_mean + hidden3_sigma * noise 
    hidden4 = fully_connected(hidden3, n_hidden4)
    hidden5 = fully_connected(hidden4, n_hidden5)
    logits = fully_connected(hidden5, n_outputs, activation_fn=None)
    outputs = tf.sigmoid(logits)

reconstruction_loss = tf.reduce_sum(
             tf.nn.sigmoid_cross_entropy_with_logits(labels=X, logits=logits))
latent_loss = 0.5 * tf.reduce_sum(
             tf.exp(hidden3_gamma) + tf.square(hidden3_mean) - 1 –         hidden3_gamma)
cost = reconstruction_loss + latent_loss

optimizer = tf.train.AdamOptimizer(learning_rate=learning_rate)
training_op = optimizer.minimize(cost)

init = tf.global_variables_initializer()

# 生成數據
import numpy as np
n_digits = 60
n_epochs = 50
batch_size = 150
with tf.Session() as sess:
    init.run()
    for epoch in range(n_epochs):
        n_batches = mnist.train.num_examples // batch_size
        for iteration in range(n_batches):
            X_batch, y_batch = mnist.train.next_batch(batch_size)
            sess.run(training_op, feed_dict={X: X_batch})
    codings_rnd = np.random.normal(size=[n_digits, n_hidden3])
    outputs_val = outputs.eval(feed_dict={hidden3: codings_rnd})
for iteration in range(n_digits):
    plt.subplot(n_digits, 10, iteration + 1)
    plot_image(outputs_val[iteration])

生成結果如下所示，都是訓練集中沒有出現的圖像

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 【機器學習基礎】無監督學習（3）——AutoEncoder 機器學習中的有監督學習，無監督學習，半監督學習機器學習分類之監督學習、無監督學習和強化學習監督學習與無監督學習的區別_機器學習機器學習一 -- 什么是監督學習和無監督學習？機器學習基礎---無監督學習之降維 Python 機器學習實戰 —— 無監督學習（上） <機器學習>無監督學習算法總結【機器學習基礎】無監督學習（1）——PCA 機器學習模型| 無監督學習