使用騰訊雲 GPU 學習深度學習系列之二：Tensorflow 簡明原理【轉】

本文轉載自查看原文 2017-07-20 21:09 1910 【人工智能】/ 【算法】/ 【思維】

轉自：https://www.qcloud.com/community/article/598765?fromSource=gwzcw.117333.117333.117333

這是《使用騰訊雲 GPU 學習深度學習》系列文章的第二篇，主要介紹了 Tensorflow 的原理，以及如何用最簡單的Python代碼進行功能實現。本系列文章主要介紹如何使用騰訊雲GPU服務器進行深度學習運算，前面主要介紹原理部分，后期則以實踐為主。

往期內容：

使用騰訊雲 GPU 學習深度學習系列之一：傳統機器學習的回顧

1. 神經網絡原理

神經網絡模型，是上一章節提到的典型的監督學習問題，即我們有一組輸入以及對應的目標輸出，求最優模型。通過最優模型，當我們有新的輸入時，可以得到一個近似真實的預測輸出。

我們先看一下如何實現這樣一個簡單的神經網絡：

輸入 x = [1,2,3],
目標輸出 y = [-0.85, 0.72]
中間使用一個包含四個單元的隱藏層。
結構如圖：

求所需參數 w1_0 w2_0 b1_0 b2_0，使得給定輸入 x 下得到的輸出，和目標輸出 png 之間的平均均方誤差（Mean Square Errors, MSE) 最小化。

我們首先需要思考，有幾個參數？由於是兩層神經網絡，結構如下圖（圖片來源http://stackoverflow.com/questions/22054877/backpropagation-training-stuck）, 其中輸入層為 3，中間層為 4，輸出層是 2：

png

因此，其中總共包含 (3x4+4) + (4*2+2) = 26 個參數需要訓練。我們可以如圖初始化參數。參數可以隨機初始化，也可以隨便指定：

# python import numpy as np w1_0 = np.array([[ 0.1, 0.2, 0.3, 0.4], [ 0.5, 0.6, 0.7, 0.8], [ 0.9, 1.0, 1.1, 1.2]]) w2_0 = np.array([[ 1.3, 1.4], [ 1.5, 1.6], [ 1.7, 1.8], [ 1.9, 2.0]]) b1_0 = np.array( [-2.0, -6.0, -1.0, -7.0]) b2_0 = np.array( [-2.5, -5.0])

我們進行一次正向傳播：

# python x = [1,2,3] y = [-0.85, 0.72] o1 = np.dot(x, w1_0 ) + b1_0 os1 = np.power(1+np.exp(o1*-1), -1) o2 = np.dot(os1, w2_0) + b2_0 os2 = np.tanh(o2)

再進行一次反向傳播：

# python alpha = 0.1 grad_os2 = (y - os2) * (1-np.power(os2, 2)) grad_os1 = np.dot(w2_0, grad_os2.T).T * (1-os1)*os1 grad_w2 = ... grad_b2 = ... ... ... w2_0 = w2_0 + alpha * grad_w2 b2_0 = b2_0 + alpha * grad_b2 ... ...

如此反復多次，直到最終誤差收斂。進行反向傳播時，需要將所有參數的求導結果都寫上去，然后根據求導結果更新參數。我這里就沒有寫全，因為一層一層推導實在是太過麻煩。更重要的是，當我們需要訓練新的神經網絡結構時，這些都需要重新推導一次，費時費力。

然而仔細想一想，這個推導的過程也並非無規律可循。即上一級的神經網絡梯度輸出，會被用作下一級計算梯度的輸入，同時下一級計算梯度的輸出，會被作為上一級神經網絡的輸入。於是我們就思考能否將這一過程抽象化，做成一個可以自動求導的框架？OK，以 Tensorflow 為代表的一系列深度學習框架，正是根據這一思路誕生的。

2.深度學習框架

近幾年最火的深度學習框架是什么？毫無疑問，Tensorflow 高票當選。

png

但實際上，這些深度學習框架都具有一些普遍特征。Gokula Krishnan Santhanam認為，大部分深度學習框架都包含以下五個核心組件：

張量（Tensor）
基於張量的各種操作
計算圖（Computation Graph）
自動微分（Automatic Differentiation）工具
BLAS、cuBLAS、cuDNN等拓展包

其中，張量 Tensor 可以理解為任意維度的數組——比如一維數組被稱作向量（Vector），二維的被稱作矩陣（Matrix），這些都屬於張量。有了張量，就有對應的基本操作，如取某行某列的值，張量乘以常數等。運用拓展包其實就相當於使用底層計算軟件加速運算。

我們今天重點介紹的，就是計算圖模型，以及自動微分兩部分。首先介紹以 Torch 框架為例，談談如何實現自動求導，然后再用最簡單的方法，實現這兩部分。

2.1. 深度學習框架如何實現自動求導

諸如 Tensorflow 這樣的深度學習框架的入門，網上有大量的幾行代碼、幾分鍾入門這樣的資料，可以快速實現手寫數字識別等簡單任務。但如果想深入了解 Tensorflow 的背后原理，可能就不是這么容易的事情了。這里我們簡單的談一談這一部分。

我們知道，當我們拿到數據、訓練神經網絡時，網絡中的所有參數都是變量。訓練模型的過程，就是如何得到一組最佳變量，使預測最准確的過程。這個過程實際上就是，輸入數據經過正向傳播，變成預測，然后預測與實際情況的誤差反向傳播誤差回來，更新變量。如此反復多次，得到最優的參數。這里就會遇到一個問題，神經網絡這么多層，如何保證正向、反向傳播都可以正確運行？

值得思考的是，這兩種傳播方式，都具有管道傳播的特征。正向傳播一層一層算就可以了，上一層網絡的結果作為下一層的輸入。而反向傳播過程可以利用鏈式求導法則，從后往前，不斷將誤差分攤到每一個參數的頭上。

圖片來源：Colah博客

png
png

進過抽象化后，我們發現，深度學習框架中的每一個模塊都需要兩個函數，一個連接正向，一個連接反向。這里的正向和反向，如同武俠小說中的任督二脈。而訓練模型的過程，數據通過正向傳播生成預測結果，進而將誤差反向傳回更新參數，就如同讓真氣通過任督二脈在體內游走，隨着訓練誤差逐漸縮小收斂，深度神經網絡也將打通任督二脈。

接下來，我們將首先審視一下 Torch 框架的源碼如何實現這兩部分內容，其次我們通過 Python 直接編寫一個最簡單的深度學習框架。

png

舉 Torch 的 nn 項目的例子是因為Torch 的代碼文件結構比較簡單，Tensorflow 的規律和Torch比較近似，但文件結構相對更加復雜，有興趣的可以仔細讀讀相關文章。

Torch nn 模塊Github 源碼這個目錄下的幾乎所有 .lua 文件，都有這兩個函數：

# lua function xxx:updateOutput(input) input.THNN.xxx_updateOutput( input:cdata(), self.output:cdata() ) return self.output end function xxx:updateGradInput(input, gradOutput) input.THNN.xxx_updateGradInput( input:cdata(), gradOutput:cdata(), self.gradInput:cdata(), self.output:cdata() ) return self.gradInput end

這里其實是相當於留了兩個方法的定義，沒有寫具體功能。具體功能的代碼，在 ./lib/THNN/generic 目錄中用 C 實現實現，具體以 Sigmoid 函數舉例。

我們知道 Sigmoid 函數的形式是：

png

代碼實現起來是這樣：

# lua void THNN_(Sigmoid_updateOutput)( THNNState *state, THTensor *input, THTensor *output) { THTensor_(resizeAs)(output, input); TH_TENSOR_APPLY2(real, output, real, input, *output_data = 1./(1.+ exp(- *input_data)); ); }

Sigmoid 函數求導變成：

png

所以這里在實現的時候就是：

// c void THNN_(Sigmoid_updateGradInput)( THNNState *state, THTensor *input, THTensor *gradOutput, THTensor *gradInput, THTensor *output) { THNN_CHECK_NELEMENT(input, gradOutput); THTensor_(resizeAs)(gradInput, output); TH_TENSOR_APPLY3(real, gradInput, real, gradOutput, real, output, real z = * output_data; *gradInput_data = *gradOutput_data * (1. - z) * z; ); }

大家應該注意到了一點， updateOutput 函數, output_data 在等號左邊， input_data 在等號右邊。而 updateGradInput 函數， gradInput_data 在等號左邊， gradOutput_data 在等號右邊。這里，output = f(input) 對應的是正向傳播 input = f(output) 對應的是反向傳播。

1.2 用 Python 直接編寫一個最簡單的深度學習框架

這部分內容屬於“造輪子”，並且借用了優達學城的一個小型項目 MiniFlow。

數據結構部分

首先，我們實現一個父類 Node，然后基於這個父類，依次實現 Input Linear Sigmoid 等模塊。這里運用了簡單的 Python Class 繼承。這些模塊中，需要將 forward 和 backward 兩個方法針對每個模塊分別重寫。

png

代碼如下：

# python class Node(object): """ Base class for nodes in the network. Arguments: `inbound_nodes`: A list of nodes with edges into this node. """ def __init__(self, inbound_nodes=[]): """ Node's constructor (runs when the object is instantiated). Sets properties that all nodes need. """ # A list of nodes with edges into this node. self.inbound_nodes = inbound_nodes # The eventual value of this node. Set by running # the forward() method. self.value = None # A list of nodes that this node outputs to. self.outbound_nodes = [] # New property! Keys are the inputs to this node and # their values are the partials of this node with # respect to that input. self.gradients = {} # Sets this node as an outbound node for all of # this node's inputs. for node in inbound_nodes: node.outbound_nodes.append(self) def forward(self): """ Every node that uses this class as a base class will need to define its own `forward` method. """ raise NotImplementedError def backward(self): """ Every node that uses this class as a base class will need to define its own `backward` method. """ raise NotImplementedError class Input(Node): """ A generic input into the network. """ def __init__(self): Node.__init__(self) def forward(self): pass def backward(self): self.gradients = {self: 0} for n in self.outbound_nodes: self.gradients[self] += n.gradients[self] class Linear(Node): """ Represents a node that performs a linear transform. """ def __init__(self, X, W, b): Node.__init__(self, [X, W, b]) def forward(self): """ Performs the math behind a linear transform. """ X = self.inbound_nodes[0].value W = self.inbound_nodes[1].value b = self.inbound_nodes[2].value self.value = np.dot(X, W) + b def backward(self): """ Calculates the gradient based on the output values. """ self.gradients = {n: np.zeros_like(n.value) for n in self.inbound_nodes} for n in self.outbound_nodes: grad_cost = n.gradients[self] self.gradients[self.inbound_nodes[0]] += np.dot(grad_cost, self.inbound_nodes[1].value.T) self.gradients[self.inbound_nodes[1]] += np.dot(self.inbound_nodes[0].value.T, grad_cost) self.gradients[self.inbound_nodes[2]] += np.sum(grad_cost, axis=0, keepdims=False) class Sigmoid(Node): """ Represents a node that performs the sigmoid activation function. """ def __init__(self, node): Node.__init__(self, [node]) def _sigmoid(self, x): """ This method is separate from `forward` because it will be used with `backward` as well. `x`: A numpy array-like object. """ return 1. / (1. + np.exp(-x)) def forward(self): """ Perform the sigmoid function and set the value. """ input_value = self.inbound_nodes[0].value self.value = self._sigmoid(input_value) def backward(self): """ Calculates the gradient using the derivative of the sigmoid function. """ self.gradients = {n: np.zeros_like(n.value) for n in self.inbound_nodes} for n in self.outbound_nodes: grad_cost = n.gradients[self] sigmoid = self.value self.gradients[self.inbound_nodes[0]] += sigmoid * (1 - sigmoid) * grad_cost class Tanh(Node): def __init__(self, node): """ The tanh cost function. Should be used as the last node for a network. """ Node.__init__(self, [node]) def forward(self): """ Calculates the tanh. """ input_value = self.inbound_nodes[0].value self.value = np.tanh(input_value) def backward(self): """ Calculates the gradient of the cost. """ self.gradients = {n: np.zeros_like(n.value) for n in self.inbound_nodes} for n in self.outbound_nodes: grad_cost = n.gradients[self] tanh = self.value self.gradients[self.inbound_nodes[0]] += (1 + tanh) * (1 - tanh) * grad_cost.T class MSE(Node): def __init__(self, y, a): """ The mean squared error cost function. Should be used as the last node for a network. """ Node.__init__(self, [y, a]) def forward(self): """ Calculates the mean squared error. """ y = self.inbound_nodes[0].value.reshape(-1, 1) a = self.inbound_nodes[1].value.reshape(-1, 1) self.m = self.inbound_nodes[0].value.shape[0] self.diff = y - a self.value = np.mean(self.diff**2) def backward(self): """ Calculates the gradient of the cost. """ self.gradients[self.inbound_nodes[0]] = (2 / self.m) * self.diff self.gradients[self.inbound_nodes[1]] = (-2 / self.m) * self.diff

調度算法與優化部分

優化部分則會在以后的系列中單獨詳細說明。這里主要將簡單講一下圖計算的算法調度。就是實際上Tensorflow的各個模塊會生成一個有向無環圖，如下圖（來源http://www.geeksforgeeks.org/topological-sorting-indegree-based-solution/）:

png

在計算過程中，幾個模塊存在着相互依賴關系，比如要計算模塊1，就必須完成模塊3和模塊4，而要完成模塊3，就需要在之前順次完成模塊5、2；因此這里可以使用 Kahn 算法作為調度算法（下面的 topological_sort 函數），從計算圖中，推導出類似 5->2->3->4->1 的計算順序。

# python def topological_sort(feed_dict): """ Sort the nodes in topological order using Kahn's Algorithm. `feed_dict`: A dictionary where the key is a `Input` Node and the value is the respective value feed to that Node. Returns a list of sorted nodes. """ input_nodes = [n for n in feed_dict.keys()] G = {} nodes = [n for n in input_nodes] while len(nodes) > 0: n = nodes.pop(0) if n not in G: G[n] = {'in': set(), 'out': set()} for m in n.outbound_nodes: if m not in G: G[m] = {'in': set(), 'out': set()} G[n]['out'].add(m) G[m]['in'].add(n) nodes.append(m) L = [] S = set(input_nodes) while len(S) > 0: n = S.pop() if isinstance(n, Input): n.value = feed_dict[n] L.append(n) for m in n.outbound_nodes: G[n]['out'].remove(m) G[m]['in'].remove(n) if len(G[m]['in']) == 0: S.add(m) return L def forward_and_backward(graph): """ Performs a forward pass and a backward pass through a list of sorted Nodes. Arguments: `graph`: The result of calling `topological_sort`. """ for n in graph: n.forward() for n in graph[::-1]: n.backward() def sgd_update(trainables, learning_rate=1e-2): """ Updates the value of each trainable with SGD. Arguments: `trainables`: A list of `Input` Nodes representing weights/biases. `learning_rate`: The learning rate. """ for t in trainables: t.value = t.value - learning_rate * t.gradients[t]

使用模型

# python import numpy as np from sklearn.utils import resample np.random.seed(0) w1_0 = np.array([[ 0.1, 0.2, 0.3, 0.4], [ 0.5, 0.6, 0.7, 0.8], [ 0.9, 1.0, 1.1, 1.2]]) w2_0 = np.array([[ 1.3, 1.4], [ 1.5, 1.6], [ 1.7, 1.8], [ 1.9, 2.0]]) b1_0 = np.array( [-2.0, -6.0, -1.0, -7.0]) b2_0 = np.array( [-2.5, -5.0]) X_ = np.array([[1.0, 2.0, 3.0]]) y_ = np.array([[-0.85, 0.75]]) n_features = X_.shape[1] W1_ = w1_0 b1_ = b1_0 W2_ = w2_0 b2_ = b2_0 X, y = Input(), Input() W1, b1 = Input(), Input() W2, b2 = Input(), Input() l1 = Linear(X, W1, b1) s1 = Sigmoid(l1) l2 = Linear(s1, W2, b2) t1 = Tanh(l2) cost = MSE(y, t1) feed_dict = { X: X_, y: y_, W1: W1_, b1: b1_, W2: W2_, b2: b2_ } epochs = 10 m = X_.shape[0] batch_size = 1 steps_per_epoch = m // batch_size graph = topological_sort(feed_dict) trainables = [W1, b1, W2, b2] l_Mat_W1 = [w1_0] l_Mat_W2 = [w2_0] l_Mat_out = [] l_val = [] for i in range(epochs): loss = 0 for j in range(steps_per_epoch): X_batch, y_batch = resample(X_, y_, n_samples=batch_size) X.value = X_batch y.value = y_batch forward_and_backward(graph) sgd_update(trainables, 0.1) loss += graph[-1].value mat_W1 = [] mat_W2 = [] for i in graph: try: if (i.value.shape[0] == 3) and (i.value.shape[1] == 4): mat_W1 = i.value if (i.value.shape[0] == 4) and (i.value.shape[1] == 2): mat_W2 = i.value except: pass l_Mat_W1.append(mat_W1) l_Mat_W2.append(mat_W2) l_Mat_out.append(graph[9].value)

來觀察一下。當然還有更高級的可視化方法：https://jizhi.im/blog/post/v_nn_learn

# python import matplotlib.pyplot as plt %matplotlib inline fig = plt.figure( figsize=(14,10)) ax0 = fig.add_subplot(131) #aax0 = fig.add_axes([0, 0, 0.3, 0.1]) c0 = ax0.imshow(np.array(l_Mat_out).reshape([-1,2]).T, interpolation='nearest',aspect='auto', cmap="Reds", vmax=1, vmin=-1) ax0.set_title("Output") cbar = fig.colorbar(c0, ticks=[-1, 0, 1]) ax1 = fig.add_subplot(132) c1 = ax1.imshow(np.array(l_Mat_W1).reshape(len(l_Mat_W1), 12).T, interpolation='nearest',aspect='auto', cmap="Reds") ax1.set_title("w1") cbar = fig.colorbar(c1, ticks=[np.min(np.array(l_Mat_W1)), np.max(np.array(l_Mat_W1))]) ax2 = fig.add_subplot(133) c2 = ax2.imshow(np.array(l_Mat_W2).reshape(len(l_Mat_W2), 8).T, interpolation='nearest',aspect='auto', cmap="Reds") ax2.set_title("w2") cbar = fig.colorbar(c2, ticks=[np.min(np.array(l_Mat_W2)), np.max(np.array(l_Mat_W2))]) ax0.set_yticks([0,1]) ax0.set_yticklabels(["out0", "out1"]) ax1.set_xlabel("epochs") #for i in range(len(l_Mat_W1)):

我們注意到，隨着訓練輪數 Epoch 不斷增多， Output 值從最初的 [0.72, -0.88] 不斷接近 y = [-0.85, 0.72], 其背后的原因，是模型參數不斷的從初始化的值變化、更新，如圖中的 w1 w2 兩個矩陣。

png

好了，最簡單的輪子已經造好了。我們的輪子，實現了 Input Linear Sigmoid Tanh 以及 MSE 這幾個模塊。接下來的內容，我們將基於現在最火的輪子 Tensorflow，詳細介紹一下更多的模塊。

最后，本篇只是造了個最基本的輪子，我們集智的知乎專欄上，有一個系列文章，正在介紹如何在Matlab上手寫深度學習框架，傳送門： matDL框架開發直播:2——全連接層的實現和優化，歡迎大家圍觀。

目前騰訊雲 GPU 服務器還在內測階段，暫時沒有申請到內測資格的讀者也可以使用普通的雲服務器運行本講的代碼。但從第三講開始，我們將逐漸開始使用 Tensorflow 框架分析相關數據，對應的計算量大大增加，必須租用雲GPU服務器才可以快速算出結果。服務器的租用方式，以及 Python 編程環境的搭建，我們將以騰訊雲 GPU 為例，在接下來的內容中和大家詳細介紹。

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 深度學習篇——Tensorflow-GPU配置深度學習應用系列（一）| 在Ubuntu 18.04安裝tensorflow 1.10 GPU版本深度學習中使用TensorFlow或Pytorch框架時到底是應該使用CPU還是GPU來進行運算？？？深度學習中（tensorflow、pytorch）解決GPU使用沖突/搶卡 Tensorflow深度學習之十二：基礎圖像處理之二深度學習入門篇01(Tensorflow-gpu的安裝) 基於TensorFlow的深度學習系列教程 2——常量Constant 基於TensorFlow的深度學習系列教程 1——Hello World! 《深度學習之TensorFlow：入門、原理與進階實戰》使用Floyd進行GPU深度學習訓練