前言
神經網絡的優化多采用諸如梯度下降法的基於一階導數的優化方法(PS:可參見之前寫的一篇文章 http://www.cnblogs.com/simplex/p/6777705.html,用下降單純形優化一個非常簡單的神經網絡,無需求梯度)。在計算待優化變量的一階導數時,最常用的方法就是 back propagation (BP)算法。BP算法可以認為是自動微分(Automatic Differentiation )的一個特例,但是很多講述BP算法在神經網絡中應用的文章都是先手工推導出各個參數的導函數表達式,再寫代碼實現導數的計算,這樣不但導致推導十分繁復,而且也沒有體現出BP算法的效果,因為BP算法(或者更一般的自動微分算法)的一個重要目的就是避免我們手動計算復雜的解析形式的導函數。本文就自動微分方法做一個介紹,包括比較簡單的前向自動微分算法和基於計算圖實現的后向自動微分算法(BP算法),並給出了相應的python代碼實現。
自動微分原理簡介
自動微分(Automatic Differentiation )的目的是為了求函數在某點的導數值。它可以認為是介於符號計算和數值計算之間的一種求微分的方式,主要原理是利用求導的鏈式法則:
假設\(\mathbf{y}=v_n(v_{n-1}(v_{n-2}\cdots( v_1(\mathbf{x}))))\),則\(\mathbf{y}\)關於\(\mathbf{x}\)的導數\(\frac{d\mathbf{y}}{d\mathbf{x}}\)為:
\(\frac{d\mathbf{y}}{d\mathbf{x}}=\frac{d\mathbf{y}}{dv_n}\frac{dv_n}{d\mathbf{x}}=\frac{d\mathbf{y}}{dv_n}\frac{dv_n}{dv_{n-1}}\frac{dv_{n-1}}{d\mathbf{x}}=\frac{d\mathbf{y}}{dv_{n}}\frac{dv_n}{dv_{n-1}}\cdots \frac{dv_{2}}{dv_1}\frac{dv_1}{d\mathbf{x}}\)
依據求導的鏈式法則,為了計算\(\frac{d\mathbf{y}}{d\mathbf{x}}\),我們可以依次求出上式中的微分表達式,然后將他們乘起來得到最后的結果。這又有兩種實現方式:
- 前向模式(或tangent-linear mode),即沿着上式乘積的反向依次求導,先求\(\frac{dv_1}{d\mathbf{x}}\) (seed)。它與函數求值的順序相同,求導所需函數值可以和對應的微分一起求出來。
- 反向模式(或adjoint mode),即沿着上式乘積的正向依次求導,先求\(\frac{d\mathbf{y}}{dv_{n}}\) (seed)。它與函數求值的順序相反,求導所需函數值沒法和對應的微分一起求出來,得分別計算函數值和到數值。
以鏈式法則的一步計算\(\frac{dv_{i+1}}{dv_{i-1}}=\frac{dv_{i+1}}{dv_{i}}\frac{dv_{i}}{dv_{i-1}}\)為例,前向和反向模式的區別是:
- 前向模式中\(\frac{dv_{i}}{dv_{i-1}}\)是已知的,需要求的是\(\frac{dv_{i+1}}{dv_{i}}\),即待求導函數的自變量的導數是已知的。
- 反向模式中\(\frac{dv_{i+1}}{dv_{i}}\)是已知的,需要求的是\(\frac{dv_{i}}{dv_{i-1}}\),即待求導函數的因變量的導數是已知的。
注意,由多元微積分可知,\(M\)維空間到\(N\)維空間的函數\(f\)的導數是一個\(N\times M\)的線性算符(雅可比矩陣),所以在多維情況下,上面涉及到的導數乘積都是矩陣乘積,這也是很多back propagation算法教程中提到的某個節點的導數要沿所有路徑求和的來歷。
兩種自動微分模式各有不同的適用范圍。對於多變量的情況,設\(\mathbf{x}\)和\(\mathbf{y}\)分別是\(M\)和\(N\)維的變量,當\(M\ll N\)時,比較適合用前向模式,因為每次自動微分都求出所有相關節點關於一個自變量分量\(x_i\)的導數\(d\mathbf{\cdot}/dx_i\),一共只需要執行$M$次自動求導便可求出完整的雅克比矩陣。反之,當\(M\gg N\)時,適合用反向模式,因為每次自動求導都求出函數\(y_i(\mathbf{x})\)關於所有相關節點的導數\(dy_i/d\mathbf{\cdot}\),要求出完成的雅克比矩陣需要執行$N$次自動求導。在實際問題中,輸入\(\mathbf{x}\)的維度往往遠大於輸出\(\mathbf{y}\)的維度(比如在物理問題中\(\mathbf{x}\)通常表示計算空間的離散網格,以及在神經網絡中的應用),因此反向模式更為常用。
關於自動微分詳細的教程可以參見文獻 [1] ,從原理到實現都有很詳細的介紹。
python代碼實現
上一章節關於前向模式和反向模式的描述比較適合概念上的理解,本節首先提供一種更適合實際代碼實現的描述。假設\(\mathbf{y}=f(\mathbf{x})\),引入兩個輔助變量\(s\)和\(t\),並定義:
\(\mathbf{x}^{(1)}=\frac{\partial \mathbf{x}}{\partial s}\)
\(\mathbf{y}^{(1)}=\frac{\partial \mathbf{y}}{\partial s}\)
以及
\(\mathbf{x}_{(1)}=\frac{\partial t}{\partial \mathbf{x}}\)
\(\mathbf{y}_{(1)}=\frac{\partial t}{\partial \mathbf{y}}\)
這樣,前向微分可以表示成已知\(\mathbf{x}^{(1)}\)求\(\mathbf{y}^{(1)}\),即利用公式\(\mathbf{y}^{(1)}=\frac{\mathbf{y}}{\mathbf{x}}\mathbf{x}^{(1)}\);而反向微分則是已知\(\mathbf{y}^{(1)}\)求\(\mathbf{x}^{(1)}\),即利用公式\(\mathbf{x}^{(1)}=\mathbf{y}^{(1)}\frac{d\mathbf{y}}{d\mathbf{x}}\)。兩者的共同點都是要求導數\(\frac{d\mathbf{y}}{d\mathbf{x}}\),而不同點是一個是已知\(\mathbf{x}\)的導數,一個則已知\(\mathbf{y}\)的導數。用有向圖來表述就是\(s\)->\(\mathbf{x}\)->\(\mathbf{y}\)->\(t\),在求出\(\mathbf{x}\)->\(\mathbf{y}\)這部分導數之后,前向模式是計算\(s\)->\(\mathbf{x}\)->\(\mathbf{y}\)這部分導數,而反向模式是計算\(\mathbf{x}\)->\(\mathbf{y}\)->\(t\)這部分導數。
另外,分別令\(s=\mathbf{x}\)和\(t=\mathbf{y}\),可以發現前向模式中,\(\mathbf{x}^{(1)}=1\),\(\mathbf{y}^{(1)}=\frac{d\mathbf{y}}{d\mathbf{x}}\)為所求導數;反向模式中,\(\mathbf{y}_{(1)}=1\),\(\mathbf{x}_{(1)}=\frac{d\mathbf{y}}{d\mathbf{x}}\)為所求導數。這點在下面的代碼實現中可以看得非常清晰。
前向模式
實現前向模式自動微分的最好方法是利用運算符重載,把只求函數值的原始運算符改為同時計算函數值以及導數值的算符,然后在計算函數值的過程中就可以同時得到相應導數值。
首先定義ADT類,包含重載的運算符
import numpy as np from prettytable import PrettyTable class ADT: x = 0 dx = 0 def __init__(self, x, dx): self.x = x self.dx = dx def __str__(self): return "value:%s\nderivative:%s" % (self.x, self.dx) def __add__(self, other): if isinstance(other, ADT): x = self.x + other.x dx = self.dx + other.dx elif isinstance(other, int) or isinstance(other, float): x = self.x + other dx = self.dx else: return NotImplemented return ADT(x, dx) def __radd__(self, other): if isinstance(other, int) or isinstance(other, float): x = self.x + other dx = self.dx else: return NotImplemented return ADT(x, dx) def __iadd__(self, other): if isinstance(other, ADT): self.x += other.x self.dx += other.dx elif isinstance(other, int) or isinstance(other, float): self.x += other else: return NotImplemented return self def __sub__(self, other): if isinstance(other, ADT): x = self.x - other.x dx = self.dx - other.dx elif isinstance(other, int) or isinstance(other, float): x = self.x - other dx = self.dx else: return NotImplemented return ADT(x, dx) def __rsub__(self, other): if isinstance(other, int) or isinstance(other, float): x = other - self.x dx = -self.dx else: return NotImplemented return ADT(x, dx) def __isub__(self, other): if isinstance(other, ADT): self.x -= other.x self.dx -= other.dx elif isinstance(other, int) or isinstance(other, float): self.x -= other else: return NotImplemented return self def __mul__(self, other): if isinstance(other, ADT): x = self.x * other.x dx = self.x * other.dx + self.dx * other.x elif isinstance(other, int) or isinstance(other, float): x = self.x * other dx = self.dx * other else: return NotImplemented return ADT(x, dx) def __rmul__(self, other): if isinstance(other, int) or isinstance(other, float): x = self.x * other dx = self.dx * other else: return NotImplemented return ADT(x, dx) def __imul__(self, other): if isinstance(other, ADT): self.x *= other.x self.dx = self.x * other.dx + self.dx * other.x elif isinstance(other, int) or isinstance(other, float): self.x *= other self.dx *= other else: return NotImplemented return self def __truediv__(self, other): if isinstance(other, ADT): x = self.x / other.x dx = (self.dx * other.x - self.x * other.dx) / other.x**2 elif isinstance(other, int) or isinstance(other, float): x = self.x / other dx = self.dx / other else: return NotImplemented return ADT(x, dx) def __rtruediv__(self, other): if isinstance(other, int) or isinstance(other, float): x = other / self.x dx = -other / (self.x)**2 * self.dx else: return NotImplemented return ADT(x, dx) def __itruediv__(self, other): if isinstance(other, ADT): self.x /= other.x self.dx = (self.dx * other.x - self.x * other.dx) / other.x**2 elif isinstance(other, int) or isinstance(other, float): self.x /= other self.dx /= other else: return NotImplemented return ADT(x, dx) def __pow__(self, n): x = self.x**n dx = n * self.x**(n - 1) * self.dx return ADT(x, dx) # some frequently used function def exp(this): x = np.exp(this.x) dx = this.dx * np.exp(this.x) return ADT(x, dx) def log(this): x = np.log(this.x) dx = 1 / this.x * this.dx return ADT(x, dx) def sin(this): x = np.sin(this.x) dx = this.dx * np.cos(this.x) return ADT(x, dx) def cos(this): x = np.cos(this.x) dx = -this.dx * np.sin(this.x) return ADT(x, dx)
再定義一個輔助函數,用於打印計算結果
def table_show(rows): table = PrettyTable() table.field_names = ["Method", "Value", "Derivative"] for row in rows: table.add_row(row) print('\n') print(table)
最后,對代碼做測試:
## one-dimensional 1 x = 0.5 dx = 1 z = ADT(x, dx) result = ADT.exp((z + 2)**2) y_analytical = np.exp((x + 2)**2) dy_analytical = 2 * (x + 2) * np.exp((x + 2)**2) rows = [['auto diff', result.x, result.dx], ['analytical', y_analytical, dy_analytical]] table_show(rows)
計算結果如下:
+------------+---------------+---------------+ | Method | Value | Derivative | +------------+---------------+---------------+ | auto diff | 518.012824668 | 2590.06412334 | | analytical | 518.012824668 | 2590.06412334 | +------------+---------------+---------------+
反向模式
反向模式相對前向模式來說比較復雜,因為要首先正向求出函數值,然后再反向求導數。這一般用computation graph實現:將函數求值過程中的所有中間節點按求值順序存儲到棧中,得到表達式對應的計算圖,然后再挨個彈出棧中的元素,求相應的導數。
首先定義計算圖類CGraph和圖的節點Node類,其中CGraph包含了正向求值和反向求導函數,Node包含了重載了的計算函數。
import numpy as np from enum import Enum from prettytable import PrettyTable import tensorflow as tf class CGraph: def __init__(self): self.Nodec = 0 self.NodeList = [] def connect(self): Node.cgraph = self return self def clear(self): self.Nodec = 0 self.NodeList = [] def disconnect(self): Node.cgraph = None return self def append(self, node): self.Nodec += 1 self.NodeList.append(node) def compute_value(self): for node in self.NodeList: if node.op is Node.operators.add: node.value = self.NodeList[node.arg1].value + self.NodeList[ node.arg2].value elif node.op is Node.operators.sub: node.value = self.NodeList[node.arg1].value - self.NodeList[ node.arg2].value elif node.op is Node.operators.mul: node.value = self.NodeList[node.arg1].value * self.NodeList[ node.arg2].value elif node.op is Node.operators.truediv: node.value = self.NodeList[node.arg1].value / self.NodeList[ node.arg2].value elif node.op is Node.operators.power: node.value = self.NodeList[node.arg1].value**self.NodeList[ node.arg2].value elif node.op is Node.operators.sin: node.value = np.sin(self.NodeList[node.arg1].value) elif node.op is Node.operators.cos: node.value = np.cos(self.NodeList[node.arg1].value) elif node.op is Node.operators.log: node.value = np.log(self.NodeList[node.arg1].value) elif node.op is Node.operators.exp: node.value = np.exp(self.NodeList[node.arg1].value) def compute_gradient(self, seedID): for node in self.NodeList: node.derivative = 0.0 self.NodeList[seedID].derivative = 1.0 for node in self.NodeList[-1::-1]: if node.op is Node.operators.add: self.NodeList[node.arg1].derivative += node.derivative self.NodeList[node.arg2].derivative += node.derivative elif node.op is Node.operators.sub: self.NodeList[node.arg1].derivative += node.derivative self.NodeList[node.arg2].derivative += -node.derivative elif node.op is Node.operators.mul: self.NodeList[ node.arg1].derivative += node.derivative * self.NodeList[ node.arg2].value self.NodeList[ node.arg2].derivative += node.derivative * self.NodeList[ node.arg1].value elif node.op is Node.operators.truediv: self.NodeList[ node.arg1].derivative += node.derivative / self.NodeList[ node.arg2].value self.NodeList[ node.arg2].derivative += -node.derivative * self.NodeList[ node.arg1].value / (self.NodeList[node.arg2].value)**2 elif node.op is Node.operators.power: self.NodeList[ node.arg1].derivative += node.derivative * self.NodeList[ node.arg2].value * self.NodeList[node.arg1].value**( self.NodeList[node.arg2].value - 1) elif node.op is Node.operators.sin: self.NodeList[ node.arg1].derivative += node.derivative * np.cos( self.NodeList[node.arg1].value) elif node.op is Node.operators.cos: self.NodeList[ node.arg1].derivative += -node.derivative * np.sin( self.NodeList[node.arg1].value) elif node.op is Node.operators.log: self.NodeList[ node.arg1].derivative += node.derivative / self.NodeList[ node.arg1].value elif node.op is Node.operators.exp: self.NodeList[ node.arg1].derivative += node.derivative * np.exp( self.NodeList[node.arg1].value) class Node: cgraph = None operators = Enum('operators', ('add', 'sub', 'mul', 'truediv', 'power', 'sin', 'cos', 'log', 'exp')) def __init__(self, value=np.NaN, derivative=0.0, op=None, arg1=None, arg2=None, name=None): self.value = value self.derivative = derivative self.op = op self.arg1 = arg1 self.arg2 = arg2 self.name = name # if graph is trace on, allocate an id for every newly created Node and add it # to the graph. if Node.cgraph is not None: self.ID = Node.allocate_ID() Node.cgraph.append(self) def __str__(self): return '\nvalue:%s\nderivative:%s\nop:%s\narg1:%s\narg2:%s\nname:%s' % ( self.value, self.derivative, self.op, self.arg1, self.arg2, self.name) @classmethod def allocate_ID(cls): if cls.cgraph is not None: return cls.cgraph.Nodec else: return None @classmethod def tome(cls, x): if isinstance(x, cls): return x else: # x is always constant return cls(x, op='constant') def __add__(self, other): other = Node.tome(other) return Node( self.value + other.value, op=Node.operators.add, arg1=self.ID, arg2=other.ID) def __sub__(self, other): other = Node.tome(other) return Node( self.value - other.value, op=Node.operators.sub, arg1=self.ID, arg2=other.ID) def __mul__(self, other): other = Node.tome(other) return Node( self.value * other.value, op=Node.operators.mul, arg1=self.ID, arg2=other.ID) def __truediv__(self, other): other = Node.tome(other) return Node( self.value / other.value, op=Node.operators.truediv, arg1=self.ID, arg2=other.ID) def __radd__(self, other): other = Node.tome(other) return Node( other.value + self.value, op=Node.operators.add, arg1=other.ID, arg2=self.ID) def __rsub__(self, other): other = Node.tome(other) return Node( other.value - self.value, op=Node.operators.sub, arg1=other.ID, arg2=self.ID) def __rmul__(self, other): other = Node.tome(other) return Node( other.value * self.value, op=Node.operators.mul, arg1=other.ID, arg2=self.ID) def __rtruediv__(self, other): other = Node.tome(other) return Node( other.value / self.value, op=Node.operators.truediv, arg1=other.ID, arg2=self.ID) def __pow__(self, other): other = Node.tome(other) return Node( self.value**other.value, op=Node.operators.power, arg1=self.ID, arg2=other.ID) def sin(self): return Node(np.sin(self.value), op=Node.operators.sin, arg1=self.ID) def cos(self): return Node(np.cos(self.value), op=Node.operators.cos, arg1=self.ID) def log(self): return Node(np.log(self.value), op=Node.operators.log, arg1=self.ID) def exp(self): return Node(np.exp(self.value), op=Node.operators.exp, arg1=self.ID)
然后利用上面定義的類進行求導計算。注意輸入變量在初始化時沒有賦值,而是等到計算圖建立之后再賦值,並利用compute_value函數先進行正向求值。當然,如果已知輸入變量的值,那么在初始化時可以直接賦值,后續所有中間節點的值都會自動求出,無需再調用compute_value函數。
graph = CGraph() graph.connect() x1 = Node() x2 = Node() x3 = Node() x4 = Node() y1 = Node.sin(x1 * x2) + Node.exp(x1 / x2) + x3**2 - x4**3 y2 = x3 * x4 # assign value for each input x1.value = 1.234 x2.value = 2.345 x3.value = 3.456 x4.value = 4.567 # forward pass, compute the function value. graph.compute_value() # backward pass, compute the derivative result = [] graph.compute_gradient(y1.ID) result.append(x1.derivative) result.append(x2.derivative) result.append(x3.derivative) result.append(x4.derivative) graph.compute_gradient(y2.ID) result.append(x3.derivative) result.append(x4.derivative)
接下來,與tensorflow中的求導函數做對比
xx1 = tf.Variable(x1.value, dtype=tf.float32) xx2 = tf.Variable(x2.value, dtype=tf.float32) xx3 = tf.Variable(x3.value, dtype=tf.float32) xx4 = tf.Variable(x4.value, dtype=tf.float32) yy1 = tf.sin(xx1 * xx2) + tf.exp(xx1 / xx2) + xx3**2 - xx4**3 yy2 = xx3 * xx4 dyy1dxx1 = tf.gradients(yy1, xx1) dyy1dxx2 = tf.gradients(yy1, xx2) dyy1dxx3 = tf.gradients(yy1, xx3) dyy1dxx4 = tf.gradients(yy1, xx4) dyy2dxx3 = tf.gradients(yy2, xx3) dyy2dxx4 = tf.gradients(yy2, xx4) sess = tf.Session() sess.run(tf.global_variables_initializer()) result_tf = sess.run( [dyy1dxx1, dyy1dxx2, dyy1dxx3, dyy1dxx4, dyy2dxx3, dyy2dxx4]) result_tf=[a[0] for a in result_tf]
最后,打印結果對比:
table = PrettyTable() table.field_names = ["Method", "dy1/dx1","dy1/dx2","dy1/dx3","dy1/dx4","dy2/dx3","dy2/dx4"] row1=['this code'] row1.extend(result) row2=['tensorflow'] row2.extend(result_tf) table.add_row(row1) table.add_row(row2) print('\n') print(table)
計算結果如下:
+------------+----------------+---------------+---------+------------+---------+---------+ | Method | dy1/dx1 | dy1/dx2 | dy1/dx3 | dy1/dx4 | dy2/dx3 | dy2/dx4 | +------------+----------------+---------------+---------+------------+---------+---------+ | this code | -1.55157212465 | -1.5760978298 | 6.912 | -62.572467 | 4.567 | 3.456 | | tensorflow | -1.55157 | -1.5761 | 6.912 | -62.5725 | 4.567 | 3.456 | +------------+----------------+---------------+---------+------------+---------+---------+
參考文獻
[1] The Art of Differentiating Computer Programs- An Introduction toAlgorithmic Differentiation
[2] Deep Learning
附錄
假設\(\mathbf{z}=g(\mathbf{y})\),\(\mathbf{y}=f(\mathbf{x})\),其中\(\mathbf{x}\),\(\mathbf{y}\),\(\mathbf{z}\)分別是\(M\),\(N\),\(P\)維變量,則\(\frac{d\mathbf{y}}{d\mathbf{x}}\),\(\frac{d\mathbf{z}}{d\mathbf{y}}\),\(\frac{d\mathbf{y}}{d\mathbf{x}}\)分別是\(N\times M\),\(P\times N\),\(P\times M\)的矩陣,且有\(\frac{d\mathbf{z}}{d\mathbf{x}}=\frac{d\mathbf{z}}{d\mathbf{y}}\frac{d\mathbf{y}}{d\mathbf{x}}\)。寫成分量形式即為:
$$\frac{\partial z_i}{\partial x_j}=\sum_{k=1}^{N}\frac{\partial z_i}{\partial y_k}\frac{\partial y_k}{\partial x_j}$$
在計算圖模型中,上式的求和可以認為是對從\(x_j\)到\(z_i\)的所有不同路徑(即通過不同的\(y_k\)到達\(z_i\))進行的。