1 # -*- coding: cp936 -*- 2 import numpy as np 3 from scipy import stats 4 import matplotlib.pyplot as plt 5 6 7 # 構造訓練數據 8 x = np.arange(0., 10., 0.2) 9 m = len(x) # 訓練數據點數目 10 x0 = np.full(m, 1.0) 11 input_data = np.vstack([x0, x]).T # 將偏置b作為權向量的第一個分量 12 target_data = 2 * x + 5 + np.random.randn(m) 13 14 15 # 兩種終止條件 16 loop_max = 10000 # 最大迭代次數(防止死循環) 17 epsilon = 1e-3 18 19 # 初始化權值 20 np.random.seed(0) 21 w = np.random.randn(2) 22 #w = np.zeros(2) 23 24 alpha = 0.001 # 步長(注意取值過大會導致振盪,過小收斂速度變慢) 25 diff = 0. 26 error = np.zeros(2) 27 count = 0 # 循環次數 28 finish = 0 # 終止標志 29 # -------------------------------------------隨機梯度下降算法---------------------------------------------------------- 30 ''' 31 while count < loop_max: 32 count += 1 33 34 # 遍歷訓練數據集,不斷更新權值 35 for i in range(m): 36 diff = np.dot(w, input_data[i]) - target_data[i] # 訓練集代入,計算誤差值 37 38 # 采用隨機梯度下降算法,更新一次權值只使用一組訓練數據 39 w = w - alpha * diff * input_data[i] 40 41 # ------------------------------終止條件判斷----------------------------------------- 42 # 若沒終止,則繼續讀取樣本進行處理,如果所有樣本都讀取完畢了,則循環重新從頭開始讀取樣本進行處理。 43 44 # ----------------------------------終止條件判斷----------------------------------------- 45 # 注意:有多種迭代終止條件,和判斷語句的位置。終止判斷可以放在權值向量更新一次后,也可以放在更新m次后。 46 if np.linalg.norm(w - error) < epsilon: # 終止條件:前后兩次計算出的權向量的絕對誤差充分小 47 finish = 1 48 break 49 else: 50 error = w 51 print 'loop count = %d' % count, '\tw:[%f, %f]' % (w[0], w[1]) 52 ''' 53 54 55 # -----------------------------------------------梯度下降法----------------------------------------------------------- 56 while count < loop_max: 57 count += 1 58 59 # 標准梯度下降是在權值更新前對所有樣例匯總誤差,而隨機梯度下降的權值是通過考查某個訓練樣例來更新的 60 # 在標准梯度下降中,權值更新的每一步對多個樣例求和,需要更多的計算 61 sum_m = np.zeros(2) 62 for i in range(m): 63 dif = (np.dot(w, input_data[i]) - target_data[i]) * input_data[i] 64 sum_m = sum_m + dif # 當alpha取值過大時,sum_m會在迭代過程中會溢出 65 66 w = w - alpha * sum_m # 注意步長alpha的取值,過大會導致振盪 67 #w = w - 0.005 * sum_m # alpha取0.005時產生振盪,需要將alpha調小 68 69 # 判斷是否已收斂 70 if np.linalg.norm(w - error) < epsilon: 71 finish = 1 72 break 73 else: 74 error = w 75 print 'loop count = %d' % count, '\tw:[%f, %f]' % (w[0], w[1]) 76 77 # check with scipy linear regression 78 slope, intercept, r_value, p_value, slope_std_error = stats.linregress(x, target_data) 79 print 'intercept = %s slope = %s' %(intercept, slope) 80 81 plt.plot(x, target_data, 'k+') 82 plt.plot(x, w[1] * x + w[0], 'r') 83 plt.show()
The Learning Rate
An important consideration is the learning rate µ, which determines by how much we change the weights w at each step. If µ is too small, the algorithm will take a long time to converge .
Conversely, if µ is too large, we may end up bouncing around the error surface out of control - the algorithm diverges. This usually ends with an overflow error in the computer's floating-point arithmetic.
Batch vs. Online Learning
Above we have accumulated the gradient contributions for all data points in the training set before updating the weights. This method is often referred to as batch learning. An alternative approach is online learning, where the weights are updated immediately after seeing each data point. Since the gradient for a single data point can be considered a noisy approximation to the overall gradient, this is also called stochastic gradient descent.
Online learning has a number of advantages:
- it is often much faster, especially when the training set is redundant (contains many similar data points),
- it can be used when there is no fixed training set (new data keeps coming in),
- it is better at tracking nonstationary environments (where the best model gradually changes over time),
- the noise in the gradient can help to escape from local minima (which are a problem for gradient descent in nonlinear models)
These advantages are, however, bought at a price: many powerful optimization techniques (such as: conjugate and second-order gradient methods, support vector machines, Bayesian methods, etc.) are batch methods that cannot be used online.A compromise between batch and online learning is the use of "mini-batches": the weights are updated after every n data points, where n is greater than 1 but smaller than the training set size.
參考:http://www.tuicool.com/articles/MRbee2i
https://www.willamette.edu/~gorr/classes/cs449/linear2.html
http://www.bogotobogo.com/python/python_numpy_batch_gradient_descent_algorithm.php