Line Search and Quasi-Newton Methods 線性搜索與擬牛頓法

本文轉載自查看原文 2016-05-22 18:30 2556 擬牛頓法/ Quasi-Newton Methods/ Machine Learning/ Line Search/ 線性搜索

Gradient Descent

機器學習中很多模型的參數估計都要用到優化算法，梯度下降是其中最簡單也用得最多的優化算法之一。梯度下降(Gradient Descent)[3]也被稱之為最快梯度(Steepest Descent)，可用於尋找函數的局部最小值。梯度下降的思路為，函數值在梯度反方向下降是最快的，只要沿着函數的梯度反方向移動足夠小的距離到一個新的點，那么函數值必定是非遞增的，如圖1所示。

梯度下降思想的數學表述如下：

b = a - α \nabla F (a) \Rightarrow f (a) \geq f (b) (1)

其中

x k + 1 = x k - α k \nabla f (x k), 0 \leq k \leq n (2)

f (x 0) \geq f (x 1) \geq f (x 2) \geq \dots \geq f (x n) (3)

f (x k + α d k) < f (x k)

d k = - B k \nabla f (x k) (5)

Line Search

在給定搜索方向

α = a r g

Bisection Search

二分線性搜索(Bisection Line Search)[2]可用於求解函數的根，其思想很簡單，就是不斷將現有區間划分為兩半，選擇必定含有使

L = (1 2 ) n α ^ (7)

L \leq ϵ \Rightarrow k \leq [log 2 (α ^ ϵ ) ] (8)

 1 def bisection(dfun,theta,args,d,low,high,maxiter=1e4):
 2     """
 3     #Functionality:find the root of the function(fun) in the interval [low,high]
 4     #@Parameters
 5     #dfun:compute the graident of function f(x)
 6     #theta:Parameters of the model
 7     #args:other variables needed to compute the value of dfun
 8     #[low,high]:the interval which contains the root
 9     #maxiter:the max number of iterations
10     """
11     eps=1e-6
12     val_low=np.sum(dfun(theta+low*d,args)*d.T)
13     val_high=np.sum(dfun(theta+high*d,args)*d.T)
14     if val_low*val_high>0:
15         raise Exception('Invalid interval!')
16     iter_num=1
17     while iter_num<maxiter:
18         mid=(low+high)/2
19         val_mid=np.sum(dfun(theta+mid*d,args)*d.T)
20         if abs(val_mid)<eps or abs(high-low)<eps:
21             return mid
22         elif val_mid*val_low>0:
23             low=mid
24         else:
25             high=mid
26         iter_num+=1

Backtracking

回溯線性搜索(Backing Line Search)[1]基於Armijo准則計算搜素方向上的最大步長，其基本思想是沿着搜索方向移動一個較大的步長估計值，然后以迭代形式不斷縮減步長，直到該步長使得函數值

f (x k + α d k) \leq f (x k) + c 1 α f' (x k) T d k (9)

h' (0) < c 1 h' (0) < 0 (10)

h' (0) = lim α \to 0 h ( α ) - h ( 0 ) α = lim α \to 0 f ( x k +

f ( x k + α d k ) - f ( x k ) α < c f ' ( x k ) T d k (12)

 1 def ArmijoBacktrack(fun,dfun,theta,args,d,stepsize=1,tau=0.5,c1=1e-3):
 2     """
 3     #Functionality:find an acceptable stepsize via backtrack under Armijo rule
 4     #@Parameters
 5     #fun:compute the value of objective function
 6     #dfun:compute the gradient of objective function
 7     #theta:a vector of parameters of the model
 8     #stepsize:initial step size
 9     #c1:sufficient decrease Parameters
10     #tau:rate of shrink of stepsize
11     """
12     slope=np.sum(dfun(theta,args)*d.T)
13     obj_old=costFunction(theta,args)
14     theta_new=theta+stepsize*d
15     obj_new=costFunction(theta_new,args)
16     while obj_new>obj_old+c1*stepsize*slope:
17         stepsize*=tau
18         theta_new=theta+stepsize*d
19         obj_new=costFunction(theta_new,args)
20     return stepsize

Interpolation

基於Armijo准則的回溯線性搜索的收斂速度無法得到保證，特別是要回退很多次后才能落入滿足Armijo准則的區間。如果我們根據已有的函數值和導數信息，采用多項式插值法(Interpolation)[12,6,5,9]擬合函數，然后根據該多項式函數估計函數的極值點，這樣選擇合適步長的效率會高很多。假設我們只有

h q (α) = (h ( α 0 ) - h ( 0 ) - α 0 h ' ( 0 ) α 2 0 ) α 2 + h

α 1 = h ' ( 0 ) α 2 0 2 [ h ( 0 ) + h ' ( 0 ) α 0 - h ( α 0 ) ]

h c (α) = a α 3 + b α 2 + h' (0) α + h (0) (15)

[a b] = 1 α 2 i - 1 α 2 i ( α i - α i - 1 ) [

α i + 1 = - b + b 2 - 3 a h ' ( 0 )----------\sqrt 3 a (17)

H 3 (α) = [1 + 2 α i - α α i - α i - 1 ]

α i + 1 = α i - (α i - α i - 1) [h ' ( α i ) + d 2 -

d 1 = h' (α i) + h' (α i - 1) - 3 [h ( α i ) - h ( α

d 2 = s i g n (α i - α i - 1) d 2 1 - h' (α i - 1) h' (

 1 def quadraticInterpolation(a,h,h0,g0):
 2     """
 3     #Functionality:Approximate h(a) with a quadratic function and return its stationary point
 4     #@Parameters
 5     #a:current stepsize
 6     #h:a function value about stepsize,h(a)=f(x_k+a*d)
 7     #h:h(0)=f(x_k)
 8     #g0:h'(0)=f'(0)
 9     """
10     numerator=g0*a**2
11     denominator=2*(g0*a+h0-h)
12     if abs(denominator)<1e-12:#indicates that a is almost 0
13         return a
14     return numerator/denominator

def cubicInterpolation(a0,h0,a1,h1,h,g):
    """
    #Functionality:Approximate h(x) with a cubic function and return its stationary point
    #This version of cubic interpolation computes h'(x) as few as possible,suitable for the case in which computing derivative is more expensive than computing function values
    #@Parameters
    #a0 and a1 are stepsize it previous two iterations
    #h0:h(a0)
    #h1:h(a1)
    #h:h(0)=f(x)
    #g:h'(0)
    """
    mat=matlib.matrix([[a0**2,-a1**2],[-a0**3,a1**3]])
    vec=matlib.matrix([[h1-h-g*a1],[h0-h-g*a0]])
    ab=mat*vec/(a0**2*a1**2*(a1-a0))
    a=ab[0,0]
    b=ab[1,0]
    if abs(a)<1e-12:#a=0 and cubic function is a quadratic one
        return -g/(2*b)
    return (-b+np.sqrt(b**2-3*a*g))/(3*a)

def cubicInterpolationHermite(a0,h0,g0,a1,h1,g1):
    """
    #Functionality:Approximate h(a) with a cubic Hermite polynomial function and return its stationary point
    #This version of cubic interpolation computes h(a) as few as possible,suitable for the case in which computing derivative is easier than computing function values
    #@Parameters
    #a0 and a1 are stepsize it previous two iterations
    #h0:h(a0)
    #g0:h'(a0)
    #h1:h(a1)
    #g1:h'(a1)
    """
    d1=g0+g1-3*(h1-h0)/(a1-a0)
    d2=np.sign(a1-a0)*np.sqrt(d1**2-g0*g1)
    res=a1-(a1-a0)*(g1+d2-d2)/(g1-g0+2*d2)
    return res

基於Armijo准則的線性搜索的算法描述如下[4]對應的Armijo線性搜索的Python代碼如下：

 1 def ArmijoLineSearch(fun,dfun,theta,args,d,a0=1,c1=1e-3,a_min=1e-7,max_iter=1e5):
 2     """
 3     #Functionality:Line search under Armijo condition with quadratic and cubic interpolation
 4     #@Parameters
 5     #fun:objective Function
 6     #dfun:compute the gradient of fun
 7     #theta:a vector of parameters of the model
 8     #args:other variables needed for fun and func
 9     #d:search direction
10     #a0:initial stepsize
11     #c1:constant used in Armijo condition
12     #a_min:minimun of stepsize
13     #max_iter:maximum of the number of iterations
14     """
15     eps=1e-6
16     c1=min(c1,0.5)#c1 should<=0.5
17     a_pre=h_pre=g_pre=0
18     a_cur=a0
19     f_val=fun(theta,args) #h(0)=f(0)
20     g_val=np.sum(dfun(theta,args)*d.T) #h'(0)=f'(x)^Td
21     h_cur=g_cur=0
22     k=0
23     while a_cur>a_min and k<max_iter:
24         h_cur=fun(theta+a_cur*d,args)
25         g_cur=np.sum(dfun(theta+a_cur*d,args)*d.T)
26         if h_cur<=f_val+c1*a_cur*g_val: #meet Armijo condition
27             return a_cur
28         if not k: #k=0,use quadratic interpolation
29             a_new=quadraticInterpolation(a_cur,h_cur,f_val,g_val)
30         else: #k>0,use cubic Hermite interpolation
31             a_new=cubicInterpolationHermite(a_pre,h_pre,g_pre,a_cur,h_cur,g_cur)
32         if abs(a_new-a_cur)<eps or abs(a_new)<eps: #safeguard procedure
33             a_new=a_cur/2
34         a_pre=a_cur
35         a_cur=a_new
36         h_pre=h_cur
37         g_pre=g_cur
38         k+=1
39     return a_min #failed search

Wolfe Search

前面說到單憑Armijo准則(不考慮回溯策略)選出的步長可能太小，為了排除這些微小的步長，我們加上曲率的約束條件(如圖5所示)

h' (α) = f' (x k + α d k) T d k \geq c 2 f' (x k) T d k

{f (x k + α d k) f' (x k + α d k) T d k \leq f (x k

{f (x k + α d k) | f' (x k + α d k) T d k |

f (x k + α' d k) = f (x k) + α' c 1 f' (x k) T d k (25)

f (x k + α' d k) - f (x k) = α' f' (x k + α'' d k) T d k

f' (x k + α'' d k) T d k = c 1 f' (x k) T d k > c 2 f'

在算法5中，

這一點結合圖7就很容易理解了，我在圖中分別用紅色和綠色點標注了

 1 def WolfeLineSearch(fun,dfun,theta,args,d,a0=1,c1=1e-4,c2=0.9,a_min=1e-7,max_iter=1e5):
 2     """
 3     #Functionality:find a stepsize meeting Wolfe condition
 4     #@Parameters
 5     #fun:objective Function
 6     #dfun:compute the gradient of fun
 7     #theta:a vector of parameters of the model
 8     #args:other variables needed for fun and func
 9     #d:search direction
10     #a0:intial stepsize
11     #c1:constant used in Armijo condition
12     #c2:constant used in curvature condition
13     #a_min:minimun of stepsize
14     #max_iter:maximum of the number of iterations
15     """
16     eps=1e-16
17     c1=min(c1,0.5)
18     a_pre=0
19     a_cur=a0
20     f_val=fun(theta,args) #h(0)=f(x)
21     g_val=np.sum(dfun(theta,args)*d.T)
22     h_pre=f_val #h'(0)=f'(x)^Td
23     k=0
24     while k<max_iter and abs(a_cur-a_pre)>=eps:
25         h_cur=fun(theta+a_cur*d,args) #f(x+ad)
26         if h_cur>f_val+c1*a_cur*g_val or h_cur>=h_pre and k>0:
27             return zoom(fun,dfun,theta,args,d,a_pre,a_cur,c1,c2)
28         g_cur=np.sum(dfun(theta+a_cur*d,args)*d.T)
29         if abs(g_cur)<=-c2*g_val:#satisfy Wolfe condition
30             return a_cur
31         if g_cur>=0:
32             return zoom(fun,dfun,theta,args,d,a_pre,a_cur,c1,c2)
33         a_new=quadraticInterpolation(a_cur,h_cur,f_val,g_val)
34         a_pre=a_cur
35         a_cur=a_new
36         h_pre=h_cur
37         k+=1
38     return a_min

zoom函數的算法描述見6。zoom函數中需要傳入搜尋區間

zoom函數對應的Python代碼如下：

 1 def zoom(fun,dfun,theta,args,d,a_low,a_high,c1=1e-3,c2=0.9,max_iter=1e4):
 2     """
 3     #Functionality:enlarge the interval to find a stepsize meeting Wolfe condition
 4     #@Parameters
 5     #fun:objective Function
 6     #dfun:compute the gradient of fun
 7     #theta:a vector of parameters of the model
 8     #args:other variables needed for fun and func
 9     #d:search direction
10     #[a_low,a_high]:interval containing a stepsize satisfying Wolfe condition
11     #c1:constant used in Armijo condition
12     #c2:constant used in curvature condition
13     #max_iter:maximum of the number of iterations
14     """
15     if a_low>a_high:
16         print('low:%f,high:%f'%(a_low,a_high))
17         raise Exception('Invalid interval of stepsize in zoom procedure')
18     eps=1e-16
19     h=fun(theta,args) #h(0)=f(x)
20     g=np.sum(dfun(theta,args)*d.T) #h'(0)=f'(x)^Td
21     k=0
22     h_low=fun(theta+a_low*d,args)
23     h_high=fun(theta+a_high*d,args)
24     if h_low>h+c1*a_low*g:
25         raise Exception('Left endpoint violates Armijo condition in zoom procedure')
26     while k<max_iter and abs(a_high-a_low)>=eps:
27         a_new=(a_low+a_high)/2
28         h_new=fun(theta+a_new*d,args)
29         if h_new>h+c1*a_new*g or h_new>h_low:
30             a_high=a_new
31             h_high=h_new
32         else:
33             g_new=np.sum(dfun(theta+a_new*d,args)*d.T)
34             if abs(g_new)<=-c2*g: #satisfy Wolfe condition
35                 return a_new 
36             if g_new*(a_high-a_low)>=0:
37                 a_high=a_new
38                 h_high=h_new
39             else:
40                 a_low=a_new
41                 h_low=h_new
42         k+=1
43     return a_low #a_low definitely satisfy Armijo condition

Newton's Method

牛頓法(Newton's method)[8]以迭代方式求解函數的根，其基本思想是從一個初始點出發，不斷在當前點

f (x k + △ x) \approx f (x k) + f' (x k) △ x + 1 2 △ x T B k △

f' (x k + 1) = f' (x k) + B k (x k + 1 - x k) (29)

x k + 1 = x k - B - 1 k f' (x k) (30)

Quasi-Newton Method

擬牛頓(Quasi-Newton)[11]算法可用於求解函數的局部最優解，也就是那些導數為0的駐點。牛頓法用於解決優化問題時，事先假設原函數可用二次函數近似，然后用一階和二階導數尋找局部最優解。而在擬牛頓算法中，不需要准確計算Hessian矩陣，取而代之的是運用下面的擬牛頓條件分析連續兩個梯度向量得到的近似值矩陣

f' (x k + 1) - f' (x k) \approx B k + 1 (x k + 1 - x k)

 1 def BFGS(fun,dfun,theta,args,H=None,mode=0,eps=1e-12,max_iter=1e4):
 2     """
 3     #Functionality:find the minimum of objective function f(x)
 4     #@Parameters
 5     #fun:objective function f(x)
 6     #dfun:compute the gradient of f(x)
 7     #args:parameters needed by fun and dfun
 8     #theta:start vector of parameters of the model
 9     #H:initial inverse Hessian approximation
10     #mode:index of line search algorithm
11     """
12     x_pre=x_cur=theta
13     g=dfun(x_cur,args)
14     I=matlib.eye(theta.size)
15     if not H:#initialize H as an identity matrix
16         H=I
17     k=0
18     while k<max_iter and np.sum(np.abs(g))>eps:
19         d=-g*H
20         step=LineSearch(fun,dfun,x_pre,args,d,1,mode)
21         x_cur=x_pre+step*d
22         s=step*d
23         y=dfun(x_cur,args)-dfun(x_pre,args)
24         ys=np.sum(y*s.T)
25         if abs(ys)<eps:
26             return x_cur
27         change=(ys+np.sum(y*H*y.T))*(s.T*s)/(ys**2)-(H*y.T*s+s.T*y*H)/ys
28         H+=change
29         g=dfun(x_cur,args)
30         x_pre=x_cur
31         k+=1
32     return x_cur

下面我們分析如何構造下L-BFGS的算法[10,13]。假設我們現在處於優化過程的第

= = = H k g k V T k - 1 H

q i = (V k - i \dots V k - 2 V k - 1) g k (33)

a i = ρ k - i s T k - i q i - 1 (34)

q i = V k - i + 1 q i - 1 = q i - 1 - ρ

H k g k = P 1 = V T k - 1 P 2 + s k - 1 a 1 (36)

P 2 = V T k - 2 P 3 + s k - 2 a 2 (37)

P i = V T k - i P i + 1 + s k - i a i = P

在算法9中，需要給出矩陣

γ k = y T k - 1 s k - 1 y T k - 1 y k - 1 (39)

 1 def LBFGS(fun,dfun,theta,args,mode=0,eps=1e-12,max_iter=1e4):
 2     """
 3     #Functionality:find the minimum of objective function f(x) with LBFGS
 4     #@Parameters
 5     #fun:objective function f(x)
 6     #dfun:compute the gradient of f(x)
 7     #args:parameters needed by fun and dfun
 8     #theta:start vector of parameters of the model
 9     #H:initial inverse Hessian approximation
10     #mode:index of line search algorithm
11     """
12     x_pre=x_cur=theta
13     s_arr=[]
14     y_arr=[]
15     Hscale=1
16     k=0
17     while k<max_iter:
18         g=dfun(x_cur,args)
19         d=LBFGSSearchDirection(y_arr,s_arr,Hscale,-g)
20         step=LineSearch(fun,dfun,x_pre,args,d,1,mode)
21         s=step*d
22         x_cur=x_pre+s
23         y=dfun(x_cur,args)-dfun(x_pre,args)
24         ys=np.sum(y*s.T)
25         if np.sum(np.abs(s))<eps:
26             return x_cur
27         x_pre=x_cur
28         k+=1
29         y_arr,s_arr,Hscale=LBFGSUpdate(y,s,y_arr,s_arr)
30     return x_cur
31 
32     
33 def LBFGSSearchDirection(y_arr,s_arr,Hscale,g):
34     """
35     #Functionality:estimate search direction using with LBFGS
36     #@Parameters
37     #y_arr:m*dim matrix,where y_arr[i,:]=f'(x_{i+1})-f'(x_i)
38     #s_arr:m*dim matrix,where s_arr[i,:]=x_{k+1}-x_k
39     #Hscale:a scale to initilize the inverse of Hessian matrix
40     #g:a row vector representing -f'(x_{k})
41     """
42     histNum=len(s_arr)#number of update data stored
43     if not histNum:
44         return g
45     dim=s_arr[0].size
46     a_arr=[0 for i in range(histNum)]
47     rho=[0 for i in range(histNum)]
48     q=g
49     for i in range(1,histNum+1):
50         s=s_arr[histNum-i]
51         y=y_arr[histNum-i]
52         rho[histNum-i]=1/np.sum(s*y.T)
53         a_arr[i-1]=rho[histNum-i]*np.sum(s*q.T)
54         q-=(a_arr[i-1]*y)
55     P=Hscale*q
56     for i in range(histNum,0,-1):
57         y=y_arr[histNum-i]
58         s=s_arr[histNum-i]
59         beta=rho[histNum-i]*np.sum(y*P.T)
60         P+=s*(a_arr[i-1]-beta)
61     return P
62         
63 
64 def LBFGSUpdate(y,s,oldy,olds,m=1e2):
65     """
66     #Functionality:refresh the historical update data
67     #@Parameters
68     #y:f'(x_{k+1})-f'(x_k)
69     #s:x_{k+1}-x_k
70     #oldy:[y0,y1,...],which is a list
71     #olds:[s0,s1,...],which is a list
72     #m:number of historical data to store(default:100)
73     """
74     eps=1e-12
75     Hscale=np.sum(y*s.T/y*y.T) #a scale to initialize H_{k-m}
76     if Hscale<eps:#skip update
77         return oldy,olds,Hscale
78     
79     cur_m=len(oldy)
80     if cur_m>=m:
81         oldy.pop(0)
82         olds.pop(0)
83     oldy.append(copy.deepcopy(y))
84     olds.append(copy.deepcopy(s))
85     return oldy,olds,Hscale

References

[1] Backtracking line search. http://en.wikipedia.org/wiki/Backtracking_line_search.

[2] Bisection method. http://en.wikipedia.org/wiki/Bisection_method.

[3] Gradient descent. http://en.wikipedia.org/wiki/Gradient_descent.

[4] Limited-memory bfgs. http://en.wikipedia.org/wiki/Limited-memory_BFGS.

[5] Line search methods. http://pages.cs.wisc.edu/~ferris/cs730/chap3.pdf.

[6] Line search methods:step length selection. http://terminus.sdsu.edu/SDSU/Math693a_f2013/Lectures/06/lecture.pdf.

[7] Math 408a line search methods. https://www.math.washington.edu/~burke/crs/408/lectures/L7-line-search.pdf.

[8] Newton’s method. http://en.wikipedia.org/wiki/Newton%27s_method.

[9] Nonlinear programming algorithms. http://www.math.bme.hu/~bog/GlobOpt/Chapter5.pdf.

[10] Oerview of quasi-newton optimization methods. https://homes.cs.washington.edu/~galen/files/quasi-newton-notes.pdf.

[11] Quasi-newton method. http://en.wikipedia.org/wiki/Quasi-Newton_method.

[12] Unconstrained minimization. http://www.ing.unitn.it/~bertolaz/2-teaching/2011-2012/AA-2011-2012-OPTIM/lezioni/slides-mND.pdf.

[13] Dong C Liu and Jorge Nocedal. On the limited memory bfgs method for large scale optimization. Mathematical programming, 45(1-3):503–528,1989.

作者：JeromeWang
郵箱：yunfeiwang@hust.edu.cn
出處：http://www.cnblogs.com/jeromeblog/

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 Line Search and Quasi-Newton Methods Armijo線性搜索擬牛頓法分析與推導擬牛頓法（Python實現）牛頓法、擬牛頓法、共軛梯度法牛頓迭代法（Newton's method）牛頓迭代法(Newton's Method) 牛頓法、擬牛頓法、阻尼牛頓法、修正牛頓法最優化算法【牛頓法、擬牛頓法、BFGS算法】擬牛頓法與最速下降法