強化學習(九):策略梯度


Policy Gradient Methods

之前學過的強化學習幾乎都是所謂的‘行動-價值’方法,也就是說這些方法先是學習每個行動在特定狀態下的價值,之后在每個狀態,根據當每個動作的估計價值進行選擇。這種方法可看成是一種‘間接’的方法,因為強化學習的目標是如何決策,這些方法把每個動作的價值作為指標,來輔助決策。這類方法是一種很直觀的,很容易理解的思維方式。當然還有另一種更‘直接’的方法,即不使用輔助手段(如之前的價值函數),直接學習決策。這種方法更直接,因為當學習結束得到的就是如何決策,但這種方法卻不太直觀,有價值函數作為助手,基本上不存在模型解釋性問題,哪個價值大選哪個,很容易理解。直接學習策略卻不同,沒有參照,沒有依傍,對每個策略的被選擇的原因不太好理解,這也成為這類方法相較於之前的‘行動-價值’方法難以理解的主要原因。其實如同深度學習一樣,我們只是找到一種函數可以很好的擬合決策過程就成功了。因為決策過程可以看成是一個決策函數,我們只須應用強化學習的方法盡量的逼近這個函數即可。既然是學習決策函數,那么我們很容易想到學習這個函數的參數,這樣策略也就成了參數化策略,參數化的策略在選擇行動時是不需要借助價值函數的。價值函數雖然不參與決策,但在策略參數學習時卻可能會被用到。

深度學習的流行,也使得基於梯度學習的算法得到廣泛應用。所以基於梯度的策略學習也成為主流。基於梯度學習就需一個目標函數,比如基於梯度的策略性能函數,不妨記為\(J(\theta)\), 接下來就很簡單,我們只需最大化這個性能函數即可,其參數更新方式:

\[\theta_{t+1} = \theta_t + \alpha \widehat{\nabla J (\theta_t)} \]

這就是策略梯算法的一般形式,其中\(\widehat{\nabla J (\theta_t)}\)是一種隨機估計,其期望近似於性能函數的梯度。

Policy Approximation and its Advantages

如果行動空間是離散的並且不是特別的在, 那么可以用一種偏好函\(h(s,a,\theta)\)數來刻畫狀態-行動對,在每個狀態中,擁有最大偏好的行動將擁有最大的概率被選擇,顯然,這樣的使用soft max來刻畫最常用:

\[\pi(a|s,\theta) \dot = \frac{e^{h(s,a,\theta)}}{\sum_b e^{h(s,b,\theta)}} \]

偏好函數\(h(s,a,\theta)\)可以用ANN來刻畫,也可以簡單地使用特征的線性組合。

 參數化的策略有很多好處。首先,參數化的策略的估計更接近確定性策略。

其次,基於行動偏好soft-max可以任意概率選擇行動。

再次,策略參數化相較於‘行動-價值’方法,策略的近似函數會相對更簡單。

最后,策略參數化方法可以方便的融入先驗知識。

The Policy Gradient Theorem

上述的優勢是策略參數化方法做優於‘行動-價值’方法在實踐中的考慮,在理論上,策略參數化方法也有一個重要的優勢:policy gradient theorem。它為性能函數的梯度提供了一個解析形式:

\[\nabla J(\theta) \propto \sum_s \mu(s)\sum_a q_{\pi}(s,a)\nabla_{\pi}(a|s,\theta) \]

其中\(\mu(s)\) 是 on-policy distribution, 為證明這個定理,我們需要先考慮 \(\nabla v_{\pi}(s)\).

\[\begin{array}\\ \nabla v_{\pi}(s)& = & \nabla\Big[\sum_a \pi(a|s)q_{\pi}(s,a) \Big] \qquad \forall s \in S\\ &=& \sum_a \Big[\nabla \pi(a|s)q_{\pi}(s,a) + \pi(a|s)\nabla q_{\pi}(s,a) \Big]\\ &=& \sum_a \Big[\nabla \pi(a|s)q_{\pi}(s,a) + \pi(a|s)\nabla(\sum_{s',r}p(s'|s,a)(r+v_{\pi}(s'))) \Big]\\ & =& \sum_a\Big[\nabla \pi(a|s)q_{\pi}(s,a) + \pi(a|s)\sum_{s'}p(s'|s,a)\nabla v_{\pi}(s') \Big]\\ \end{array} \]

以上,我們推出從s 到 s'的遞推公式, 這是一個很重要的結果。

接下來,還需要一個認識:從狀態角度看待MDP,MDP是從一個狀態s到另一個狀態s‘的過程。 從s到s‘是可以有多種情況的,一步到達時的概率我們是可以寫出的,即:

\[p_{\pi}(s\rightarrow s',n =1) = \sum_a \pi(a|s)p(s'|s,a) \]

其中n代表步數。那如果n = k呢,顯然,我們不知道,而且可預知是很復雜的,但我們可以遞推的得出,比如我們假設n=k的概率已知,即\(p(s\rightarrow s',n = k)\)已知。那么n=k+1的概率:

\[\begin{array}\\ p_{\pi}(s\rightarrow s',n = k+1) &=& \sum_{s''}p(s'|s'')p_{\pi}(s\rightarrow s'',n=k)\\ &=& \sum_{s''} \sum_a \pi(a|s'')p(s'|s'',a)p_{\pi}(s\rightarrow s'',n=k)\\ &= & \sum_{s''}p_{\pi}(s\rightarrow s'',n = k)p_{\pi}(s''\rightarrow s',n =1) \end{array} \]

這樣我們可以繼續\(\nabla v_{\pi}(s)\) 的推導:

\[\begin{array}\\ \nabla v_{\pi}(s)& = & \sum_a\Big[\nabla \pi(a|s)q_{\pi}(s,a) + \pi(a|s)\sum_{s'}p(s'|s,a)\nabla v_{\pi}(s') \Big]\\ & =& \sum_a\nabla \pi(a|s)q_{\pi}(s,a) + \sum_{s'}\sum_a\pi(a|s) p(s'|s,a)\nabla v_{\pi}(s')\\ & & \big (for\ simplicity, define: \phi(s) =\sum_a\nabla \pi(a|s)q_{\pi}(s,a) \big)\\ &=& \phi(s) + \sum_{s'}p_{\pi}(s\rightarrow s',1)\nabla v_{\pi}(s')\\ & = & \phi(s) + \sum_{s'}p_{\pi}(s\rightarrow s',1)\Big( \phi(s') + \sum_{s''}p_{\pi}(s'\rightarrow s'',1)\nabla v_{\pi}(s'') \Big)\\ & = & \phi(s) + \sum_{s'}p_{\pi}(s\rightarrow s',1) \phi(s') + \sum_{s'}p_{\pi}(s\rightarrow s',1)\sum_{s'}p_{\pi}(s'\rightarrow s'',1)\nabla v_{\pi} (s'')\\ & =& \phi(s) + \sum_{s'}p_{\pi}(s\rightarrow s',1) \phi(s') + \sum_{s''}p_{\pi}(s\rightarrow s'',2)\nabla v_{\pi} (s'')\\ &=& \dots\\ & = & \sum_x\sum_{k=0}^{\infty}p_{\pi}(s\rightarrow x,k)\phi(x) \end{array} \]

上面提到J(\theta) 是性能函數,其有一種常用的形式是:

\[J(\theta) \dot = v_{\pi}(s_0) \]

於是:

\[\begin{array}\\ \nabla J(\theta) &=& \nabla v_{\pi}(s_0)\\ &=& \sum_s\sum_{k = 0}^{\infty}p_{\pi}(s_0\rightarrow s)\phi(s)\\ &=& \sum_s \eta(s) \phi(s)\\ &=& \sum_{s'}\eta(s') \sum\frac{\eta(s)}{\sum_{s'}\eta(s')} \phi(s)\\ & \propto & \sum_{s}\frac{\eta(s)}{\sum_{s'}\eta(s')} \phi(s)\qquad\qquad \qquad (as \ \sum_{s'}\eta(s') \ is \ a\ constant.)\\ &=& \sum_{s} \mu(s) \sum_a\nabla \pi(a|s)q_{\pi}(s,a) \qquad (define \ \mu(s) = \frac{\eta(s)}{\sum_{s'}\eta(s')} \ 上述定理得證!)\\ & = &  \sum_{s} \mu(s) \sum_a\pi(a|s) q_{\pi}(s,a)\frac{\nabla \pi(a|s)}{\pi(a|s)}\\ & =& E_{\pi}\big[ q_{\pi}(s,a)\nabla \ln \pi(a|s) \big] \qquad \qquad (E_{\pi}\ \ refers\ to\ E_{s\sim \mu(s),a\sim \pi_{\theta}})\\ & = & E_{\pi}\Big[G_t \nabla \ln \pi(a|s) \Big]\qquad \qquad (as \ E_{\pi}[G_t|S_t,A_t] = q_{\pi}(S_t,A_t)) \end{array} \]

REINFORCE: Monte Carlo Policy Gradient

\[\begin{array}\\ \nabla J(\theta)& \propto& \sum_s \mu(s)\sum_a q_{\pi}(s,a)\nabla_{\pi}(a|s,\theta) \\ &=& E_{\pi}\left[\sum_a \pi(a|S_t,\theta)q_{\pi}(S_t,a) \frac{\nabla\pi(a|S_t,\theta)}{\pi(a|S_t,\theta)}\right]\\ &=& E_{\pi}\left[q_{\pi}(S_t,A_t)\frac{\nabla\pi(A_t|S_t,\theta)}{\pi(A_t|S_t,\theta)}\right]\qquad\qquad (\text{replacing a by the sample} \ A_t \sim \pi)\\ &=& E_{\pi}\left[ G_t \frac{\nabla\pi(A_t|S_t,\theta)}{\pi(A_t|S_t,\theta)}\right]\qquad\qquad\qquad (\text{becauser}\ E_{\pi}[G_t | S_t,A_t] = q_{\pi}(S_t,A_t)) \end{array} \]

其中\(G_t\)是收益(returns),由上可以得出參數更新公式:

\[\theta_{t+1} = \theta_t + \alpha G_t \frac{\nabla\pi(A_t|S_t,\theta)}{\pi(A_t|S_t,\theta)} \]

# REINFORCE: Monte-Carlo Policy-Gradient Control (episodic) for pi*
Algorithm parametere: step size alpha >0
Initialize policy parameter theta(vector)
Loop forever:
    Generate an episode S0,A0,R1,...S_{T-1},A_{T-1},RT, following pi(.|.,theta)
    Loop for each step of the episode t = 0,1,...,T-1
    G = sum_{k=t+1}^T gamma^{k-t-1}Rk
    theta = theta + alpha r^t G grad(ln pi(A_t|S_t,theta))

REINFORCE with Baseline

作為一種Monte Carlo方法,REINFORCE的方差比較較大,從而導致學習的效率較低。而引入一個baseline卻可以降低方差。

\[\nabla J(\theta) \propto \sum_s \mu(s)\sum_a( q_{\pi}(s,a)\nabla_{\pi}-b(s))(a|s,\theta) \]

從而:

\[\theta_{t+1} = \theta_t + \alpha (G_t-b(S_t)) \frac{\nabla\pi(A_t|S_t,\theta)}{\pi(A_t|S_t,\theta)} \]

baseline的一個自然的選擇就是價值函數\(v(s)\)

# REINFORCE with baseline(episodic), for estimating pi_theta = pi*
Input: a differentiable policy parameterization pi(a|s,theta)
Input: a differentiable state-value function parameteration v(s,w)
Alogrithm parameters: step sizes alpha_theta >0, alpha_w >0
Initialize policy parameter theta and state_value weights w

Loop forever(for each episode):
    Generate an episode S_0,A_0,R_1,... S_{T-1},A_{T-1},R_T following pi(.|.,theta)
    Loop for each step of the episode t= 0,1,...T-:
        G = sum_{k=t+1}^T gammma^{k-t-1} R_k
        delta = G- v(S_t,w)
        w = w + alpha_w gammma^t delta grad(v(S_t,w))
        theta = theta + alpha_theta gamm^t delta grad(ln(pi(A_t|S_t,theta)))

Actor-Critic Method

\[\begin{array}\\ \theta_{t+1} &\dot =& \theta_t + \alpha (G_t-\hat v(S_t,w)) \frac{\nabla\pi(A_t|S_t,\theta)}{\pi(A_t|S_t,\theta)}\\ &=& \theta_t + \alpha(R_{t+1} + \gamma\hat v(S_{t+1},w) - \hat v(S_t,w)) \frac{\nabla\pi(A_t|S_t,\theta)}{\pi(A_t|S_t,\theta)}\\ &=& \theta + \alpha \delta_t \frac{\nabla\pi(A_t|S_t,\theta)}{\pi(A_t|S_t,\theta)}\\ \end{array} \]

# one-step Acotor-Critic (episodic), for estimating pi_theta = pi*
Input: a differentiable policy parameterization pi(a|s,theta)
Input: a differentiable state-value function parameterization v(s,w)
Parameters: step size alhpa_theta > 0, theta_w > 0
Initialize policy parameter theta and state value weights w
Loop forever (first state of episode)
I =1
Loop while S is not terminal (for each time step):
    A = pi(.|S,theta)
    take action A, observe S',R
    delta = R + gamma v(S',w) - v(S,w)
    w = w + alpha_w I delta grad(v(S,w))
    theta = theta + alpha_theta I delta grad(ln pi(A|S,theta))
    I = gamma I
    S = S'
# Actor-Critic with Eligibility Traces(episodic), for estimating pi_theta = pi*
Input: a differentiable policy paramterization pi(a|s,theta)
Input: a differentiable state-value function parameterization v(s,w)
Parameters: trace-decay rates lambda_theta in [0,1], lambda_w in [0,1], step size alpha_theta > 0, alpha_w > 0.
Initialize policy parameter theta and state-value weights w

Loop forever(for each episode):
    Initialize S (first state of episode)
    z_theta = 0 (d'-component eligibility trace vector)
    z_vector = 0 (d-component eligibility trace vector)
    I = 1
    Loop while S is not terminal (for each time step):
         A = pi(.|S,theta)
         take action A, observe S',R
         delta = R + gamma v(S',w) - v(S,w)
         z_w = gamma lambda_w z_w + I grad(v(S,w))
         z_theta = gamma lambda_theta z_theta + I grad(ln(pi(A|S,theta)))
         w = w + alpha_w delta z_w
         theta = theta + alpha_theta delta z_theta
         I = gamma I
         S = S'                             

Policy Gradient for Continuing Problems

對於連續問題,需要針對每步的平均獎勵定義性能指標:

\[\begin{array}\\ J(\theta) \dot= r(\pi) &\dot =& \lim_{h\rightarrow\infty}\frac{1}{h}\sum_{t=1}^h E[R_t| A_{0:t-1}\sim \pi]\\ &=&\lim_{t\rightarrow \infty} E[R_t| A_{0:t-1} \sim \pi]\\ &=& \sum_{s}\mu(s)\sum_a \pi(a|s) \sum_{s',r}p(s',r|s,a)r \end{array} \]

其中,\(\mu\)是在\(\pi\)下的steady-state 分布:\(\mu(s)\dot = \lim_{t\rightarrow\infty}P\{S_t = s| A_{0:t}\sim \pi\}\) 假定存在且獨立於\(S_0\)

這是一個特別的分布,在此分布下,根據\(\pi\)來選擇行動,那么得到的結果仍符合當前分布:

\[\sum_s\mu(s)\sum_{a}\pi(a|s,\theta)p(s'|s,a) = \mu(s'), \qquad s' \in S \]

# Actor-Critic with Eligibility Traces(continuing), for estimating pi_theta = pi*
Input: a differentiable policy paramterization pi(a|s,theta)
Input: a differentiable state-value function parameterization v(s,w)
Parameters: trace-decay rates lambda_theta in [0,1], lambda_w in [0,1], step size alpha_theta > 0, alpha_w > 0,alpha_{R_bar} >0,
Initialize R_bar
Initialize policy parameter theta and state-value weights w
Initialize S
z_w = 0 (eligibility trace vector)
z_theta = 0 (eligibility trace vector)

Loop forever(for each time step):
    Select A from pi(.|S,theta)
    take action A and observe S', R
    delta = R - R_bar + v(S',w) - v(S,w)
    R_bar = R_bar + alpha_{R_bar} delta
    z_w = lambda_w z_w + grad(v(S,w))
    z_theta = lambda_theta z_theta grad(ln(pi(A|S,theta)))
    w = w + alpha_w delta z_w
    theta = theta alpha_theta delta z_theta
    S = S'

Policy Parameterization for Continuous Actions

當動作空間是連續的,基於策略的方法轉而關注動作分布的性質。如果每個動作可用實數值來刻畫,那么策略可以用正態分布密度函數來刻畫:

\[\pi(a|s,\pmb \theta) \dot = \frac{1}{\sigma(s,\pmb\theta)\sqrt{2\pi}}\exp \bigg(-\frac{(a - \mu(s,\pmb\theta))^2}{2\theta(s,\pmb\theta)^2}\bigg) \]


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM