深度強化學習——TRPO


TRPO

1.算法推導

​ 由於我們希望每次在更新策略之后,新策略\(\tilde\pi\)能必當前策略\(\pi\)更優。因此我們希望能夠將\(\eta(\tilde\pi)\)寫為\(\eta(\pi)+\cdots\)的形式,這時我們只需要考慮\((\cdots)\geq 0\),就能保證\(\eta(\tilde\pi)\)是單調增大的。

​ 那么由\(A_\pi(s_t,a_t)\)可以重新定義\(\eta(\tilde\pi)=\eta(\pi)+E_{s_0,a_0,\cdots,\sim\pi}[\sum\limits_{t=0}^\infty \gamma^tA_\pi(s_t,a_t)]\) 1

​ 將上式寫為求和的形式: \(\eta(\tilde\pi)=\eta(\pi)+\sum\limits_{t=0}^\infty\sum\limits_sP(s|\tilde\pi)\sum\limits_a\gamma^tA_\pi(s_t,a_t)\) 2

​ 將上式進行改寫\(\eta(\tilde\pi)=\eta(\pi)+\sum\limits_s\sum\limits_{t=0}^\infty\gamma^tP(s|\tilde\pi)\sum\limits_a\tilde\pi(a|s)A_\pi(s_t,a_t)\) 3

​ 定義\(\rho_{\tilde\pi}=P(s_0|\tilde\pi)+\gamma\ P(s_1|\tilde\pi)+\gamma^2P(s_2|\tilde\pi)+\cdots+\gamma^t P(s_t|\tilde\pi)\)

​ 所以3為:\(\eta(\tilde\pi)=\eta(\pi)+\sum\limits_s\rho_\tilde\pi(s)\sum\limits_a\tilde\pi(a|s)A_\pi(s_t,a_t)\) 4

​ 接下來為第一次近似:

​ 由於\(\rho_\tilde\pi(s)\)是新策略的采樣,而我們得到新策略的轉移概率是極為困難的,因此我們利用\(\rho_\pi(s)\)替換\(\rho_\tilde\pi(s)\),得到\(L(\tilde\pi)=\eta(\pi)+\sum\limits_s\rho_\pi(s)\sum\limits_a\tilde\pi(a|s)A_\pi(s_t,a_t)\) 5

​ 可以證明\(L(\pi_{old})=\eta(\pi_{old}), \triangledown L(\pi_\theta)|_{\theta=\theta_{old}}=\triangledown \eta(\pi_\theta)|_{\theta=\theta_{old}}\) 6

​ 那么對於函數\(L(\pi_\theta),\eta(\pi_\theta)\)\(\theta_{old}\)處,對\(\theta_{old}\)更新足夠小的一步,那么對\(L_{\theta_{old}}(\pi_\theta)\)的提升相當於對\(\eta_\theta(\pi_\theta)\)的提高。

​ 此時我們需要做的便是\(\theta_{old}\)\(\theta_{new}\)之間的句里的衡量以及判斷更新策略后會帶來\(\eta(\pi_\theta)\)多大的提升。

​ 由02年的論文中的證明,我們可以得到,當策略以\(\tilde\pi=(1-\alpha)\pi+\alpha\pi^{'}\)更新時,有:

\(\eta(\tilde\pi)\geq L(\tilde\pi)-\frac{2\epsilon\gamma}{(1-\gamma)^t}\alpha^2\)此時$\epsilon=\max\limits_s|E_a[A_\pi(s,a)]| $ 7

​ 這里的\(\alpha\)指采用上述的混合方式時的權重,但是我們也可以將其視作為新策略和舊策略之間的散度

​ 因此得到式子\(\eta(\tilde\pi)\geq L(\tilde\pi)-\frac{4\epsilon\gamma}{(1-\gamma)^2}\alpha^2\) 8 此時\(\alpha=D_{TV}^{max}(\pi_{old},\pi_{new})\)

​ [\(D_{TV}\)即The Total Variation Divergence,max為求最大值。定義為\(D_{TV}(q,p)=\frac{1}{2}\sum\limits_i|p_i-q_i|,D_{TV}^{max}(p,q)=\max\limits_sD_{TV}(p.q)\)]

​ 且有\(D_{TV}(q||p)^2\leq D_{KL}(q||p)\),則我們得到了\(\eta(\tilde\pi)-C\cdot D_{KL}^{max}(\pi||\tilde\pi)\),其中\(C=\frac{4\epsilon\gamma}{(1-\gamma)^2}\) 9

​ 通過上式,我們可以發現當前的更新都是單調遞增的。 10

​ 當前更新\(\theta\)的過程為:找到一個\(\theta\)使得不等式右邊值最大,然后令\(\theta_{old}=\theta\),即\(\max\limits_\theta[L_{\theta_{old}}(\theta)-C\cdot D_{KL}^{max}(\theta_{old}||\theta)]\) 11

​ 而如果使用理論推導出來的不等式,系數\(C=\frac{4\epsilon\gamma}{(1-\gamma)^2}\),那么更新步幅會很小,更新會很慢,但着眼於\(D_{KL}^{max}\)項,通過對這一項進行約束,將問題轉換成有約束的優化問題,可以獲得較大的更新步幅。

​ 此時得到優化目標:\(\max\limits_\theta L_{\theta_{old}}(\theta)\ \ \ \ subject\ \ to\ \ D_{KL}^{max}(\theta_{old}||\theta)\leq\delta\) 12

​ 但可以看到\(D_{KL}^{max}\leq\delta\)這個約束時施加於所有狀態的,要對每一個狀態進行考察,這是十分困難的。

​ 因此進行了第二次近似:利用average KL divergence代替原來的\(D_{KL}^{max}\)約束項。

​ 定義\(\overline{D_{KL}^{\rho}}(\theta_1,\theta_2)=E_s[D_{KL}(\pi_{\theta_1}(\cdot|s)||\pi_{\theta_2}(\cdot||s))]\) 13

​ 那么問題轉變為\(\max\limits_\theta L_{\theta_{old}}(\theta)\ \ \ \ subiect\ \ to\ \ \overline{D}_{KL}^{\rho}(\theta_{old}||\theta)\leq\delta\) 14 在實驗中發現,\(\overline{D}_{KL}^\rho\)\(D_{KL}^{max}\)有着相似的表現。

​ 為了能夠在實際中使用,因此進行第三次近似:

​ 將目標函數展開\(L_{\theta_{old}}=\sum\limits_s\rho_{\theta_{old}}(s)\sum\limits_a\pi_\theta(a|s)A_{\theta_{old}}(s,a)\)

​ 首先將\(\sum\limits_s\rho(\theta)(\cdots)\)用樣本的期望值代替:\(\frac{1}{1-\gamma}E_{s\sim\rho}(\cdots)\) 15

​ 接着將\(\sum\limits_a\pi_\theta(a|s)A_{\theta_{old}}(s,a)\)利用重要性采樣估計,得:\(E_{a\sim q}[\frac{\pi_\theta(a|s_n)}{q(a|s_n)}A_{\theta_{old}}(s_n,a)]\) 16

​ 最后將\(A_{\theta_{old}}(s,a)\)替換為\(Q_{\theta_{old}}(s,a)\) 17 ,這一替換只使得目標函數差了一個常數項

​ 優化目標的最終形態:\(\max\limits_\theta E_{s\sim\rho_{\theta_{old}},a\sim q}[\frac{\pi_\theta(a|s_n)}{q(a|s_n)}A_{\theta_{old}}(s_n,a)]\ \ \ subject\ to\ D_{KL}(\pi_{\theta_{old}}(\cdot|s)||\pi_\theta(\cdot|s))\leq\delta\)

​ 在實際操作時:①用樣本均值替換期望E②用實驗估計替換Q值

2.優化說明

​ 對於\(\max\limits_\theta\ \ L_{\theta_{old}}(\theta)\ \ \ subject\ to\ D_{KL}(\pi_{\theta_{old}}(\cdot|s)||\pi_\theta(\cdot|s))\leq\delta\)

​ 用蒙特卡洛方法估計出目標函數與約束方程的待估計值之后,來考慮如何解這一有約束的優化過程。

​ 將\(l(\theta)\)\(kl(\theta)\)\(\theta_{old}\)處進行Tylor展開

\(l(\theta)\approx l(\theta_{old})+\triangledown l(\theta_{old})^T(\theta-\theta_{old})+\frac{1}{2}(\theta-\theta_{old})^TH(l)(\theta_{old})(\theta-\theta_{old})\approx g(\theta-\theta_{old})\) 18(第一項為0,第三項極小)

\(kl(\theta)\approx kl(\theta_{old})+\triangledown kl(\theta_{old})^T(\theta-\theta_{old})+\frac{1}{2}(\theta-\theta_{old})^TH(kl)(\theta_{old})(\theta-\theta_{old})\approx \frac{1}{2}(\theta-\theta_{old})^TF(\theta-\theta_{old})\) 19(第一項為0,第二項為0)

​ 那么優化問題近似成:\(\max\limits_\theta g(\theta-\theta_{old})\ \ \ subject\ \ to\ \ \frac{1}{2}(\theta-\theta_{old})^TF(\theta-\theta_{old})\leq\delta\) 20

​ 構建拉格朗日函數:\(L(\theta,\lambda)=g(\theta-\theta_{old})-\frac{\lambda}{2}[(\theta-\theta_{old})^TF(\theta-\theta_{old})-\delta]\) 21

​ 因為約束項為不等式,還應該滿足KKT條件:

\[\left\{ \begin{array}{**lr**} \frac{1}{2}(\theta-\theta_{old})^TF(\theta-\theta_{old})\leq\delta, & \\ \gamma\geq 0, & \\ \frac{\lambda}{2}[(\theta-\theta_{old})^TF(\theta-\theta_{old})-\delta]=0 & \end{array} \right. \]

​ 聯解上式可以得到\(\frac{1}{2}s^TFs=\delta\)其中 \(s=\theta-\theta_{old}\)

3.Hession Free

​ 共軛梯度相對於牛頓法來說,不能一步求解,需要多步過程,但不需要對Hessian矩陣求逆,不過仍需對Hessian矩陣進行存儲和計算。

​ 有以下關系:\((Hv)^{(i)}=\sum\limits_{j=1}^N[\frac{\partial^2f}{\partial x_i\partial x_j}(x)v_j]=[\triangledown\frac{\partial f(x)}{\partial x_i}]\cdot v\),這正是函數\(g=\frac{\partial f}{\partial x_i}\)關於方向v的方向導數

​ 於是 \(\triangledown_vg=\lim\limits_{\epsilon\rightarrow0}\frac{g(x+\epsilon)-g(x)}{\epsilon}\approx\frac{g(x+\epsilon)-g(x)}{\epsilon}\),得到\(Hv\approx\frac{g(x+\epsilon)-g(x)}{\epsilon}\)

4.更新過程

​ 此時我們需要求解的拉格朗日函數為

\[L(\theta,\lambda)=g(\theta-\theta_{old})-\frac{\lambda}{2}[(\theta-\theta_{old})^TF(\theta-\theta_{old})]\ \ \ constriant\ \ \frac{1}{2}s^TFs=\delta \]

​ 利用共軛梯度法們可計算出當前點指向極值點的向量\(s_u=\frac{1}{\lambda}F^{-1}g\)

​ 為滿足限制條件,對\(s_u\)進行修正:\(s=\sqrt{\frac{2\delta}{s_u^TFs_u}}s_u\)

​ 利用這一修正后的向量s進行搜索:分別以向量\(s,\frac{s}{2},\frac{s}{4},\cdots\)與當前迭代點\(x_i\)相加,直到優化目標有所提升。

5.證明

式①的證明

\[\begin{align*} E_{\tau\sim\pi}[\sum\limits_{t=0}^\infty\gamma_tA_\pi(s_t,a_t)] &= E_{\tau\sim\pi}[\sum\limits_{t=0}^\infty\gamma_t(r_t+V_\pi(s_{t+1})-V_\pi(s_t))] \\ &= E_{\tau\sim\pi}[-V_\pi(s_0)+\sum\limits_{t=0}^\infty\gamma^tr_t] \\ &= E_{\tau\sim\pi}[-V_\pi(s_0)] + E_{\tau\sim\pi}[\sum\limits_{t=0}^\infty\gamma^tr_t] \\ &=-\eta(\pi)+\eta(\tilde\pi) \end{align*} \]

​ 其中第一個等號為將\(A_\pi\)展開;第二個等號為將求和每項寫出來,並將相同項相減的項合並;第四個等號為在初始狀態\(s_0\)\(\pi\)\(\tilde\pi\)是一樣的。

式⑥的證明

\[L(\pi_{old})=\eta(\pi_{old}),\triangledown L(\pi_\theta)|_{\theta=\theta_{old}}=\triangledown\eta(\pi_\theta)|_{\theta=\theta_{old}} \]

顯然 \(L(\pi_{old})=\eta(\pi_{old})\),

對於\(\triangledown L(\pi_\theta)|_{\theta=\theta_{old}}=\sum\limits_s\rho_{\theta_{old}}(s)\sum\limits_a\triangledown_\theta\pi(a|s)A_\pi(s,a)|_{\theta=\theta_{old}}\)

\(\triangledown \eta(\pi_\theta)|_{\theta=\theta_{old}}=\sum\limits_s\rho_{\theta}(s)\sum\limits_a\triangledown_\theta\pi(a|s)A_\pi(s,a)|_{\theta=\theta_{old}}\)

在實際操作中,\(\sum\limits_s\rho_\pi(s)\)是由樣本信息得到的,當按照 \(\theta=\theta_{old}\),即\(\pi_{\theta_{old}}\)采樣時,\(\sum\limits_s\rho_{\theta_{old}}(s)=\sum\limits_s\rho_\theta(s)\),等式左右兩邊相等。

式⑦⑧的證明

\(\pi_{new}(a|s)=(1-\alpha)\pi_{old}+\alpha\pi^{'}(a|s)\),會有\(\eta(\pi_{new})\geq L_{\pi_{old}}-\frac{2\epsilon\gamma}{(1-\gamma)^2}\alpha^2;\ \ \ where\ \ \epsilon=\max\limits_s|E_{a\sim\pi^{'}}[A_\pi(a,s)]|\)

定義\(\overline{A}(s)=E_{a\sim\tilde\pi(\cdot|s)}[A_\pi(s,a)]\leftarrow \overline A(s)\)表示在狀態s時,采用策略\(\tilde\pi\)相對於之前策略的改進。

\(\eta(\tilde\pi)=\eta(\pi)+E_{\tau\sim\tilde\pi}[\sum\limits_{t=0}^\infty\gamma^t\overline{A}(s_t)],\ \ L(\tilde\pi)=\eta(\pi)+E_{\tau\sim\pi}[\sum\limits_{t=0}^\infty\gamma^t\overline A(s_t)]\)

可以將一個策略表示為策略對\((\pi,\tilde\pi)\),由策略對產生的動作對\((a,\tilde a)\) ,並且有\(P(a\neq\tilde a|s)\leq\alpha\)

\(\overline{A}(s)=E_{\tilde a\sim\tilde\pi}[A_\pi(s,\tilde a)]=E_{(a,\tilde a)\sim(\pi,\tilde\pi)}[A_\pi(s,\tilde a)-A_\pi(s,a)]=P(a\neq\tilde a|s)E_{(a,\tilde a)\sim(\pi,\tilde\pi)}[A_\pi(s,\tilde a)-A_\pi(s,a)]\)

因為\(E_{a\sim\pi}[A_\pi(s,a)]=0\)\(P(a=\tilde a|s)E_{(a,\tilde a)\sim(\pi,\tilde\pi)}[A_\pi(s,\tilde a)-A_\pi(s,a)]=0\)

所以\(|\overline{A}(s)|\leq 2\alpha\max\limits_{s,a}|A_\pi(s,a)|\)

進而能夠得到\(|E_{s_t\sim\tilde\pi}(\overline A(s_t))-E_{s_t\sim\pi}(\overline A(s_t))|\leq2\alpha\max\limits_s\overline{A}(s)\leq4\alpha(1-(1-\alpha)^t)\max\limits_s|A_\pi(s,a)|\)

證明如下:\(n_t\)為時刻t之前策略\(\pi\)\(\tilde\pi\)產生不同動作的次數

\(E_{s_t\sim\tilde\pi}[\overline{A}(s_t)]=P(n_t>0)E_{s_t\sim\tilde\pi|n_t>0}[\overline{A}(s_t)]+P(n_t=0)E_{s_t\sim\tilde\pi|n_t=0}[\overline{A}(s_t)]\)

\(E_{s_t\sim\pi}[\overline{A}(s_t)]=P(n_t>0)E_{s_t\sim\pi|n_t>0}[\overline{A}(s_t)]+P(n_t=0)E_{s_t\sim\pi|n_t=0}[\overline{A}(s_t)]\)

\(E_{s_t\sim\tilde\pi|n_t=0}[\overline{A}(s_t)]=E_{s_t\sim\pi|n_t=0}[\overline{A}(s_t)]\leftarrow\)在時刻t之前兩種策略產生的動作相同,也在相同的狀態

所以\(E_{s_t\sim\tilde\pi}[\overline A(s_t)]-E_{s_t\sim\pi}[\overline{A}(s_t)]=P(n_t>0)(E_{s_t\sim\tilde\pi|n_t>0}[\overline{A}(s_t)]-E_{s_t\sim\pi|n_t>0}[\overline{A}(s_t)])\)

因為\(P(n_t>0)\leq1-(1-\alpha)^t\)

\[|E_{s_t\sim\tilde\pi|n_t>0}[\overline{A}(s_t)]-E_{s_t\sim\pi|n_t>0}[\overline{A}(s_t)]|\leq|E_{s_t\sim\tilde\pi|n_t>0}[\overline{A}(s_t)]|+|E_{s_t\sim\pi|n_t>0}[\overline{A}(s_t)]| \]

所以\(|E_{s_t\sim\tilde\pi|n_t>0}[\overline{A}(s_t)]-E_{s_t\sim\pi|n_t>0}[\overline{A}(s_t)]|\leq4\alpha(1-(1-\alpha)^t)\max\limits_{s,a]}|A_\pi(s,a)|\leq4\alpha\max\limits_{s,a}|A_\pi(s,a)|\)

所以最終能夠得到:

\[\begin{align*} |\eta(\tilde\pi)-L_\pi(\tilde\pi)| &=\sum\limits_{t=0}^\infty\gamma^t|E_{\tau\sim\tilde\pi}[\tilde A(s_t)]-E_{\tau\sim\pi}[\overline{A}(s_t)]| \\ &\leq\sum\limits_{t=0}^\infty\gamma^t4\epsilon\alpha(1-(1-\alpha)^t) \\ &=4\epsilon\alpha(\frac{1}{1-\gamma}-\frac{1}{1-\gamma(1-\alpha)}) \\ &=\frac{4\alpha^2\gamma\epsilon}{(1-\gamma)(1-\gamma(1-\alpha))} \\ &\leq\frac{4\gamma\epsilon\alpha^2}{(1-\gamma)^2} \end{align*} \]

式⑩的證明

利用\(M_i(\pi)=L_{\pi_i}(\pi)-C\cdot D_{kl}^{max}(\pi_i,\pi)\)

通過提升\(\eta(\pi)\)的下限\(M_i(\pi)\),來提升\(\eta(\pi)\)

\(\eta(\pi_{i+1})\geq M_i(\pi_{i+1})\)

\(\eta(\pi_i)=M_i(\pi_i)=L_{\pi_i}(\pi_i)\)

所以\(\eta(\pi_{i+1})-\eta(\pi_i)\geq M_i(\pi_{i+1})-M_i(\pi_i)\)

因此提升\(M_i(\pi)\)即可提升\(\eta(\pi)\)

式17的證明

\[\begin{align*} \sum\limits_a\pi(a|s)A_\pi(s,a) &= \sum\limits_a\pi(a|s)[Q_\pi(s,a)-V_\pi(s)] \\ &= \sum\limits_a(\pi(a|s)Q_\pi(s,a))-V_\pi(s)\sum\limits_a\pi(a|s) \\ &= \sum\limits_a(\pi(a|s)Q_\pi(s,a))-V_\pi(s) \end{align*} \]


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM