深度强化学习——TRPO


TRPO

1.算法推导

​ 由于我们希望每次在更新策略之后,新策略\(\tilde\pi\)能必当前策略\(\pi\)更优。因此我们希望能够将\(\eta(\tilde\pi)\)写为\(\eta(\pi)+\cdots\)的形式,这时我们只需要考虑\((\cdots)\geq 0\),就能保证\(\eta(\tilde\pi)\)是单调增大的。

​ 那么由\(A_\pi(s_t,a_t)\)可以重新定义\(\eta(\tilde\pi)=\eta(\pi)+E_{s_0,a_0,\cdots,\sim\pi}[\sum\limits_{t=0}^\infty \gamma^tA_\pi(s_t,a_t)]\) 1

​ 将上式写为求和的形式: \(\eta(\tilde\pi)=\eta(\pi)+\sum\limits_{t=0}^\infty\sum\limits_sP(s|\tilde\pi)\sum\limits_a\gamma^tA_\pi(s_t,a_t)\) 2

​ 将上式进行改写\(\eta(\tilde\pi)=\eta(\pi)+\sum\limits_s\sum\limits_{t=0}^\infty\gamma^tP(s|\tilde\pi)\sum\limits_a\tilde\pi(a|s)A_\pi(s_t,a_t)\) 3

​ 定义\(\rho_{\tilde\pi}=P(s_0|\tilde\pi)+\gamma\ P(s_1|\tilde\pi)+\gamma^2P(s_2|\tilde\pi)+\cdots+\gamma^t P(s_t|\tilde\pi)\)

​ 所以3为:\(\eta(\tilde\pi)=\eta(\pi)+\sum\limits_s\rho_\tilde\pi(s)\sum\limits_a\tilde\pi(a|s)A_\pi(s_t,a_t)\) 4

​ 接下来为第一次近似:

​ 由于\(\rho_\tilde\pi(s)\)是新策略的采样,而我们得到新策略的转移概率是极为困难的,因此我们利用\(\rho_\pi(s)\)替换\(\rho_\tilde\pi(s)\),得到\(L(\tilde\pi)=\eta(\pi)+\sum\limits_s\rho_\pi(s)\sum\limits_a\tilde\pi(a|s)A_\pi(s_t,a_t)\) 5

​ 可以证明\(L(\pi_{old})=\eta(\pi_{old}), \triangledown L(\pi_\theta)|_{\theta=\theta_{old}}=\triangledown \eta(\pi_\theta)|_{\theta=\theta_{old}}\) 6

​ 那么对于函数\(L(\pi_\theta),\eta(\pi_\theta)\)\(\theta_{old}\)处,对\(\theta_{old}\)更新足够小的一步,那么对\(L_{\theta_{old}}(\pi_\theta)\)的提升相当于对\(\eta_\theta(\pi_\theta)\)的提高。

​ 此时我们需要做的便是\(\theta_{old}\)\(\theta_{new}\)之间的句里的衡量以及判断更新策略后会带来\(\eta(\pi_\theta)\)多大的提升。

​ 由02年的论文中的证明,我们可以得到,当策略以\(\tilde\pi=(1-\alpha)\pi+\alpha\pi^{'}\)更新时,有:

\(\eta(\tilde\pi)\geq L(\tilde\pi)-\frac{2\epsilon\gamma}{(1-\gamma)^t}\alpha^2\)此时$\epsilon=\max\limits_s|E_a[A_\pi(s,a)]| $ 7

​ 这里的\(\alpha\)指采用上述的混合方式时的权重,但是我们也可以将其视作为新策略和旧策略之间的散度

​ 因此得到式子\(\eta(\tilde\pi)\geq L(\tilde\pi)-\frac{4\epsilon\gamma}{(1-\gamma)^2}\alpha^2\) 8 此时\(\alpha=D_{TV}^{max}(\pi_{old},\pi_{new})\)

​ [\(D_{TV}\)即The Total Variation Divergence,max为求最大值。定义为\(D_{TV}(q,p)=\frac{1}{2}\sum\limits_i|p_i-q_i|,D_{TV}^{max}(p,q)=\max\limits_sD_{TV}(p.q)\)]

​ 且有\(D_{TV}(q||p)^2\leq D_{KL}(q||p)\),则我们得到了\(\eta(\tilde\pi)-C\cdot D_{KL}^{max}(\pi||\tilde\pi)\),其中\(C=\frac{4\epsilon\gamma}{(1-\gamma)^2}\) 9

​ 通过上式,我们可以发现当前的更新都是单调递增的。 10

​ 当前更新\(\theta\)的过程为:找到一个\(\theta\)使得不等式右边值最大,然后令\(\theta_{old}=\theta\),即\(\max\limits_\theta[L_{\theta_{old}}(\theta)-C\cdot D_{KL}^{max}(\theta_{old}||\theta)]\) 11

​ 而如果使用理论推导出来的不等式,系数\(C=\frac{4\epsilon\gamma}{(1-\gamma)^2}\),那么更新步幅会很小,更新会很慢,但着眼于\(D_{KL}^{max}\)项,通过对这一项进行约束,将问题转换成有约束的优化问题,可以获得较大的更新步幅。

​ 此时得到优化目标:\(\max\limits_\theta L_{\theta_{old}}(\theta)\ \ \ \ subject\ \ to\ \ D_{KL}^{max}(\theta_{old}||\theta)\leq\delta\) 12

​ 但可以看到\(D_{KL}^{max}\leq\delta\)这个约束时施加于所有状态的,要对每一个状态进行考察,这是十分困难的。

​ 因此进行了第二次近似:利用average KL divergence代替原来的\(D_{KL}^{max}\)约束项。

​ 定义\(\overline{D_{KL}^{\rho}}(\theta_1,\theta_2)=E_s[D_{KL}(\pi_{\theta_1}(\cdot|s)||\pi_{\theta_2}(\cdot||s))]\) 13

​ 那么问题转变为\(\max\limits_\theta L_{\theta_{old}}(\theta)\ \ \ \ subiect\ \ to\ \ \overline{D}_{KL}^{\rho}(\theta_{old}||\theta)\leq\delta\) 14 在实验中发现,\(\overline{D}_{KL}^\rho\)\(D_{KL}^{max}\)有着相似的表现。

​ 为了能够在实际中使用,因此进行第三次近似:

​ 将目标函数展开\(L_{\theta_{old}}=\sum\limits_s\rho_{\theta_{old}}(s)\sum\limits_a\pi_\theta(a|s)A_{\theta_{old}}(s,a)\)

​ 首先将\(\sum\limits_s\rho(\theta)(\cdots)\)用样本的期望值代替:\(\frac{1}{1-\gamma}E_{s\sim\rho}(\cdots)\) 15

​ 接着将\(\sum\limits_a\pi_\theta(a|s)A_{\theta_{old}}(s,a)\)利用重要性采样估计,得:\(E_{a\sim q}[\frac{\pi_\theta(a|s_n)}{q(a|s_n)}A_{\theta_{old}}(s_n,a)]\) 16

​ 最后将\(A_{\theta_{old}}(s,a)\)替换为\(Q_{\theta_{old}}(s,a)\) 17 ,这一替换只使得目标函数差了一个常数项

​ 优化目标的最终形态:\(\max\limits_\theta E_{s\sim\rho_{\theta_{old}},a\sim q}[\frac{\pi_\theta(a|s_n)}{q(a|s_n)}A_{\theta_{old}}(s_n,a)]\ \ \ subject\ to\ D_{KL}(\pi_{\theta_{old}}(\cdot|s)||\pi_\theta(\cdot|s))\leq\delta\)

​ 在实际操作时:①用样本均值替换期望E②用实验估计替换Q值

2.优化说明

​ 对于\(\max\limits_\theta\ \ L_{\theta_{old}}(\theta)\ \ \ subject\ to\ D_{KL}(\pi_{\theta_{old}}(\cdot|s)||\pi_\theta(\cdot|s))\leq\delta\)

​ 用蒙特卡洛方法估计出目标函数与约束方程的待估计值之后,来考虑如何解这一有约束的优化过程。

​ 将\(l(\theta)\)\(kl(\theta)\)\(\theta_{old}\)处进行Tylor展开

\(l(\theta)\approx l(\theta_{old})+\triangledown l(\theta_{old})^T(\theta-\theta_{old})+\frac{1}{2}(\theta-\theta_{old})^TH(l)(\theta_{old})(\theta-\theta_{old})\approx g(\theta-\theta_{old})\) 18(第一项为0,第三项极小)

\(kl(\theta)\approx kl(\theta_{old})+\triangledown kl(\theta_{old})^T(\theta-\theta_{old})+\frac{1}{2}(\theta-\theta_{old})^TH(kl)(\theta_{old})(\theta-\theta_{old})\approx \frac{1}{2}(\theta-\theta_{old})^TF(\theta-\theta_{old})\) 19(第一项为0,第二项为0)

​ 那么优化问题近似成:\(\max\limits_\theta g(\theta-\theta_{old})\ \ \ subject\ \ to\ \ \frac{1}{2}(\theta-\theta_{old})^TF(\theta-\theta_{old})\leq\delta\) 20

​ 构建拉格朗日函数:\(L(\theta,\lambda)=g(\theta-\theta_{old})-\frac{\lambda}{2}[(\theta-\theta_{old})^TF(\theta-\theta_{old})-\delta]\) 21

​ 因为约束项为不等式,还应该满足KKT条件:

\[\left\{ \begin{array}{**lr**} \frac{1}{2}(\theta-\theta_{old})^TF(\theta-\theta_{old})\leq\delta, & \\ \gamma\geq 0, & \\ \frac{\lambda}{2}[(\theta-\theta_{old})^TF(\theta-\theta_{old})-\delta]=0 & \end{array} \right. \]

​ 联解上式可以得到\(\frac{1}{2}s^TFs=\delta\)其中 \(s=\theta-\theta_{old}\)

3.Hession Free

​ 共轭梯度相对于牛顿法来说,不能一步求解,需要多步过程,但不需要对Hessian矩阵求逆,不过仍需对Hessian矩阵进行存储和计算。

​ 有以下关系:\((Hv)^{(i)}=\sum\limits_{j=1}^N[\frac{\partial^2f}{\partial x_i\partial x_j}(x)v_j]=[\triangledown\frac{\partial f(x)}{\partial x_i}]\cdot v\),这正是函数\(g=\frac{\partial f}{\partial x_i}\)关于方向v的方向导数

​ 于是 \(\triangledown_vg=\lim\limits_{\epsilon\rightarrow0}\frac{g(x+\epsilon)-g(x)}{\epsilon}\approx\frac{g(x+\epsilon)-g(x)}{\epsilon}\),得到\(Hv\approx\frac{g(x+\epsilon)-g(x)}{\epsilon}\)

4.更新过程

​ 此时我们需要求解的拉格朗日函数为

\[L(\theta,\lambda)=g(\theta-\theta_{old})-\frac{\lambda}{2}[(\theta-\theta_{old})^TF(\theta-\theta_{old})]\ \ \ constriant\ \ \frac{1}{2}s^TFs=\delta \]

​ 利用共轭梯度法们可计算出当前点指向极值点的向量\(s_u=\frac{1}{\lambda}F^{-1}g\)

​ 为满足限制条件,对\(s_u\)进行修正:\(s=\sqrt{\frac{2\delta}{s_u^TFs_u}}s_u\)

​ 利用这一修正后的向量s进行搜索:分别以向量\(s,\frac{s}{2},\frac{s}{4},\cdots\)与当前迭代点\(x_i\)相加,直到优化目标有所提升。

5.证明

式①的证明

\[\begin{align*} E_{\tau\sim\pi}[\sum\limits_{t=0}^\infty\gamma_tA_\pi(s_t,a_t)] &= E_{\tau\sim\pi}[\sum\limits_{t=0}^\infty\gamma_t(r_t+V_\pi(s_{t+1})-V_\pi(s_t))] \\ &= E_{\tau\sim\pi}[-V_\pi(s_0)+\sum\limits_{t=0}^\infty\gamma^tr_t] \\ &= E_{\tau\sim\pi}[-V_\pi(s_0)] + E_{\tau\sim\pi}[\sum\limits_{t=0}^\infty\gamma^tr_t] \\ &=-\eta(\pi)+\eta(\tilde\pi) \end{align*} \]

​ 其中第一个等号为将\(A_\pi\)展开;第二个等号为将求和每项写出来,并将相同项相减的项合并;第四个等号为在初始状态\(s_0\)\(\pi\)\(\tilde\pi\)是一样的。

式⑥的证明

\[L(\pi_{old})=\eta(\pi_{old}),\triangledown L(\pi_\theta)|_{\theta=\theta_{old}}=\triangledown\eta(\pi_\theta)|_{\theta=\theta_{old}} \]

显然 \(L(\pi_{old})=\eta(\pi_{old})\),

对于\(\triangledown L(\pi_\theta)|_{\theta=\theta_{old}}=\sum\limits_s\rho_{\theta_{old}}(s)\sum\limits_a\triangledown_\theta\pi(a|s)A_\pi(s,a)|_{\theta=\theta_{old}}\)

\(\triangledown \eta(\pi_\theta)|_{\theta=\theta_{old}}=\sum\limits_s\rho_{\theta}(s)\sum\limits_a\triangledown_\theta\pi(a|s)A_\pi(s,a)|_{\theta=\theta_{old}}\)

在实际操作中,\(\sum\limits_s\rho_\pi(s)\)是由样本信息得到的,当按照 \(\theta=\theta_{old}\),即\(\pi_{\theta_{old}}\)采样时,\(\sum\limits_s\rho_{\theta_{old}}(s)=\sum\limits_s\rho_\theta(s)\),等式左右两边相等。

式⑦⑧的证明

\(\pi_{new}(a|s)=(1-\alpha)\pi_{old}+\alpha\pi^{'}(a|s)\),会有\(\eta(\pi_{new})\geq L_{\pi_{old}}-\frac{2\epsilon\gamma}{(1-\gamma)^2}\alpha^2;\ \ \ where\ \ \epsilon=\max\limits_s|E_{a\sim\pi^{'}}[A_\pi(a,s)]|\)

定义\(\overline{A}(s)=E_{a\sim\tilde\pi(\cdot|s)}[A_\pi(s,a)]\leftarrow \overline A(s)\)表示在状态s时,采用策略\(\tilde\pi\)相对于之前策略的改进。

\(\eta(\tilde\pi)=\eta(\pi)+E_{\tau\sim\tilde\pi}[\sum\limits_{t=0}^\infty\gamma^t\overline{A}(s_t)],\ \ L(\tilde\pi)=\eta(\pi)+E_{\tau\sim\pi}[\sum\limits_{t=0}^\infty\gamma^t\overline A(s_t)]\)

可以将一个策略表示为策略对\((\pi,\tilde\pi)\),由策略对产生的动作对\((a,\tilde a)\) ,并且有\(P(a\neq\tilde a|s)\leq\alpha\)

\(\overline{A}(s)=E_{\tilde a\sim\tilde\pi}[A_\pi(s,\tilde a)]=E_{(a,\tilde a)\sim(\pi,\tilde\pi)}[A_\pi(s,\tilde a)-A_\pi(s,a)]=P(a\neq\tilde a|s)E_{(a,\tilde a)\sim(\pi,\tilde\pi)}[A_\pi(s,\tilde a)-A_\pi(s,a)]\)

因为\(E_{a\sim\pi}[A_\pi(s,a)]=0\)\(P(a=\tilde a|s)E_{(a,\tilde a)\sim(\pi,\tilde\pi)}[A_\pi(s,\tilde a)-A_\pi(s,a)]=0\)

所以\(|\overline{A}(s)|\leq 2\alpha\max\limits_{s,a}|A_\pi(s,a)|\)

进而能够得到\(|E_{s_t\sim\tilde\pi}(\overline A(s_t))-E_{s_t\sim\pi}(\overline A(s_t))|\leq2\alpha\max\limits_s\overline{A}(s)\leq4\alpha(1-(1-\alpha)^t)\max\limits_s|A_\pi(s,a)|\)

证明如下:\(n_t\)为时刻t之前策略\(\pi\)\(\tilde\pi\)产生不同动作的次数

\(E_{s_t\sim\tilde\pi}[\overline{A}(s_t)]=P(n_t>0)E_{s_t\sim\tilde\pi|n_t>0}[\overline{A}(s_t)]+P(n_t=0)E_{s_t\sim\tilde\pi|n_t=0}[\overline{A}(s_t)]\)

\(E_{s_t\sim\pi}[\overline{A}(s_t)]=P(n_t>0)E_{s_t\sim\pi|n_t>0}[\overline{A}(s_t)]+P(n_t=0)E_{s_t\sim\pi|n_t=0}[\overline{A}(s_t)]\)

\(E_{s_t\sim\tilde\pi|n_t=0}[\overline{A}(s_t)]=E_{s_t\sim\pi|n_t=0}[\overline{A}(s_t)]\leftarrow\)在时刻t之前两种策略产生的动作相同,也在相同的状态

所以\(E_{s_t\sim\tilde\pi}[\overline A(s_t)]-E_{s_t\sim\pi}[\overline{A}(s_t)]=P(n_t>0)(E_{s_t\sim\tilde\pi|n_t>0}[\overline{A}(s_t)]-E_{s_t\sim\pi|n_t>0}[\overline{A}(s_t)])\)

因为\(P(n_t>0)\leq1-(1-\alpha)^t\)

\[|E_{s_t\sim\tilde\pi|n_t>0}[\overline{A}(s_t)]-E_{s_t\sim\pi|n_t>0}[\overline{A}(s_t)]|\leq|E_{s_t\sim\tilde\pi|n_t>0}[\overline{A}(s_t)]|+|E_{s_t\sim\pi|n_t>0}[\overline{A}(s_t)]| \]

所以\(|E_{s_t\sim\tilde\pi|n_t>0}[\overline{A}(s_t)]-E_{s_t\sim\pi|n_t>0}[\overline{A}(s_t)]|\leq4\alpha(1-(1-\alpha)^t)\max\limits_{s,a]}|A_\pi(s,a)|\leq4\alpha\max\limits_{s,a}|A_\pi(s,a)|\)

所以最终能够得到:

\[\begin{align*} |\eta(\tilde\pi)-L_\pi(\tilde\pi)| &=\sum\limits_{t=0}^\infty\gamma^t|E_{\tau\sim\tilde\pi}[\tilde A(s_t)]-E_{\tau\sim\pi}[\overline{A}(s_t)]| \\ &\leq\sum\limits_{t=0}^\infty\gamma^t4\epsilon\alpha(1-(1-\alpha)^t) \\ &=4\epsilon\alpha(\frac{1}{1-\gamma}-\frac{1}{1-\gamma(1-\alpha)}) \\ &=\frac{4\alpha^2\gamma\epsilon}{(1-\gamma)(1-\gamma(1-\alpha))} \\ &\leq\frac{4\gamma\epsilon\alpha^2}{(1-\gamma)^2} \end{align*} \]

式⑩的证明

利用\(M_i(\pi)=L_{\pi_i}(\pi)-C\cdot D_{kl}^{max}(\pi_i,\pi)\)

通过提升\(\eta(\pi)\)的下限\(M_i(\pi)\),来提升\(\eta(\pi)\)

\(\eta(\pi_{i+1})\geq M_i(\pi_{i+1})\)

\(\eta(\pi_i)=M_i(\pi_i)=L_{\pi_i}(\pi_i)\)

所以\(\eta(\pi_{i+1})-\eta(\pi_i)\geq M_i(\pi_{i+1})-M_i(\pi_i)\)

因此提升\(M_i(\pi)\)即可提升\(\eta(\pi)\)

式17的证明

\[\begin{align*} \sum\limits_a\pi(a|s)A_\pi(s,a) &= \sum\limits_a\pi(a|s)[Q_\pi(s,a)-V_\pi(s)] \\ &= \sum\limits_a(\pi(a|s)Q_\pi(s,a))-V_\pi(s)\sum\limits_a\pi(a|s) \\ &= \sum\limits_a(\pi(a|s)Q_\pi(s,a))-V_\pi(s) \end{align*} \]


免责声明!

本站转载的文章为个人学习借鉴使用,本站对版权不负任何法律责任。如果侵犯了您的隐私权益,请联系本站邮箱yoyou2525@163.com删除。



 
粤ICP备18138465号  © 2018-2025 CODEPRJ.COM