TRPO
1.算法推导
由于我们希望每次在更新策略之后,新策略\(\tilde\pi\)能必当前策略\(\pi\)更优。因此我们希望能够将\(\eta(\tilde\pi)\)写为\(\eta(\pi)+\cdots\)的形式,这时我们只需要考虑\((\cdots)\geq 0\),就能保证\(\eta(\tilde\pi)\)是单调增大的。
那么由\(A_\pi(s_t,a_t)\)可以重新定义\(\eta(\tilde\pi)=\eta(\pi)+E_{s_0,a_0,\cdots,\sim\pi}[\sum\limits_{t=0}^\infty \gamma^tA_\pi(s_t,a_t)]\) 1
将上式写为求和的形式: \(\eta(\tilde\pi)=\eta(\pi)+\sum\limits_{t=0}^\infty\sum\limits_sP(s|\tilde\pi)\sum\limits_a\gamma^tA_\pi(s_t,a_t)\) 2
将上式进行改写\(\eta(\tilde\pi)=\eta(\pi)+\sum\limits_s\sum\limits_{t=0}^\infty\gamma^tP(s|\tilde\pi)\sum\limits_a\tilde\pi(a|s)A_\pi(s_t,a_t)\) 3
定义\(\rho_{\tilde\pi}=P(s_0|\tilde\pi)+\gamma\ P(s_1|\tilde\pi)+\gamma^2P(s_2|\tilde\pi)+\cdots+\gamma^t P(s_t|\tilde\pi)\)
所以3为:\(\eta(\tilde\pi)=\eta(\pi)+\sum\limits_s\rho_\tilde\pi(s)\sum\limits_a\tilde\pi(a|s)A_\pi(s_t,a_t)\) 4
接下来为第一次近似:
由于\(\rho_\tilde\pi(s)\)是新策略的采样,而我们得到新策略的转移概率是极为困难的,因此我们利用\(\rho_\pi(s)\)替换\(\rho_\tilde\pi(s)\),得到\(L(\tilde\pi)=\eta(\pi)+\sum\limits_s\rho_\pi(s)\sum\limits_a\tilde\pi(a|s)A_\pi(s_t,a_t)\) 5
可以证明\(L(\pi_{old})=\eta(\pi_{old}), \triangledown L(\pi_\theta)|_{\theta=\theta_{old}}=\triangledown \eta(\pi_\theta)|_{\theta=\theta_{old}}\) 6
那么对于函数\(L(\pi_\theta),\eta(\pi_\theta)\)在\(\theta_{old}\)处,对\(\theta_{old}\)更新足够小的一步,那么对\(L_{\theta_{old}}(\pi_\theta)\)的提升相当于对\(\eta_\theta(\pi_\theta)\)的提高。
此时我们需要做的便是\(\theta_{old}\)和\(\theta_{new}\)之间的句里的衡量以及判断更新策略后会带来\(\eta(\pi_\theta)\)多大的提升。
由02年的论文中的证明,我们可以得到,当策略以\(\tilde\pi=(1-\alpha)\pi+\alpha\pi^{'}\)更新时,有:
\(\eta(\tilde\pi)\geq L(\tilde\pi)-\frac{2\epsilon\gamma}{(1-\gamma)^t}\alpha^2\)此时$\epsilon=\max\limits_s|E_a[A_\pi(s,a)]| $ 7
这里的\(\alpha\)指采用上述的混合方式时的权重,但是我们也可以将其视作为新策略和旧策略之间的散度。
因此得到式子\(\eta(\tilde\pi)\geq L(\tilde\pi)-\frac{4\epsilon\gamma}{(1-\gamma)^2}\alpha^2\) 8 此时\(\alpha=D_{TV}^{max}(\pi_{old},\pi_{new})\)
[\(D_{TV}\)即The Total Variation Divergence,max为求最大值。定义为\(D_{TV}(q,p)=\frac{1}{2}\sum\limits_i|p_i-q_i|,D_{TV}^{max}(p,q)=\max\limits_sD_{TV}(p.q)\)]
且有\(D_{TV}(q||p)^2\leq D_{KL}(q||p)\),则我们得到了\(\eta(\tilde\pi)-C\cdot D_{KL}^{max}(\pi||\tilde\pi)\),其中\(C=\frac{4\epsilon\gamma}{(1-\gamma)^2}\) 9
通过上式,我们可以发现当前的更新都是单调递增的。 10
当前更新\(\theta\)的过程为:找到一个\(\theta\)使得不等式右边值最大,然后令\(\theta_{old}=\theta\),即\(\max\limits_\theta[L_{\theta_{old}}(\theta)-C\cdot D_{KL}^{max}(\theta_{old}||\theta)]\) 11
而如果使用理论推导出来的不等式,系数\(C=\frac{4\epsilon\gamma}{(1-\gamma)^2}\),那么更新步幅会很小,更新会很慢,但着眼于\(D_{KL}^{max}\)项,通过对这一项进行约束,将问题转换成有约束的优化问题,可以获得较大的更新步幅。
此时得到优化目标:\(\max\limits_\theta L_{\theta_{old}}(\theta)\ \ \ \ subject\ \ to\ \ D_{KL}^{max}(\theta_{old}||\theta)\leq\delta\) 12
但可以看到\(D_{KL}^{max}\leq\delta\)这个约束时施加于所有状态的,要对每一个状态进行考察,这是十分困难的。
因此进行了第二次近似:利用average KL divergence代替原来的\(D_{KL}^{max}\)约束项。
定义\(\overline{D_{KL}^{\rho}}(\theta_1,\theta_2)=E_s[D_{KL}(\pi_{\theta_1}(\cdot|s)||\pi_{\theta_2}(\cdot||s))]\) 13
那么问题转变为\(\max\limits_\theta L_{\theta_{old}}(\theta)\ \ \ \ subiect\ \ to\ \ \overline{D}_{KL}^{\rho}(\theta_{old}||\theta)\leq\delta\) 14 在实验中发现,\(\overline{D}_{KL}^\rho\)和\(D_{KL}^{max}\)有着相似的表现。
为了能够在实际中使用,因此进行第三次近似:
将目标函数展开\(L_{\theta_{old}}=\sum\limits_s\rho_{\theta_{old}}(s)\sum\limits_a\pi_\theta(a|s)A_{\theta_{old}}(s,a)\)
首先将\(\sum\limits_s\rho(\theta)(\cdots)\)用样本的期望值代替:\(\frac{1}{1-\gamma}E_{s\sim\rho}(\cdots)\) 15
接着将\(\sum\limits_a\pi_\theta(a|s)A_{\theta_{old}}(s,a)\)利用重要性采样估计,得:\(E_{a\sim q}[\frac{\pi_\theta(a|s_n)}{q(a|s_n)}A_{\theta_{old}}(s_n,a)]\) 16
最后将\(A_{\theta_{old}}(s,a)\)替换为\(Q_{\theta_{old}}(s,a)\) 17 ,这一替换只使得目标函数差了一个常数项
优化目标的最终形态:\(\max\limits_\theta E_{s\sim\rho_{\theta_{old}},a\sim q}[\frac{\pi_\theta(a|s_n)}{q(a|s_n)}A_{\theta_{old}}(s_n,a)]\ \ \ subject\ to\ D_{KL}(\pi_{\theta_{old}}(\cdot|s)||\pi_\theta(\cdot|s))\leq\delta\)
在实际操作时:①用样本均值替换期望E②用实验估计替换Q值
2.优化说明
对于\(\max\limits_\theta\ \ L_{\theta_{old}}(\theta)\ \ \ subject\ to\ D_{KL}(\pi_{\theta_{old}}(\cdot|s)||\pi_\theta(\cdot|s))\leq\delta\)
用蒙特卡洛方法估计出目标函数与约束方程的待估计值之后,来考虑如何解这一有约束的优化过程。
将\(l(\theta)\)与\(kl(\theta)\)在\(\theta_{old}\)处进行Tylor展开
\(l(\theta)\approx l(\theta_{old})+\triangledown l(\theta_{old})^T(\theta-\theta_{old})+\frac{1}{2}(\theta-\theta_{old})^TH(l)(\theta_{old})(\theta-\theta_{old})\approx g(\theta-\theta_{old})\) 18(第一项为0,第三项极小)
\(kl(\theta)\approx kl(\theta_{old})+\triangledown kl(\theta_{old})^T(\theta-\theta_{old})+\frac{1}{2}(\theta-\theta_{old})^TH(kl)(\theta_{old})(\theta-\theta_{old})\approx \frac{1}{2}(\theta-\theta_{old})^TF(\theta-\theta_{old})\) 19(第一项为0,第二项为0)
那么优化问题近似成:\(\max\limits_\theta g(\theta-\theta_{old})\ \ \ subject\ \ to\ \ \frac{1}{2}(\theta-\theta_{old})^TF(\theta-\theta_{old})\leq\delta\) 20
构建拉格朗日函数:\(L(\theta,\lambda)=g(\theta-\theta_{old})-\frac{\lambda}{2}[(\theta-\theta_{old})^TF(\theta-\theta_{old})-\delta]\) 21
因为约束项为不等式,还应该满足KKT条件:
联解上式可以得到\(\frac{1}{2}s^TFs=\delta\)其中 \(s=\theta-\theta_{old}\)
3.Hession Free
共轭梯度相对于牛顿法来说,不能一步求解,需要多步过程,但不需要对Hessian矩阵求逆,不过仍需对Hessian矩阵进行存储和计算。
有以下关系:\((Hv)^{(i)}=\sum\limits_{j=1}^N[\frac{\partial^2f}{\partial x_i\partial x_j}(x)v_j]=[\triangledown\frac{\partial f(x)}{\partial x_i}]\cdot v\),这正是函数\(g=\frac{\partial f}{\partial x_i}\)关于方向v的方向导数
于是 \(\triangledown_vg=\lim\limits_{\epsilon\rightarrow0}\frac{g(x+\epsilon)-g(x)}{\epsilon}\approx\frac{g(x+\epsilon)-g(x)}{\epsilon}\),得到\(Hv\approx\frac{g(x+\epsilon)-g(x)}{\epsilon}\)
4.更新过程
此时我们需要求解的拉格朗日函数为
利用共轭梯度法们可计算出当前点指向极值点的向量\(s_u=\frac{1}{\lambda}F^{-1}g\)
为满足限制条件,对\(s_u\)进行修正:\(s=\sqrt{\frac{2\delta}{s_u^TFs_u}}s_u\)
利用这一修正后的向量s进行搜索:分别以向量\(s,\frac{s}{2},\frac{s}{4},\cdots\)与当前迭代点\(x_i\)相加,直到优化目标有所提升。
5.证明
式①的证明
其中第一个等号为将\(A_\pi\)展开;第二个等号为将求和每项写出来,并将相同项相减的项合并;第四个等号为在初始状态\(s_0\),\(\pi\)和\(\tilde\pi\)是一样的。
式⑥的证明
显然 \(L(\pi_{old})=\eta(\pi_{old})\),
对于\(\triangledown L(\pi_\theta)|_{\theta=\theta_{old}}=\sum\limits_s\rho_{\theta_{old}}(s)\sum\limits_a\triangledown_\theta\pi(a|s)A_\pi(s,a)|_{\theta=\theta_{old}}\)
\(\triangledown \eta(\pi_\theta)|_{\theta=\theta_{old}}=\sum\limits_s\rho_{\theta}(s)\sum\limits_a\triangledown_\theta\pi(a|s)A_\pi(s,a)|_{\theta=\theta_{old}}\)
在实际操作中,\(\sum\limits_s\rho_\pi(s)\)是由样本信息得到的,当按照 \(\theta=\theta_{old}\),即\(\pi_{\theta_{old}}\)采样时,\(\sum\limits_s\rho_{\theta_{old}}(s)=\sum\limits_s\rho_\theta(s)\),等式左右两边相等。
式⑦⑧的证明
\(\pi_{new}(a|s)=(1-\alpha)\pi_{old}+\alpha\pi^{'}(a|s)\),会有\(\eta(\pi_{new})\geq L_{\pi_{old}}-\frac{2\epsilon\gamma}{(1-\gamma)^2}\alpha^2;\ \ \ where\ \ \epsilon=\max\limits_s|E_{a\sim\pi^{'}}[A_\pi(a,s)]|\)
定义\(\overline{A}(s)=E_{a\sim\tilde\pi(\cdot|s)}[A_\pi(s,a)]\leftarrow \overline A(s)\)表示在状态s时,采用策略\(\tilde\pi\)相对于之前策略的改进。
\(\eta(\tilde\pi)=\eta(\pi)+E_{\tau\sim\tilde\pi}[\sum\limits_{t=0}^\infty\gamma^t\overline{A}(s_t)],\ \ L(\tilde\pi)=\eta(\pi)+E_{\tau\sim\pi}[\sum\limits_{t=0}^\infty\gamma^t\overline A(s_t)]\)
可以将一个策略表示为策略对\((\pi,\tilde\pi)\),由策略对产生的动作对\((a,\tilde a)\) ,并且有\(P(a\neq\tilde a|s)\leq\alpha\)
\(\overline{A}(s)=E_{\tilde a\sim\tilde\pi}[A_\pi(s,\tilde a)]=E_{(a,\tilde a)\sim(\pi,\tilde\pi)}[A_\pi(s,\tilde a)-A_\pi(s,a)]=P(a\neq\tilde a|s)E_{(a,\tilde a)\sim(\pi,\tilde\pi)}[A_\pi(s,\tilde a)-A_\pi(s,a)]\)
因为\(E_{a\sim\pi}[A_\pi(s,a)]=0\)且\(P(a=\tilde a|s)E_{(a,\tilde a)\sim(\pi,\tilde\pi)}[A_\pi(s,\tilde a)-A_\pi(s,a)]=0\)
所以\(|\overline{A}(s)|\leq 2\alpha\max\limits_{s,a}|A_\pi(s,a)|\)
进而能够得到\(|E_{s_t\sim\tilde\pi}(\overline A(s_t))-E_{s_t\sim\pi}(\overline A(s_t))|\leq2\alpha\max\limits_s\overline{A}(s)\leq4\alpha(1-(1-\alpha)^t)\max\limits_s|A_\pi(s,a)|\)
证明如下:\(n_t\)为时刻t之前策略\(\pi\)和\(\tilde\pi\)产生不同动作的次数
\(E_{s_t\sim\tilde\pi}[\overline{A}(s_t)]=P(n_t>0)E_{s_t\sim\tilde\pi|n_t>0}[\overline{A}(s_t)]+P(n_t=0)E_{s_t\sim\tilde\pi|n_t=0}[\overline{A}(s_t)]\)
\(E_{s_t\sim\pi}[\overline{A}(s_t)]=P(n_t>0)E_{s_t\sim\pi|n_t>0}[\overline{A}(s_t)]+P(n_t=0)E_{s_t\sim\pi|n_t=0}[\overline{A}(s_t)]\)
且\(E_{s_t\sim\tilde\pi|n_t=0}[\overline{A}(s_t)]=E_{s_t\sim\pi|n_t=0}[\overline{A}(s_t)]\leftarrow\)在时刻t之前两种策略产生的动作相同,也在相同的状态
所以\(E_{s_t\sim\tilde\pi}[\overline A(s_t)]-E_{s_t\sim\pi}[\overline{A}(s_t)]=P(n_t>0)(E_{s_t\sim\tilde\pi|n_t>0}[\overline{A}(s_t)]-E_{s_t\sim\pi|n_t>0}[\overline{A}(s_t)])\)
因为\(P(n_t>0)\leq1-(1-\alpha)^t\)且
所以\(|E_{s_t\sim\tilde\pi|n_t>0}[\overline{A}(s_t)]-E_{s_t\sim\pi|n_t>0}[\overline{A}(s_t)]|\leq4\alpha(1-(1-\alpha)^t)\max\limits_{s,a]}|A_\pi(s,a)|\leq4\alpha\max\limits_{s,a}|A_\pi(s,a)|\)
所以最终能够得到:
式⑩的证明
利用\(M_i(\pi)=L_{\pi_i}(\pi)-C\cdot D_{kl}^{max}(\pi_i,\pi)\)
通过提升\(\eta(\pi)\)的下限\(M_i(\pi)\),来提升\(\eta(\pi)\)
\(\eta(\pi_{i+1})\geq M_i(\pi_{i+1})\)
\(\eta(\pi_i)=M_i(\pi_i)=L_{\pi_i}(\pi_i)\)
所以\(\eta(\pi_{i+1})-\eta(\pi_i)\geq M_i(\pi_{i+1})-M_i(\pi_i)\)
因此提升\(M_i(\pi)\)即可提升\(\eta(\pi)\)