近端策略優化算法(Proximal Policy Optimization Algorithms, PPO)
作者:凱魯嘎吉 - 博客園 http://www.cnblogs.com/kailugaji/
這篇博文是Schulman, J., Wolski, F., Dhariwal, P., Radford, A., and Klimov, O. Proximal policy optimization algorithms. Advances in Neural Information Processing Systems, 2017.的閱讀筆記,用來介紹PPO優化方法及其一些公式的推導。文中給出了三種優化方法,其中第三種是第一種的拓展,這兩種使用廣泛,第二種實驗驗證效果不好,但也是一個小技巧。閱讀本文,需要事先了解信賴域策略優化(Trust Region Policy Optimization, TRPO),從Proximal這個詞匯中,可以聯想到一類涉及矩陣范數的優化問題中的軟閾值算子(soft thresholding/shrinkage operator)以及圖Lasso求逆協方差矩陣(Graphical Lasso for inverse covariance matrix)中使用近端梯度下降(Proximal Gradient Descent, PGD)求解Lasso問題。更多強化學習內容,請看:隨筆分類 - Reinforcement Learning。
1. 前提知識
策略梯度法(Policy Gradient Methods)與信賴域策略優化(Trust Region Policy Optimization, TRPO)

由於TRPO使用了一個硬約束來計算策略梯度,因此很難選擇一個在不同問題中都表現良好的單一約束值。
2. 方法一:Clipped Surrogate Objective


3. 方法二:Adaptive KL Penalty Coefficient

4. 方法三:Actor-Critic-Style Algorithm



5. 參考文獻
[1] Schulman, J., Wolski, F., Dhariwal, P., Radford, A., and Klimov, O. Proximal policy optimization algorithms. Advances in Neural Information Processing Systems, 2017.
[2] Proximal Policy Optimization — Spinning Up documentation https://spinningup.openai.com/en/latest/algorithms/ppo.html
[3] V. Mnih, A.Badia, M. Mirza, A. Graves, T. Lillicrap, T. Harley, D. Silver, K. Kavukcuoglu. Asynchronous Methods for Deep Reinforcement Learning. ICML, 2016.
[4] Proximal Policy Optimization Algorithms, slides, https://dvl.in.tum.de/slides/automl-ss19/01_stadler_ppo.pdf
