信賴域策略優化(Trust Region Policy Optimization, TRPO)
作者:凱魯嘎吉 - 博客園 http://www.cnblogs.com/kailugaji/
這篇博文是John S., Sergey L., Pieter A., Michael J., Philipp M., Trust Region Policy Optimization. Proceedings of the 32nd International Conference on Machine Learning, PMLR 37:1889-1897, 2015.的閱讀筆記,用來介紹TRPO策略優化方法及其一些公式的推導。TRPO是一種基於策略梯度的強化學習方法,除了定理1沒推導之外,其他公式的來龍去脈都進行了詳細介紹,為后續進一步深入研究其他強化學習方法提供基礎。更多強化學習內容,請看:隨筆分類 - Reinforcement Learning。
1. 基礎知識
KL散度 (Kullback–Leibler Divergence or Relative Entropy),總變差散度(Total Variation Divergence),以及KL散度與TV散度之間的關系(Pinsker’s inequality)
共軛梯度法(Conjugate Gradient Algorithm)
新舊策略期望折扣獎勵差
2. η的局部近似
3. 一般性隨機策略的單調提升保證
4. 參數化策略的優化問題
5. Sample-Based Estimation of the Objective and Constraint
6. 約束優化問題的求解
7. 算法總體流程
8. 參考文獻
[1] John S., Sergey L., Pieter A., Michael J., Philipp M., Trust Region Policy Optimization. Proceedings of the 32nd International Conference on Machine Learning, PMLR 37:1889-1897, 2015.
[2] Entropy and Information Theory, Robert M. Gray, http://www.cis.jhu.edu/~bruno/s06-466/GrayIT.pdf. Lemma 5.2.8 P88.
[3] Su G . On Choosing and Bounding Probability Metrics. International Statistical Review, 2002, 70(3). https://arxiv.org/pdf/math/0209021.pdf
[4] Concentration inequalities: A nonasymptotic theory of independence, http://home.ustc.edu.cn/~luke2001/pdf/concentration.pdf, Theorem 4.19, P103, Pinsker's inequality.
[5] J. Nocedal and S. J. Wright, Numerical optimization. New York, NY: Springer (2006; Zbl 1104.65059) http://www.apmath.spbu.ru/cnsa/pdf/monograf/Numerical_Optimization2006.pdf
[6] Kakade, Sham and Langford, John. Approximately optimal approximate reinforcement learning. In ICML, volume 2, pp. 267 274, 2002.
[7] Trust Region Policy Optimization https://spinningup.openai.com/en/latest/algorithms/trpo.html