強化學習讀書筆記 - 13 - 策略梯度方法(Policy Gradient Methods)

學習筆記：
Reinforcement Learning: An Introduction, Richard S. Sutton and Andrew G. Barto c 2014, 2015, 2016

參照

需要了解強化學習的數學符號，先看看這里：

強化學習讀書筆記 - 00 - 術語和數學符號

策略梯度方法(Policy Gradient Methods)

基於價值函數的思路

\[\text{Reinforcement Learning} \doteq \pi_* \\ \quad \updownarrow \\ \pi_* \doteq \{ \pi(s) \}, \ s \in \mathcal{S} \\ \quad \updownarrow \\ \begin{cases} \pi(s) = \underset{a}{argmax} \ v_{\pi}(s' | s, a), \ s' \in S(s), \quad \text{or} \\ \pi(s) = \underset{a}{argmax} \ q_{\pi}(s, a) \\ \end{cases} \\ \quad \updownarrow \\ \begin{cases} v_*(s), \quad \text{or} \\ q_*(s, a) \\ \end{cases} \\ \quad \updownarrow \\ \text{approximation cases:} \\ \begin{cases} \hat{v}(s, \theta) \doteq \theta^T \phi(s), \quad \text{state value function} \\ \hat{q}(s, a, \theta) \doteq \theta^T \phi(s, a), \quad \text{action value function} \\ \end{cases} \\ where \\ \theta \text{ - value function's weight vector} \\ \]

策略梯度方法的新思路(Policy Gradient Methods)

\[\text{Reinforcement Learning} \doteq \pi_* \\ \quad \updownarrow \\ \pi_* \doteq \{ \pi(s) \}, \ s \in \mathcal{S} \\ \quad \updownarrow \\ \pi(s) = \underset{a}{argmax} \ \pi(a|s, \theta) \\ where \\ \pi(a|s, \theta) \in [0, 1] \\ s \in \mathcal{S}, \ a \in \mathcal{A} \\ \quad \updownarrow \\ \pi(a|s, \theta) \doteq \frac{exp(h(s,a,\theta))}{\sum_b exp(h(s,b,\theta))} \\ \quad \updownarrow \\ exp(h(s,a,\theta)) \doteq \theta^T \phi(s,a) \\ where \\ \theta \text{ - policy weight vector} \\ \]

策略梯度定理（The policy gradient theorem）

情節性任務

如何計算策略的價值\(\eta\)

\[\eta(\theta) \doteq v_{\pi_\theta}(s_0) \\ where \\ \eta \text{ - the performance measure} \\ v_{\pi_\theta} \text{ - the true value function for } \pi_\theta \text{, the policy determined by } \theta \\ s_0 \text{ - some particular state} \\ \]

策略梯度定理

\[\nabla \eta(\theta) = \sum_s d_{\pi}(s) \sum_{a} q_{\pi}(s,a) \nabla_\theta \pi(a|s, \theta) \\ where \\ d(s) \text{ - on-policy distribution, the fraction of time spent in s under the target policy } \pi \\ \sum_s d(s) = 1 \\ \]

蒙特卡洛策略梯度強化算法(ERINFORCE: Monte Carlo Policy Gradient)

策略價值計算公式

\[\begin{align} \nabla \eta(\theta) & = \sum_s d_{\pi}(s) \sum_{a} q_{\pi}(s,a) \nabla_\theta \pi(a|s, \theta) \\ & = \mathbb{E}_\pi \left [ \gamma^t \sum_a q_\pi(S_t,a) \nabla_\theta \pi(a|s, \theta) \right ] \\ & = \mathbb{E}_\pi \left [ \gamma^t G_t \frac{\nabla_\theta \pi(A_t|S_t, \theta)}{\pi(A_t|S_t, \theta)} \right ] \end{align} \]

Update Rule公式

\[\begin{align} \theta_{t+1} & \doteq \theta_t + \alpha \gamma^t G_t \frac{\nabla_\theta \pi(A_t|S_t, \theta)}{\pi(A_t|S_t, \theta)} \\ & = \theta_t + \alpha \gamma^t G_t \nabla_\theta \log \pi(A_t|S_t, \theta) \\ \end{align} \]

算法描述(ERINFORCE: A Monte Carlo Policy Gradient Method (episodic))
請看原書，在此不做拗述。

帶基數的蒙特卡洛策略梯度強化算法(ERINFORCE with baseline)

策略價值計算公式

\[\begin{align} \nabla \eta(\theta) & = \sum_s d_{\pi}(s) \sum_{a} q_{\pi}(s,a) \nabla_\theta \pi(a|s, \theta) \\ & = \sum_s d_{\pi}(s) \sum_{a} \left ( q_{\pi}(s,a) - b(s)\right ) \nabla_\theta \pi(a|s, \theta) \\ \end{align} \\ \because \\ \sum_{a} b(s) \nabla_\theta \pi(a|s, \theta) \\ \quad = b(s) \nabla_\theta \sum_{a} \pi(a|s, \theta) \\ \quad = b(s) \nabla_\theta 1 \\ \quad = 0 \\ where \\ b(s) \text{ - an arbitrary baseline function, e.g. } b(s) = \hat{v}(s, w) \\ \]

Update Rule公式

\[\delta = G_t - \hat{v}(s, w) \\ w_{t+1} = w_{t} + \beta \delta \nabla_w \hat{v}(s, w) \\ \theta_{t+1} = \theta_t + \alpha \gamma^t \delta \nabla_\theta \log \pi(A_t|S_t, \theta) \\ \]

算法描述
請看原書，在此不做拗述。

角色評論算法(Actor-Critic Methods)

這個算法實際上是：

帶基數的蒙特卡洛策略梯度強化算法的TD通用化。
加上資格跡(eligibility traces)

注：蒙特卡洛方法要求必須完成當前的情節。這樣才能計算正確的回報\(G_t\)。
TD避免了這個條件（從而提高了效率），可以通過臨時差分計算一個近似的回報\(G_t^{(0)} \approx G_t\)（當然也產生了不精確性）。
資格跡(eligibility traces)優化了(計算權重變量的)價值函數的微分值，\(e_t \doteq \nabla \hat{v}(S_t, \theta_t) + \gamma \lambda \ e_{t-1}\)。

Update Rule公式

\[\delta = G_t^{(1)} - \hat{v}(S_t, w) \\ \quad = R_{t+1} + \gamma \hat{v}(S_{t+1}, w) - \hat{v}(S_t, w) \\ w_{t+1} = w_{t} + \beta \delta \nabla_w \hat{v}(s, w) \\ \theta_{t+1} = \theta_t + \alpha \gamma^t \delta \nabla_\theta \log \pi(A_t|S_t, \theta) \\ \]

Update Rule with eligibility traces公式

\[\delta = R + \gamma \hat{v}(s', w) - \hat{v}(s', w) \\ e^w = \lambda^w e^w + \gamma^t \nabla_w \hat{v}(s, w) \\ w_{t+1} = w_{t} + \beta \delta e_w \\ e^{\theta} = \lambda^{\theta} e^{\theta} + \gamma^t \nabla_\theta \log \pi(A_t|S_t, \theta) \\ \theta_{t+1} = \theta_t + \alpha \delta e^{\theta} \\ where \\ R + \gamma \hat{v}(s', w) = G_t^{(0)} \\ \delta \text{ - TD error} \\ e^w \text{ - eligibility trace of state value function} \\ e^{\theta} \text{ - eligibility trace of policy value function} \\ \]

算法描述
請看原書，在此不做拗述。

針對連續性任務的策略梯度算法(Policy Gradient for Continuing Problems(Average Reward Rate))

策略價值計算公式
對於連續性任務的策略價值是每個步驟的平均獎賞。

\[\begin{align} \eta(\theta) \doteq r(\theta) & \doteq \lim_{n \to \infty} \frac{1}{n} \sum_{t=1}^n \mathbb{E} [R_t|\theta_0=\theta_1=\dots=\theta_{t-1}=\theta] \\ & = \lim_{t \to \infty} \mathbb{E} [R_t|\theta_0=\theta_1=\dots=\theta_{t-1}=\theta] \\ \end{align} \]

Update Rule公式

Update Rule Actor-Critic with eligibility traces (continuing) 公式

\[\delta = R - \bar{R} + \gamma \hat{v}(s', w) - \hat{v}(s', w) \\ \bar{R} = \bar{R} + \eta \delta \\ e^w = \lambda^w e^w + \gamma^t \nabla_w \hat{v}(s, w) \\ w_{t+1} = w_{t} + \beta \delta e_w \\ e^{\theta} = \lambda^{\theta} e^{\theta} + \gamma^t \nabla_\theta \log \pi(A_t|S_t, \theta) \\ \theta_{t+1} = \theta_t + \alpha \delta e^{\theta} \\ where \\ R + \gamma \hat{v}(s', w) = G_t^{(0)} \\ \delta \text{ - TD error} \\ e^w \text{ - eligibility trace of state value function} \\ e^{\theta} \text{ - eligibility trace of policy value function} \\ \]

算法描述(Actor-Critic with eligibility traces (continuing))
請看原書，在此不做拗述。
原書還沒有完成，這章先停在這里

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 強化學習(十三) 策略梯度(Policy Gradient) 強化學習七 - Policy Gradient Methods DRL之：策略梯度方法　（Policy Gradient Methods）強化學習讀書筆記 - 05 - 蒙特卡洛方法(Monte Carlo Methods) 強化學習讀書筆記 - 09 - on-policy預測的近似方法強化學習讀書筆記 - 10 - on-policy控制的近似方法論文《policy-gradient-methods-for-reinforcement-learning-with-function-approximation 》的閱讀——強化學習中的策略梯度算法基本形式與部分證明強化學習讀書筆記 - 11 - off-policy的近似方法《強化學習導論》讀書筆記強化學習算法Policy Gradient