強化學習讀書筆記 - 10 - on-policy控制的近似方法

學習筆記：
Reinforcement Learning: An Introduction, Richard S. Sutton and Andrew G. Barto c 2014, 2015, 2016

參照

需要了解強化學習的數學符號，先看看這里：

強化學習讀書筆記 - 00 - 術語和數學符號

on-policy控制的近似方法

近似控制方法(Control Methods)是求策略的行動狀態價值\(q_{\pi}(s, a)\)的近似值\(\hat{q}(s, a, \theta)\)。

半梯度遞減的控制Sarsa方法 (Episodic Semi-gradient Sarsa for Control)

Input: a differentiable function \(\hat{q} : \mathcal{S} \times \mathcal{A} \times \mathbb{R}^n \to \mathbb{R}\)

Initialize value-function weights \(\theta \in \mathbb{R}^n\) arbitrarily (e.g., \(\theta = 0\))
Repeat (for each episode):
\(S, A \gets\) initial state and action of episode (e.g., "\(\epsilon\)-greedy)
Repeat (for each step of episode):
Take action \(A\), observe \(R, S'\)
If \(S'\) is terminal:
\(\theta \gets \theta + \alpha [R - \hat{q}(S, A, \theta)] \nabla \hat{q}(S, A, \theta)\)
Go to next episode
Choose \(A'\) as a function of \(\hat{q}(S', \dot \ , \theta)\) (e.g., \(\epsilon\)-greedy)
\(\theta \gets \theta + \alpha [R + \gamma \hat{q}(S', A', \theta) - \hat{q}(S, A, \theta)] \nabla \hat{q}(S, A, \theta)\)
\(S \gets S'\)
\(A \gets A'\)

多步半梯度遞減的控制Sarsa方法 (n-step Semi-gradient Sarsa for Control)

請看原書，不做拗述。

（連續性任務的）平均獎賞

由於打折率(\(\gamma\), the discounting rate)在近似計算中存在一些問題（說是下一章說明問題是什么）。
因此，在連續性任務中引進了平均獎賞(Average Reward)\(\eta(\pi)\):

\[\begin{align} \eta(\pi) & \doteq \lim_{T \to \infty} \frac{1}{T} \sum_{t=1}{T} \mathbb{E} [R_t | A_{0:t-1} \sim \pi] \\ & = \lim_{t \to \infty} \mathbb{E} [R_t | A_{0:t-1} \sim \pi] \\ & = \sum_s d_{\pi}(s) \sum_a \pi(a|s) \sum_{s',r} p(s,r'|s,a)r \end{align} \]

目標回報（= 原獎賞 - 平均獎賞）

\[G_t \doteq R_{t+1} - \eta(\pi) + R_{t+2} - \eta(\pi) + \cdots \]

策略價值

\[v_{\pi}(s) = \sum_{a} \pi(a|s) \sum_{r,s'} p(s',r|s,a)[r - \eta(\pi) + v_{\pi}(s')] \\ q_{\pi}(s,a) = \sum_{r,s'} p(s',r|s,a)[r - \eta(\pi) + \sum_{a'} \pi(a'|s') q_{\pi}(s',a')] \\ \]

策略最優價值

\[v_{*}(s) = \underset{a}{max} \sum_{r,s'} p(s',r|s,a)[r - \eta(\pi) + v_{*}(s')] \\ q_{*}(s,a) = \sum_{r,s'} p(s',r|s,a)[r - \eta(\pi) + \underset{a'}{max} \ q_{*}(s',a')] \\ \]

時序差分誤差

\[\delta_t \doteq R_{t+1} - \bar{R} + \hat{v}(S_{t+1},\theta) - \hat{v}(S_{t},\theta) \\ \delta_t \doteq R_{t+1} - \bar{R} + \hat{q}(S_{t+1},A_t,\theta) - \hat{q}(S_{t},A_t,\theta) \\ where \\ \bar{R} \text{ - is an estimate of the average reward } \eta(\pi) \]

半梯度遞減Sarsa的平均獎賞版

\[\theta_{t+1} \doteq \theta_t + \alpha \delta_t \nabla \hat{q}(S_{t},A_t,\theta) \]

半梯度遞減Sarsa的平均獎賞版(for continuing tasks)

Input: a differentiable function \(\hat{q} : \mathcal{S} \times \mathcal{A} \times \mathbb{R}^n \to \mathbb{R}\)
Parameters: step sizes \(\alpha, \beta > 0\)

Initialize value-function weights \(\theta \in \mathbb{R}^n\) arbitrarily (e.g., \(\theta = 0\))
Initialize average reward estimate \(\bar{R}\) arbitrarily (e.g., \(\bar{R} = 0\))
Initialize state \(S\), and action \(A\)

Repeat (for each step):
Take action \(A\), observe \(R, S'\)
Choose \(A'\) as a function of \(\hat{q}(S', \dot \ , \theta)\) (e.g., \(\epsilon\)-greedy)
\(\delta \gets R - \bar{R} + \hat{q}(S', A', \theta) - \hat{q}(S, A, \theta)\)
\(\bar{R} \gets \bar{R} + \beta \delta\)
\(\theta \gets \theta + \alpha \delta \nabla \hat{q}(S, A, \theta)\)
\(S \gets S'\)
\(A \gets A'\)

多步半梯度遞減的控制Sarsa方法 - 平均獎賞版(for continuing tasks)

請看原書，不做拗述。

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。