[Reinforcement Learning] Model-Free Control

本文轉載自查看原文 2018-10-31 10:40 1355 RL/ Algorithm/ Machine Learning/ ML

上篇總結了 Model-Free Predict 問題及方法，本文內容介紹 Model-Free Control 方法，即 "Optimise the value function of an unknown MDP"。

在這里說明下，Model-Free Predict/Control 不僅適用於 Model-Free 的情況，其同樣適用於 MDP 已知的問題：

MDP model is unknown, but experience can be sampled.
MDP model is known, but is too big to use, except by samples.

在正式介紹 Model-Free Control 方法之前，我們先介紹下 On-policy Learning 及 Off-policy Learning。

On-policy Learning vs. Off-policy Learning

On-policy Learning：

"Learn on the job"
Learn about policy \(\pi\) from experience sampled from \(\pi\)（即采樣的策略與學習的策略一致）

Off-policy Learning：

"Look over someone's shoulder"
Learn about policy \(\pi\) from experience sampled from \(\mu\)（即采樣的策略與學習的策略不一致）

On-Policy Monte-Carlo Learning

Generalized Policy Iteration

具體的 Control 方法，在《動態規划》一文中我們提到了 Model-based 下的廣義策略迭代 GPI 框架，那在 Model-Free 情況下是否同樣適用呢？
如下圖為 Model-based 下的廣義策略迭代 GPI 框架，主要分兩部分：策略評估及基於 Greedy 策略的策略提升。

![](https://img2018.cnblogs.com/blog/764050/201810/764050-20181030094138877-1709997119.png)

Model-Free 策略評估

在《Model-Free Predict》中我們分別介紹了兩種 Model-Free 的策略評估方法：MC 和 TD。我們先討論使用 MC 情況下的 Model-Free 策略評估。
如上圖GPI框架所示：

基於 \(V(s)\) 的貪婪策略提升需要 MDP 已知：

\[\pi'(s) = \arg\max_{a\in A}\Bigl(R_{s}^{a}+P_{ss'}^{a}V(s')\Bigr) \]

基於 \(Q(s, a)\) 的貪婪策略提升不需要 MDP 已知，即 Model-Free：

\[\pi'(s) = \arg\max_{a\in A}Q(s, a) \]

因此 Model-Free 下需要對 \(Q(s, a)\) 策略評估，整個GPI策略迭代也要基於 \(Q(s, a)\)。

Model-Free 策略提升

確定了策略評估的對象，那接下來要考慮的就是如何基於策略評估的結果 \(Q(s, a)\) 進行策略提升。
由於 Model-Free 的策略評估基於對經驗的 samples（即評估的 \(q(s, a)\) 存在 bias），因此我們在這里不采用純粹的 greedy 策略，防止因為策略評估的偏差導致整個策略迭代進入局部最優，而是采用具有 explore 功能的 \(\epsilon\)-greedy 算法：

\[\pi(a|s) = \begin{cases} &\frac{\epsilon}{m} + 1 - \epsilon, &\text{if } a^*=\arg\max_{a\in A}Q(s, a)\\ &\frac{\epsilon}{m}, &\text{otherwise} \end{cases} \]

因此，我們確定了 Model-Free 下的 Monto-Carlo Control：

![](https://img2018.cnblogs.com/blog/764050/201810/764050-20181030094153831-1123656335.png)

GLIE

先直接貼下David的課件，GLIE 介紹如下：

![](https://img2018.cnblogs.com/blog/764050/201810/764050-20181030094208835-610523224.png)

對於 \(\epsilon\)-greedy 算法而言，如果 \(\epsilon\) 隨着迭代次數逐步減為0，那么 \(\epsilon\)-greedy 是 GLIE，即：

\[\epsilon_{k} = \frac{1}{k} \]

GLIE Monto-Carlo Control

GLIE Monto-Carlo Control：

對於 episode 中的每個狀態 \(S_{t}\) 和動作 \(A_t\)：

\[N(S_t, A_t) ← N(S_t, A_t) + 1 \\ Q(S_t, A_t) ← Q(S_t, A_t) + \frac{1}{N(S_t, A_t)}(G_t - Q(S_t, A_t)) \]

基於新的動作價值函數提升策略：

\[\epsilon ← \frac{1}{k}\\ \pi ← \epsilon\text{-greedy}(Q) \]

定理：GLIE Monto-Carlo Control 收斂到最優的動作價值函數，即：\(Q(s, a) → q_*(s, a)\)。

On-Policy Temporal-Difference Learning

Sarsa

我們之前總結過 TD 相對 MC 的優勢：

低方差
Online
非完整序列

那么一個很自然的想法就是在整個控制閉環中用 TD 代替 MC：

使用 TD 來計算 \(Q(S, A)\)
仍然使用 \(\epsilon\)-greedy 策略提升
每一個 step 進行更新

通過上述改變就使得 On-Policy 的蒙特卡洛方法變成了著名的 Sarsa。

更新動作價值函數

![](https://img2018.cnblogs.com/blog/764050/201810/764050-20181030094221840-942626866.png)

- Control

![](https://img2018.cnblogs.com/blog/764050/201810/764050-20181030094230851-277293106.png)

Sarsa算法的偽代碼如下：

Sarsa(λ)

n-step Sarsa returns 可以表示如下：
\(n=1\) 時：\(q_{t}^{(1)} = R_{t+1} + \gamma Q(S_{t+1})\)
\(n=2\) 時：\(q_{t}^{(2)} = R_{t+1} + \gamma R_{t+2} + \gamma^2 Q(S_{t+2})\)
...
\(n=\infty\) 時：\(q_{t}^{\infty} = R_{t+1} + \gamma R_{t+2} + ... + \gamma^{T-1} R_T\)
因此，n-step return \(q_{t}^{(n)} = R_{t+1} + \gamma R_{t+2} + ... + \gamma^{n}Q(S_{t+n})\)

n-step Sarsa 更新公式：

\[Q(S_t, A_t) ← Q(S_t, A_t) + \alpha (q_t^{(n)} - Q(S_t, A_t)) \]

具體的 Sarsa(λ) 算法偽代碼如下：

其中 \(E(s, a)\) 為資格跡。

下圖為 Sarsa(λ) 用於 Gridworld 例子的示意圖：

![](https://img2018.cnblogs.com/blog/764050/201810/764050-20181030094258849-595223970.png)

Off-Policy Learning

Off-Policy Learning 的特點是評估目標策略 \(\pi(a|s)\) 來計算 \(v_{\pi}(s)\) 或者 \(q_{\pi}(s, a)\)，但是跟隨行為策略 \(\{S_1, A_1, R_2, ..., S_T\}\sim\mu(a|s)\)。

Off-Policy Learning 有什么意義？

Learn from observing humans or other agents
Re-use experience generated from old policies \(\pi_1, \pi_2, ..., \pi_{t-1}\)
Learn about optimal policy while following exploratory policy
Learn about multiple policies while following one policy

重要性采樣

重要性采樣的目的是：Estimate the expectation of a different distribution。

\[\begin{align} E_{X\sim P}[f(X)] &= \sum P(X)f(X)\\ &= \sum Q(X)\frac{P(X)}{Q(X)}f(X)\\ &= E_{X\sim Q}[\frac{P(X)}{Q(X)}f(X)] \end{align} \]

Off-Policy MC 重要性采樣

使用策略 \(\pi\) 產生的 return 來評估 \(\mu\)：

\[G_t^{\pi/\mu} = \frac{\pi(A_t|S_t)}{\mu(A_t|S_t)} \frac{\pi(A_{t+1}|S_{t+1})}{\mu(A_{t+1}|S_{t+1})}...\frac{\pi(A_T|S_T)}{\mu(A_T|S_T)}G_t \]

朝着正確的 return 方向去更新價值：

\[V(S_t) ← V(S_t) + \alpha\Bigl(\color{Red}{G_t^{\pi/\mu}}-V(S_t)\Bigr) \]

需要注意兩點：

Cannot use if \(\mu\) is zero when \(\pi\) is non-zero
重要性采樣會顯著性地提升方差

Off-Policy TD 重要性采樣

TD 是單步的，所以使用策略 \(\pi\) 產生的 TD targets 來評估 \(\mu\)：

\[V(S_t) ← V(S_t) + \alpha\Bigl(\frac{\pi(A_t|S_t)}{\mu(A_t|S_t)}(R_{t+1}+\gamma V(S_{t+1}))-V(S_t)\Bigr) \]

方差比MC版本的重要性采樣低很多

Q-Learning

前面分別介紹了對價值函數 \(V(s)\) 進行 off-policy 學習，現在我們討論如何對動作價值函數 \(Q(s, a)\) 進行 off-policy 學習：

不需要重要性采樣
使用行為策略選出下一步的動作：\(A_{t+1}\sim\mu(·|S_t)\)
但是仍需要考慮另一個后繼動作：\(A'\sim\pi(·|S_t)\)
朝着另一個后繼動作的價值更新 \(Q(S_t, A_t)\)：

\[Q(S_t, A_t) ← Q(S_t, A_t) + \alpha\Bigl(R_{t+1}+\gamma Q(S_{t+1}, A')-Q(S_t, A_t)\Bigr) \]

討論完對動作價值函數的學習，我們接着看如何通過 Q-Learning 進行 Control：

行為策略和目標策略均改進
目標策略 \(\pi\) 以greedy方式改進：

\[\pi(S_t) = \arg\max_{a'}Q(S_{t+1}, a') \]

行為策略 \(\mu\) 以 \(\epsilon\)-greedy 方式改進
Q-Learning target：

\[\begin{align} &R_{t+1}+\gamma Q(S_{t+1}, A')\\ =&R_{t+1}+\gamma Q\Bigl(S_{t+1}, \arg\max_{a'}Q(S_{t+1}, a')\Bigr)\\ =&R_{t+1}+\max_{a'}\gamma Q(S_{t+1}, a') \end{align} \]

Q-Learning 的 backup tree 如下所示：

![](https://img2018.cnblogs.com/blog/764050/201810/764050-20181030094313903-820222072.png)

關於 Q-Learning 的結論：

Q-learning control converges to the optimal action-value function, \(Q(s, a)→q_*(s, a)\)

Q-Learning 算法具體的偽代碼如下：

![](https://img2018.cnblogs.com/blog/764050/201810/764050-20181030094323853-141348712.png)

對比 Sarsa 與 Q-Learning 可以發現兩個最重要的區別：

TD target 公式不同
Q-Learning 中下一步的動作從行為策略中選出，而不是目標策略

DP vs. TD

兩者的區別見下表：

![](https://img2018.cnblogs.com/blog/764050/201810/764050-20181030094333833-1951913901.png) ![](https://img2018.cnblogs.com/blog/764050/201810/764050-20181030094341855-1641614291.png)

Reference

[1] Reinforcement Learning: An Introduction, Richard S. Sutton and Andrew G. Barto, 2018
[2] David Silver's Homepage

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 [Reinforcement Learning] Model-Free Prediction 【強化學習RL】model-free的prediction和control — MC, TD(λ), SARSA, Q-learning等 Reinforcement Learning github上DQN代碼的環境搭建，及運行（Human-Level Control through Deep Reinforcement Learning）conda配置深度強化學習介紹【PPT】 Human-level control through deep reinforcement learning （DQN） Continuous Control with Deep Reinforcement A Distributional Perspective on Reinforcement Learning Learning to Reinforcement Learn 在Matlab 上使用 Reinforcement learning RL^2: Fast Reinforcement Learning via Slow Reinforcement Learning