強化學習讀書筆記 - 08 - 規划式方法和學習式方法

學習筆記：
Reinforcement Learning: An Introduction, Richard S. Sutton and Andrew G. Barto c 2014, 2015, 2016

需要了解強化學習的數學符號，先看看這里：

強化學習讀書筆記 - 00 - 術語和數學符號

什么是模型(model)

環境的模型，本體可以通過模型來預測行為的反應。
對於隨機的環境，有兩種不同的模型：

distribution model - 分布式模型，返回行為的各種可能和其概率。
sample model - 樣本式模型，根據概率，返回行為的一種可能。

樣本式模型的數學表達

\[(R, S') = model(S, A) \]

規划型方法和學習型方法（Planning and Learning with Tabular Methods）

planning methods - 規划型方法。通過模型來獲得價值信息（行動狀態轉換，獎賞等）。
比如：動態規划（dynamic programming）和啟發式查詢（heuristic search）。
模型planning相當於模型模擬(model simulation)。
learning methods - 學習型方法。通過體驗（experience）來獲得價值信息。
比如：蒙特卡洛方法(Mento Carlo method)和時序差分方法(temporal different method)。

蒙特卡洛樹方法是一個規划型方法，需要一個樣本式模型。而蒙特卡洛方法是一個學習型方法。
這並不矛盾，只是意味着學習型方法的體驗是可以用模型來執行，而獲得一個模擬的經驗(simulated experience)。

規划型方法和學習型方法的相似性
規划型方法和學習型方法都是通過計算策略價值來優化策略。因此，可以融合到一起。
見書中例子：Random-sample on-step tabular Q-planning.

規划型方法

規划就是通過模型來學習 - 優化策略，有兩種：

state-place planning - 狀態空間規划
這也是本書中所講的。
plan-place planning - 規划空間規划
本書不講。

Dyna - 結合模型學習和直接強化學習

model learning - 模型學習，通過體驗來優化模型的過程。
directly reinforcement learning - 直接強化學習，通過體驗來優化策略的過程。

這里的思想是：通過體驗來直接優化策略和優化模型（再優化策略）。見圖：

Tabular Dyna-Q

Initialize $Q(s, a)$ and $Model(s, a) \forall s \in S \ and \ a \in A(s)$
Do forever(for each episode):
(a) $S \gets $ current (nonterminal) state
(b) $A \gets \epsilon-greedy(S, Q)$
(c) Execute action $A$; observe resultant reward, $R$, and state, $S'$
(d) $Q(S, A) \gets Q(S, A) + \alpha [R + \gamma \underset{a}{max} \ Q(S', a) - Q(S, A)]$
(e) $Model(S, A) \gets R, S'$ (assuming deterministic environment)
(f) Repeat n times:
$S \gets $ random previously observed state
$A \gets $ random action previously taken in $S$
$R, S' \gets Model(S, A)$
$Q(S, A) \gets Q(S, A) + \alpha [R + \gamma \underset{a}{max} \ Q(S', a) - Q(S, A)]$

理解
上面的算法，如果$n=0$，就是Q-learning算法。Dyna-Q的算法的優勢在於性能上的提高。
我想主要原因是通過建立模型，減少了操作(c)，模型學習到了$Model(S, A) \gets R, S'$。

優化的交換（Prioritized Sweeping）

下面的算法，提供了一種性能的優化，只評估那些誤差大於一定值$\theta$的策略價值。

Initialize $Q(s, a)$, $Model(s, a), \ \forall s, \forall a$ and PQueue to empty
Do forever(for each episode):
(a) $S \gets $ current (nonterminal) state
(b) $A \gets policy(S, Q)$
(c) Execute action A; observe resultant reward, R, and state, $S'$
(d) $Model(S, A) \gets R, S'$
(e) $P \gets |R + \gamma \underset{a}{max} \ Q(S', a) - Q(S, A)|$
(f) if $P > \theta$, then insert $S, A$ into $PQueue$ with priority $P$
(g) Repeat $n$ times, while $PQueue$ is not empty:
$S, A \gets first(PQueue)$ (will remove the first also)
$R, S' \gets Model(S, A)$
$Q(S, A) \gets Q(S, A) + \alpha [R + \gamma \underset{a}{max} \ Q(S', a) - Q(S, A)]$
Repeat, for all $S,A$ predicted to lead to $S$:
$\overline{P} \gets $ predicted reward for $\overline{S}, \overline{A}, S$
$P \gets |\overline{R} + \gamma \underset{a}{max} \ Q(S', a) - Q(\overline{S}, \overline{A})|$
if $P > \theta$, then insert $\overline{S}, \overline{A}$ into $PQueue$ with priority $P$

蒙特卡洛樹搜索

我有另外一個博文介紹了這個算法。
蒙特卡洛樹搜索算法（UCT）: 一個程序猿進化的故事

參照

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 強化學習讀書筆記 - 04 - 動態規划強化學習讀書筆記 - 09 - on-policy預測的近似方法強化學習經典入門書的讀書筆記系列--第二篇（上）強化學習讀書筆記 - 12 - 資格痕跡(Eligibility Traces) 強化學習讀書筆記 - 02 - 多臂老O虎O機問題正面管教讀書筆記 08 班會 [強化學習論文筆記(4)]:DuelingDQN [強化學習論文筆記(1)]:DQN 淺談強化學習的方法及學習路線強化學習（8）------動態規划（通俗解釋）