強化學習讀書筆記 - 06~07 - 時序差分學習(Temporal-Difference Learning)
學習筆記:
Reinforcement Learning: An Introduction, Richard S. Sutton and Andrew G. Barto c 2014, 2015, 2016
數學符號看不懂的,先看看這里:
時序差分學習簡話
時序差分學習結合了動態規划和蒙特卡洛方法,是強化學習的核心思想。
時序差分這個詞不好理解。改為當時差分學習比較形象一些 - 表示通過當前的差分數據來學習。
蒙特卡洛的方法是模擬(或者經歷)一段情節,在情節結束后,根據情節上各個狀態的價值,來估計狀態價值。
時序差分學習是模擬(或者經歷)一段情節,每行動一步(或者幾步),根據新狀態的價值,然后估計執行前的狀態價值。
可以認為蒙特卡洛的方法是最大步數的時序差分學習。
本章只考慮單步的時序差分學習。多步的時序差分學習在下一章講解。
數學表示
根據我們已經知道的知識:如果可以計算出策略價值(\(\pi\)狀態價值\(v_{\pi}(s)\),或者行動價值\(q_{\pi(s, a)}\)),就可以優化策略。
在蒙特卡洛方法中,計算策略的價值,需要完成一個情節(episode),通過情節的目標價值\(G_t\)來計算狀態的價值。其公式:
Formula MonteCarlo
時序差分的思想是通過下一個狀態的價值計算狀態的價值,形成一個迭代公式(又):
Formula TD(0)
注:書上提出TD error並不精確,而Monte Carlo error是精確地。需要了解,在此並不拗述。
時序差分學習方法
本章介紹的是時序差分學習的單步學習方法。多步學習方法在下一章介紹。
- 策略狀態價值\(v_{\pi}\)的時序差分學習方法(單步\多步)
- 策略行動價值\(q_{\pi}\)的on-policy時序差分學習方法: Sarsa(單步\多步)
- 策略行動價值\(q_{\pi}\)的off-policy時序差分學習方法: Q-learning(單步)
- Double Q-learning(單步)
- 策略行動價值\(q_{\pi}\)的off-policy時序差分學習方法(帶importance sampling): Sarsa(多步)
- 策略行動價值\(q_{\pi}\)的off-policy時序差分學習方法(不帶importance sampling): Tree Backup Algorithm(多步)
- 策略行動價值\(q_{\pi}\)的off-policy時序差分學習方法: \(Q(\sigma)\)(多步)
策略狀態價值\(v_{\pi}\)的時序差分學習方法
單步時序差分學習方法TD(0)
- 流程圖
- 算法描述
Initialize \(V(s)\) arbitrarily \(\forall s \in \mathcal{S}^+\)
Repeat (for each episode):
Initialize \(\mathcal{S}\)
Repeat (for each step of episode):
\(A \gets\) action given by \(\pi\) for \(S\)
Take action \(A\), observe \(R, S'\)
\(V(S) \gets V(S) + \alpha [R + \gamma V(S') - V(S)]\)
\(S \gets S'\)
Until S is terminal
多步時序差分學習方法
- 流程圖
- 算法描述
Input: the policy \(\pi\) to be evaluated
Initialize \(V(s)\) arbitrarily \(\forall s \in \mathcal{S}\)
Parameters: step size \(\alpha \in (0, 1]\), a positive integer \(n\)
All store and access operations (for \(S_t\) and \(R_t\)) can take their index mod \(n\)Repeat (for each episode):
Initialize and store \(S_0 \ne terminal\)
\(T \gets \infty\)
For \(t = 0,1,2,\cdots\):
If \(t < T\), then:
Take an action according to \(\pi(\dot \ | S_t)\)
Observe and store the next reward as \(R_{t+1}\) and the next state as \(S_{t+1}\)
If \(S_{t+1}\) is terminal, then \(T \gets t+1\)
$\tau \gets t - n + 1 \ $ (\(\tau\) is the time whose state's estimate is being updated)
If \(\tau \ge 0\):
\(G \gets \sum_{i = \tau + 1}^{min(\tau + n, T)} \gamma^{i-\tau-1}R_i\)
if \(\tau + n \le T\) then: \(G \gets G + \gamma^{n}V(S_{\tau + n}) \qquad \qquad (G_{\tau}^{(n)})\)
\(V(S_{\tau}) \gets V(S_{\tau}) + \alpha [G - V(S_{\tau})]\)
Until \(\tau = T - 1\)
這里要理解\(V(S_0)\)是由\(V(S_0), V(S_1), \dots, V(S_n)\)計算所得;\(V(S_1)\)是由\(V(S_1), V(S_1), \dots, V(S_{n+1})\)。
策略行動價值\(q_{\pi}\)的on-policy時序差分學習方法: Sarsa
單步時序差分學習方法
- 流程圖
- 算法描述
Initialize \(Q(s, a), \forall s \in \mathcal{S}, a \in \mathcal{A}(s)\) arbitrarily, and \(Q(terminal, \dot \ ) = 0\)
Repeat (for each episode):
Initialize \(\mathcal{S}\)
Choose \(A\) from \(S\) using policy derived from \(Q\) (e.g. \(\epsilon-greedy\))
Repeat (for each step of episode):
Take action \(A\), observe \(R, S'\)
Choose \(A'\) from \(S'\) using policy derived from \(Q\) (e.g. \(\epsilon-greedy\))
\(Q(S, A) \gets Q(S, A) + \alpha [R + \gamma Q(S', A') - Q(S, A)]\)
\(S \gets S'; A \gets A';\)
Until S is terminal
多步時序差分學習方法
- 流程圖
- 算法描述
Initialize \(Q(s, a)\) arbitrarily \(\forall s \in \mathcal{S}^, \forall a in \mathcal{A}\)
Initialize \(\pi\) to be \(\epsilon\)-greedy with respect to Q, or to a fixed given policy
Parameters: step size \(\alpha \in (0, 1]\),
small \(\epsilon > 0\)
a positive integer \(n\)
All store and access operations (for \(S_t\) and \(R_t\)) can take their index mod \(n\)Repeat (for each episode):
Initialize and store \(S_0 \ne terminal\)
Select and store an action \(A_0 \sim \pi(\dot \ | S_0)\)
\(T \gets \infty\)
For \(t = 0,1,2,\cdots\):
If \(t < T\), then:
Take an action \(A_t\)
Observe and store the next reward as \(R_{t+1}\) and the next state as \(S_{t+1}\)
If \(S_{t+1}\) is terminal, then:
\(T \gets t+1\)
Else:
Select and store an action \(A_{t+1} \sim \pi(\dot \ | S_{t+1})\)
$\tau \gets t - n + 1 \ $ (\(\tau\) is the time whose state's estimate is being updated)
If \(\tau \ge 0\):
\(G \gets \sum_{i = \tau + 1}^{min(\tau + n, T)} \gamma^{i-\tau-1}R_i\)
if \(\tau + n \le T\) then: \(G \gets G + \gamma^{n} Q(S_{\tau + n}, A_{\tau + n}) \qquad \qquad (G_{\tau}^{(n)})\)
\(Q(S_{\tau}, A_{\tau}) \gets Q(S_{\tau}, A_{\tau}) + \alpha [G - Q(S_{\tau}, A_{\tau})]\)
If {\pi} is being learned, then ensure that \(\pi(\dot \ | S_{\tau})\) is \(\epsilon\)-greedy wrt Q
Until \(\tau = T - 1\)
策略行動價值\(q_{\pi}\)的off-policy時序差分學習方法: Q-learning
Q-learning 算法(Watkins, 1989)是一個突破性的算法。這里利用了這個公式進行off-policy學習。
單步時序差分學習方法
- 算法描述
Initialize \(Q(s, a), \forall s \in \mathcal{S}, a \in \mathcal{A}(s)\) arbitrarily, and \(Q(terminal, \dot \ ) = 0\)
Repeat (for each episode):
Initialize \(\mathcal{S}\)
Choose \(A\) from \(S\) using policy derived from \(Q\) (e.g. \(\epsilon-greedy\))
Repeat (for each step of episode):
Take action \(A\), observe \(R, S'\)
\(Q(S, A) \gets Q(S, A) + \alpha [R + \gamma \underset{a}{max} \ Q(S‘, a) - Q(S, A)]\)
\(S \gets S';\)
Until S is terminal
- Q-learning使用了max,會引起一個最大化偏差(Maximization Bias)問題。
具體說明,請看書上的Example 6.7。**
使用Double Q-learning可以消除這個問題。
Double Q-learning
單步時序差分學習方法
Initialize \(Q_1(s, a)\) and \(Q_2(s, a), \forall s \in \mathcal{S}, a \in \mathcal{A}(s)\) arbitrarily
Initialize \(Q_1(terminal, \dot \ ) = Q_2(terminal, \dot \ ) = 0\)
Repeat (for each episode):
Initialize \(\mathcal{S}\)
Repeat (for each step of episode):
Choose \(A\) from \(S\) using policy derived from \(Q_1\) and \(Q_2\) (e.g. \(\epsilon-greedy\))
Take action \(A\), observe \(R, S'\)
With 0.5 probability:
\(Q_1(S, A) \gets Q_1(S, A) + \alpha [R + \gamma Q_2(S', \underset{a}{argmax} \ Q_1(S', a)) - Q_1(S, A)]\)
Else:
\(Q_2(S, A) \gets Q_2(S, A) + \alpha [R + \gamma Q_1(S', \underset{a}{argmax} \ Q_2(S', a)) - Q_2(S, A)]\)
\(S \gets S';\)
Until S is terminal
策略行動價值\(q_{\pi}\)的off-policy時序差分學習方法(by importance sampling): Sarsa
考慮到重要樣本,把\(\rho\)帶入到Sarsa算法中,形成一個off-policy的方法。
\(\rho\) - 重要樣本比率(importance sampling ratio)
多步時序差分學習方法
- 算法描述
Input: behavior policy \mu such that \(\mu(a|s) > 0,\forall s \in \mathcal{S}, a \in \mathcal{A}\)
Initialize \(Q(s,a)\) arbitrarily \(\forall s \in \mathcal{S}^, \forall a in \mathcal{A}\)
Initialize \(\pi\) to be \(\epsilon\)-greedy with respect to Q, or to a fixed given policy
Parameters: step size \(\alpha \in (0, 1]\),
small \(\epsilon > 0\)
a positive integer \(n\)
All store and access operations (for \(S_t\) and \(R_t\)) can take their index mod \(n\)Repeat (for each episode):
Initialize and store \(S_0 \ne terminal\)
Select and store an action \(A_0 \sim \mu(\dot \ | S_0)\)
\(T \gets \infty\)
For \(t = 0,1,2,\cdots\):
If \(t < T\), then:
Take an action \(A_t\)
Observe and store the next reward as \(R_{t+1}\) and the next state as \(S_{t+1}\)
If \(S_{t+1}\) is terminal, then:
\(T \gets t+1\)
Else:
Select and store an action \(A_{t+1} \sim \pi(\dot \ | S_{t+1})\)
$\tau \gets t - n + 1 \ $ (\(\tau\) is the time whose state's estimate is being updated)
If \(\tau \ge 0\):
\(\rho \gets \prod_{i = \tau + 1}^{min(\tau + n - 1, T -1 )} \frac{\pi(A_t|S_t)}{\mu(A_t|S_t)} \qquad \qquad (\rho_{\tau+n}^{(\tau+1)})\)
\(G \gets \sum_{i = \tau + 1}^{min(\tau + n, T)} \gamma^{i-\tau-1}R_i\)
if \(\tau + n \le T\) then: \(G \gets G + \gamma^{n} Q(S_{\tau + n}, A_{\tau + n}) \qquad \qquad (G_{\tau}^{(n)})\)
\(Q(S_{\tau}, A_{\tau}) \gets Q(S_{\tau}, A_{\tau}) + \alpha \rho [G - Q(S_{\tau}, A_{\tau})]\)
If {\pi} is being learned, then ensure that \(\pi(\dot \ | S_{\tau})\) is \(\epsilon\)-greedy wrt Q
Until \(\tau = T - 1\)
Expected Sarsa
- 流程圖
策略行動價值\(q_{\pi}\)的off-policy時序差分學習方法(不帶importance sampling): Tree Backup Algorithm
Tree Backup Algorithm的思想是每步都求行動價值的期望值。
求行動價值的期望值意味着對所有可能的行動\(a\)都評估一次。
多步時序差分學習方法
- 流程圖
- 算法描述
Initialize \(Q(s,a)\) arbitrarily \(\forall s \in \mathcal{S}^, \forall a in \mathcal{A}\)
Initialize \(\pi\) to be \(\epsilon\)-greedy with respect to Q, or to a fixed given policy
Parameters: step size \(\alpha \in (0, 1]\),
small \(\epsilon > 0\)
a positive integer \(n\)
All store and access operations (for \(S_t\) and \(R_t\)) can take their index mod \(n\)Repeat (for each episode):
Initialize and store \(S_0 \ne terminal\)
Select and store an action \(A_0 \sim \pi(\dot \ | S_0)\)
\(Q_0 \gets Q(S_0, A_0)\)
\(T \gets \infty\)
For \(t = 0,1,2,\cdots\):
If \(t < T\), then:
Take an action \(A_t\)
Observe and store the next reward as \(R_{t+1}\) and the next state as \(S_{t+1}\)
If \(S_{t+1}\) is terminal, then:
\(T \gets t+1\)
\(\delta_t \gets R - Q_t\)
Else:
\(\delta_t \gets R + \gamma \sum_a \pi(a|S_{t+1})Q(S_{t+1},a) - Q_t\)
Select arbitrarily and store an action as \(A_{t+1}\)
\(Q_{t+1} \gets Q(S_{t+1},A_{t+1})\)
\(\pi_{t+1} \gets \pi(S_{t+1},A_{t+1})\)
$\tau \gets t - n + 1 \ $ (\(\tau\) is the time whose state's estimate is being updated)
If \(\tau \ge 0\):
\(E \gets 1\)
\(G \gets Q_{\tau}\)
For \(k=\tau, \dots, min(\tau + n - 1, T - 1):\)
\(G \gets\ G + E \delta_k\)
\(E \gets\ \gamma E \pi_{k+1}\)
\(Q(S_{\tau}, A_{\tau}) \gets Q(S_{\tau}, A_{\tau}) + \alpha [G - Q(S_{\tau}, A_{\tau})]\)
If {\pi} is being learned, then ensure that \(\pi(a | S_{\tau})\) is \(\epsilon\)-greedy wrt \(Q(S_{\tau},\dot \ )\)
Until \(\tau = T - 1\)
策略行動價值\(q_{\pi}\)的off-policy時序差分學習方法: \(Q(\sigma)\)
\(Q(\sigma)\)結合了Sarsa(importance sampling), Expected Sarsa, Tree Backup算法,並考慮了重要樣本。
當\(\sigma = 1\)時,使用了重要樣本的Sarsa算法。
當\(\sigma = 0\)時,使用了Tree Backup的行動期望值算法。
多步時序差分學習方法
- 流程圖
- 算法描述
Input: behavior policy \mu such that \(\mu(a|s) > 0,\forall s \in \mathcal{S}, a \in \mathcal{A}\)
Initialize \(Q(s,a)\) arbitrarily \forall s \in \mathcal{S}^, \forall a in \mathcal{A}$
Initialize \(\pi\) to be \(\epsilon\)-greedy with respect to Q, or to a fixed given policy
Parameters: step size \(\alpha \in (0, 1]\),
small \(\epsilon > 0\)
a positive integer \(n\)
All store and access operations (for \(S_t\) and \(R_t\)) can take their index mod \(n\)Repeat (for each episode):
Initialize and store \(S_0 \ne terminal\)
Select and store an action \(A_0 \sim \mu(\dot \ | S_0)\)
\(Q_0 \gets Q(S_0, A_0)\)
\(T \gets \infty\)
For \(t = 0,1,2,\cdots\):
If \(t < T\), then:
Take an action \(A_t\)
Observe and store the next reward as \(R_{t+1}\) and the next state as \(S_{t+1}\)
If \(S_{t+1}\) is terminal, then:
\(T \gets t+1\)
\(\delta_t \gets R - Q_t\)
Else:
Select and store an action as \(A_{t+1} \sim \mu(\dot \ |S_{t+1})\)
Select and store \(\sigma_{t+1})\)
\(Q_{t+1} \gets Q(S_{t+1},A_{t+1})\)
\(\delta_t \gets R + \gamma \sigma_{t+1} Q_{t+1} + \gamma (1 - \sigma_{t+1})\sum_a \pi(a|S_{t+1})Q(S_{t+1},a) - Q_t\)
\(\pi_{t+1} \gets \pi(S_{t+1},A_{t+1})\)
\(\rho_{t+1} \gets \frac{\pi(A_{t+1}|S_{t+1})}{\mu(A_{t+1}|S_{t+1})}\)
$\tau \gets t - n + 1 \ $ (\(\tau\) is the time whose state's estimate is being updated)
If \(\tau \ge 0\):
\(\rho \gets 1\)
\(E \gets 1\)
\(G \gets Q_{\tau}\)
For \(k=\tau, \dots, min(\tau + n - 1, T - 1):\)
\(G \gets\ G + E \delta_k\)
\(E \gets\ \gamma E [(1 - \sigma_{k+1})\pi_{k+1} + \sigma_{k+1}]\)
\(\rho \gets\ \rho(1 - \sigma_{k} + \sigma_{k}\tau_{k})\)
\(Q(S_{\tau}, A_{\tau}) \gets Q(S_{\tau}, A_{\tau}) + \alpha \rho [G - Q(S_{\tau}, A_{\tau})]\)
If \({\pi}\) is being learned, then ensure that \(\pi(a | S_{\tau})\) is \(\epsilon\)-greedy wrt \(Q(S_{\tau},\dot \ )\)
Until \(\tau = T - 1\)
總結
時序差分學習方法的限制:學習步數內,可獲得獎賞信息。
比如,國際象棋的每一步,是否可以計算出一個獎賞信息?如果使用蒙特卡洛方法,模擬到游戲結束,肯定是可以獲得一個獎賞結果的。
