《強化學習導論》讀書筆記

本文轉載自查看原文 2020-01-01 16:58 706 強化學習

Chapter1
Chapter2
Chapter3 有限MDPs
Chapter 4 DP
Chapter 5 Monte Carlo method
Chapter 6 TD learning
- TD(0) 的優勢
- sarsa on-policy TD control and Q-learning off-policy TD control
Chapter 7 n-step bootstrapping
Chapter 8 Planning and learning
Chapter 9 approximate solution method
Chapter 10 on-policy control with approximation
- continuing tasks :average reward
Chapter 11 off policy methods with approximation
Chapter 12 Eligibility Traces
Chapter 13 Policy Gradient Methods

Chapter1

RL is learning what to do.

RL兩大特征： trial-and-error, delayed reward

MDP 三個方面：sensation, action, goal

RL 認為自己不屬於supervised L 和unsupervised L

RL獨特的問題：exploration and exploitation. (其他領域也有類似的比如生成圖像的多樣性和准確性)

One Key feature:強化學習考慮全局

除了agent, environment:

policy: lookup table, function, or search process

reward signal:

一個state的value 就是往后余生所有Reward的和（可能會有decay,畢竟人生得意須盡歡，莫使金樽空對月。）

Chapter2

Learning- Evaluative feedback vs Instructive feedback

監督學習是給出建設性意見，不管你的action是什么。

RL是根據你的action，告訴你的action是好是壞

多臂賭博機 multi-armed bandits

多臂賭博機只有一個situation, 一個action就是一個episode。

對於每個lever, Reward是一個概率分布,$A_t$意味着在時間t采取的行動，對應action a,value就是reward的期望值

$q_*(a) = E(R_t|A_t=a)$

遺憾的是我們不知道這個，我們可以有一個估計值$Q(a)$

$Q(a)$是根據我們過往經驗的估計值，所以可能不准，並且存在 exploration，exploitation 的conflict。

影響explore 和exploit的因素有很多，比如剩下的steps，現有的估計值，不確定性等。有很多方法去衡量，但是大多都有假設，這些假設通常在RL領域不work

action-value method

estimating

$Q_t(a) = \frac{\sum_{i=1}^{t-1}R_i \cdot I_{A_i=a}}{\sum_{i=1}^{t-1}\cdot I_{A_i=a}}$

$I_{A_i=a}$是指示函數

greedy action:

$A_t= \underset{a}{argmax}Q_t(a)$

$\epsilon$ -greedy 有$\epsilon$的概率隨機選擇，剩下$1-\epsilon$greedy。

Incremental implementation

求平均值的實現

$Q_t(a) = Q_{t-1}(a) + \frac{1}{t-1}(R_{t-1}-Q_{t-1})$

$NewEstimate=OldEstimate+StepSize[Target-OldEstimate]$

Nonstationary Problem

不穩定狀態，recent reward更重要

$Q_{n+1}(a) = Q_{n}(a) + \alpha(R_{n}-Q_{n}), \alpha\in(0,1]$

展開$Q_{n+1} = (1-n)^nQ_1+\sum_{i=1}^{n}\alpha(1-\alpha)^{n-i}R_i$

vary step size $\alpha_n(a)$.

收斂條件$\sum_{n=1}^{\infty}\alpha_n(a)>\infty,\sum_{n=1}^{\infty}\alpha_n(a)^2<\infty$

$\alpha_n(a)=\alpha$不滿足條件，但是因為不穩定環境無需收斂，跟隨變化就行。滿足條件的收斂慢，理論可行。

optimistic initial values

初始值$Q_1$設置一個很高的值，選一個action，就會對這個action失望，然后選擇其他沒有選擇過的action。

狀態空間變化無法使用，僅僅是初始探索而已

UCB(Upper confidence bound)

雨露均沾，寵幸的少了，就要補償一點

$A_t = \underset{a}{argmax}[Q_t(a)+c\sqrt{\frac{lnt}{N_t(a)}}]$

無法處理大的狀態空間，不穩定系統

Gradient bandit algorithms

$H_t(a)$ preference for each action a

$Pr\{A_t=a\} = \frac{e^{H_t(a)}}{\sum_{b=1}^{k}e^{H_t(b)}}= \pi_t{(a)}$

采取$A_t$,獲得$R_t$后

$H_{t+1}(A_t) = H_t(A_t) +\alpha(R_t-\bar{R}_t)(1-\pi_t(A_t))$

$H_{t+1}(a) = H_t(a) -\alpha(R_t-\bar{R}_t)\pi_t(A_t)$for all $a \neq A_t$

$\bar{R}_t$是不包含$R_t$的平均值，也就是baseline

上面的兩個遞推式，可以通過梯度上升推導出來

preference的改變量就是對期望reward的影響力

$H_{t+1}(a) = H_{t}(a) + \alpha\frac{\partial E[R_t]}{\partial H_t(a)}$

對於賭博機，期望Reward就是所有action的value之和

$E[R_t] = \sum_x \pi_t(x)q_*(x)$

$\begin{align*} \frac{\partial E[R_t]}{\partial H_t(a)}=&\frac{\partial }{\partial H_t(a)}{[\sum_x \pi_t(x)q_*(x)]}\\=&\sum_xq_*(x)\frac{\partial \pi_t(x)}{\partial H_t(a)}\\=&\sum_x(q_*(x)-B_t)\frac{\partial \pi_t(x)}{\partial H_t(a)}\end{align*}$

$B_t$是baseline,和x無關的值，$\sum_x \partial\pi_t(x)=0$ ，baseline不改變什么

想要把求和變為求期望,用$A_t$替換x

$\begin{align*}\frac{\partial E[R_t]}{\partial H_t(a)}=&\sum_x \pi_t(x)(q_*(x)-B_t)\frac{\partial \pi_t(x)}{\partial H_t(a)}/\pi_t(x)\\=&E[(q_*(A_t)-B_t)\frac{\partial \pi_t(A_t)}{\partial H_t(a)}/\pi_t(A_t)]\\=&E[(R_t-\bar{R}_t)\frac{\partial \pi_t(A_t)}{\partial H_t(a)}/\pi_t(A_t)]\end{align*}$

因為$E[R_t|A_t]=q_*(A_t)$

$\begin{align*}\frac{\partial E[R_t]}{\partial H_t(a)}=&E[(R_t-\bar{R}_t)\pi_t(A_t)(I_{a=A_t}-\pi_t(a))/\pi_t(A_t)]\\=&E[(R_t-\bar{R}_t)(I_{a=A_t}-\pi_t(a))]\end{align*}$

期望變成每一步

$H_{t+1}(a) = H_t(a) +\alpha (R_t-\bar{R}_t)(I_{a=A_t}-\pi_t(a))$ for all a

contextual bandit

在多臂賭博機中，沒有狀態的概念，每一時刻，我們都是選擇一個行動而已。

上下文賭博機，每次給定一個賭博機，同時給你一個賭博機的相關信息，這就是狀態了，根據不同狀態，判斷需要采取的相應action。介於多臂賭博機和真正的強化學習之間。

Chapter3 有限MDPs

The Agent–Environment Interface

MDPs 是learning from interaction to achieve a goal的一個框架，適用於大多數強化學習，但並不是全部

agent :learner, decision maker

environment: everything outside the agent

dynamics:

$p(s',r|s,a)=Pr\{S_t=s',R_t=r|S_{t-1}=s,A_{t-1}=a\}$

P就是MDP的dynamics，是一個條件概率分布$p:S*R*S*A\to[0,1]$

四元組也可以變成其他的

$p(s'|s,a),r(s,a),r(s,a,s')$

In general, actions can be any decisions we want to learn how to make, and the states can be anything we can know that might be useful in making them.

Goals and rewards

goals, purposes是，最大累計Reward的期望。

Reward可以是懲罰也可以是獎勵

The reward signal is your way of communicating to the robot what you want it to achieve, not how you want it achieved.

設置Reward不應該包含怎樣去做的先驗知識。

Returns and Episodes

return :$G_t= R_{t+1} + R_{t+2} + R_{t+3} + ··· + R_T$,T是final time step

與環境交互的一個subsequence : episodes

terminal state ,結束之后，下一個state不依賴terminal state.

這種是episodic task

nonterminal state S,terminal state $S^+$

continuing tasks: on-going process-control task

需要使用discounting

$\begin{align*}G_t=& R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + ··· \\=&\sum_{k=0}^{\infty }\gamma^k R_{t+k+1},\gamma\in [0,1]\end{align*}$

$\gamma$ 決定agent是myopic還是farsighted

$G_t = R_{t+1}+\gamma G_{t+1}$

$G_t$是收斂的，$\sum_{k=0}^{\infty}\gamma^k=\frac{1}{1-\gamma}$

unified notation for episode and continuing tasks

兩種類型的可以用一種形式表示

對於episode，沒必要區分不同的episode

而且episode可以看做是continuing tasks, 最后在absorbing state無限循環，r=0

$G_t =\sum_{k=t+1}^{T}\gamma^{k-t-1}R_k$

$T$可以是$\infty$,$\gamma$可以是1,不能同時滿足

Polices and value functions

value function: estimate how good it is for the agent to be in a given state (or how good it is to perform a given action in a given state)

"how good" :feature rewards that can be expected

policy:$\pi(a|s)$

value function under a policy $\pi$, $v_\pi(s)=E_\pi[G_t|S_t=s]$

action-value function:$q_\pi(s,a)=E_\pi[G_t|S_t=s,A_t=a]$

$\begin{align*}V_\pi(s)= &E[R_{t+1}+\gamma G_{t+1}|S_t=s]\\=&\sum_{a}\pi_(a|s)\sum_{s'}\sum_{r}p(s',r|s,a)[r+\gamma E_\pi[G_{t+1}|S_{t+1}=s']]\\=&\sum_{a}\pi_(a|s)\sum_{s'}\sum_{r}p(s',r|s,a)[r+\gamma v_\pi(s')]\end{align*}$
$q_\pi(s,a) = \sum_{s'}\sum_{r}p(s',r|s,a)[r+\gamma \sum_{a'}\pi(a'|s')q_\pi(s',a')]$

optimal policies and optimal value functions

$v_*(s) = \underset{\pi}{max} \,v_\pi(s)$

$q_*(s,a)=\underset{\pi}{max}q_\pi(s,a)$

$q_*(s,a) = E_\pi[R_{t+1}+\gamma v_*(S_{t+1})|S_t=s,A_t=a]$

$\begin{align*}v_*(s)=&\underset{a\in A(s)}{max}q_{\pi_*}(s,a)\\=&\underset{a}{max} E_\pi[R_{t+1}+\gamma v_*(S_{t+1})|S_t=s,A_t=a]\\=&\underset{a}{max}\sum_{s',r}p(s',r|s,a)(r+\gamma v_*(s')) \end{align*}$

$\begin{align*}q_*(s,a)=&E[R_{t+1}+\gamma \underset{a'}{max}q_*(S_{t+1},a')|S_t=s,A_t=a]\\=&\sum_{s',r}p(s',r|s,a)(r+\gamma \underset{a'}{max}q_*(s',a'))\end{align*}$

optimality and approximation

bellman optimality equation 理論上在知道模型的情況下可解。但是受到狀態空間和memory的限制，只能求近似值

Chapter 4 DP

finite MPDs 是指action ,state, reward finite

$v_*(s)=\underset{a}{max}\sum_{s',r}p(s',r|s,a)(r+\gamma v_*(s'))$

$q_*(s,a)=\sum_{s',r}p(s',r|s,a)(r+\gamma \underset{a'}{max}q_*(s',a'))$

policy evaluation (prediction)

給定policy $\pi$ 計算$v_\pi$

$\begin{align*}v_\pi(s)=& E[R_{t+1}+\gamma G_{t+1}|S_t=s]\\=&\sum_{a}\pi_(a|s)\sum_{s'}\sum_{r}p(s',r|s,a)[r+\gamma E_\pi[G_{t+1}|S_{t+1}=s']]\\=&\sum_{a}\pi_(a|s)\sum_{s'}\sum_{r}p(s',r|s,a)[r+\gamma v_\pi(s')]\end{align*}$

如果系統的dynamics 知道，就是解$|S|$個線性方程，傻子才這么做

采用迭代的方式，$v_0,v_1,\cdots,v_k$

初始值，隨便設，但是final state$V_\pi(S^+)=0$

$v_{k+1}(s)=\sum_{a}\pi_(a|s)\sum_{s',r}p(s',r|s,a)[r+\gamma v_k(s')]$

$v_\pi$是不動點，根據壓縮映射原理，保證收斂$v_{k}=v_\pi,k\to \infty$

這種方式叫做iterative policy evaluation

一次更新叫做$expected \,update$

policy improvement

policy improvement 就是對於兩個確定性Policy

$q_\pi(s,\pi'(s))\geq v_\pi(s)$

則for all state $s\in S$

$v_{\pi'}(s)\geq v_\pi(s)$(證明，就是不斷迭代)

那么$\pi'$就比$\pi$ 好

所以怎么找到$\pi'$

$\pi'(s)=\underset{a}{argmax}\sum_{s',r}p(s',r|s,a)[r+\gamma v_\pi(s')]$

除非$\pi=\pi_*$否則一定有policy improvement(strict inequality)

確定性策略成立，隨機策略也成立

policy iteration

evaluation 和improvement交替進行，實際上evaluation不用等到收斂就可以進行improvement

value iteration

只進行一次policy evaluation 的迭代，然后進行improvement

$v_{k+1}=\underset{a}{max}\sum_{s',r}p(s',r|s,a)(r+\gamma v_k(s'))$

也可以看做是bellman optimality equation

GPI（general Policy iteration）

evaluation 和improvement可以亂序執行，前者希望value function跟當前的policy一致，后者改變當前的策略，讓value function偏離的Policy

The evaluation and improvement processes in GPI can be viewed as both competing
and cooperating. They compete in the sense that they pull in opposing directions. Making
the policy greedy with respect to the value function typically makes the value function
incorrect for the changed policy, and making the value function consistent with the policy
typically causes that policy no longer to be greedy. In the long run, however, these
two processes interact to find a single joint solution: the optimal value function and an
optimal policy.

bootstrapping：通過后記狀態的estimate，來估計現在的state

Chapter 5 Monte Carlo method

不需要model的P，只需要experience,

actual experience: striking

simulated experience: powerful

based on sample returns, only for episodic task

MC prediction

each occurrence of state s is called a visit to s

first-visit-MC

every-visit-MC

理論不同

都收斂到$v_\pi(s)$

MC 不用bootstrap

MC 可以只計算感興趣的states

MC estimation of action values

因為沒有model,只知道$v_\pi$,不知道如何選擇action，所以需要學習$q_\pi(s,a)$

MC需要足夠的探索，確定性策略不行。

可以選擇，start in a state-action pair. exploring starts (不太靠譜，隨機策略)

MC control

MC的方法就是GPI里解決evaluation的一個方式

$\pi_(s) =\underset{a}{argmax}\,q(s,a)$

$q_{\pi_k}(s, \pi_{k+1}(s)) = q_{\pi_k}(s,\underset{a}{argmax}\,q_{\pi_k}(s,a))= \underset{a}{max}q_{\pi_k}(s,a)\geq q_{\pi_k}(s,\pi_k{(s)}) \geq v_{\pi_k}(s)$

沒有理論上的證明一定收斂到最優策略

Monte Carlo Control without Exploring Starts

exploring starts 不切實際

on policy: attempt to evaluate or improve the policy that is used to generate the data.

off policy: attempt to evaluate or improve a policy different from that used to generate the data

on policy : generally soft: $\pi(a|s)>0 ,\forall s\in S,a\in A $

理論policy improvement 仍然可行。

off-policy via IS

evaluation 需要足夠的data去學習，但是improvement讓policy越來越greedy. 如果采用當前策略生成data用來學習，意味着是on-policy,這時候policy為了保持探索只能是近似optimality.

另一個方法就是采用兩個策略，behavior policy只是用來生成data. 然后生成的data進行evaluation,然后improvement target policy. 顯然evaluation 在DP里是bootstrap 求期望，在MC里面是根據大數定理算均值。

在MC里面數據是假設在當前策略中產生的。

所以既然生成數據分布和真實分布不是一個分布

那么我們就用其他方法來采樣

所以on-policy可以看做是off-policy的一個特例。

這一節只考慮prediction的問題

behavior policy $b$

target policy $\pi$

coverage : $\pi(a|s)>0,imples ,b(a|s)>0$

$\begin{align*}Pr\{A_t,S_{t+1},\cdots ,S_T |S_t,A_{t:T-1}\sim \pi\}=&\pi(A_t|S_t)p(S_{t+1|S_t,A_t})\pi(A_{t+1}|S_{t+1})\cdots p(S_T|S_{T-1},A_{T-1})\\=&\prod_{k=t}^{T-1}\pi(A_k|S_k)p(S_{k+1}|S_k,A_k)\end{align*}$

$\rho_{t:T-1}= \frac{\prod_{k=t}^{T-1}\pi(A_k|S_k)p(S_{k+1}|S_k,A_k)}{\prod_{k=t}^{T-1}b(A_k|S_k)p(S_{k+1}|S_k,A_k)}=\prod_{k=t}^{T-1}\frac{\pi(A_k|S_k)}{b(A_k|S_k)}$

$E[\rho_{t:T -1}G_t | S_t=s] = v_\pi(s)$

$J(s)$包含state s的time step, 所有episode 排成一列。

ordinary importance sampling: O

$V(s)=\frac{\sum_{t\in J(s)}\rho_{t:T-1}G_t}{|J(s)|}$

weighted importance sampling: W

$V(s)=\frac{\sum_{t\in J(s)}\rho_{t:T-1}G_t}{\sum_{t\in J(s)}\rho_{t:T-1}}$

first visit:

O : unbiased, high variance(unbounded)
W: biased(converge to zero), low variance

every visit:

both biased

incremental implementation

ordinary IS 采用chapter 2 的方法。

weighted IS

$V_n=\frac{\sum_{k=1}^{n-1}W_{k}G_k}{\sum_{k=1}^{n-1}W_K},n\geq2$

$V_{n+1} = V_{v} + \frac{W_n}{C_n}(G_n-V_n),n\geq 1,C_{n+1} = C_n+W_{n+1},C_0=0$

每次都從episode的tail開始學，slow

Per-decision Importance Sampling

$\begin{align*}\rho_{t:T-1}G_t=&\rho_{t:T-1}(R_{t+1}+\gamma R_{t+2}+\cdots )\\=&\rho_{t:T-1}R_{t+1}+\gamma \rho_{t:T-1} R_{t+2}+\cdots \end{align*}$

$\rho_{t:T-1}R_{t+1}=\frac{\pi(A_t|S_t)\pi(A_{t+1}|S_{t+1})\cdots}{b(A_t|S_t)b(A_{t+1}|S_{t+1})\cdots}R_{t+1}$

$R_{t+1}只和第一項有關$

$E[\rho_{t:T-1}R_{t+1}]=E[\rho_{t:t}R_{t+1}]$

$E[\rho_{t:T-1}G_t]=E[\overset{\sim}G_t]$

$\overset{\sim}{G_t} = \rho_{t:t}R_{t+1} + \gamma \rho_{t:t+1}R_{t+2} + \gamma^2 \rho_{t:t+2}R_{t+3} + ··· $

per-decision IS,能夠減小variance

$V(s) =\frac{\sum_{t\in J(s)\overset{\sim}{G}_t}}{|J(s)|}$

Is there a per-decision version of weighted importance sampling? no clear

Chapter 6 TD learning

TD 是介於DP,和MC之間的方法，不需要model，需要bootstrap.

MC :$V(s_t) = V(s_t)+ \alpha (G_t-V(s_t))$

TD:$V(s_t) = V(s_t)+ \alpha (R_{t+1} +\gamma V(s_{t+1})-V(s_t))$

TD(0), one-step TD

$\begin{align*}v_\pi(s) =& E_\pi[G_t|S_t=s]\\ =& E_\pi[R_{t+1} +\gamma G_{t+1}|S_t=s]\\=& E_\pi[R_{t+1}+ \gamma v_\pi(S_{t+1})|S_t=s]\end{align*}$

TD 結合了MC的sampling 和 DP的bootstrapping

TD-error: $\delta_t = R_{t+1}+\gamma V(S_{t+1})-V(S_t)$

TD(0) 的優勢

batch 就是對於一堆數據，TD(0)統一計算TD-error,或者MC統一算均值

TD(0)在batch的情況下比MC更快。

MC是想最小化training set 的訓練誤差

而TD(0)是想求MDP的極大似然估計。certainty-equivalence estimate

sarsa on-policy TD control and Q-learning off-policy TD control

$Q(S_t,A_t) = Q(S_t,A_t)+ \alpha [R_{t+1}+\gamma Q(S_{t+1},A_{t+1})-Q(S_t,A_t)]$

$Q(S_t,A_t) = Q(S_t,A_t)+ \alpha [R_{t+1}+\gamma \, \underset{a}{max} \,Q(S_{t+1},a)-Q(S_t,A_t)]$

expected sarsa(方差小，最后策略如果是greedy的話，就是Q-learning)

$Q(S_t,A_t) = Q(S_t,A_t)+ \alpha [R_{t+1}+\gamma \, \underset{a}{\sum}\pi(a|S_{t+1}) \,Q(S_{t+1},a)-Q(S_t,A_t)]$

最大化估計值，通常有bias

問題大概是，一個state-action的return是一個分布，有好有壞，期望並不好。但是由於過於樂觀所以，總想要好的，高估了這個的價值。

問題是我們用同一個東西及決定了誰是最大的，又去評估它的價值。應該用兩個，一個選擇，一個評估

double q-learning

$Q_1(S_t,A_t) = Q_1(S_t,A_t)+ \alpha [R_{t+1}+\gamma \, Q_2(S_{t+1},\underset{a}{argmax} \,Q_1(S_{t+1},a))-Q_1(S_t,A_t)]$

通常的task都是action-value function。

但是也有特殊情況，比如棋類，我們明確的知道我們的action之后，s會變成什么，有時候不同的action state 最后還變成同一state

arterstate function 是在做出動作后評估價值

Chapter 7 n-step bootstrapping

n-step sasrsa

$G_t= R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + ··· + \gamma^{T-t-1}R_T$

$G_{t:t+1} = R_{t+1} +\gamma V_t{(S_{t+1)}}$

$G_{t:t+2} = R_{t+1} +\gamma R_{t+2}+\gamma^2 V_{t+1}{(S_{t+2)}}$

$G_{t:t+n} = R_{t+1} +\gamma R_{t+2}+\cdots +\gamma^{n-1}R_{t+n} +\gamma^{n} V_{t+n-1}{(S_{t+n)}}$

$V_{t+n} = V_{t+n-1}(S_t) + \alpha [G_{t:t+n}-V_{t+n-1}(S_t)]$

off policy 采用重要性采樣

$V_{t+n}(S_t) = V_{t+n-1}(S_t) + \alpha \rho_{t:t+n-1}[G_{t:t+n} -V_{t+n-1}(S_t)]$

action value 對應第一個state的action被選過了，關注后面的

$Q_{t+n}(S_t, A_t) = Q_{t+n-1}(S_t, A_t) + \alpha \rho_{t+1:t+n}[G_{t:t+n} - Q_{t+n-1}(S_t, A_t)] $

Note that the importance sampling ratio here starts and ends one step
later than for n-step TD (7.9). This is because here we are updating a state–action
pair. We do not have to care how likely we were to select the action; now that we have
selected it we want to learn fully from what happens, with importance sampling only for
subsequent actions.

$\rho_{t:h} =\prod_{k=t}^{min(h,T-1)} \frac{\pi(A_k|S_k)}{b(A_k|S_k)}$

$\rho_{t:h}$為0的時候target就為0,連乘中一個為0，結果就是0，帶來了較大的方差

對於value-function

$G_{t:h} = R_{t+1}+\gamma G_{t+1:h},G_{h:h} = V_{h-1}(S_h)$

變為

$G_{t:h} = \rho_t(R_{t+1}+\gamma G_{t+1:h})+(1-\rho_t)V_{h-1}(S_t),G_{h:h} = V_{h-1}(S_h)$

$\rho_t$的期望是1不影響結果，當$\rho_t=0$的時候就用$V_{h-1}(S_t)$代替，td-error為0, $(1-\rho_t)V_{h-1}(S_t)$這就是control variable

對於action-value function

$\begin{align*}G_{t:h} =&R_{t+1}+ \gamma (\rho_{t+1} G_{t+1:h}+\bar{V}_{h-1}(S_{t+1})-\rho_{t+1}Q_{h-1}(S_{t+1},A_{t+1}))\\=&R_{t+1}+ \gamma \rho_{t+1} (G_{t+1:h}-Q_{h-1}(S_{t+1},A_{t+1}))+\gamma\bar{V}_{h-1}(S_{t+1})\end{align*}$

$\bar{V}_{h-1}(S_{t+1})-\rho_{t+1}Q_{h-1}(S_{t+1},A_{t+1})$是control variable

off policy without IS

tree backup algorithm

類似於expected sarsa

策略b是random的

$\begin{align*}G_{t:h}=& R_{t+1} + \gamma \sum_a \pi(a|S_{t+1})Q_{h-1}(S_{t+1}, a)\\=& R_{t+1} + \gamma \sum_{a\neq A_{t=1}} \pi(a|S_{t+1})Q_{h-1}(S_{t+1}, a) + \gamma \pi(A_{t+1}|S_{t+1})G_{t+1:h}\end{align*}$

總結

n-step sarsa

n-step expected sarsa

tree-backup algorithm

$\rho$ 表示off policy需要IS的地方

tree backup的公式

$\begin{align*}G_{t:h}=& R_{t+1} + \gamma \sum_{a\neq A_{t=1}} \pi(a|S_{t+1})Q_{h-1}(S_{t+1}, a) + \gamma \pi(A_{t+1}|S_{t+1})G_{t+1:h}\\=&R_{t+1} + \gamma \bar{V}_{h-1}(S_{t+1})-\gamma\pi(A_{t+1}|S_{t+1})Q_{h-1}(S_{t+1}, A_{t+1}) + \gamma \pi(A_{t+1}|S_{t+1})G_{t+1:h}\\=&R_{t+1} + \gamma \pi(A_{t+1}|S_{t+1})(G_{t+1:h}-Q_{h-1}(S_{t+1}, A_{t+1}))+ \gamma \bar{V}_{h-1}(S_{t+1})\end{align*}$

n-step sarsa 公式:

$G_{t:h}=R_{t+1}+ \gamma \rho_{t+1} (G_{t+1:h}-Q_{h-1}(S_{t+1},A_{t+1}))+\gamma\bar{V}_{h-1}(S_{t+1})$

$Q(\sigma)$公式

$=R_{t+1}+ \gamma (\sigma_{t+1}\rho_{t+1}+(1-\sigma_{t+1}\pi(A_{t+1}|S_{t+1}))) (G_{t+1:h}-Q_{h-1}(S_{t+1},A_{t+1}))+\gamma\bar{V}_{h-1}(S_{t+1})$

用$\sigma$去調節$\rho$和$\pi$的比例

for $t<h \leq T$. The recursion ends with $G_{h:h}=Q_{h-1}(S_h, A_h)$ if h<T, or with
$G_{T-1:T}=R_T$ if h = T. Then we use the general (o↵-policy) update for n-step Sarsa
(7.11). A complete algorithm is given in the box.

Chapter 8 Planning and learning

mode: Given a state and an action, a model produces a
prediction of the resultant next state and next reward

sample model:依概率返回一個state 和Reward

distribution model:返回next state 和Reward的概率$p(s',r|s,a)$

model is used to simulate the environment and produce simulated experience.

State-space planning：從狀態空間出發，找到一個最優policy

plan-space planning: 不care

model$\to$ simulated experience $\to$ values $\to$ policy

basic idea:

通過計算value function 提升 policy
從experience 中通過update 或者 backup 計算 value

when model is wrong

模型正確的時候，model based會比model free 更快。

但是當model錯誤的時候，就需要重新修正。錯誤有兩種

一種是，在當前狀態，model和environment 不一致，等到agent 在真實environment 中經歷過后再修正過來

另一種是model和environment在局部一致，且在model認為這就是最優的state，但是environment改變了另外其他的地方，有更好的Policy，但是model不知道，陷入局部最優

prioritized sweeping

sample updates and expected

三元組產生8種value function update（1種無效）

expected update

$Q(s,a) =\sum_{s',r}p(s',r|s,a)[r + \gamma \underset{a'}{max} Q(s',a')]$

sample updates

$Q(s,a) = Q(s,a) + \alpha [r + \gamma \underset{a'}{max} Q(s',a') - Q(s,a)]$

sample updates 比expected updates更高效，在計算受限的情況下更出色

並且，sample updates 的 Q(s',a')一直在更新，所以 sample updates的successor state的estimates更加准確

trajectory sampling

DP里更新時候有用的是講所有的state 或者state value 更新

但是有其他方法根據分布采樣來更新

比如uniform distribution 這和exhaustive sweeps 一樣

on-policy distribution, that is, according to the distribution observed when following the current policy.

trajectory sampling

real time DP

$v_{k+1} = \underset{a}{max}\sum_{s',r}p(s',r|s,a)[r+ v_{k}(s')]$

trajectory sampling asynchronous value iteration DP

不用每個state 都更新inftyite次，關注和策略相關的一個state子集

只關注當前的，當value function穩定的時候就到了最優策略，普通DP 不是這樣的

planning at decision time

planning 分為兩種

一種是利用data from model , improve value function or policy, 然后給定當前的$S_t$ 選擇最好的action，improve是對所有state improve的所以叫background update

decision-time planning 給定$S_t$ 給出$A_t$

Heuristic Search

關注當前狀態 searching deeper

rollout algorithm

在當前狀態MC control

MCTS

MCTS 四個步驟

第一步：selection:用tree policy(通常會權衡explore 和exploit，比如UCB)到達一個葉子節點

第二步：expansion：通常隨機擴展出一個葉子節點

第三步：simulation：使用rollout policy simulate 一個trajectory 來

第四步：backup: 用simulate 的trajectory 來backup葉子節點和他的父節點，simulated 的trajectory不保留

到下一狀態了能夠利用當前產生的樹的子樹

第三個dimension 是 on-policy 或者off-Policy

Chapter 9 approximate solution method

approximate value 是參數值改一下，所有state value都會受到不同程度的影響

我們需要一個衡量標准，對於所有的state，

$\overline{VE}(w) = \sum_{s}\mu(s)(\hat{v}(s)-v_\pi(s))^2$

$\mu(s)$是on-policy distribution 體現了對不同狀態的重視程度

$\eta(s) = h(s) + \sum_{\overline{s}} \eta(\overline{s}) \sum_{a}\pi(a|\overline{s})p(s|a,\overline{s})$

$\mu(s) = \frac{\eta{(s)}}{\sum_{s'}\eta{(s')}},\forall s'\in S$

$\eta (s)$表示花在s上的時間，就是start state為s,的概率，加上前一狀態能夠來到s狀態的概率

目標就是要找$\overline{VE}(w^*)\leq \overline{VE}(w),\forall w$

或者找一個local optimum

SGD 和semi-gradient

$\begin{align*}w_{t+1} =& w_{t} -0.5*\alpha \nabla [v_\pi(S_t)-\hat{v}(S_t,w_t)]^2\\=&w_t+\alpha [v_\pi(S_t)-\hat{v}(S_t,w_t)]\nabla \hat v{(S_t,w_t)}\end{align*}$

符號變了,書上這么寫的

gradient MC unbiased

bootstrapping 是semi-gradient

因為target value 里面有$w_t$但是不對它求導

$w_{t+1}=w_t+\alpha [R_{t+1}+\gamma \hat{v}(S_{t+1},w_t)-\hat{v}(S_t,w_t)]\nabla \hat v{(S_t,w_t)}$

linear model

$\begin{align*}\hat{v}(s,w)=&w^Tx(s)\\=& \sum_{i=1}^{d}w_ix_i(s)\end{align*}$

x(s)是feature vector

$\nabla v(s,w) = x(s)$

MC 可以收斂到$\overline{VE}$的global optimum

但是TD(0)不行

$\begin{align*}w_{t+1}=&w_t + \alpha [R_{t+1}+\gamma w_{t}^Tx_{t+1}-w_t^Tx_{t} ] x_t\\=&w_t + \alpha [R_{t+1}x_t -x_t(x_t-\gamma x_{t+1})^Tw_t]\end{align*}$

$E(w_{t+1}|w_t) = w_t+\alpha(b-Aw_t)$

$b = E[R_{t+1}x_t]$

$A=E[x_t(x_t-\gamma x_{t+1})^T]$

如果收斂的話

$b-Aw_{TD}=0$

$w_{TD}=A^{-1}b $

TD fixed point A逆存在可證

$\overline{VE}(w_{TD})\leq \frac{1}{1-\gamma} min_{w} \overline{VE}(w)$

$\gamma$接近1,所以誤差會很大

Feature Construction for Linear Methods

polynomials

state 可以表示為$(s_1,s_2,\cdots,s_k)^T$

x(s)可以是state的多項式表示，能夠讓不同dimension的state屬性相互影響

如果使用n階多項式

$x_i(s)= \prod_{j=1}^{k}s_j^{c_{i,j}}$

$c_{i,j}\in [0,n]$

總共有$(n+1)^k$

state維度高，階數高不可行

Fourier series

$s =(s_1,s_2,\cdots,s_k)^T$

$x_i(s) = cos(\pi s^T c_i)$

$c_i=(c_1^i,c_2^i,\cdots,c_k^i)$

$c_j^i\in\{0,1,...,n\}$

$i=1,\cdots,(n+1)^k$

離散的state不好表示，會出現ringing現象。需要很高的采樣率

Coarse Coding

Tile Coding

適用於多維連續空間具有泛化能力，計算簡單

不同的平移方式能編出不同形式的編碼,非對稱的好一點

Radial Basis Functions

$x_i(s)=exp(-\frac{|s-c_i|^2}{2\sigma_i^2})$

相當於粗編碼的擴展，粗編碼只能是0,1這里是0,1類似加了距離度量，跟選定的n個中心的距離組成了feature,沒什么用。

Selecting Step-Size Parameters Manually

$\alpha = (\tau E[x^Tx)^{-1}]$

Suppose you wanted to learn in about $\tau$ experiences with substantially the same feature vector

Nonlinear Function Approximation: Artificial Neural Networks

Least-Squares TD

$E[w_{t+1}|w_t]=w_t+\alpha (b-Aw_t)$

$w_{TD}=A^{-1}b$

$A = E[x_t(x_t-\gamma x_{t+1})^T]$

$b= E[R_{t+1}x_t]$

$\hat{A}_t = \sum_{k=0}^{t-1}x_k(x_k-\gamma x_{k+1})^T + \epsilon I$

$\hat{b}_t = \sum_{k=0}^{t-1}R_{k+1}x_k$

不用除以t

$w_t = \hat{A}_t^{-1}\hat{b}_t$消掉了

$\begin{align*}\hat{A}_t^{-1} =& (\hat{A}_{t-1} + x_{t-1}(x_{t-1}-\gamma x_t)^{T})^{-1}\\=&\hat{A}_{t-1}^{-1}- \frac{\hat{A}_{t-1}^{-1} x_{t-1} (x_{t-1}-\gamma x_t )^T \hat{A}_{t-1}^{-1}}{1+(x_{t-1}-\gamma x_t)^T \hat{A}_{t-1}^{-1}x_{t-1}}\end{align*}$

Sherman-Morrison formula

Memory-based Function Approximation

lazy learning

記下來，遇到了，再找類似 KNN

local learning

state的距離有多種定義方法

nearest neighbor: 找最近的

weighted average：根據距離加權

Locally weighted regression ：計算一個locally fitted surface

避免維度災難

Kernel-based Function Approximation

用kernel function 代替距離，作為一種相關性的衡量，或者泛化能力的體現

Kernel regression $\hat{v}(s,D)=\sum_{s'\in D} k(s,s')g(s')$

D是存儲的數據，$g$就是s對應的值

$k(s,s')=x(s)^Tx(s')$會產生和線性回歸x(s)一樣的結果

Looking Deeper at On-policy Learning: Interest and Emphasis

$w_{t+n} = w_{t+n-1} +\alpha M_t[G_{t:t+n}-\hat{v}(S_t,w_{t+n-1})] \nabla \hat{v}(S_t,w_{t+n-1})$

$M_t$表示 emphasis 表示對t時刻更新的重視程度

$M_t=I_t+\gamma^nM_{t-n}$

$I_t$表示interest 表示在t時刻對不同狀態的重視程度

Chapter 10 on-policy control with approximation

先從TD 變為sarsa

$w_{t+1}=w_t + \alpha [U_t -\hat{q}(S_t,A_t,w_t)]\nabla \hat{q}(S_t,A_t,w_t)$

$w_{t+1}=w_t + \alpha [R_{t+1} +\gamma \hat{q}(S_{t+1},A_{t+1},w_t) -\hat{q}(S_t,A_t,w_t)]\nabla \hat{q}(S_t,A_t,w_t)$

episodic semi-gradient one step sarsa

和TD(0)收斂到一樣的結果

Policy improvement is then done (in the on-policy case treated in this chapter) by changing the estimation policy to a soft approximation of the greedy policy such as the "-greedy policy.

$G_{t:t+n} = R_{t+1} + \gamma R_{t+2}+ \cdots + \gamma^{n-1} R_{t+n} + \gamma^n \hat{q}(S_{t+n},A_{t+n},w_{t+n-1})$

$w_{t+n}=w_{t+n-1} + \alpha [G_{t:t+n} -\hat{q}(S_t,A_t,w_{t+n-1})]\nabla \hat{q}(S_t,A_t,w_{t+n-1})$

continuing tasks :average reward

discounted 對於永不終止的任務，不太適合

$\begin{align*}r(\pi)=&\underset{h\to \infty}{lim}\frac{1}{h}\sum_{t=1}^{h}E[R_t|S_0,A_{0:t-1} \sim \pi]\\=&\underset{t\to \infty}{lim}E[R_t|S_0,A_{0:t-1} \sim \pi]\\=&\sum_s \mu_\pi(s)\sum_a \pi{(a|s)} \sum_{s',r}p(s',r|s,a)r\end{align*}$

$\mu_\pi(s)=lim_{t\to\infty}Pr(S_t=s|A_{0:t-1} \sim \pi)$

最終MDP進入平穩分布，與初始狀態和前面的burning無關

平穩分布時：

$\sum_s\mu_\pi(s)\sum_a \pi{(a|s)} p(s'|s,a)r = \mu_\pi(s')$

In the average-reward setting, returns are defined in terms of di↵erences between rewards and the average reward:

differential return $G_t = R_{t+1}-r(\pi) + R_{t+2}-r(\pi)+\cdots$

對於這種$v_\pi,p_\pi,v_*,p_*$仍然有bellman function

$v_\pi(s) = \sum_{a} \pi(a|s)\sum_{s',r}p(s',r|s,a)[r-r(\pi)+v_\pi(s')]$

$q_\pi(s,a) = \sum_{s',r}p(s',r|s,a)[r-r(\pi)+\sum_{a'}\pi(a'|s')q_\pi(s',a')]$

$v_* = \max_{a} \sum_{s',r}p(s',r|s,a)[r-max_\pi r(\pi)+v_*(s')]$

$q_*(s,a) = \sum_{s',r}p(s',r|s,a)[r-max_\pi r(\pi)+max_{a'}q_*(s',a')]$

td-error

$\delta_t = R_{t+1}-\overline{R}_t + \hat{v}_\pi(S_{t+1},w_t)-\hat{v}(S_t,w_t)$

$\delta_t = R_{t+1}-\overline{R}_t + \hat{q}_\pi(S_{t+1},A_{t+1},w_t)-\hat{q}(S_t,A_t,w_t)$

$w_{t+1} = w_{t} + \alpha \delta_t \nabla \hat{q}(S_t,A_t,w_t)$

tabular 中加$\gamma$ 之后在穩定之后的為$\frac{1}{1-\gamma}r(\pi)$,$1+\gamma +\gamma^2+\cdots = \frac{1}{1-\gamma}$看上去加不加不影響。但是在函數近似中不能這樣用。

函數近似不再是局部優化進而策略提升，函數近似一旦改變就是全局改變，沒辦法證明全局提升

但是在函數近似中

The root cause of the diffculties with the discounted control setting is that with
function approximation we have lost the policy improvement theorem (Section 4.2). It is
no longer true that if we change the policy to improve the discounted value of one state
then we are guaranteed to have improved the overall policy in any useful sense.

缺少policy improvement 的理論支撐

Chapter 11 off policy methods with approximation

off-policy learning的困難

1,target of the update

	target policy使用的數據是有 behavior policy 產生的

2, distribution of the updates

	數據分布是behavior 的數據分布

Something more is needed for the second part of the challenge of off-policy learning
with function approximation because the distribution of updates in the off-policy case is
not according to the on-policy distribution. The on-policy distribution is important to
the stability of semi-gradient methods

Semi-gradient Methods

$\rho_t = p_{t:t}=\frac{\pi(A_t|S_t)}{b(A_t|S_t)}$

$w_{t+1} = w_t + \alpha \rho_t \delta_t \nabla \hat{v}(S_t,w_t)$

$\delta_t = R_{t+1} + \gamma \hat{v}(S_{t+1},w_t)-\hat{v}(S_t,w_t)$ episodic, discounted

$\delta_t = R_{t+1} -\overline{R}+ \hat{v}(S_{t+1},w_t)-\hat{v}(S_t,w_t)$,continuing, undiscounted

action-value

semi-gradient Expected Sarsa:

只用到了(S_t,A_t,R_t)應該沒有設計到分布的問題

$w_{t+1} = w_t + \alpha \delta_t \nabla \hat{q}(S_t,A_t,w_t)$

$\delta_t = R_{t+1} + \gamma \sum_a \pi(a|S_{t+1})\hat{q}(S_{t+1},a,w_t)-\hat{q}(S_t,A_t,w_t)$

$\delta_t = R_{t+1} -\overline{R} + \sum_a \pi(a|S_{t+1})\hat{q}(S_{t+1},a,w_t)-\hat{q}(S_t,A_t,w_t)$

multi-step 設計IS

$w_{t+n} =w_{t+n-1}+ \alpha \rho_{t+1} \cdots \rho_{t+n-1}[G_{t+n-1}-\hat{q}(S_t,A_t,w_{t+n-1})]\nabla \hat{q}(S_t,A_t,w_{t+n-1})$

$G_{t:t+n} = R_{t+1}+\cdots +\gamma^{n-1}R_{t+n}+ \gamma^n \hat{q}(S_{t+n},A_{t+n},w_{t+n-1})$

$G_{t:t+n} = R_{t+1}-\overline{R}_t+\cdots +R_{t+n}-\overline{R}_{t+n-1}+ \hat{q}(S_{t+n},A_{t+n},w_{t+n-1})$

n-step tree-backup

$w_{t+n} =w_{t+n-1}+ \alpha [G_{t:t+n}-\hat{q}(S_t,A_t,w_{t+n-1})]\nabla \hat{q}(S_t,A_t,w_{t+n-1})$

$G_{t:t+n}=\hat{q}(S_t,A_t,w_{t-1})+\sum_{k=t}^{t+n-1}\delta_{k}\prod_{i=t+1}^{k}\gamma \pi(A_i|S_i)$

TODO

off policy function approximation 不穩定的第二個困難是on-policy 會在后續修正預測的錯誤，但是off-policy時由於target-Policy 和on-policy的分布不一致，$\rho=0$,導致target 估計錯的可能永遠不會得到修正

the deadly triad 致命三合

Function approximation

Bootstrapping

Off-policy training

三個合起來就不穩定

線性 value-function geometory

假設狀態空間是${s_0,s_1,s_2}$總共有三個狀態，三維空間的一點對應一個value function

我們使用一個線性模包含兩個參數$w_1,w_2$,很顯然對於線性模型只能表示三維空間的一個平面。對於一個策略$\pi$它的$v_\pi$很可能在平面外，我們找到的最優解就是$v_\pi$在平面的投影$\Pi v_\pi$

定義距離 $|v|^2_\mu =\sum_{s\in S}\mu(s)v(s)^2$ ,$\mu(s)$體現了對不同s的重視程度

$\overline{VE}(w) = |{v_w}-v_\pi|^2$

$\Pi v =v_w,w=\underset{w\in R^n}{argmin}|v-v_w|^2$

$\Pi = X(X^TDX)^{-1}X^TD$ ,$D$維度是$|S|*|S|$的對角矩陣，對角線是$\mu(s)$, $X$是$|S|*d$的feature matrix

$|v|_\mu^2=v^TDv$

$v_w=Xw$

$v_\pi(s) = \sum_{a}\pi(a|s)\sum_{s',r}p(s',r|s,a)(r+\gamma v_\pi(s'))$

Bellman error at state s

$\begin{align*}\overline{\delta}_w(s) =& [\sum_{a}\pi(a|s)\sum_{s',r}p(s',r|s,a)(r+\gamma v_w(s'))]-v_w(s)\\=&E_\pi[R_{t+1}+\gamma v_w(S_{t+1})-v_w(S_t)|S_t=s,A_t=a]\end{align*}$

Mean Squared Bellman Error,衡量所有狀態下的error

$\overline{BE}(w) = |\overline{\delta}_w|^2_\mu$

Bellman operator$B_\pi$

$(B_\pi v)(s) = \sum_{a}\pi(a|s)\sum_{s',r}p(s',r|s,a)(r+\gamma v(s'))$

$\overline{\delta}_w = B_\pi v_w-v_w$

在動態規划的情況下（without function approximation）不斷迭代有不動點$v_\pi = B_\pi v_\pi$

過程就是圖中的gray line 不斷迭代到達了$v_\pi$

但是在使用$function \,approximation$ 后,平面外的東西表示不了，也就是$B_\pi v_w$表示不了只能表示它的投影

$PBE = ||\Pi \overline{\delta}_w||^2_\mu$

PBE在線性的情況下可以為0，也就是$w_{TD}$

Gradient Descent in the bellman error

The TD error,也不一定要這樣定義

$δ_θ(s,a,s′)=R(s,a,s′)+γv_θ(s′)−vθ(s)$

The Advantage

$\begin{align*}A_θ(s,a)=& E_{s′∼P}[δ_θ(s,a,s′)]\\ =& E_{s′∼P}[R(s,a,s′)+γv_θ(s′)]−v_θ(s)\end{align*}$

The bellman error

$\begin{align*}ϵ_θ(s)=&E_{a∼π}[Aθ(s,a)]\\=&E{a∼π,s′∼P}[R(s,a,s′)+γvθ(s′)]−vθ(s)\end{align*}$

SGD 就是找個目標最小化

首先來看TD error

$\delta_t＝Ｒ_{t+1} + \gamma \hat{v}(S_{t+1},w_t)- \hat{v}(S_t,w_t)$

$\overline{TDE}(w) =\sum_{s}\mu(s)E[\delta^2_t|S_t=s,A_t\sim \pi]$

$\overline{TDE}(w) =\sum_{s}\mu(s)E[\rho_t \delta^2_t|S_t=s,A_t\sim b]$

$=E_b[\rho_t\delta_t^2]$　如果$\mu$是在ｂ下的分布，就不用了

$\begin{align*}w_{t+1} =& w_w -\frac{1}{2}\alpha \nabla(\rho_t\delta_t^2)\\=&w_t-\alpha\rho_t \delta_t \nabla \delta_t\\=&w_t +\alpha \rho_t \delta_t (\nabla\hat{v}(S_t,w_t)-\gamma \hat{v}(S_{t+1},w_t))\end{align*}$

naive residual gradient,收斂，但是收斂不到我們想要的地方

真實的value function 的TDE 會比其他來的大

The Bellman error for a state is the expected TD error in that　state

$w_{t+1} = w_w -\frac{1}{2}\alpha \nabla(E_\pi[\delta_t]^2)$

$w_{t+1} = w_w -\frac{1}{2}\alpha \nabla(E_b[\rho_t\delta_t]^2)$

$w_{t+1} = w_w -E_b[\rho_t\delta_t]\alpha \nabla(E_b[\rho_t\delta_t])$

$w_{t+1} = w_w -\alpha E_b[\rho_t(R_{t+1}+\gamma \hat{v}(S_{t+1},w) -\hat{v}(S_t,w))]E_b[\rho_t\nabla \delta_t])$

$w_{t+1} = w_w +\alpha [E_b[\rho_t(R_{t+1}+\gamma \hat{v}(S_{t+1},w))]-\hat{v}(S_t,w)][\nabla \hat{v}(S_t,w)- \gamma E_b[\rho_t\nabla \hat{v}(S_{t+1},w)]]$

找個算法叫做residual-gradient algorithm

$S_{t+1}$出現在前后兩個地方，如果要unbiased,前后兩個需要無關。確定性環境或者模擬環境

這個算法也收斂，

三個方面不滿意：

１，太慢了

２．仍然會收斂到錯誤的地方，BE就是TDE的期望，確定性環境結果一樣，

learnable: 傳統上是多項式時間可解，在強化學習下值得是從experience data 里面無法學習到某些性質，這些性質給定環境內部結構，可以計算出來.

Chapter 12 Eligibility Traces

The lamda-return

$G_{t:t+n}=R_{t+1}+ \gamma R_{t+1} + \cdots \gamma^{n-1}R_{t+n} +\gamma ^n \hat{v}(S_{t+n},w_{t+n-1}),0\leq t\leq T-n$

$G_t^\lambda = (1-\lambda)\sum_{n=1}^{\infty}\lambda^{n-1}G_{t:t+n}$

$G_t^\lambda = (1-\lambda)\sum_{n=1}^{T-t-1}\lambda^{n-1}G_{t:t+n} +\lambda^{T-t-1}G_t$

the off line $\lambda$-return algorithm

$w_{t+1} = w_{t} + \alpha [G_t^\lambda -\hat{v}(S_t,w_t)]\nabla\hat{v}(S_t,w_t)$

TD($\lambda$)

forward view 只能在episode結束后才可以計算

backward view 可以online 可以使用continuing

z　就是eligibility trace 相當於記錄了前面參數對狀態的影響，累加。然后每一步更新的時候都把前面的影響考慮進來

TD($\lambda$) 是另一種統一TD 和ＭＣ　的方法

$\lambda$ = 0 的時候就是TD(0)

$\lambda$ = 1的時候就是帶decay 的MC,TD(1)

TD(1)比MC 更加general

n-step Truncated $\lambda$-return Methods

$G_{t:h}^\lambda = (1-\lambda)\sum_{n=1}^{h-t-1}\lambda^{n-1}G_{t:t+n} +\lambda^{h-t-1}G_{t:h},0\leq t\leq h\leq T$

$w_{t+n} = w_{t+n-1} + \alpha [G^\lambda_{t:t+n} -\hat{v}(S_t,w_{t+n-1})]\nabla\hat{v}(S_t,w_{t+n-1})$

迭代計算k-step $\lambda$-return

$G_{t:t+k}^\lambda =\hat{v}(S_t,w_{t-1})+ \sum_{i=t}^{t+k-1}(\gamma\lambda )^{i-t}\delta_i'$

$\delta'_t=R_{t+1}\gamma \hat{v}(S_{t+1},w_t)-\hat{v}(S_t,w_{t-1})$

剩下還有內容都不寫了

Chapter 13 Policy Gradient Methods

$\pi(a|s,\theta)$

REINFORCE with baseline 不是AC 算法，因為它的value function 沒有bootstrap.

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。