強化學習導論課后習題參考 - Chapter 3,4

本文轉載自查看原文 2021-03-25 06:08 387 Reinforcement Learning/ RL

Reinforcement Learning: An Introduction (second edition) - Chapter 3,4

Chapter 3

3.1 Devise three example tasks of your own that fit into the MDP framework, identifying for each its states, actions, and rewards. Make the three examples as different from each other as possible. The framework is abstract and flexible and can be applied in many different ways. Stretch its limits in some way in at least one of your examples.

Atari游戲，第一視角3D游戲等。狀態是圖像輸入，動作是控制游戲的可選操作，回報是游戲目標設定的獎勵值。這類游戲的特點是輸入通常是像素，沒有經過狀態表征的預處理。
棋類游戲，如五子棋、圍棋等。狀態是棋盤狀態，動作是落子的位置，回報是勝（+1）負（-1）或無（0）。這類問題的特點在於回報很稀疏（sparse reward），只有在游戲結束的狀態才有非零的回報。另一個特點是動力學函數（dynamics）是確定性的，即$p(s^\prime,r|s,a)$不存在隨機概率轉移，只會轉移到確定的狀態和確定的回報值。
自動駕駛，飛行器控制等。狀態是雷達感知的信號等，動作是速度和方向的控制，回報反映駕駛的效果。這類問題的特點在於通常不設置終止狀態，沒有勝利或失敗（車毀人亡不考慮的話）的終止條件，整個過程一直在學習和運用，有點life-long/continual learning的感覺。

3.2 Is the MDP framework adequate to usefully represent all goal-directed learning tasks? Can you think of any clear exceptions?

顯然不行，如果可以也不會做馬爾可夫性質的假設了。具體例子，比如不能完全由上一個時刻的狀態進行決策的任務，比如星際、英雄聯盟等有戰爭迷霧的游戲等（部分可觀測馬爾可夫決策過程）。

3.3 Consider the problem of driving. You could define the actions in terms of the accelerator, steering wheel, and brake, that is, where your body meets the machine. Or you could define them farther out—say, where the rubber meets the road, considering your actions to be tire torques. Or you could define them farther in—say, where your brain meets your body, the actions being muscle twitches to control your limbs. Or you could go to a really high level and say that your actions are your choices of where to drive. What is the right level, the right place to draw the line between agent and environment? On what basis is one location of the line to be preferred over another? Is there any fundamental reason for preferring one location over another, or is it a free choice?

這個問題比較開放。如果我們想控制司機去開車，那么第一種第一方式合理。如果想控制更底層的機械，第二種方式更合理。如果想考慮司機的關節的控制，第三種方式更合理。如果想考慮更宏觀層面的控制，第四種方式更合理。只要把想要解決的問題定義清楚了，智能體和環境就區分開了。這也是由於強化可以解決各種尺度和不同層面的問題的性質決定的。

3.4 Give a table analogous to that in Example 3.3, but for $p(s^\prime,r|s,a)$. It should have columns for $s, a, s^\prime, r$, and $p(s^\prime, r|s, a)$, and a row for every 4-tuple for which $p(s^\prime, r|s, a) > 0$.

仿照Example 3.3寫成表格

$s$	$a$	$s^\prime$	$r$	$p(s^\prime,r\\|s,a)$
high	search	high	$r_{search}$	$\alpha$
high	search	low	$r_{search}$	$1-\alpha$
high	wait	high	$r_{wait}$	1
low	wait	low	$r_{wait}$	1
low	recharge	high	0	1
low	search	high	-3	$1-\beta$
low	search	low	$r_{search}$	$\beta$

3.5
The equations in Section 3.1 are for the continuing case and need to be modified (very slightly) to apply to episodic tasks. Show that you know the modifications needed by giving the modified version of (3.3).

原(3.3)式：$\sum_{s \in \mathcal{S}}\sum_{r \in \mathcal{R}}p(s^\prime,r|s,a)=1, \ \ \text{for all} \ \ s \in \mathcal{S}, a \in \mathcal{A}(s).$ 這里沒太明白意思，區分一下狀態集？$\sum_{s \in \mathcal{S}}\sum_{r \in \mathcal{R}}p(s^\prime,r|s,a)=1, \ \ \text{for all} \ \ s \in \mathcal{S}, a \in \mathcal{A}(s),s^\prime \in \mathcal{S^+}.$

3.6
Suppose you treated pole-balancing as an episodic task but also used discounting, with all rewards zero except for −1 upon failure. What then would the return be at each time? How does this return differ from that in the discounted, continuing formulation of this task?

這里上文說了幾種情形，第一種是當成有終止狀態的任務（episodic），如果沒有失敗，那每個time step都給獎勵1。第二種是當成連續的任務（continuing task），沒有終止狀態但是獎勵帶折扣（discounting），每次失敗給-1，其他時候給0。問題問的是把任務當成episodic的，同時獎勵帶折扣，失敗給-1，其他時候給0，這種情況如何。這種情況如果一直不失敗，return為0，如果有失敗的情況，軌跡上的return正比於$-\gamma^K$，其中$K<T$。和continuing formulation的形式相比，多一個序列終止的fina time step $T$，其他好像區別不大。

3.7 Imagine that you are designing a robot to run a maze. You decide to give it a reward of +1 for escaping from the maze and a reward of zero at all other times. The task seems to break down naturally into episodes—the successive runs through the maze—so you decide to treat it as an episodic task, where the goal is to maximize expected total reward (3.7). After running the learning agent for a while, you find that it is showing no improvement in escaping from the maze. What is going wrong? Have you effectively communicated to the agent what you want it to achieve?

(3.7)式是不帶折扣的$G_t \overset{.}{=} R_{t+1}+R_{t+2}+R_{t+3}+...+R_T$。因為只有最終成功能得到獎勵+1，不管中途怎么走，獎勵都是0完全一樣，又因為return是不帶折扣的形式，導致只要能成功，那么不同的動作序列對應的return沒有區別，所以不會有提升。通過添加折扣因子，越早成功的動作序列對應的return會越大，此時動作之間return產生差異，智能體的效果會更好。此外還需要關注一個問題是sparse reward，如果任務過於復雜，智能體無法探索到成功的軌跡，這個時候所有return都是0，動作之間也沒有區分，此時需要調整算法或者獎勵值的形式，增加探索度。

3.8 Suppose $\gamma=0.5$ and the following sequence of rewards is received $R1 = -1, R_2 = 2, R_3 = 6, R_4 = 3$, and $R_5 = 2$, with $T = 5$. What are $G_0, G_1,..., G_5$? Hint: Work backwards.

從后往前計算：

\[\begin{array}{l} G_5= 0 \\ G_4 = R_5 + \gamma G_5 = 2+0.5 \times 0 = 2 \\ G_3 = R_4 + \gamma G_4 = 3 + 0.5 \times 2 = 4\\ G_2 = R_3 + \gamma G_3 = 6 + 0.5 \times 4 = 8\\ G_1 = R_2 + \gamma G_2 = 2+0.5 \times 8 = 6 \\ G_0 = R_1 + \gamma G_1 = -1 + 0.5 \times 6 = 2 \end{array} \]

3.9
Suppose $\gamma = 0.9$ and the reward sequence is $R_1 = 2$ followed by an infinite sequence of 7s. What are $G_1$ and $G_0$?

由(3.8)式和(3.10)式：

\[\begin{array}{l} G_1 = \sum^{\infty}_{k=0}\gamma^k R_{k+2} = 7\times \sum^{\infty}_{k=0}\gamma^k=7 \times \frac{1}{1-\gamma} = 70 \\ G_2 = R_1 + \gamma \times G_1 = 2 + 0.9 \times 70 = 65 \end{array} \]

3.10
Prove the second equality in (3.10).

把(3.10)寫開：

\[\begin{array}{l} G_t = \sum^{\infty}_{k=0}\gamma^k \\ \quad \ = 1 + \gamma + \gamma^2 + ... \\ \quad \ = \frac{(1-\gamma)(1+\gamma+\gamma^2+...)}{1-\gamma} \\ \quad \ =\frac{1-\lim_{n\rightarrow \infty}\gamma^n}{1-\gamma} \\ \quad \ = \frac{1}{1-\gamma} \end{array} \]

3.11
If the current state is $S_t$, and actions are selected according to stochastic policy $\pi$, then what is the expectation of $R_{t+1}$ in terms of $\pi$ and the four-argument function $p$ (3.2)?

直接寫：

\[E_{\pi}[R_{t+1}|S_t=s_t]=\sum_{s_{t+1} \in \mathcal{S}}\sum_{r_{t+1} \in \mathcal{R}}\sum_{a_t \in \mathcal{A}} r_{t+1}p(s_{t+1},r_{t+1}|s_t,a_t)\pi(a_t|s_t) \]

3.12
Give an equation for $v_{\pi}$ in terms of $q_{\pi}$ and $\pi$.

按照前面定義展開，太復雜了，直接寫成期望的形式

\[v_{\pi}(s) = E_{a \sim \pi}[q_{\pi}(s,a)] = \sum_{a \in \mathcal{A}}\pi(a|s)q_{\pi}(s,a) \]

3.13
Give an equation for $q_{\pi}$ in terms of $v_{\pi}$ and the four-argument $p$.

同上：

\[q_{\pi}(s,a) = E_{r,s^\prime}[r+\gamma v_\pi(s^\prime)]=\sum_{r,s^\prime}p(s^\prime,r|s,a)[r+\gamma v_{\pi}(s^\prime)] \]

3.14
The Bellman equation (3.14) must hold for each state for the value function $v_\pi$ shown in Figure 3.2 (right) of Example 3.5. Show numerically that this equation holds
for the center state, valued at +0.7, with respect to its four neighboring states, valued at +2.3, +0.4, −0.4, and +0.7. (These numbers are accurate only to one decimal place.)

寫出(3.14)，帶入得：

\[\begin{array}{l} v_\pi(s) = \sum_a\pi(a|s)\sum_{s^\prime,r}p(s^\prime,r|s,a)[r+\gamma v_\pi(s^\prime)] \\ \qquad \ \ = \frac{1}{4}\times 0.9\times 2.3 + \frac{1}{4}\times 0.9\times 0.4 + \frac{1}{4}\times 0.9\times (-0.4) + \frac{1}{4}\times 0.9\times 0.7 \\ \qquad \ \ = 0.675 \approx 0.7 \end{array} \]

3.15
In the gridworld example, rewards are positive for goals, negative for running into the edge of the world, and zero the rest of the time. Are the signs of these rewards important, or only the intervals between them? Prove, using (3.8), that adding a constant $c$ to all the rewards adds a constant, $v_c$, to the values of all states, and thus does not affect the relative values of any states under any policies. What is $v_c$ in terms of $c$ and $\gamma$?

剛讀題有點拗口，意思就是說這些正負獎勵值，重要的是值的大小，還是這些值之間的相對大小（intervals between them）？然后讓你證明，如果每個獎勵都加上一個常數$c$，相當於給值函數加上一個常數$v_c$，所以也就不影響所有狀態的value的相對大小關系，這個情況在任意策略下都成立。讓你寫出$v_c$。
第一個問題題目后面都說明了，顯然是相對大小重要，比較根據相對大小來區分動作直接的好壞差異。根據(3.8)寫出$v_c$，定義$G^c_t$為加上$c$的累計回報,有：

\[G^c_t = \sum^{\infty}_{k=0}\gamma^k (c + R_{t+k+1}) = G_t + c\times\frac{1}{1-\gamma} \]

所以有常數$v_c=\frac{c}{1-\gamma}$，即每個狀態的value都加上了一個常數$v_c$。

3.16 Now consider adding a constant $c$ to all the rewards in an episodic task, such as maze running. Would this have any effect, or would it leave the task unchanged as in the continuing task above? Why or why not? Give an example.

對於continuing的情況，上面已經證明對所有狀態值的影響就是同時加一個常數，且與策略無關。對episodic的情況，可能會加上不同的值，因為序列的長度會因為策略而發生變化，變得不等長，從而影響$G^c_t$:

\[G^c_t = \sum^{T}_{k=0}\gamma^k (c + R_{t+k+1}) = G_t + c\times\frac{1-\gamma^{T+1}}{1-\gamma} \]

這種情況下，如果加上的常數$c>0$，那么序列越長的決策的累積回報增加值會大過序列短的累積回報的增加值。

3.17
What is the Bellman equation for action values, that is, for $q_{\pi}$? It must give the action value $q_\pi(s, a)$ in terms of the action values, $q_\pi(s^\prime, a^\prime)$, of possible successors to the state–action pair $(s, a)$. Hint: The backup diagram to the right corresponds to this equation. Show the sequence of equations analogous to (3.14), but for action values.

根據backup diagram寫，先是一個$p(s^\prime,r|s,a)$轉移到$s^\prime$，得到$r$。然后根據$\pi$選擇動作$a^\prime$，得到$G_{t+1}$，即$q_\pi(s^\prime,a^\prime)$：

\[\begin{array}{l} q_\pi(s,a) \overset{.}{=} E_\pi[G_t|S_t=s,A_t=a] \\ \qquad \quad \ = E_\pi[R_{t+1}+\gamma G_{t+1}|S_t=s,A_t=a] \\ \qquad \quad \ = \sum_{s^\prime,r}p(s^\prime,r|s,a)[r+\gamma \sum_{a^\prime}\pi(a^\prime|s^\prime)E_\pi[G_{t+1}|S_{t+1}=s^\prime,A_{t+1}=a^\prime]] \\ \qquad \quad \ = \sum_{s^\prime,r}p(s^\prime,r|s,a)[r+\gamma \sum_{a^\prime}\pi(a^\prime|s^\prime)q_\pi(s^\prime,a^\prime)] \end{array} \]

3.18 The value of a state depends on the values of the actions possible in that state and on how likely each action is to be taken under the current policy. We can think of this in terms of a small backup diagram rooted at the state and considering each possible action:

Give the equation corresponding to this intuition and diagram for the value at the root node, $v_\pi(s)$, in terms of the value at the expected leaf node, $q_\pi(s, a)$, given $S_t = s$. This equation should include an expectation conditioned on following the policy, $\pi$. Then give a second equation in which the expected value is written out explicitly in terms of $\pi(a|s)$ such that no expected value notation appears in the equation.

這段話說的很清楚了，先寫成$q_\pi(s, a)$的期望，再展開成$\pi(a|s)$的形式：

\[\begin{array}{l} v_\pi(s)=E_{a\sim\pi}[q_\pi(s,a)] \\ \qquad \ \ = \sum_a \pi(a|s)q_\pi(s,a) \end{array} \]

3.19
The value of an action, $q_\pi(s, a)$, depends on the expected next reward and the expected sum of the remaining rewards. Again we can think of this in terms of a small backup diagram, this one rooted at an action (state–action pair) and branching to the possible next states:

Give the equation corresponding to this intuition and diagram for the action value, $q_\pi(s, a)$, in terms of the expected next reward, $R_{t+1}$, and the expected next state value, $v_\pi(S_{t+1})$, given that $S_t=s$ and $A_t=a$. This equation should include an expectation but not one conditioned on following the policy. Then give a second equation, writing out the expected value explicitly in terms of $p(s^\prime, r|s, a)$ defined by (3.2), such that no expected value notation appears in the equation.

同上題：

\[\begin{array}{l} q_\pi(s,a)=E[R_{t+1}+\gamma v_\pi(s^\prime)|S_t=s,A_t=a] \\ \qquad \ \ = \sum_{s^\prime,r} p(s^\prime,r|s,a)[r+\gamma v_\pi(s^\prime)] \end{array} \]

這里期望里面的$G_{t+1}$變成了$v_\pi$，里面的項已經和$\pi$無關了，所以外面的期望沒有$\pi$的期望項了，只有關於$s^\prime,r$的。

3.20
Draw or describe the optimal state-value function for the golf example.

這里主要是知道高爾夫的例子在說啥，底層人民沒玩過，看了半天。大概意思是說一共有兩個動作，putter和driver。putter是小幅度的推桿，走得近但是准確。driver是大幅度揮桿，打的遠但是不夠准。目標當然是打到洞里。最優決策是green外面用driver，里面用putter。整個圖green外面同Figure 3.3 $q_\star(s,driver)$部分，green里面同$v_{putt}$部分。

3.21 Draw or describe the contours of the optimal action-value function for putting, $q_\star(s, putter)$, for the golf example.

Figure 3.3 畫的是$q_\star(s,driver)$，也就是第一步是driver，后面是最優決策的$q$值。現在讓我們畫$q_\star(s, putter)$，也就是第一步是putter，，后面是最優決策的$q$值。對比$v_{putt}$和$q_\star(s,driver)$的等高線，以圖$v_{putt}$來說明位置。在-6的位置做putter，只能到-5的位置，還需要兩次driver，再加一次putter，所以-6位置的值為-4。-5的位置做putter，到-4的位置，這個時候一次driver到green，再一次putter進洞，所以-5的位置為-3。剩下的同理。sand的地方要注意一下，一次putter還是在sand，然后driver一次到green，再一次putter進洞，所以是-3。最后得到位置和值的對應關系

state	$q_\star(s,putter)$
-6	-4
-5	-3
-4	-3
-3	-2
-2	-2
green	-1
sand	-3

3.22 Consider the continuing MDP shown on to the right. The only decision to be made is that in the top state, where two actions are available, left and right. The numbers show the rewards that are received deterministically after each action. There are exactly two deterministic policies, left and right. What policy is optimal if $\gamma = 0$? If $\gamma = 0.9$? If $\gamma = 0.5$?

這個題搞了一個比較奇特的MDP，只能在最上面的狀態做決策，而在下面兩個狀態沒法做決策，但是reward還是要分別計算到兩個狀態上。所以有

\[v_{\pi_{left}}=1+\gamma\times0+\gamma^2\times 1+\gamma^3\times 0 +\gamma^4\times1 \cdots=\sum_{k=0}^\infty\gamma^{2k} \]

\[v_{\pi_{right}}=0+\gamma\times2+\gamma^2\times0+\gamma^3\times 2 +\gamma^4\times0 \cdots=2\sum_{k=0}^\infty\gamma^{2k+1} \]

所以有

\[\begin{array}{l} \gamma = 0 \rightarrow v_{\pi_{left}}=1 > v_{\pi_{right}}=0 \\ \gamma = 0.9 \rightarrow v_{\pi_{left}}\approx5.26 < v_{\pi_{right}}\approx 9.47 \\ \gamma = 0.5 \rightarrow v_{\pi_{left}}\approx1.33 = v_{\pi_{right}}\approx 1.33 \end{array} \]

3.23
Give the Bellman equation for $q_*$ for the recycling robot.

給出$q$的貝爾曼最優方程，(3.20)：

\[q_*(s,a) = \sum_{r^\prime,s}p(s^\prime,r|s,a)[r+\gamma\max_{a^\prime}q_*(s^\prime,a^\prime)] \]

再根據Excercise(3.4)的表：

$s$	$a$	$s^\prime$	$r$	$p(s^\prime,r\\|s,a)$
high	search	high	$r_{search}$	$\alpha$
high	search	low	$r_{search}$	$1-\alpha$
high	wait	high	$r_{wait}$	1
low	wait	low	$r_{wait}$	1
low	recharge	high	0	1
low	search	high	-3	$1-\beta$
low	search	low	$r_{search}$	$\beta$

帶入式子得到方程組：

\[\begin{array}{l} q_*(\text{high},\text{search}) = p(\text{high},r_{\text{search}}|\text{high},\text{search})[r_{\text{search}}+\gamma\max_{a^\prime}q_*(\text{high},a^\prime)] + \\ \qquad \qquad \qquad \qquad \ \ p(\text{low},r_{\text{search}}|\text{high},\text{search})[r_{\text{search}}+\gamma\max_{a^\prime}q_*(\text{low},a^\prime)] \\ q_*(\text{high},\text{wait}) = p(\text{high},r_{\text{wait}}|\text{high},\text{wait})[r_{\text{wait}}+\gamma\max_{a^\prime}q_*(\text{high},a^\prime)] \\ q_*(\text{low},\text{wait}) = p(\text{low},r_{\text{wait}}|low,\text{wait})[r_{\text{wait}}+\gamma\max_{a^\prime}q_*(\text{low},a^\prime)] \\ q_*(\text{low},\text{recharge}) = p(\text{high},0|\text{low},\text{recharge})[0+\gamma\max_{a^\prime}q_*(\text{high},a^\prime)] \\ q_*(\text{low},\text{search}) = p(\text{high},-3|\text{low},\text{search})[-3 + \gamma\max_{a^\prime}q_*(\text{high},a^\prime)] + \\ \qquad \qquad \qquad \qquad p(\text{low},r_{\text{search}}|\text{low},\text{search})[r_{\text{search}}+\gamma\max_{a^\prime}q_*(\text{low},a^\prime)] \end{array} \]

化簡有：

\[\begin{array}{l} q_*(\text{high},\text{search}) = \alpha[r_{\text{search}}+\gamma\max_{a^\prime}q_*(\text{high},a^\prime)] + (1-\alpha)[r_{\text{search}}+\gamma\max_{a^\prime}q_*(\text{low},a^\prime)] \\ q_*(\text{high},\text{wait}) = r_{\text{wait}}+\gamma\max_{a^\prime}q_*(\text{high},a^\prime) \\ q_*(\text{low},\text{wait}) = r_{\text{wait}}+\gamma\max_{a^\prime}q_*(\text{low},a^\prime) \\ q_*(\text{low},\text{recharge}) = \gamma\max_{a^\prime}q_*(\text{high},a^\prime) \\ q_*(\text{low},\text{search}) = (1-\beta)[-3 + \gamma\max_{a^\prime}q_*(\text{high},a^\prime)] + \beta[r_{\text{search}}+\gamma\max_{a^\prime}q_*(\text{low},a^\prime)] \end{array} \]

3.24
Figure 3.5 gives the optimal value of the best state of the gridworld as 24.4, to one decimal place. Use your knowledge of the optimal policy and (3.8) to express this value symbolically, and then to compute it to three decimal places.

在24.4的位置剛好在$\text{A}$，這個位置的最優策略是先隨便做一個動作，得到獎勵值10，跳到$\text{A}^\prime$，然后一直往上走到$\text{A}$，中間得到獎勵0。一直重復這個過程。直接根據(3.8)寫：

\[\begin{array}{l} G_t \overset{.}{=} R_{t+1} + \gamma R_{t+2} + \gamma^2R_{t+3} + \cdots \\ \quad \ \ = 10 + \gamma^5\times10 + \gamma^{10}\times10 + \cdots \\ \quad \ \ = \frac{10}{1-0.9^5} \approx 24.419 \end{array} \]

3.25
Give an equation for $v_*$in terms of $q_*$.

這個題公式(3.19)給的不是嗎？還是我理解錯了？

\[v_*(s)=\max_{a \in \mathcal{A}(s)}q_{\pi_*}(s,a) \]

3.26
Give an equation for $q_*$ in terms of $v_*$ and the four-argument $p$.

接着公式(3.20)寫：

\[q_*(s,a) = \sum_{s^\prime,r}p(s^\prime,r|s,a)[r+\gamma\max_{a^\prime}q_*(s^\prime,a^\prime)] = \sum_{s^\prime,r}p(s^\prime,r|s,a)[r+\gamma v_*(s^\prime)] \]

3.27 Give an equation for $\pi_*$ in terms of $q_*$.

這個不就是argmax嗎，還是我沒get到點？

\[\pi^*(a|s)= \left\{\begin{array}{l} 1,if \quad a=\arg\max\limits_{a \in A}q_*(s,a)\\ 0,otherwise \end{array}\right. \]

3.28 Give an equation for $\pi_*$ in terms of $v_*$ and the four-argument $p$.

由3.26有：

\[q_*(s,a) = \sum_{s^\prime,r}p(s^\prime,r|s,a)[r+\gamma v_*(s^\prime)] \]

直接放到3.27中？

\[\pi^*(a|s)= \left\{\begin{array}{l} 1,if \quad a=\arg\max\limits_{a \in A}\sum_{s^\prime,r}p(s^\prime,r|s,a)[r+\gamma v_*(s^\prime)]\\ 0,otherwise \end{array}\right. \]

3.29 Rewrite the four Bellman equations for the four value functions $(v_\pi,v_*,q_\pi,and \ q_*)$ in terms of the three argument function $p$ (3.4) and the two-argument function $r$ (3.5).

先全列出來：

\[p(s^\prime|s,a) = \sum_{r \in \mathcal{R}}p(s^\prime,r|s,a) \]

\[r(s,a) = \sum_{r \in \mathcal{R}}r\sum_{s^\prime \in \mathcal{R}} p(s^\prime,r|s,a) \]

依次替換：

\[\begin{array}{l} v_\pi = \sum_a \pi(a|s)\sum_{s^\prime,r}p(s^\prime,r|s,a)[r+\gamma v_\pi(s^\prime)] \\ \quad \ =\sum_a \pi(a|s)[\sum_{r}\sum_{s^\prime}p(s^\prime,r|s,a)r+\sum_{s^\prime}\sum_{r}p(s^\prime,r|s,a)\gamma v_\pi(s^\prime)] \\ \quad \ =\sum_a \pi(a|s)[r(s,a)+\sum_{s^\prime}p(s^\prime|s,a)\gamma v_\pi(s^\prime)] \end{array} \]

\[\begin{array}{l} v_* = \max_{a}\sum_{s^\prime,r}p(s^\prime,r|s,a)[r+\gamma v_*(s^\prime)] \\ \quad \ =\max_{a}[\sum_{r}\sum_{s^\prime}p(s^\prime,r|s,a)r+\sum_{s^\prime}\sum_{r}p(s^\prime,r|s,a)\gamma v_*(s^\prime)] \\ \quad \ =\max_a[r(s,a)+\sum_{s^\prime}p(s^\prime|s,a)\gamma v_*(s^\prime)] \end{array} \]

\[\begin{array}{l} q_\pi(s,a) = \sum_{s^\prime,r}p(s^\prime,r|s,a)[r+\gamma \sum_{a^\prime}\pi(a^\prime|s^\prime)q_\pi(s^\prime,a^\prime)] \\ \qquad \quad \ \ = \sum_{r}\sum_{s^\prime}p(s^\prime,r|s,a)r+\sum_{s^\prime}\sum_{r}p(s^\prime,r|s,a)\gamma \sum_{a^\prime}\pi(a^\prime|s^\prime)q_\pi(s^\prime,a^\prime) \\ \qquad \quad \ \ = r(s,a)+\sum_{s^\prime}p(s^\prime|s,a)\gamma \sum_{a^\prime}\pi(a^\prime|s^\prime)q_\pi(s^\prime,a^\prime) \end{array} \]

\[\begin{array}{l} q_*(s,a) = \sum_{s^\prime,r}p(s^\prime,r|s,a)[r+\gamma\max_{a^\prime}q_\pi(s^\prime,a^\prime)] \\ \qquad \quad \ = \sum_{r}\sum_{s^\prime}p(s^\prime,r|s,a)r+\sum_{s^\prime}\sum_{r}p(s^\prime,r|s,a)\gamma\max_{a^\prime}q_\pi(s^\prime,a^\prime) \\ \qquad \quad \ = r(s,a)+\sum_{s^\prime}p(s^\prime|s,a)\gamma\max_{a^\prime}q_\pi(s^\prime,a^\prime) \end{array} \]

Chapter 4

4.1
In Example 4.1, if $\pi$ is the equiprobable random policy, what is $q_\pi(11, \text{down})$? What is $q_\pi(7, \text{down})$?

Figure 4.1給出了$v_\pi$，直接算$q$即可：$q_\pi(11, \text{down}) = E_{s^\prime,r}[r+\gamma v_\pi(s^\prime)]=-1+0=-1$。$q_\pi(7, \text{down}) = -1-14=-15$。

4.2
In Example 4.1, suppose a new state 15 is added to the gridworld just below state 13, and its actions, left, up, right, and down, take the agent to states 12, 13, 14, and 15, respectively. Assume that the transitions from the original states are unchanged. What, then, is $v_\pi(15)$ for the equiprobable random policy? Now suppose the dynamics of state 13 are also changed, such that action down from state 13 takes the agent to the new state 15. What is $v_\pi(15)$ for the equiprobable random policy in this case?

如果其他值都不變，那直接帶進去算就行。

\[\begin{array}{l} v_\pi(15)=\sum_a\pi(a|s)\sum_{s^\prime,r}p(s^\prime,r|s,a)[r+\gamma v_\pi(s^\prime)] \\ \qquad \quad = \frac{1}{4}(-1-22)+\frac{1}{4}(-1-20)+\frac{1}{4}(-1-14)+\frac{1}{4}(-1+v_\pi(15)) \\ \\ \Rightarrow v_\pi(15) \approx -20 \end{array} \]

如果state 13也會變，那么就是一個二元方程組：

\[\begin{array}{l} \Rightarrow \left\{\begin{array}{l} v_\pi(15) = \frac{1}{4}(-1-22)+\frac{1}{4}(-1+v_\pi(13))+\frac{1}{4}(-1-14)+\frac{1}{4}(-1+v_\pi(15)) \\ \\ v_\pi(13)= \frac{1}{4}(-1-22)+\frac{1}{4}(-1-20)+\frac{1}{4}(-1-14)+\frac{1}{4}(-1+v_\pi(15)) \end{array}\right. \\ \\ \Rightarrow \left\{\begin{array}{l} v_\pi(15) = \frac{1}{4}(-39+v_\pi(13)+v_\pi(15)) \\ \\ v_\pi(13)= \frac{1}{4}(-60+v_\pi(15)) \end{array}\right. \\ \\ \Rightarrow \left\{\begin{array}{l} v_\pi(15) \approx -19.6 \\ \\ v_\pi(13) \approx -19.9 \end{array}\right. \end{array} \]

4.3
What are the equations analogous to (4.3), (4.4), and (4.5) for the actionvalue function $q_\pi$ and its successive approximation by a sequence of functions $q_0, q_1, q_2,\cdots$?

照着之前的寫

\[q_\pi(s,a) = E_\pi[R_{t+1}+\gamma G_{t+1}|S_t=s,A_t=a] \\ \qquad \qquad \qquad \quad = \sum_{s^\prime,r}p(s^\prime,r|s,a)[r+\gamma\sum_{a^\prime}\pi(a^\prime|s^\prime) q_\pi(s^\prime,a^\prime)] \\ \Rightarrow q_{k+1}(s,a)=\sum_{s^\prime,r}p(s^\prime,r|s,a)[r+\gamma\sum_{a^\prime}\pi(a^\prime|s^\prime) q_k(s^\prime,a^\prime)] \]

4.4 The policy iteration algorithm on page 80 has a subtle bug in that it may never terminate if the policy continually switches between two or more policies that are equally good. This is ok for pedagogy, but not for actual use. Modify the pseudocode so that convergence is guaranteed.

他這里說如果有兩個策略一樣好，那這個迭代過程就停不下來了。可以將終止條件改為兩次policy evaluation的比較，如果兩次evaluation得到的值估計不再變化，就終止。

4.5
How would policy iteration be defined for action values? Give a complete algorithm for computing $q_*$, analogous to that on page 80 for computing $v_*$. Please pay special attention to this exercise, because the ideas involved will be used throughout the rest of the book.

感覺式子是一樣的，還是根據$q_\pi(s,\pi^\prime(s)) \geq v_\pi(s)$。只是之前維護的是$v$，現在要維護$q$。由$v_\pi(s)=\sum_a\pi(a|s)q_\pi(s,a)$，計算的時候拆開計算即可。
policy evaluation里面更新$q$，

\[q_{k+1}(s,a) \leftarrow \sum_{s^\prime,r}p(s^\prime,r|s,a)[r+\gamma\sum_{a^\prime}\pi(a^\prime|s^\prime) q_k(s^\prime,a^\prime)] \]

policy improvement里面更新$\pi$，

\[\pi(s) \leftarrow \arg\max_a\sum_{s^\prime,r}p(s^\prime,r|s,a)[r+\gamma \max_{a^\prime}q_k(s^\prime,a^\prime)] \]

4.6
Suppose you are restricted to considering only policies that are $\epsilon-soft$, meaning that the probability of selecting each action in each state, $s$, is at least $\epsilon/|\mathcal{A}(s)|$. Describe qualitatively the changes that would be required in each of the steps 3, 2, and 1, in that order, of the policy iteration algorithm for $v_*$ on page 80.

step 3需要修改動作選擇

\[\pi(s) = \left\{\begin{array}{l} \arg\max_a\sum_{s^\prime,r}p(s^\prime,r|s,a)[r+\gamma V(s^\prime)] \quad \text{with probability} \ 1-\epsilon \\ \text{random action \quad with probability} \ \epsilon \end{array}\right. \]

step 2更新把$\epsilon$導致的動作選擇概率考慮進去

\[V(s) \leftarrow \sum_a\pi(a|s)\sum_{s^\prime,r}p(s^\prime,r|s,a)[r+\gamma V(s^\prime)] \]

step 1不用變，多初始化一個$\epsilon$就行。

4.7 (programming) Write a program for policy iteration and re-solve Jack’s car rental problem with the following changes. One of Jack’s employees at the first location rides a bus home each night and lives near the second location. She is happy to shuttle one car to the second location for free. Each additional car still costs $2, as do all cars moved in the other direction. In addition, Jack has limited parking space at each location. If more than 10 cars are kept overnight at a location (after any moving of cars), then an additional cost of $4 must be incurred to use a second parking lot (independent of how many cars are kept there). These sorts of nonlinearities and arbitrary dynamics often occur in real problems and cannot easily be handled by optimization methods other than dynamic programming. To check your program, first replicate the results given for the original problem.

4.8
Why does the optimal policy for the gambler’s problem have such a curious form? In particular, for capital of 50 it bets it all on one flip, but for capital of 51 it does not. Why is this a good policy?

這個題確實很奇怪，如果不是寫代碼跑一跑，很難想得到長這樣。這里說一點感覺上的東西。首先關於中間50的這個地方，取50。先反過來想，要想贏，至少要上100，那么在50這個位置，不管從哪個地方到最后上100，最后一步都會乘一個概率$p_h$，也就是最后一步的$p_h$都一樣。如果剛好下注50，也就是這個下注50就是最后一步了，要么贏，要么輸，這個時候的值估計就是$p_h=0.4$。接下來說明其他下注的方式小於這個就行。假如投注小於這個值，那么不管輸贏，這一次都不可能超過100，還必須經過多次投注，直到最后一個動作，經過概率概率$p_h$超過100，但是中間這么多過程的概率乘積肯定都是小於等於1的，也就是說$p_1p_2...p_tp_h \leq p_h$，也就是其他一連串的動作，勝率都不如直接50，所以再這個地方最優策略就是50。然后說一下為啥51突然就變成1了。這個也很難看出來。有個知覺的方法是，押注1如果輸了，就轉到50，然后就和50一樣了。如果贏了，那就變成了52了，接下來的序列至少有概率會贏，所以51這個地方的值估計至少比50處的大。至於是不是最優的，感覺想不出來了。貌似押注49也比50處的大，哪個好就不知道了，感覺貌似一樣。

4.9
(programming) Implement value iteration for the gambler’s problem and solve it for $p_h = 0.25$ and $p_h = 0.55$. In programming, you may find it convenient to introduce two dummy states corresponding to termination with capital of 0 and 100, giving them values of 0 and 1 respectively. Show your results graphically, as in Figure 4.3. Are your results stable as $\theta \rightarrow 0$?

4.10
What is the analog of the value iteration update (4.10) for action values, $q_{k+1}(s, a)$?

根據Excercise 4.5，給$q_{k+1}(s,a) \leftarrow \sum_{s^\prime,r}p(s^\prime,r|s,a)[r+\gamma\sum_{a^\prime}\pi(a^\prime|s^\prime) q_k(s^\prime,a^\prime)]$的$\pi$改成$\max$：

\[q_{k+1}(s,a) \leftarrow \sum_{s^\prime,r}p(s^\prime,r|s,a)[r+\gamma \max_{a^\prime} q_k(s^\prime,a^\prime)] \]

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 強化學習導論課后習題參考 - Chapter 5,6 《強化學習導論》讀書筆記什么是強化學習？強化學習和ADP（上）強化學習強化學習強化學習總結強化學習——入門強化學習（MATLAB）什么是強化學習？