強化學習讀書筆記 - 00 - 術語和數學符號

學習筆記：
Reinforcement Learning: An Introduction, Richard S. Sutton and Andrew G. Barto c 2014, 2015, 2016

基本概念

Agent - 本體。學習者、決策者。
Environment - 環境。本體外部的一切。
\(s\) - 狀態(state)。一個表示環境的數據。
\(S, \mathcal{S}\) - 所有狀態集合。環境中所有的可能狀態。
\(a\) - 行動(action)。本體可以做的動作。
\(A, \mathcal{A}\) - 所有行動集合。本體可以做的所有動作。
\(A(s), \mathcal{A}(s)\) - 狀態\(s\)的行動集合。本體在狀態\(s\)下，可以做的所有動作。
\(r\) - 獎賞(reward)。本體在一個行動后，獲得的獎賞。
\(\mathcal{R}\) - 所有獎賞集合。本體可以獲得的所有獎賞。

\(S_t\) - 第t步的狀態(state)。\(t\) from 0
\(A_t\) - 第t步的行動(select action)。\(t\) from 0
\(R_t\) - 第t步的獎賞(reward)。\(t\) from 1
\(G_t\) - 第t步的長期回報(return)。\(t\) from 0。 強化學習的目標1：追求最大回報

\[G_t \doteq \sum_{k=0}^{\infty} \gamma^k R_{t+k+1} \\ where \\ k \text{ - the sequence number of an action.} \\ \gamma \text{ - discount rate,} \ 0 \leqslant \gamma \leqslant 1 \]

可以看出，當\(\gamma=0\)時，只考慮當前的獎賞。當\(\gamma=1\)時，未來的獎賞沒有損失。
\(G_t^{(n)}\) - 第t步的n步回報(n-step return)。。一個回報的近似算法。

\[G_t^{(n)} \doteq \sum_{k=0}^{n} \gamma^k R_{t+k+1} \\ where \\ k \text{ - the sequence number of an action.} \\ \gamma \text{ - discount rate,} \ 0 \leqslant \gamma \leqslant 1 \]

\(G_t^{\lambda}\) - 第t步的\(\lambda\)回報(\(\lambda\)-return)。一個回報的近似算法。可以說是\(G_t^{(n)}\)的優化。

\[\text{Continuing tasks: } \\ G_t^{\lambda} \doteq (1 - \lambda) \sum_{n=1}^{\infty} \lambda^{n-1}G_t^{(n)} \\ \text{Episodic tasks: } \\ G_t^{\lambda} \doteq (1 - \lambda) \sum_{n=1}^{T-t-1} \lambda^{n-1}G_t^{(n)} + \lambda^{T-t-1}G_t \\ where \\ \lambda \in [0, 1] \\ (1 - \lambda) \sum_{n=1}^{\infty}\lambda^{n-1} = 1 \\ (1 - \lambda) \sum_{n=1}^{T-t-1} \lambda^{n-1} + \lambda^{T-t-1} = 1 \\ \text{if } \lambda = 0, \text{become to 1-step TD algorithm}\\ \text{if } \lambda = 1, \text{become to Monte Carlo algorithm} \\ \]

策略

\(\pi\) - 策略(policy)。強化學習的目標2：找到最優策略。
策略規定了狀態\(s\)時，應該選擇的行動\(a\)。

\[\pi = [\pi(s_1), \cdots, \pi(s_n)] \]

\(\pi(s)\) - 策略\(\pi\)在狀態\(s\)下，選擇的行動。
\(\pi_*\) - 最優策略(optimal policy)。
\(\pi(a | s)\) - 隨機策略\(\pi\)在狀態\(s\)下，選擇的行動\(a\)的概率。

\(r(s, a)\) - 在狀態\(s\)下，選擇行動\(a\)的獎賞。
\(r(s, a， s')\) - 在狀態\(s\)下，選擇行動\(a\)，變成(狀態\(s‘\))的獎賞。
\(p(s′, r | s, a)\) - (狀態\(s\)、行動\(a\))的前提下，變成(狀態\(s‘\)、獎賞\(r\))的概率。
\(p(s′ | s, a)\) - (狀態\(s\)、行動\(a\))的前提下，變成(狀態\(s‘\))的概率。
\(v_{\pi}(s)\) - 狀態價值。使用策略\(\pi\)，（狀態\(s\)的）長期獎賞\(G_t\)。
\(q_{\pi}(s, a)\) - 行動價值。使用策略\(\pi\)，（狀態\(s\)，行動\(a\)的）長期獎賞\(G_t\)。
\(v_{*}(s)\) - 最佳狀態價值。
\(q_{*}(s, a)\) - 最佳行動價值。
\(V(s)\) - \(v_{\pi}(s)\)的集合。
\(Q(s, a)\) - \(q_{\pi}(s, a)\)的集合。

\[\text{For continuing tasks: } \\ G_t \doteq \sum_{k=0}^{\infty} \gamma^k R_{t+k+1} \\ \text{For episodic tasks: } \\ G_t \doteq \sum_{k=0}^{T-t-1} \gamma^k R_{t+k+1} \\ v_{\pi}(s) \doteq \mathbb{E}_{\pi} [G_t | S_t=s] = \mathbb{E}_{\pi} \left [ \sum_{k=0}^{\infty} \gamma^k R_{t+k+1}|S_t = s \right ] \\ q_{\pi}(s,a) \doteq \mathbb{E}_{\pi} [G_t | S_t=s,A_t=a] = \mathbb{E}_{\pi} \left [ \sum_{k=0}^{\infty} \gamma^k R_{t+k+1}|S_t = s, A_t=a \right ] \\ v_{\pi}(s) = \max_{a \in \mathcal{A}} q_{\pi}(s,a) \\ \pi(s) = \underset{a}{argmax} \ v_{\pi}(s' | s, a) \\ \pi(s) \text{ is the action which can get the next state which has the max value.} \\ \pi(s) = \underset{a}{argmax} \ q_{\pi}(s, a) \\ \pi(s) \text{ is the action which can get the max action value from the current state.} \\ \]

由上面的公式可以看出：\(\pi(s)\)可以由\(v_{\pi}(s)\)或者\(q_{\pi}(s,a)\)決定。

\[\text{Reinforcement Learning} \doteq \pi_* \\ \quad \updownarrow \\ \pi_* \doteq \{ \pi(s) \}, \ s \in \mathcal{S} \\ \quad \updownarrow \\ \begin{cases} \pi(s) = \underset{a}{argmax} \ v_{\pi}(s' | s, a), \ s' \in S(s), \quad \text{or} \\ \pi(s) = \underset{a}{argmax} \ q_{\pi}(s, a) \\ \end{cases} \\ \quad \updownarrow \\ \begin{cases} v_*(s), \quad \text{or} \\ q_*(s, a) \\ \end{cases} \\ \quad \updownarrow \\ \text{approximation cases:} \\ \begin{cases} \hat{v}(s, \theta) \doteq \theta^T \phi(s), \quad \text{state value function} \\ \hat{q}(s, a, \theta) \doteq \theta^T \phi(s, a), \quad \text{action value function} \\ \end{cases} \\ where \\ \theta \text{ - value function's weight vector} \\ \]

強化學習的目標3：找到最優價值函數\(v_*(s)\)或者\(q_*(s,a)\)。

近似計算

強化學習的目標4：找到最優近似價值函數\(\hat{v}(S_t, \theta_t)\)或者\(\hat{q}(S_t, A_t, \theta_t)\)。
強化學習的目標5：找到求解\(\theta\)。
\(\rho_t^k\) - importance sampling ratio for time t to time k - 1。
\(\mathcal{J}(s)\) - 狀態\(s\)被訪問的步驟序號。
\(\theta\) - 近似價值函數的權重向量。
\(\phi(s)\) - 近似價值函數的特征函數。是一個將狀態\(s\)轉化成計算向量的方法。這個結果和\(\theta\)組成近似價值函數。
\(\hat{v}(S_t, \theta_t)\) - 近似狀態價值函數。

\[\hat{v} \doteq \theta^T \phi(s) \]

\(\hat{q}(S_t, A_t, \theta_t)\) - 近似行動價值函數。

\[\hat{q} \doteq \theta^T \phi(s,a) \]

\(e_t\) - 第t步資格跡向量(eligibility trace rate)。可以理解為近似價值函數微分的優化值。

\[e_0 \doteq 0 \\ e_t \doteq \nabla \hat{v}(S_t, \theta_t) + \gamma \lambda e_{t-1} \\ \theta_t \doteq \theta_t + \alpha \delta_t e_t \]

\(\alpha\) - 學習步長。\(\alpha \in (0, 1)\)
\(\gamma\) - 未來回報的折扣率(discount rate)。\(\gamma \in [0, 1]\)
\(\lambda\) - \(\lambda\)-return中的比例參數。\(\lambda \in [0, 1]\)
h（horizon）- 水平線h表示on-line當時可以模擬的數據步驟。\(t < h \le T\)

老O虎O機問題

\(q_*(a)\) - 行動 a 的真實獎賞(true value)。這個是（實際中）不可知的。期望計算的結果收斂(converge)與它。
\(N_t(a)\) - 在第t步之前，行動a被選擇的次數。
\(Q_t(a)\) - 行動 a 在第t步前（不包括第t步）的實際平均獎賞。

\[Q_t(a) = \frac{\sum_{i=1}^{t-1} R_i \times 1_{A_i=a}}{N_t(a)} \]

\(H_t(a)\) - 對於行動a的學習到的傾向(reference)。
\(\epsilon\) - 在ε-貪婪策略中，采用隨機行動的概率\([0, 1)\)。

通用數學符號

\(\doteq\) - 定義上的等價關系。
\(\mathbb{E}[X]\) - \(X\)的期望值。
\(Pr\{X = x\}\) - 變量\(X\)值為\(x\)的概率。
\(v \mapsto g\) - v漸近g。
\(v \approx g\) - v約等於g。
\(\mathbb{R}\) - 實數集合。
\(\mathbb{R}^n\) - n個元素的實數向量。
\(\underset{a \in \mathcal{A}}{max} \ F(a)\) - 在所有的行動中，求最大值\(F(a)\)。
\(\underset{c}{argmax} \ F(c)\) - 求當F(c)為最大值時，參數\(c\)的值。

術語

episodic tasks - 情節性任務。指（強化學習的問題）會在有限步驟下結束。
continuing tasks - 連續性任務。指（強化學習的問題）有無限步驟。
episode - 情節。指從起始狀態（或者當前狀態）到結束的所有步驟。
tabular method - 列表方法。指使用了數組或者表格存儲每個狀態（或者狀態-行動）的信息（比如：其價值）。

planning method - 計划性方法。需要一個模型，在模型里，可以獲得狀態價值。比如：動態規划。
learning method - 學習性方法。不需要模型，通過模擬（或者體驗），來計算狀態價值。比如：蒙特卡洛方法，時序差分方法。

on-policy method - on-policy方法。評估的策略和優化的策略是同一個。
off-policy method - off-policy方法。評估的策略和優化的策略不是同一個。意味着優化策略使用來自外部的樣本數據。
target policy - 目標策略。off-policy方法中需要優化的策略。
behavior policy - 行為策略\(\mu\)。off-policy方法中提供樣本數據的策略。
importance sampling - 行為策略\(\mu\)的樣本數據。
importance sampling rate - 由於目標策略\(\pi\)和行為策略\(\mu\)不同，導致樣本數據在使用上的加權值。
ordinary importance sampling - 無偏見的計算策略價值的方法。
weighted importance sampling - 有偏見的計算策略價值的方法。
MSE(mean square error) - 平均平方誤差。
MDP(markov decision process) - 馬爾科夫決策過程
The forward view - We decide how to update each state by looking forward to future rewards and states.
例如：

\[G_t^{(n)} \doteq R_{t+1} + \gamma R_{t+2} + \dots + \gamma^{n-1} R_{t+n} + \gamma^n \hat{v}(S_{t+n}, \theta_{t+n-1}) , \ 0 \le t \le T-n \\ \]

The backward or mechanistic view - Each update depends on the current TD error combined with eligibility traces of past events.
例如：

\[e_0 \doteq 0 \\ e_t \doteq \nabla \hat{v}(S_t, \theta_t) + \gamma \lambda e_{t-1} \\ \]

參照

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 《強化學習導論》讀書筆記強化學習讀書筆記 - 13 - 策略梯度方法(Policy Gradient Methods) 強化學習讀書筆記 - 05 - 蒙特卡洛方法(Monte Carlo Methods) 強化學習讀書筆記 - 11 - off-policy的近似方法強化學習讀書筆記 - 03 - 有限馬爾科夫決策過程 [強化學習論文筆記(3)]:DRQN [強化學習論文筆記(2)]:DoubleDQN [強化學習論文筆記(7)]:DPG 《數學之美》-吳軍讀書筆記如何高效學習讀書筆記