[強化學習論文閱讀(9)]:soft Q-learning

本文轉載自查看原文 2020-01-06 16:14 1926 強化學習論文閱讀/ 強化學習/ 論文

Reinforcement Learning with Deep Energy-Based Policies

論文地址

soft Q-learning

筆記

標准的強化學習策略

\[\begin{equation}\pi^*_{std} = \underset{\pi}{argmax}\sum_tE_{(S_t,A_t)\sim \rho_\pi}[r(S_t,A_t)]\end{equation} \]

最大熵的強化學習策略

\[\begin{equation}\pi^*_{MaxEnt} = \underset{\pi}{argmax}\sum_tE_{(S_t,A_t)\sim \rho_\pi}[r(S_t,A_t) + \alpha H(\pi(\cdot | S_t))]\end{equation} \]

\(\alpha\)是比例參數，調節相對重要性

以前的做法大多都是greedy最大化策略在當前狀態的熵，但是論文里熵和reward是同等地位的，也就是我們想讓未來累積的熵最大。

隨機策略\(\pi\),文章想讓這個分布可以general,

\[\begin{equation}\pi(a_t|s_t)\propto exp(-\varepsilon (s_t,a_t))\end{equation} \]

\(\varepsilon\)是energy function,可以用神經網絡擬合

定義下面的Q-function

\[\begin{equation}Q^*_{soft}(s_t,a_t) = r_t +E_{(s_{t+1},\cdots)\sim \rho_\pi}[\sum_{l=1}^{\infty}\gamma^l(r_{t+l}+\alpha H(\pi^*_{MaxEnt}(\cdot|s_{t+l})))]\end{equation} \]

推導value function

\[\begin{equation}V^*_{soft}(s_t) = \alpha log \int_A exp(\frac{1}{\alpha}Q^*_{soft}(s_t,a'))da'\end{equation} \]

value function 是\(LogSumExp\)的形式，

解出最優策略的表達式

\[\begin{equation}\pi^*_{MaxEnt} (a_t|s_t)= exp(\frac{1}{\alpha}(Q^*_{soft}(s_t,a_t)-V^*_{soft}(s_t)))\end{equation} \]

公式(5)是配分函數，和動作無關的常量,

soft Bellman equation(公式(4)，滿足這個形式)

\[\begin{equation}Q^*_{soft}(s_t,a_t) = r_t+ \gamma E_{s_{t+1}\sim \rho_s}[V^*_{soft}(s_{t+1})]\end{equation} \]

soft bellman equation 可以看做是普通版本的泛化，通過\(\alpha\)來調節soft-hard,當\(\alpha\to 0\)時，就是一個hard maximum.

為了求解soft bellman equation 推到了類似policy iterative的soft q iteration

fixed-point iteration

\[\begin{equation}Q_{soft}(s_t,a_t) \leftarrow r_t+ \gamma E_{s_{t+1}\sim \rho_s}[V_{soft}(s_{t+1})] ,\forall s_t,a_t \end{equation} \]

\[\begin{equation}V_{soft}(s_t) \leftarrow \alpha log \int_A exp(\frac{1}{\alpha}Q_{soft}(s_t,a'))da',\forall s_t \end{equation} \]

為了把soft q iteration 轉化成一個隨機優化問題

通過重要性采樣計算\(V\)

\[\begin{equation}V^\theta _{soft}(s_t) = \alpha log E_{q_{a'}}[\frac{ exp(\frac{1}{\alpha}Q^\theta_{soft}(s_t,a'))}{q_{a'}(a')}]\end{equation} \]

soft q iteration 等價最小化下面的函數

\[\begin{equation}J_Q(\theta) = E_{s_t\sim q_{s_t},a_t\sim q_{a_t}}[\frac{1}{2}(\hat{Q}^{\bar{\theta}}_{soft}(s_t,a_t)-Q^{\theta}_{soft}(s_t,a_t))^2]\end{equation} \]

\(\hat{Q}^{\bar{\theta}}_{soft}(s_t,a_t) = r_t+ \gamma E_{s_{t+1}\sim \rho_s}[V_{soft}^{\bar{\theta}}(s_{t+1})]\)是target Q-value

\(q_{s_t},q_{a_t}\)通過從\(\pi(a_t|s_t)\propto exp(-\varepsilon (s_t,a_t))\)的樣本buffer中采樣

\(q_{a'}\)從current policy采樣，但是這個采樣比較困難

\(a_t =f^\phi(ξ;s_t)\),\(\xi\) 是normal gausssian 的噪聲.\(f^\phi\)是神經網絡

想要最小化兩個分布之間的距離

\[\begin{equation} J_\pi (\phi,s_t) = D_{KL} (\pi^\phi(\cdot|s_t)\| exp(\frac{1}{\alpha} (Q^\theta_{soft}(s_t,\cdot)-V^\theta_{soft} )))\end{equation} \]

如果我們可以用一系列樣本,然后找到一個合適的方向來減小KL距離。

Suppose we “perturb” a set of independent samples \(a^{(i)}_t =f_\phi(ξ^{(i)};s_t)\) in appropriate directions \(∆f_\phi(ξ(i);s_t)\), the induced KL divergence can be reduced. The most greedy direction is

\[\begin{equation}\Delta f^{\phi}(\cdot;s_t) = E_{a_t\sim \pi^\phi} [\kappa (a_t,f^{\phi}(\cdot;s_t))\nabla_{a'}Q^\theta_{soft}(s_t,a')|_{a'=a_t}+ \alpha \nabla_{a'}\kappa(a',f^\phi(\cdot;s_t))|_{a'=a_t}] \end{equation} \]

但是\(\Delta f^\phi\) 不是真正的\(\nabla_\phi J\)的方向，但是可以利用鏈式法則

\[\begin{equation}\frac{\partial J_\pi(\phi;s_t)}{\partial \phi} \propto E_{\xi}[\Delta f^\phi(\xi;s_t)\frac{\partial f^\phi(\xi;s_t)}{\partial\phi}]\end{equation} \]

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 強化學習——Q-learning算法強化學習-Q-Learning算法強化學習之Q-learning ^_^ 強化學習之Q-learning簡介 Deep Learning專欄--強化學習之Q-Learning與DQN（2）深度學習之強化學習Q-Learning 強化學習-Q-learning學習筆記強化學習（三）—— 時序差分法（SARSA和Q-Learning）【強化學習】python 實現 q-learning 例二強化學習（九）Deep Q-Learning進階之Nature DQN