Reinforcement Learning with Deep Energy-Based Policies
論文地址
soft Q-learning
筆記
標准的強化學習策略
\[\begin{equation}\pi^*_{std} = \underset{\pi}{argmax}\sum_tE_{(S_t,A_t)\sim \rho_\pi}[r(S_t,A_t)]\end{equation} \]
最大熵的強化學習策略
\[\begin{equation}\pi^*_{MaxEnt} = \underset{\pi}{argmax}\sum_tE_{(S_t,A_t)\sim \rho_\pi}[r(S_t,A_t) + \alpha H(\pi(\cdot | S_t))]\end{equation} \]
\(\alpha\)是比例參數,調節相對重要性
以前的做法大多都是greedy最大化策略在當前狀態的熵,但是論文里熵和reward是同等地位的,也就是我們想讓未來累積的熵最大。
隨機策略\(\pi\),文章想讓這個分布可以general,
\[\begin{equation}\pi(a_t|s_t)\propto exp(-\varepsilon (s_t,a_t))\end{equation} \]
\(\varepsilon\)是energy function,可以用神經網絡擬合
定義下面的Q-function
\[\begin{equation}Q^*_{soft}(s_t,a_t) = r_t +E_{(s_{t+1},\cdots)\sim \rho_\pi}[\sum_{l=1}^{\infty}\gamma^l(r_{t+l}+\alpha H(\pi^*_{MaxEnt}(\cdot|s_{t+l})))]\end{equation} \]
推導value function
\[\begin{equation}V^*_{soft}(s_t) = \alpha log \int_A exp(\frac{1}{\alpha}Q^*_{soft}(s_t,a'))da'\end{equation} \]
value function 是\(LogSumExp\)的形式,
解出最優策略的表達式
\[\begin{equation}\pi^*_{MaxEnt} (a_t|s_t)= exp(\frac{1}{\alpha}(Q^*_{soft}(s_t,a_t)-V^*_{soft}(s_t)))\end{equation} \]
公式(5)是配分函數,和動作無關的常量,
soft Bellman equation(公式(4),滿足這個形式)
\[\begin{equation}Q^*_{soft}(s_t,a_t) = r_t+ \gamma E_{s_{t+1}\sim \rho_s}[V^*_{soft}(s_{t+1})]\end{equation} \]
soft bellman equation 可以看做是普通版本的泛化,通過\(\alpha\)來調節soft-hard,當\(\alpha\to 0\)時,就是一個hard maximum.
為了求解soft bellman equation 推到了類似policy iterative的soft q iteration
fixed-point iteration
\[\begin{equation}Q_{soft}(s_t,a_t) \leftarrow r_t+ \gamma E_{s_{t+1}\sim \rho_s}[V_{soft}(s_{t+1})] ,\forall s_t,a_t \end{equation} \]
\[\begin{equation}V_{soft}(s_t) \leftarrow \alpha log \int_A exp(\frac{1}{\alpha}Q_{soft}(s_t,a'))da',\forall s_t \end{equation} \]
為了把soft q iteration 轉化成一個隨機優化問題
通過重要性采樣計算\(V\)
\[\begin{equation}V^\theta _{soft}(s_t) = \alpha log E_{q_{a'}}[\frac{ exp(\frac{1}{\alpha}Q^\theta_{soft}(s_t,a'))}{q_{a'}(a')}]\end{equation} \]
soft q iteration 等價最小化下面的函數
\[\begin{equation}J_Q(\theta) = E_{s_t\sim q_{s_t},a_t\sim q_{a_t}}[\frac{1}{2}(\hat{Q}^{\bar{\theta}}_{soft}(s_t,a_t)-Q^{\theta}_{soft}(s_t,a_t))^2]\end{equation} \]
\(\hat{Q}^{\bar{\theta}}_{soft}(s_t,a_t) = r_t+ \gamma E_{s_{t+1}\sim \rho_s}[V_{soft}^{\bar{\theta}}(s_{t+1})]\)是target Q-value
\(q_{s_t},q_{a_t}\)通過從\(\pi(a_t|s_t)\propto exp(-\varepsilon (s_t,a_t))\)的樣本buffer中采樣
\(q_{a'}\)從current policy采樣,但是這個采樣比較困難
\(a_t =f^\phi(ξ;s_t)\),\(\xi\) 是normal gausssian 的噪聲.\(f^\phi\)是神經網絡
想要最小化兩個分布之間的距離
\[\begin{equation} J_\pi (\phi,s_t) = D_{KL} (\pi^\phi(\cdot|s_t)\| exp(\frac{1}{\alpha} (Q^\theta_{soft}(s_t,\cdot)-V^\theta_{soft} )))\end{equation} \]
如果我們可以用一系列樣本,然后找到一個合適的方向來減小KL距離。
Suppose we “perturb” a set of independent samples \(a^{(i)}_t =f_\phi(ξ^{(i)};s_t)\) in appropriate directions \(∆f_\phi(ξ(i);s_t)\), the induced KL divergence can be reduced. The most greedy direction is
\[\begin{equation}\Delta f^{\phi}(\cdot;s_t) = E_{a_t\sim \pi^\phi} [\kappa (a_t,f^{\phi}(\cdot;s_t))\nabla_{a'}Q^\theta_{soft}(s_t,a')|_{a'=a_t}+ \alpha \nabla_{a'}\kappa(a',f^\phi(\cdot;s_t))|_{a'=a_t}] \end{equation} \]
但是\(\Delta f^\phi\) 不是真正的\(\nabla_\phi J\)的方向,但是可以利用鏈式法則
\[\begin{equation}\frac{\partial J_\pi(\phi;s_t)}{\partial \phi} \propto E_{\xi}[\Delta f^\phi(\xi;s_t)\frac{\partial f^\phi(\xi;s_t)}{\partial\phi}]\end{equation} \]
