Asynchronous Methods for Deep Reinforcement Learning
ICML 2016
深度強化學習最近被人發現貌似不太穩定,有人提出很多改善的方法,這些方法有很多共同的 idea:一個 online 的 agent 碰到的觀察到的數據序列是非靜態的,然后就是,online的 RL 更新是強烈相關的。通過將 agent 的數據存儲在一個 experience replay 單元中,數據可以從不同的時間步驟上,批處理或者隨機采樣。這種方法可以降低 non-stationarity 然后去掉了更新的相關性,但是與此同時,也限制了該方法只能是 off-policy 的 RL 算法。
Experience replay 有如下幾個缺點:
1. 每一次交互 都會耗費更多的內存和計算;
2. 需要 off-policy 的學習算法從更老的策略中產生的數據上進行更新。
本文中我們提出了一種很不同的流程來做 DRL。不用 experience replay,而是以異步的方式,並行執行多個 agent,在環境中的多個示例當中。這種平行結構可以將 agent的數據“去相關(decorrelates)”到一個更加靜態的過程當中去,因為任何給定的時間步驟,並行的 agent 都會經歷不同的狀態。這個簡單的idea 確保了一個更大的范圍,基礎的 on-policy RL 算法,如:Sarsa,n-step methods,以及 actor-critic 方法,以及 off-policy 算法,如:Q-learning,利用 CNN 可以更加魯棒 和 有效的進行應用。
本文的並行 RL 結構也提供了一些實際的好處,前人的方法都是基於特定的硬件,如:GPUs 或者 大量分布式結構,本文的算法可以在單機上用多核 CPU 來執行。取得了比 之前基於 GPU 的算法 更好的效果, asynchronous advantage actor-critic(A3C)算法更是很牛的樣子,接着往下看。
文章第3節講 DQN 的背景知識,該部分透露出 Q-learning的幾個問題:
one-step Q-learning 是朝着 one-step return 的方向去更新 action value Q(s, a)。但是利用 one-step 方法的缺陷在於:得到一個獎勵 r 僅僅直接影響了 得到該獎勵的狀態動作對 (s, a) 的值(obtain a reward r only directly affects the value of the state action pairs s, a that led to the reward)。其他 state action pairs的值僅僅間接的通過更新 value Q(s, a) 來影響。這就使得學習過程緩慢,因為許多更新都需要傳播一個 reward 給相關進行的 states 和 actions。
一種快速的傳播獎賞的方法是利用 n-step returns。 在 n-step Q-learning 中,Q(s, a) 是朝向 n-step return 進行更新。這樣就使得一個獎賞 r 直接地影響了 n 個正在進行的 state action pairs 的值。這也使得給相關 state-action pairs 的獎賞傳播更加有效。
另外就是定義了“優勢函數(advantage function A)”,即:利用 Q值 減去 狀態值,定義為:
$A(a_t, s_t) = Q(a_t, s_t) - V(s_t)$;
這種方法也可以看做是 actor-critic 結構,其中,policy $\pi$ 可以看做是 actor,baseline $b_t$ 是 critic。
3. Reinforcement Learning Background
我們考慮標准的強化學習設定,一個 agent 通過幾個離散的時間步驟來環境進行交互.在每個時間步驟 t,根據策略 $\pi$, 我們的 agent 收到一個狀態 $S_t$,並且從可能的動作集合中選擇一個 action $a_t$.其中,策略可以看做是狀態到動作的一個映射.作為回報,agent 收到下一個狀態 $S_{t+1}$,收到一個可變的獎賞 $r_t$.重復這個過程直到 agent 達到一個停止狀態.The Return $R_t$ 是總的累計的獎賞.agent 的目標就是在每一個狀態下都最大化期望的 return.動作值 $Q^{\pi}(s, a)$ 是遵循策略,在狀態s下,選擇動作a所帶來的期望 return.最優值函數 $Q^*(s, a) = max_{\pi} Q^{\pi}(s, a)$ 給出了最大動作值.類似的,在策略 $\pi$下,狀態 s 的值定義為:$V^{\pi}(s) = E [R_t|s_t = s] $ 是在狀態 s 下采取策略 $\pi$ 后得到的簡單的期望 return.
在基於值的 model-free RL 方法中,動作值函數 (action value function) 用一個函數近似表示,例如:神經網絡.用 $Q(s, a; \theta)$ 是一個近似的帶有參數 $\theta$ 的動作值函數.對參數的更新可以從很多RL算法中得到.有一個稱為 Q-Learning 的算法,目標就是直接估計最優動作值函數:$Q^*(s, a) ≈ Q(s, a; \theta)$. 在 one-step Q-learning 中,動作值函數 $Q(s, a; \theta)$ 的參數可以通過迭代的最小化一個序列的損失函數來學習到,第 i 個損失函數的定義為:
其中,s^' 是s轉移之后的狀態.
我們將上述方法稱為 one-step Q-learning,因為其更新了動作值 Q(s, a) 朝向一步 return $r+\gammamax_{a^'}Q(s', a';\theta)$. 這種方法的一個缺點是:得到一個獎賞 r 僅僅直接影響導致這個獎賞的狀態動作值對 (the state action pair) s, a.這就使得學習過程非常緩慢,因為需要執行許多更新以將獎賞傳遞到相關的狀態和動作上.有一個方法可以快速的進行傳遞獎賞,即:采用 n-step returns. 在這個 n-step Q-learning 過程中,Q(s, a) 是朝向 n-step return 定義為:$ r_t + \gamma r_{t+1} + ... + \gamma^{n-1} r_{t+n-1} + max_a \gamma^n Q(s_{t+n}, a)$.這個結果在一次獎賞 r 直接影響了 n 個后續狀態動作對(n proceding state action pairs).這就使得傳遞獎賞的過程變得非常有效.
對比 value-based methods, policy-based model-free methods 直接參數化策略 $\pi (a|s; \theta)$ and 通過執行 $E[R_t]$ 梯度下降來更新參數.
值函數學習到的估計通常用作 baseline $b_t(s_t) ≈ V^{\pi} (s_t)$ 得到了一個較低的策略梯度的方差估計.當估計的值函數被當做 baseline時,$R_t - b_t$ 用來 scale the policy gradient 可以看做是在狀態 $s_t$ 下 action $a_t$ 的優勢(advantage)的估計,或者說是:$A(a_t, s_t) = Q(a_t, s_t) - V(s_t)$,因為 $R_t$ 是 $Q^{\pi} (a_t, s_t)$ and $b_t$ 是 $V^{\pi}(s_t)$ 的預測.這個方法可以看做是 an actor-critic architecture, where the policy $\pi$ is the actor and the baseline $b_t$ is the critic.
Asynchronous RL Framework
本文提出了多線程的各種算法的異步變種,即:one-step Sarsa,one-step Q-learning,n-step Q-leanring 以及 advantage actor-critic。設計這些算法的目的是找到 RL 算法可以訓練深度神經網絡策略 而不用花費太多的計算資源。但是 RL 算法又不相同,actor-critic 算法是 on-policy 策略搜索算法,但是 Q-learning 是 off-policy value-based 方法,我們利用兩種主要的 idea 來實現四種算法以達到我們的目標。
首先,我們利用異步 actor-learners,利用單機的多CPU線程,將這些 learner 保持在一個機器上,就省去了 多個learner之間通信的花銷,使得可以利用 Hogwild! 的方式更新而完成訓練。
第二,我們觀察到 multiple actors-learners 並行的運行更可能去探索環境的不同部分。
此外, 我們還可以顯示的利用每一個 actor-learner 不同的探索策略來最大化這個多樣性。通過在不同的線程中不同的探索策略,並行 online 的改變參數,更可能在時間上不相關,相比較 單個 agent 采用 online 的更新方式。所以,我們不采用 經驗回放,二是依賴於 parallel actors 采用不同的探索策略來執行 DQN 中 experience replay 的角色。
除了使得學習過程更加穩定,利用多個並行的 actor-learner 有多個優勢:
1. 我們獲得了訓練時間的大幅度降低,因為時間大致和 並行的 actor-learners 的個數呈線性關系。
2. 因為我們不在依賴於 experience replay 來穩定學習,我們可以利用 on-policy reinforcement learning methods 像:Sarsa, actor-critic 來訓練神經網絡。
Asynchronous one-step Q-learning :
Each thread interacts with its own copy of the envionment and at each step computes a gradient of the Q-learning loss.
每一個線程有他自己獨自 copy 過來的交互環境,每一個時間步驟計算 Q-learning loss 的一個梯度。
We use a shared and slowly changing target network in computing the Q-learning loss, as was proposed in the DQN training method.
我們利用 DQN當中提出來的 target network 計算 Q-learning loss 。
We also accumultate gradients over multiple timesteps before they are applied, which is similar to using minibatches.
我們也累積多個時間步驟的梯度,像 minibatches。
They reduces the chances of multiple actor learners overwriting each other's updates.
這降低了每個 actor learner 覆蓋相互更新的概率。
Accumulating updates over several steps also provides some ability to trade off computional efficiency for data efficiency.
多個時間步驟的累積更新,也提供了計算效率的平衡。
Finally, we found that giving each thread a different exploration policy helps improve robustness.
最終,我們發現給定每一個線程,不同的探索策略,可以幫助改進魯棒性。
雖然有很多探索策略,我們采用 $\epsilon-greedy$ exploitation policy 。
Asynchronous n-step Q-learning :
這種算法看起來並不是非常的 “常規” ,因為它在前向角度操作時,通過顯示的計算 n-step returns,和更常見的 后向角度相反。
==>> The algorithm is somewhat unusual because it operates in the forward view by explicitly computing n-step returns, as opposed to the more common backward view used by techniques like eligibility traces.
我們發現當利用基於動量的方法(momentum-based methods)和后向傳播(BP)訓練神經網絡的時候,利用前向視角更加簡單。 In order to compute a single update, the algorithm first selects actions using its exploration policy for up to $t_max$ steps or until a terminal state is reached. This process results in the agent receiving up to $t_max$ rewards from the environment since its last update. The algorithm then computes gradients for n-step Q-learning updates for each of the state-action pairs encountered since the last update.
另外,也可以參加博文: https://blog.acolyer.org/2016/10/10/asynchronous-methods-for-deep-reinforcement-learning/
Asynchronous methods for deep reinforcement learning Mnih et al. ICML 2016
You know something interesting is going on when you see a scalability plot that looks like this:

That’s a superlinear speedup as we increase the number of threads, giving a 24x performance improvement with 16 threads as compared to a single thread. The result comes from the Google DeepMind team’s research onasynchronous methods for deep reinforcement learning. In fact, of the four asynchronous algorithms that Mnih et al experimented with, the “asynchronous 1-step Q-learning” algorithm whose scalability results are plotted above is not the best overall. That honour goes to “A3C”, the Asynchronous Advantage Actor-Critic, which exhibits regular slightly sub-linear scaling as you add threads. How come it’s the best then? Because itsabsolute performance, as measured by how long it takes to achieve a given reference score when learning to play Atari games, is the best.
DeepMind’s DQN sytem is a Deep-Q-Network reinforcement learning system that learned to play Atari games. DQN relied heavily on GPUs. A3C beats DQN easily, using just CPUs:
When applied to a variety of Atari 2600 domains, on many games asynchronous reinforcement learning achieves better results, in far less time than previous GPU-based algorithms, using far less resource than massively distributed approaches.
Here you can see a comparison of the learning speed of the asynchronous algorithms vs DQN, with DQN trained on a single GPU, and the asynchronous algorithms trained using 16 CPU cores on a single machine.
(Click for larger view)
And when it comes to overall performance levels achieved, look how well A3C does compared to many other state of the art systems, despite significantly reduced training times.

Let’s take a step back and explore what’s going on here.
Asynchronous learning
We’re talking about reinforcement learning systems, and in particular for the experiments conducted in this paper, reinforcement learning systems used to learn how to play Atari games (57 of them), drive a car in the TORCS car racing simulator:
Whereas those methods are variations on a theme of ‘do the same thing (or a very close approximation to the same thing), but in parallel’, the asynchronous methods here exploit the parallel nature of multiple threads to enable a different approach altogether. DQN and other deep reinforcement learning algorithms use experience replay, capturing an agent’s data which can subsequently be batched and/or sampled over different time-steps.
Deep RL algorithms based on experience replay have achieved unprecedented success in challenging domains such as Atari 2600. However, experience replay has several drawbacks: it uses more memory and computation per real interaction; and it requires off-policy learning algorithms that can update from data generated by an older policy.
Instead of experience replay, one of the key insights in this paper is that you can achieve many of the same objectives of experience replay by playing many instances of the game in parallel.
… we make the observation that multiple actor-learners running in parallel are likely to be exploring different parts of the environment. Moreover, one can explicitly use different exploration policies in each actor-learner to maximize this diversity. By running different exploration policies in different threads, the overall changes being made to the parameters by multiple actor-learners applying online updates in parallel are likely to be less correlated in time than a single agent applying online updates. Hence, we do not use a replay memory and rely on parallel actors employing different exploration policies to perform the stabilizing role undertaken by experience replay in the DQN training algorithm.
This explains the superlinear speed-up in training time required to reach a given level of skill: the more games are being explored in parallel, the better the training input to the network.
I really like this idea that the very nature of doing things in parallel opens up the possibility to use a fundamentally different approach. I don’t think that insight would naturally occur to me, and it makes me wonder if there are other scenarios where it might also apply. A
The algorithms
In reinforcement learning an agent interacts with an environment by taking actions and receiving a reward. At each time step the agent receives the state of the world and a reward score from the previous time step, and selects an action from some universe of possible actions. An action value function, typically represented as Q determines the expected reward for choosing a given action in a given state when following some policy π. There are two broad approaches to learning: value-based and policy-based.
In value-based model-free reinforcement learning methods the action value function is represented using a function approximation, such as a neural network…. In contrast to value-based methods, policy-based model-free methods directly parameterize the policy π(a|s;θ) and update the parameters θ by performing, typically approximate, gradient descent.
(a represents an action, s the state).
Because the parallel approach no longer relies on experience replay, it becomes possible to use ‘on-policy’ reinforcement learning methods such as Sarsa and actor-critic. The authors create asynchronous variants of one-step Q-learning, one-step Sarsa, n-step Q-learning, and advantage actor-critic. Since the asynchronous advantage actor-critic (A3C) algorithm appears to dominate all the others, I’ll just concentrate on that one.
A3C uses a ‘forward-view’ and n-step updates. Forward view means that the algorithm selects actions using its exploration policy for up to tmaxsteps in the future. The agent will then receive up to tmax rewards from the environment since its last update. The policy and value functions are then updated for each state-action pair and associated reward over the tmaxsteps. For each update, the algorithm use “the longest possible n-step return.” In other words, the update includes all steps up to and including the step we are currently performing the update for: a 2-step update for the second state-action, reward pair, a 3-step update for the third , and so on.
Here’s the pseudo-code for the algorithm, taken from the supplementary materials:

(V is a function that determines the value of some state s under policy π.)
We typically use a convolutional neural network that has one softmax output for the policy π(at|st;θ) and one linear output for the value function _V(st;θv), with all non-output layers shared.
Experiments
If you watch some of the videos I linked earlier you can see how well A3C learns to perform a variety of tasks. Playing Atari games has been well covered before. The TORCS car racing simulator is more challenging:
TORCS not only has more realistic graphics than Atari 2600 games, but also requires the agent to learn the dynamics of the car it is controlling…. A3C reached between roughly 75% and 90% of the score obtained by a human tester on all four game configurations in about 12 hours of training.
The Mujoco physics engine simulations required a reinforcement learning approach adapted to continuous actions, which A3C was able to do. It was tested on a number of manipulation and locomotion tasks, and found good solutions in less than 24 hours, and often just a few hours.
The final experiments used A3C on a new 3D maze environment called Labyrinth:
This task is much more challenging than the TORCS driving domain because the agent is faced with a new maze in each episode and must learn a general strategy for exploring random mazes… The final average score indicates that the agent learned a reasonable strategy for exploring random 3D mazes using only a visual input.
Closing thoughts
We’ve seen a number of papers showing how various machine learning tasks can be made more efficient in terms of elapsed training time by exploiting asynchronous parallel workers, as well as more efficient algorithms. There’s another kind of efficiency that’s equally important though: data efficiency, a concept that was much discussed at the recent London Deep Learning Summit. Data efficiency refers to the amount of data that an algorithm needs to achieve a given level of performance. Breakthroughs in data efficiency could have an even bigger impact than breakthroughs in computational efficiency.
And on the topic of computers learning to play games, since Go has now fallen, when will we see a reinforcement learning system beat the (human) champions in esports games too? That would make a great theatre for a battle.