論文筆記之：Progressive Neural Network Google DeepMind

本文轉載自查看原文 2016-10-26 22:40 2521 深度學習/ Deep Reinforcement Learning

Progressive Neural Network

Google DeepMind

　　摘要：學習去解決任務的復雜序列 --- 結合 transfer (遷移)，並且避免 catastrophic forgetting （災難性遺忘） --- 對於達到 human-level intelligence 仍然是一個關鍵性的難題。本文提出的 progressive networks approach 朝這個方向邁了一大步：他們對 forgetting 免疫，並且可以結合 prior knowledge 。在強化學習任務上做了一系列的實驗，超出了常規的 pre-training 和 finetuning 的 baseline。利用一個新穎的 sensitivity measure（感知衡量），我們表明在 low-level sensory 和 high-level control layers 都發生了transfer 。

　　Introduction ：

　　利用卷積神經網絡（CNN）進行微調是遷移學習的一種方式。Hinton 首次在 2006 年嘗試了這種遷移學習的方法，從產生式模型上遷移到判別模型上。在 2012年 Bengio 也嘗試了這種方式，取得了極大的成功，相關文章見：Unsupervised and transfer learning challenge: a deep learning approach. 但是這種方法致命的缺陷在於，不適合進行 multi-task 的遷移： if we wish to leverage knowledge acquired over a sequence of experiences, which model should we use to initialize subsequent models? （如果我們希望將一個序列的經驗利用上，那么該用什么模型來初始化接下來的序列模型呢？）這樣的話，就使得不僅需要一種學習方法，可以支持 transfer learning，並且沒有 catastrophic forgetting，並且要求前面的知識必須是類似的。除此之外，當進行 finetuning 的時候，允許我們在 target domain 恢復出頂尖的性能，也是一個毀滅的過程，其將要扔掉之前學習到的函數。當然我們可以顯示的在微調之前復制每一個model，來顯示的記住所有的之前任務，但是仍然存在選擇一個合適的初始化的問題。

　　本文就在這個motivation的基礎上，就提出了一種 progressive networks，一種新穎的模型結構來顯示的支持不同任務序列之間進行遷移學習。微調只有在初始化的時候，才會結合先驗信息，progressive networks 保留了一個 pool 用來存儲訓練過程當中 pre-trained model，然后從那些模型中提取有用的特征，學習 lateral connections ，來進行新任務的訓練。通過這種方式結合之前學習到的特征，progressive networks 達到了更豐富的組合性，先驗知識不再是端在存在的，而可以在每一層都結合進來，構成 hierarchy feature。此外，預訓練網絡的新的存儲提供了這些模型靈活的重新利用 old computations 並且學習新的。如我們即將展示的那樣，progressive networks 自然的累積了經驗，而且對於災難性遺忘具有免疫能力，通過設計，making them an ideal springboard for trackling long-standing problems of continual or lifelong learning.

　　本文的創新點在於三個地方：

　　　　1. progressive networks 當中的每一個成分都是現有的，但是將其組合，並且用於解決復雜的序列或者任務，這一點是比較 novel的；

　　　　2. 我們充分的將該模型在復雜的 RL 領域進行了驗證。在這個過程中，我們也驗證了在 RL 領域的另一個 transfer 的方式；

　　　　3. 特別的，我們表明 progressive networks 提供了comparable 的遷移能力（if not slightly better）相對於傳統的微調來說，但是沒有破壞序列（without the destructive consequences）

　　　　4. 最后，我們在 Fisher Information and perturbation 基礎上進行了新穎的分析，使得我們可以細致的分析，怎么在任務之間進行了遷移（回答了 how and where 的問題）

　　2 Progressive Networks ：

　　持續學習（continual learning）是機器學習領域當中的長遠目標，agents 不僅學習（and remember）一系列的任務經驗，同時也有能力去從之前的任務上遷移出有用的知識來改進收斂的速度。

　　Progressive Networks 將這些急需品直接結合到框架中：　　　　

　　　　 1. catastrophic forgetting is prevented by instantiating a new neural network (a column) for each task being solved,

　　　　2. while transfer is enabled via lateral connections to features of previously learned columns.

　　　　3. The scalability of this approach is addressed at the end of this section.

　　一個 Progressive networks 是由單列的網絡開始的：a deep neural network having L layers with hidden activations $h_i^{(1)}$，其中，ni 是第 i 層的單元個數，以及參數訓練到收斂。

　　當切換到第二個網絡的時候，我們固定住參數，並用隨機初始化的方式，開始第二個任務，其參數記為，layer $h_i^{(2)}$ 從 $h_{i-1}^{(2)}$ and $h_{i-1}^{(1)}$ 兩個地方接收到輸入，通過 lateral connections。這種連接方式，拓展到 K 個任務的時候，可以記為：

　　其中，是第 k 列的第 i 層的權重矩陣， U 是從第 j 列的第 i-1 層到第 k 列的第 i 層的 lateral connections ， $h_0$ 是網絡的輸入。

　　f 是 an element-wise non-linearity: we use f(x) = max(0, x) for all intermediate layers. 一個 Progressive networks 的示意圖為：

　　這些模型所起到的作用有：

　　　　(1). 最終達到解決 K 個獨立任務的訓練；

　　　 (2). 當可能的時候，通過遷移，加速學習過程；

　　　 (3). 避免災難性遺忘的問題。

　　在標准的預訓練和微調機制上，有一個隱含的假設："overlap" between the tasks.

　　微調是一種這種設置下，特別有效，因為這個時候，參數只要稍微調整一下，就可以適應 target domain，通常 the top layer is retained. 作為對比，我們不假設任務之間存在這種關系，那么實際上，可能任務之間是互相垂直的，或者說是對抗的。當微調階段可能潛在的不學習這些 features，這可能證明是困難的。Progressive Networks 繞開了這個問題，通過給每一個新的任務，配置一個新的列，其權重隨機的初始化。與和任務相關的初始化預訓練相比較，Progressive Networks 的列，是可以隨意 reuse，modify or 通過 lateral connections 忽略之前學習到的features。因為 lateral connections U 僅僅從第 k 列到第 j 列，之前的列，不受前向傳播中學習到的特征的影響。因為參數是固定的，當訓練的時候，任務之間並沒有相互干擾，所以不存在毀滅性的遺忘。

　另外，可以參考這篇博文： https://blog.acolyer.org/2016/10/11/progressive-neural-networks/?utm_source=tuicool&utm_medium=referral

Progressive neural networks

Rusu et al, 2016

If you’ve seen one Atari game you’ve seen them all, or at least once you’ve seen enough of them anyway. When we (humans) learn, we don’t start from scratch with every new task or experience, instead we’re able to build on what we already know. And not just for one new task, but the accumulated knowledge across a whole series of experiences is applied to each new task. Nor do we suddenly forget everything we knew before – just because you learn to drive (for example), that doesn’t mean you suddenly become worse at playing chess. But neural networks don’t work like we do. There seem to be three basic scenarios:

Training starts with a blank slate
Training starts from a model that has been pre-trained in a similar domain, and the model is then specialised for the target domain (this can be a good tactic when there is lots of data in the pre-training source domain, and not so much in the target domain). In this scenario, the resulting model becomes specialised for the new target domain, but in the process may forget much of what it knew about the source domain (“catastrophic forgetting”). This scenario is called ‘fine tuning’ by the authors.
Use pre-trained feature representations (e.g. word vectors) as richer features in some model.

The last case gets closest to knowledge transfer across domains, but can have limited applicability.

This paper introduces progressive networks, a novel model architecture with explicit support for transfer across sequences of tasks. While fine tuning incorporates prior knowledge only at initialization, progressive networks retain a pool of pretrained models throughout training, and learn lateral connections from these to extract useful features for the new task.

The progressive networks idea is actually very easy to understand (somewhat of a relief for someone like myself who is just following along as an interested outsider observing developments in the field!). Some of the key benefits include:

The ability to incorporate prior knowledge at each layer of the feature hierarchy
The ability to reuse old computations and learn new ones
Immunity to catastrophic forgetting

Thus they are a stepping stone towards continual / life-long learning systems.

Here’s how progressive networks work. Start out by training a neural network with some number L of layers to perform the initial task. Call this neural network the initial column of our progressive network:

When it comes time to learn the second task, we add an additional column and freeze the weights in the first column (thus catastrophic forgetting, or indeed any kind of forgetting is impossible by design). The outputs of layerl in the original network becomes additional inputs to layer l+1 in the new column.

The new column is initialized with random weights.

We make no assumptions about the relationship between tasks, which may in practice be orthogonal or even adversarial. While the fine tuning stage could potentially unlearn these features, this may prove difficult. Progressive networks side-step this issue by allocating a new column for each task, whose weights are initialized randomly.

Suppose we now want to learn a third task. Just add a third column, and connect the outputs of layer l in all previous columns to the inputs of layerl+1 in the new column:

This input connection is made through an adapter which helps to improve initial conditioning and also deals with the dimensionality explosion that would happen as more and more columns are added:

…we replace the linear lateral connection with a single hidden layer MLP (multi-layer perceptron). Before feeding the lateral activations into the MLP, we multiply them by a learned scalar, initialized by a random small value. Its role is to adjust for the different scales of the different inputs. The hidden layer on the non-linear adapter is a projection onto an n<sub>l</sub> dimensional subspace (n<sub>l</sub> is the number of units at layer _l).

As more tasks are added, this ensures that the number of parameters coming from the lateral connections remains in the same order.

Progressive networks in practice

The evaluation uses the A3C framework that we looked at yesterday. It’s superior convergence speed and ability to train on CPUs made it a natural fit for the large number of sequential experiments required for the evaluation. To see how well progressive networks performed, the authors compared both two and three-column progressive networks against four different baselines:

(i) A single column trained on the target task (traditional network learning from scratch)
(ii) A single column, using a model pre-trained on a source task, and then allowing just the final layer to be fine tuned to fit the target task
(iii) A single column, using a model pre-trained on a source task, and then allowing the whole model to be fine tuned to fit the target task
(iv) A two-column progressive network, but where the first column is simply initialized with random weights and then frozen.

The experiments include:

Learning to play the Atari pong game as the initial task, and then trying to learn to play a variety of synthetic variants (extra noise added to the inputs, change the background colour, scale and translate the input, flip horizontally or vertically).
Learning three source games (three columns, one each for Pong, RiverRaid, and Seaquest) and then seeing how easy it is to learn a new game – for a variety of randomly selected target games.
Playing the Labyrinth 3D maze game – each column is a level (track) in the game, and we see how the network learns new mazes using information from prior mazes.

For the Pong challenge, baseline 3 (fine tuning a network pre-trained on Pong prior to the synthetic change) performed the best of the baselines, with high positive transfer. The progressive network outperformed even this baseline though, with better mean and median scores.

As the mean is more sensitive to outliers, this suggests that progressive networks are better able to exploit transfer when transfer is possible (i.e. when source and target domains are compatible).

For the game transfer challenge the target games experimented with include Alien, Asterix, Boxing, Centipede, Gopher, Hero, James Bond, Krull, Robotank, Road Runner, Star Gunner, and Wizard of Wor.

Across all games, we observe that progressive nets result in positive transfer in 8 out of 12 target tasks, with only two cases of negative transfer. This compares favourably to baseline 3, which yields positive transfer in only 5 out of 12 games.

The more columns (the more prior games the progressive network has seen), the more progressive networks outperform baseline 3.

Seaquest -> Gopher (two quite different games) is an example of negative transfer:

Sequest -> RiverRaid -> Pong -> Boxing is an example where the progressive networks yield significant transfer increase.

With the Labyrinth tests, the progressive networks once again yield more positive transfer than any of the baselines.

Limitations and future directions

Progressive networks are a stepping stone towards a full continual learning agent: they contain the necessary ingredients to learn multiple tasks, in sequence, while enabling transfer and being immune to catastrophic forgetting. A downside of the approach is the growth in number of parameters with the number of tasks. The analysis of Appendix 2 reveals that only a fraction of the new capacity is actually utilized, and that this trend increases with more columns. This suggests that growth can be addressed, e.g. by adding fewer layers or less capacity, by pruning [9], or by online compression [17] during learning. Furthermore, while progressive networks retain the ability to solve all K tasks at test time, choosing which column to use for inference requires knowledge of the task label. These issues are left as future work.

The other observation I would make is that the freezing prior columns certainly prevents catastrophic forgetting, but also prevents any ‘skills’ a network learns on subsequent tasks being used to improve performance on previous tasks. It would be interesting to see backwards transfer as well, and what could be done there without catastrophic forgetting.

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 論文筆記：Progressive Neural Architecture Search 論文筆記：蒸餾網絡（Distilling the Knowledge in Neural Network）論文筆記系列-Neural Network Search ：A Survey 【論文筆記】Progressive Neural Networks 漸進式神經網絡論文筆記《ImageNet Classification with Deep Convolutional Neural Network》論文筆記之：Hybrid computing using a neural network with dynamic external memory 【論文筆記】Malware Detection with Deep Neural Network Using Process Behavior 論文筆記：Person Re-identification with Deep Similarity-Guided Graph Neural Network 論文筆記：（2019）GAPNet: Graph Attention based Point Neural Network for Exploiting Local Feature of Point Cloud 《Multi-focus image fusion with a deep convolutional neural network》論文筆記