【ENAS】2018-ICML-Efficient Neural Architecture Search via Parameter Sharing-論文閱讀

本文轉載自查看原文 2020-06-15 20:13 864 Paper/ 2020/ AutoML

ENAS

2018-ICML-Efficient Neural Architecture Search via Parameter Sharing

來源：ChenBong 博客園

Hieu Pham（Google Brain、CMU），Quoc V. Le（Google Brain），Jeff Dean（Google Brain）
GitHub：
https://github.com/carpedm20/ENAS-pytorch 2.2k
https://github.com/melodyguan/enas 1.5k
Citation：625

Introduction

We propose Efficient Neural Architecture Search (ENAS), a fast and inexpensive approach for automatic model design

提出了ENAS，一種快速且計算開銷小的自動化模型設計方法。

In ENAS, a controller discovers neural network architectures by searching for an optimal subgraph within a large computational graph.

在ENAS中，控制器從一個大的計算圖中，搜索最佳子圖

On the CIFAR-10 dataset, ENAS finds a novel architecture that achieves 2.89% test error, which is on par with the 2.65% test error of NASNet.

在cifar10上 2.89% err，對比NASNet方法是2.65% err

On Penn Treebank, our method achieves a test perplexity of 55.8, which significantly outperforms NAS’s test perplexity of 62.4 (Zoph & Le, 2017) and which is a new state-of-the-art among Penn Treebank’s approaches that do not utilize post-training processing.

在Penn Treebank上，我們的方法的perplexity是55.8，遠低於NAS方法的62.4，是新的sota

Importantly, in all of our experiments, for which we use a single Nvidia GTX 1080Ti GPU, the search for architectures takes less than 16 hours.

很重要的一點，我們的實驗使用一張1080ti 跑了16小時即可

Due to its efficiency, we name our method Efficient Neural Architecture Search (ENAS).

由於高效性，因此叫做ENAS

Motivation

In NAS, an RNN controller is trained in a loop: the controller first samples a candidate architecture, i.e. a child model, and then trains it to convergence to measure its performance on the task of desire.

The controller then uses the performance as a guiding signal to find more promising architectures.

在NAS方法中，RNN控制器循環訓練，1. sample 一個子網絡，2. 訓練子網絡至收斂，獲得子網絡的性能指標（測試集精度）3. RNN控制器使用性能指標作為信號，生成更好的模型

We observe that the computational bottleneck of NAS is the training of each child model to convergence, only to measure its accuracy whilst throwing away all the trained weights.

我們認為，NAS計算開銷大的主要原因在於，每個子網絡都需要訓練到收斂，但需要的只是子網絡的acc，而把訓練好的權重都丟棄。

Contribution

The main contribution of this work is to improve the efficiency of NAS by forcing all child models to share weights to eschew training each child model from scratch to convergence.

本文的主要貢獻是，通過強制所有子網絡貢獻參數，而避免from scratch 訓練每個子網絡

Sharing parameters among child models allows ENAS to deliver strong empirical performances, while using much fewer GPUhours than existing automatic model design approaches, and notably, 1000x less expensive than standard Neural Architecture Search.

子模型間共享參數，可以讓ENAS有很好的表現同時有着更少的GPU hours，比NAS方法少了1000倍。

Method

Central to the idea of ENAS is the observation that all of the graphs which NAS ends up iterating over can be viewed as sub-graphs of a larger graph.

ENAS的核心是，觀察到NAS最終得到的圖都可以看成大圖的子圖

In other words, we can represent NAS’s search space using a single directed acyclic graph (DAG).

就是說，我們可以用一個有向無環圖（DAG）來表示NAS的搜索空間

Figure 2 illustrates a generic example DAG, where an architecture can be realized by taking a subgraph of the DAG.

圖2是一個DAG，可以通過選取其中的子圖來實現不同的cell結構

Intuitively, ENAS’s DAG is the superposition of all possible child models in a search space of NAS, where the nodes represent the local computations and the edges represent the flow of information.

ENAS的DAG是搜索空間中所有子模型的疊加，節點代表計算操作，邊代表信息流動

The local computations at each node have their own parameters, which are used only when the particular computation is activated.

每個節點上的操作都有自己的參數，盡在當前節點激活時才使用

Therefore, ENAS’s design allows parameters to be shared among all child models,

因此，ENAS允許不同子模型之間共享參數

In the following, we facilitate the discussion of ENAS with an example that illustrates how to design a cell for recurrent neural networks from a specified DAG and a controller (Section 2.1).

下面通過一個示例來說明如何從一個DAG中設計cell

We will then explain how to train ENAS and how to derive architectures from ENAS’s controller (Section 2.2).

我們將介紹如何訓練ENAS以及如何從控制器RNN中導出cell結構（section 2.2）

Finally, we will explain our search space for designing convolutional architectures (Sections 2.3 and 2.4).

最后我們將說明我設計卷積網絡的搜索空間

Sec 2.3. Designing Convolutional Networks

Recall that in the search space of the recurrent cell, the controller RNN samples two decisions at each decision block:

what previous node to connect to and

what activation function to use.

RNN cell的搜索空間中，控制器RNN在每個block進行2個預測：

1）連接到之前哪個節點

2）使用哪種激活函數

In the search space for convolutional models, the controller RNN also samples two sets of decisions at each decision block:

what previous nodes to connect to and

what computation operation to use.

在卷積模型的搜索空間中，控制器RNN也在每個block進行2個預測：

1）連接到之前哪個層

2）使用哪種操作

These decisions construct a layer in the convolutional model.

這兩個預測決定一個層

The decision of what previous nodes to connect to allows the model to form skip connections.

連接到之前哪個層即允許skip connections

Specifically, at layer k, up to k−1 mutually distinct previous indices are sampled, leading to 2k−1 possible decisions at layer k.

第k層，輸入有k-1個可以選擇，所以共有 $2^{k-1}$ 種情況

$2^{k-1}$ ：前面的k-1層都有2種情況，選或者不選

We provide an illustrative example of sampling a convolutional network in Figure 3.

以圖3為例說明

In this example, at layer k = 4, the controller samples previous indices {1, 3}, so the outputs of layers 1 and 3 are concatenated along their depth dimension and sent to layer 4.

在k=4層，控制器RNN選擇1,3作為輸入，因此1,3層的輸出contact后作為第4層的輸入

Meanwhile, the decision of what computation operation to use sets a particular layer into convolution or average pooling or max pooing.

同時，控制器RNN預測要使用的操作（卷積/平均池化/最大池化）

The 6 operations available for the controller are:

convolutions with filter sizes 3 × 3 and 5 × 5,

depthwise-separable convolutionswith filter sizes 3×3 and 5×5 (Chollet, 2017), and

max pooling and average pooling of kernel size 3 × 3.

總共有6種可選操作：

3 × 3 and 5 × 5 卷積

3 × 3 and 5 × 5 深度可分離卷積

3 × 3 最大池化、平均池化

Making the described set of decisions for a total of L times, we can sample a network of L layers.

執行以上預測步驟L次，即可確定網絡的L層

Since all decisions are independent, there are $6^L × 2^{L(L−1)/2}$ networks in the search space.

$6^L$ ：每一層有6種操作可以選擇，一共L層

$2^{L(L−1)/2}$ ：$\sum_{k=1}^{L} 2^{k-1}$

In our experiments, L = 12, resulting in $1.6 × 10^{29}$ possible networks.

在我們的實驗中，L=12，共有 $1.6 × 10^{29}$ 種網絡

Sec 2.4. Designing Convolutional Cells

Rather than designing the entire convolutional network, one can design smaller modules and then connect them together to form a network.

不是直接搜索整個網絡，而是先搜索cell，再堆疊cell

Figure 4 illustrates this design, where the convolutional cell and reduction cell architectures are to be designed.

以圖4為例，說明conv cell 和 reduction cell是如何設計的

We utilize the ENAS computational DAG with B nodes to represent the computations that happen locally in a cell.

DAG是一個帶有B個node的cell

In this DAG, node 1 and node 2 are treated as the cell’s inputs, which are the outputs of the two previous cells in the final network

在DAG中，當前cell 的 node 1 和 node 2表示cell的輸入，來自之前2個cell的輸出

For each of the remaining B − 2 nodes, we ask the controller RNN to make two sets of decisions:

two previous nodes to be used as inputs to the current node

two operations to apply to the two sampled nodes.

對於剩余的B-2個node（每個node有2個input，1個output），我們讓控制器RNN做2個預測：

1）當前node的input取自哪兩個之前的node

2）在2個input上施加什么操作

The 5 available operations are:

identity,

separable convolution with kernel size 3×3 and 5×5, and

average pooling and max pooling with kernel size 3×3.

5個可選操作：

identity

sep conv 3×3 and 5×5

avg pool 3×3、max pool 5×5

At each node, after the previous nodes and their corresponding operations are sampled, the operations are applied on the previous nodes, and their results are added.

每個node的2個input進行操作后，結果相加（add）

As before, we illustrate the mechanism of our search space with an example, here with B = 4 nodes (refer to Figure 5).

以圖5為例說明cell搜索的過程，這里B=4，即每個cell有4個node

Details are as follows.

Nodes 1, 2 are input nodes, so no decisions are needed for them. Let h1, h2 be the outputs of these nodes.

At node 3: the controller samples two previous nodes and two operations. In Figure 5 Top Left, it samples node 2, node 2, separable conv 5x5, and identity. This means that h3 = sep conv 5x5(h2) + id(h2).

At node 4: the controller samples node 3, node 1, avg pool 3x3, and sep conv 3x3. This means that h4 = avg pool 3x3(h3) + sep conv 3x3(h1).

Since all nodes but h4 were used as inputs to at least another node, the only loose end, h4, is treated as the cell’s output. If there are multiple loose ends, they will be concatenated along the depth dimension to form the cell’s output.

node 1 和 node 2 是輸入節點，用h1，h2表示這兩個節點的輸出
node 3：控制器RNN從當前node之前的node 選取2個作為輸入，選中了2,2；選取2個操作，分別是sep 5x5 和 id，因此h3 = sep conv 5x5(h2) + id(h2)
node4：控制器RNN從當前node之前的node 選取2個作為輸入，選中了1,3；選取2個操作，分別是avg 3×3 和 sep 3×3，因此h4 = avg pool 3x3(h3) + sep conv 3x3(h1)
cell中除了h4都被用作其他node的input，僅剩下h4未使用，因此將其作為cell的輸出，如果有多個未使用的node的output，將這些output concat，一起作為cell的輸出

A reduction cell can also be realized from the search space we discussed,

sampling a computational graph from the search space, and

applying all operations with a stride of 2.

reduction cell也可以使用同樣的方法實現

1）從search space采樣計算圖

2）將stride設置為2

A reduction cell thus reduces the spatial dimensions of its input by a factor of 2.

因此reduction cell可以將feature map長寬降低為1/2

Following Zoph et al. (2018), we sample the reduction cell conditioned on the convolutional cell, hence making the controller RNN run for a total of 2(B − 2) blocks.

因此控制器RNN一共有2(B-2)個block

2(B-2)：2種cell，每種cell 有 B 個node，但只有(B-2)個node需要決定，每個node有4個預測值（2個input，2個op）

Finally, we estimate the complexity of this search space.

最后估計search space 的復雜度

At node i (3 ≤ i ≤ B), the controller can select any two nodes from the i − 1 previous nodes, and any two operations from 5 operations.

第i（3 ≤ i ≤ B）個node，控制器RNN可以在之前i-1個node的outputs中抽樣2次作為輸入，從5中操作中抽樣2次對應2個input的operation

As all decisions are independent, there are $(5 × (B − 2)!)^2$ possible cells.

一共有 $(5 × (B − 2)!)^2$ 種可能的cell

Since we independently sample for a convolutional cell and a reduction cell, the final size of the search space is $(5 × (B − 2)!)^4$ .

如果獨立預測conv cell 和 reduction cell，則共有 $(5 × (B − 2)!)^4$ 種cell

With B = 7 as in our experiments, the search space can realize $ 1.3 × 10^{11}$ final networks, making it significantly smaller than the search space for entire convolutional networks (Section 2.3).

當B=7，即cell內有7個node時，共有 $ 1.3 × 10^{11}$ 中可能的網絡，比2.3中的少很多

Experiments

Sec 3.2. Image Classification on CIFAR-10

We apply ENAS to two search spaces:

the macro search space over entire convolutional models (Section 2.3); and

the micro search space over convolutional cells (Section 2.4).

我們使用ENAS搜索2個搜索空間

1）整個網絡

2）cell

Table 2 summarizes the test errors of ENAS and other approaches.

表2對比了ENAS與其他方法：

the first block presents the results of DenseNet (Huang et al., 2016), one of the highest-performing architectures that are designed by human experts.

表2的第一部分為DensNet，手工設計的最好的網絡

The second block of Table 2 presents the performances of approaches that attempt to design an entire convolutional network, along with the the number of GPUs and the time these methods take to discover their final models.

表2的第二部分為ENAS與其他搜索整個網絡的NAS方法對比

As shown, ENAS finds a network architecture, which we visualize in Figure 7, and which achieves 4.23% test error.

ENAS搜索到的整個網絡結果如圖7，達到4.23% err

If we keep the architecture, but increase the number of filters in the network’s highest layer to 512, then the test error decreases to 3.87%, which is not far away from NAS’s best model, whose test error is 3.65%.

如果使用同樣的網絡結構，將最后一層的filter數量增加到512，得到 3.87% err，比NAS的3.65%差距不大

Impressively, ENAS takes about 7 hours to find this architecture, reducing the number of GPU-hours by more than 50,000x compared to NAS.

重要的是，ENAS僅使用了7個gpu hours，比NAS方法快了50 000x

The third block of Table 2 presents the performances of approaches that attempt to design one more more modules and then connect them together to form the final networks.

表2的第3部分表示搜索cell再堆疊的方法的結果

ENAS takes 11.5 hours to discover the convolution cell and the reduction cell, which are visualized in Figure 8.

ENAS花費11.5hours 搜索到的 conv cell 和 reduction cell 如圖8

With the convolutional cell replicated for N = 6 times (c.f. Figure 4), ENAS achieves 3.54% test error, on par with the 3.41% error of NASNet-A (Zoph et al., 2018). With CutOut (DeVries & Taylor, 2017), ENAS’s error decreases to 2.89%, compared to 2.65% by NASNet-A.

conv cell的重復次數N=6，如圖4，ENAS達到了 3.54% err，對應NASNet-A 的 3.41% err；使用cutout后，達到 2.89% err，對應NASNet-A 2.65% err

In addition to ENAS’s strong performance, we also find that the models found by ENAS are, in a sense, the local minimums in their search spaces.

ENAS搜索到的結構是搜索空間里的局部最小值

In particular, in the model that ENAS finds from the marco search space, if we replace all separable convolutions with normal convolutions, and then adjust the model size so that the number of parameters stay the same, then the test error increases by 1.7%.

在搜索整個網絡的的結果中，將所有sep conv 替換為 conv，err增加了1.7%

Similarly, if we randomly change several connections in the cells that ENAS finds in the micro search space, the test error increases by 2.1%.

在搜索整個網絡的的結果中，在搜索隨機改變一些連接，err增加了2.1%

We thus believe that the controller RNN learned by ENAS is as good as the controller RNN learned by NAS, and that the performance gap between NAS and ENAS is due to the fact that we do not sample multiple architectures from our trained controller, train them, and then select the best architecture on the validation data. This extra step benefits NAS’s performance.

Sec 3.3. The Importance of ENAS

A question regarding ENAS’s importance is whether ENAS is actually capable of finding good architectures, or if it is the design of the search spaces that leads to ENAS’s strong empirical performance.

一個問題是，好的結果是搜索空間設計的原因，還是ENAS算法的原因？

Comparing to Guided Random Search

Our random convolutional network reaches 5.86% test error, and our two random cells reache 6.77% on CIFAR-10, while ENAS achieves 4.23% and 3.54%, respectively.

隨機搜索整個網絡和cell網絡，err分別為5.86% 和 6.77%，使用ENAS后，分別是 4.23% 和 3.51%

Conclusion

However, NAS’s computational expense prevents it from being widely adopted.

NAS的計算開銷使得它無法廣泛應用

In this paper, we presented ENAS, a novel method that speeds up NAS by more than 1000x, in terms of GPU hours.

在本文中，我們提出ENAS，一種新的NAS算法，將gpu時間減少了1000x

ENAS’s key contribution is the sharing of parameters across child models during the search for architectures.

ENAS的主要貢獻是在子網絡之間權值共享

This insight is implemented by searching for a subgraph within a larger graph that incorporates architectures in a search space.

這是通過在大圖中搜索子圖來實現的，而大圖就我們的搜索空間

We showed that ENAS works well on both CIFAR-10 and Penn Treebank datasets.

Summary

Reference

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。