論文筆記之：Hybrid computing using a neural network with dynamic external memory

本文轉載自查看原文 2016-10-14 13:15 1896 論文閱讀/ Nature Paper/ 深度學習/ Dynamic Memory Network

Hybrid computing using a neural network with dynamic external memory

Nature 2016

updated on 2018-07-21 15:30:31

Paper：http://www.nature.com/nature/journal/vaop/ncurrent/pdf/nature20101.pdf

Code：https://github.com/deepmind/dnc

Slides: http://people.idsia.ch/~rupesh/rnnsymposium2016/slides/graves.pdf

Blog：

1. Offical blog: https://deepmind.com/research/dnc/

2. others:

Applications on CV tasks (for example, 3 CVPR-2018 papers):

1. VQA: http://openaccess.thecvf.com/content_cvpr_2018/papers/Ma_Visual_Question_Answering_CVPR_2018_paper.pdf

2. One-Shot Image Recognition: http://openaccess.thecvf.com/content_cvpr_2018/papers/Cai_Memory_Matching_Networks_CVPR_2018_paper.pdf

3. Video Caption: http://openaccess.thecvf.com/content_cvpr_2018/papers/Wang_M3_Multimodal_Memory_CVPR_2018_paper.pdf

摘要：人工智能神經網絡在感知處理，序列學習，強化學習領域得到了非常大的成功，但是限制於其表示變量和數據結構的能力，長時間存儲知識的能力，因為其缺少一個額外的記憶單元。此處，我們引入一個機器學習模型，稱為：a differentiable neural computer (DNC)，包含一個神經網絡，可以讀取和寫入一個額外的記憶矩陣；類似於計算機當中的 random-access memory。像傳統的計算機一樣，可以利用其 memory 表示和執行一個復雜的數據結構，但是，像神經網絡一樣，也可以從數據中進行學習。當進行監督學習的時候，我們表明一個 DNC 能夠成功的回答模擬的問題，在自然語言中進行推理和論證問題。我們表明，他可以學習到類似給定特定點的最短距離和推理在隨機產生的圖中丟失的連接，然后推廣到特定的 graph，例如：交通運輸網絡和家譜樹結構。當進行強化學習的時候，一個 DNC 可以完成移動 block 的難題。總的來說，我們的結果表明，DNCs 能夠解決復雜的，結構化的任務，但是這些任務假如沒有 external read-write memory，那么根本無法完成的任務。

引言：

雖然最近的突破表明神經網絡在信號處理，序列學習，強化學習上有很強的適應性。但是，認知科學家和神經科學家都認為：神經網絡在表示變量和數據結構上，能力有限，以及存儲長時間的數據（the neural networks are limited in their ability to represent variables and data structure, and to store data over long timescales without interference）。我們嘗試結合神經元和計算處理的優勢，具體做法是：providing a neural network with read-write access to exernal memory. 整個系統都是可微分的，可以用 gradient descent 的方法進行 end to end 的學習，允許網絡自動學習如果操作和組織 memory（in a goal-directed manner）。

System Overview：

A DNC is a neural network coupled to an external memory matrix.

如果 memory 可以認為是：DNC's RAM，那么，the network，可以認為是 controller，是一個可微分的 CPU，其操作是用 gradient descent 的方法來學習的。DNC 的結構不同於最近的神經記憶單元，主要體現在：the memory can be slectively written to as well as read, allowing iterative modification of memory content.

傳統的計算機利用獨特的地址倆訪問 memory content，DNC 利用可微分的注意力機制來定義：distributions over the N rows, or "locations", in the N*M memory matrix M. 這些分布，我們稱之為：weightings，代表了每個位置涉及到 read or write operation 的程度。

這些功能性的單元，決定和采用了這些權重，我們稱之為：“read and write heads”。heads 的操作，如圖1所示。

Interaction between the heads and the memory

The heads 利用了三種不同形式的可微分的 attention。

第一種是：the content lookup （內容查找表）， in which a key vector emitted by the controller is compared to the content of each location in memory according to a similarity measure (here, cosine similarity).

第二種 attention 機制記錄了：records transitions between consecutively written locations in an N*N temporal link matrix L.

第三種 attention 分配內存用於 writting。

注意力機制的設計是受到計算上的考慮。

Content lookup 確保了連接數據結構的形式；

temporal links 確保了輸入序列的時序檢索；

allocation 提供了 the write head with unused locations.

METHODS

Controller Networks.

在每一個時間步驟 t 控制網絡 N 從數據集或者環境中接收一個輸入向量 x_t，並且輸出一個向量 y_t 用於參數化要么是一個目標向量 z 的預測分布（監督學習的角度來說），要么是一個動作分布（強化學習的角度來說）。另外，the controller 接收一組 R read vectors from the memory matrix M_t-1 at the previous time-step, via the read heads. 它然后發射一個 interface vector，定義了在當前時刻與 memory 的交互。為了符號表示的方便，我們將輸入和 read vectors 表示為 a single controller input vector X_t = [x_t; r_t-1¹; ... ; r_t-1^R]. 任何結構的神經網絡都可以用於 controller，但是我們這里采用 deep LSTM 結構的變種：

其中，i，f, s, o, h 分別代表輸入門，遺忘門，狀態（即常規的cell），輸出門，以及 hidden state。

在每一個時間步驟，the controller 發射一個輸出向量 vt，以及一個交互向量，定義為：

假設控制網絡是 recurrent，他的輸出是復雜歷史（X₁, X₂, ... X_t）的函數。所以我們可以壓縮 the controller 的操作為：

It is possible to use a feedforward controller, in which case N is a function of X_t only; however, we use only recurrent controller in this paper.

最終，輸出的向量 y_t 定義為：adding v_t to a vector obtained by passing the connection of the current read vectors through the RW*Y weight matirx Wr:

這種安排使得 DNC 能夠在剛剛讀取到的記憶基礎之上，進行決策的輸出；但是很難將這個信息傳遞到 controller，從而利用他們來決定 v，without carrying a cycle in the computation graph.

Interference parameters：

Before being used to parameterize the memory interactions, the interface vector 被划分為如下幾個部分：

每一個單獨的成分然后被不同的函數進行處理，以確保他們能夠在合適的 domain 當中。如：

1. the logistic sigmoid function is used to constrain to [0, 1].

2. the "oneplus" function is used to constrain to [1, 無窮)，其中：

　　　　oneplus(x) = 1 + log(1+e^x)

3. softmax function is used to constrain vectors to S_N, the N-1-dimensional unit simplex:

在處理完畢之后，我們有如下的變量和向量：

Reading and Writing to memory

選擇位置進行讀寫是依賴於權重（weighting）的，即：屬於0-1之間的 value，總和為 1. The complete set of allowed weightings over N locations is the non-negative orthant of R^N with the unit simplex as a boundary (known as the "corner of the cube"):

對於 read 操作，R read weightings 被用於計算內容的加權平均，所以，定義 read vectors 為：

The read vectors 加上下一個時間步驟的 controller input，使之能夠訪問到 memory content。

The write operation 被單個 write weighting 所調節，經常跟擦除向量（erase vector）和寫入向量（write vector）一起使用來修改記憶：

其中，E 是 N*W matrix of ones.

Memory Addressing：

這個系統利用了 content-based addressing and dynamic memory allocation to determine where to write in memory ;

　　　　　　　　content-based addressing and temporal memory linkage to determine where to read.

下面將分別介紹這些機制：

Content-based addressing. 所有的 content lookup 操作，都利用如下的函數：

權重 $C(M, k, \beta)$ 定義了一個歸一化的概率分布（over the memory locations）。

Dynamic memory allocation. 為了允許 the controller 能夠釋放和分配所需要的 memory，我們研發了一個可微分的類似 “free list” 的 memory allocation scheme，其中，可用記憶位置的列表（a list of available memory locations）是通過添加和移除 linked list 上的 address 來實現的。在時刻 t 的記憶利用向量為：u_t, 並且 u₀ = 0。在寫入到 memory 之前，the controller 發射一系列的 free gates，one per read head, 來決定是否最近讀取的位置可以被釋放？The memory retention vector 表示 how much each location will not be freed by the free gates, 並且定義為：

所以，the usage vector 可以定義為：

直觀上的理解：locations are used if they have been retained by the free gates, and were either already in use or have just been written to. 每一次對一個位置的寫入，都會增加他的 usage，直到1，利用率也可以用過 free gates 進行逐漸的降低；u_t 的元素從而可以被約束在 [0, 1]之間。一旦 u_t 被確定了，the free list 就被定義為：sorting the indices of the memory locations in ascending order of usage; 對應的，就是 the index of the least used location。分配的權重 a_t 被用於提供新的位置來進行寫操作，即：

如果所有的 usage 都是1，那么 a_t= 0 ，the controller 就不在能夠分配 memory了，除非它首先將已經使用的 locations 進行釋放.

Write weightings：控制器可以寫入到 newly allocated locations，或者 locations addressed by content, 或者他可以選擇不進行 write 操作。首先，一個寫入內容權重通過 the write key 和 write strength 來構建：

其中，c_t^w is interpolated with the allocation weighting a_t defined in equation (1) to determine a write weighting:

其中，g_t^a is the allocation gate governing the interpolation and g_t^w is the write gate. 當 write gate 為 0 的時候，然后就什么都不進行寫入，而不管其他參數怎么樣；這可以從某種程度上保護記憶，免得受到不必要的更新。

Temporal memory linkage：The memory allocation system 不存儲序列信息（stores no information about the order ）。但是，這種次序的信息卻經常是有效的：for example，when a sequence of instructions must be recored and retrieved in order. 所以，我們采用了一個 link matrix L 來跟蹤連續被修改的記憶位置（to keep track of consecutively modified memory locations）。

L_t[i, j] represents the degree to which location i was the location written to after location j, and each row and column of Lt defines a weighting over locations:

為了定義 Lt, 我們需要 a precedence weighting Pt, where element Pt[i] represents the degree to which location i was the last one written to. Pt is defined by the recurrence relation:

每次當一個位置被更新之后，the link matrix 被更新，to remove old links to and from that location. 從上次寫入的位置的新的鏈接被添加。我們利用如下的 recurrence relation 來執行這個邏輯：

我們將 self-links 扔掉了（即：link matrix 的對角線元素全部為 0），因為：it is unclear how to follow a translation from a location to itself. Lt 的行和列分別代表了：temporal links 進去和出來某一特定 memory slots 的權重。給定 Lt，read head i 的反向權重 b_tⁱ 和前向權重 f_tⁱ 分別定義為：

其中，是第 t-th 次從上一個時間步驟得到的 read weighting。

Sparse link matrix. the link matrix is N*N and therefore requires O(N²) resources in both memory and computation to calculate exactly.

Read weighting.

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 論文筆記之《Event Extraction via Dynamic Multi-Pooling Convolutional Neural Network》【論文筆記】Malware Detection with Deep Neural Network Using Process Behavior 論文筆記：蒸餾網絡（Distilling the Knowledge in Neural Network）論文筆記系列-Neural Network Search ：A Survey 論文筆記之：Progressive Neural Network Google DeepMind 論文筆記《ImageNet Classification with Deep Convolutional Neural Network》論文筆記：Learning Dynamic Memory Networks for Object Tracking 論文筆記之：Optical Flow Estimation using a Spatial Pyramid Network Spark 經典論文筆記---Resilient Distributed Datasets : A Fault-Tolerant Abstraction for In-Memory Cluster Computing 語音降噪論文“A Hybrid Approach for Speech Enhancement Using MoG Model and Neural Network Phoneme Classifier”的研讀