論文選讀三 QANet

本文轉載自查看原文 2018-05-05 09:33 2110 自注意力機制/ SQuAD/ nlp/ 谷歌/ reading comprehension/ QANet/ google/ 自然語言處理/ Deep Learning/ 閱讀理解

Reading Comprehension(RC)

閱讀理解對於機器來說，是一項非常艱巨的任務。google提出QANet，目前（2018 0505）一直是SQuAD的No. 1. 今天簡單地與大家分享一下。

SQuAD

Stanford Question Answering Dataset (SQuAD) [1] 閱讀理解理解數據集，包含100，000＋的數據樣本，采用眾包的方式，對500＋的 Wikipedia 文章進行處理，得到（Context, question, answer) 三元組樣本。答案是Context 中的一小段文本。

In meteorology, precipitation is any product of the condensation of atmospheric water vapor that falls under gravity. The main forms of precipitation include drizzle, rain, sleet, snow, graupel and hail... Precipitation forms as smaller droplets coalesce via collision with other rain drops or ice crystals within a cloud. Short, intense periods of rain in scattered locations are called “showers”.

# What causes precipitation to fall?

gravity

# What is another main form of precipitation besides drizzle, rain, snow, sleet and hail?

graupel

# Where do water droplets collide with ice crystals to form precipitation?

within a cloud

SQuAD Leaderboard

QANet

Contribution:

移除了循環（Recurrent）機制，使用巻積（convolution)與自注意（self-attention)機制處理相關任務，提升了模型的數據處理速度（trainging: 3x to 13x, inference: 4x to 9x)
提出了數據增強技術：NMT。

Model Structure:

Embedding layer
Embedding encoder layer
Context-attention layer
Model encoder layer
Output layer

Embedding Layer：

將自然語言轉化計算機可處理的向量，並盡量保留詞語中所包含的語義信息。

采用詞向量與字向量拼接的方式獲得最終的詞向量：

Word embedding: 預訓練，采用 GloVe 詞向量。
Character embedding: 可訓練（trainable）。

處理過程：

對於字量操作：
- 將每個字符轉化(truncated or padded) 成統一長的單詞（16）；
- 池化（max pooling）（沿行），char_embedding = reduce_max(char_embedding, axis= row) ；
- 巻積操作。
字、詞向量拼接：

\[x_e = [x_w；x_c] \]

Highway Nets [2]處理：

\[outputs = H(x,W_H) \cdot T(x, W_T) + x \cdot C(x,W_C) \]
其中， H() 是仿射變換(Affine Transformation), 一般可理解為處理 x 時所用的網絡， T(), C() 則是構成高速路網絡的非線性變換, 一般為簡潔: \(C = 1 - T\):

\[outputs = H(x,W_H) \cdot T(x, W_T) + x \cdot (1 - T(x,W_T)) \]

Embedding Encoder layer:

提取Context 與 question中的主義信息。

采用巻積與自注意機制：構建了一個 encoder block:

[(pos-encoding)+conv x # + self-attention + feed-forward]

Position encoding[3]:

捕捉位置信息

\[PE_{(pos,2i)} = \sin(pos / 10000^{2i/d_{model}})\\ PE_{(pos,2i+1)} = \cos(pos / 10000^{2i/d_{model}}) \]

其中， pos表示詞的位置，i表示的\(i^{th}\) 的embedding維度。\(d_{model}\) 表示embedding的維度.

posting encoding 結果與輸入相加，作為下一步的輸入。
深度（可分離）巻積(Depth wise separable convolutions)[4]:

在經典巻積中，巻積核在所有輸入通道上進行巻積操作，並綜合所有輸入通道情況得到巻積結果（如加和，池化等等），而在深度可分離巻積中，巻積操作分為兩步，第一步，巻積核對每個輸入通道進行單獨地處理，不做綜合處理；每二步，對第一步的結果，使用（WxH=1x1)的巻積核進行處理，並得到最終結果。這樣可以提高泛化能力與巻積效率，避免參數冗余。

與經典巻積的對比：

[經典巻積]

[深度可分離巻積]

自注意力機制[3]:

一種序列表示(sequence representation), 提取全局信息。

\[Attention(Q,K,V) = softmax(\frac{QK^T}{\sqrt{d_k}})V \]

其中Q:query, K: key, V:value.

\[MultiHead(Q,K,V) = Concat(head1,...,headh)W^O\\ 其中，\ head_i = Attention(QW_i^Q,KW_i^K,VW_i^V) \]

在 encoder block中，將每個子層都包裹在列差模塊中:

\[Output = f(layernorm(x)) + x \]
其中 f 表示encoder block 中的子層，如 depth conv, self-attention, feed-forward等。layernorm() 表示 layer normalization[5].

Context-Query Attention Layer

發現context query 之間的聯系，並在詞的層面上，解析出query, context中關鍵的詞語。

從詞的層面上，挖掘context, query 之間的關系S (n x m) [6]：

\[S_{i,j} = f(q,c ) = W_0[q,c,q\odot c] \]
其中，\(\odot\) 表示逐元素(element-wise)相乘, n 表示 context 的長度, m表示query的長度。
Context-to-query attention A:

\[A = softmax(S, axis=row) \cdot Q^T \quad \in R^{n\times d} \]
其中，d 為embedding長度
query-to-context attention B:

\[B = A\cdot softmax(S,axis=column)^T \cdot C^T \]

Model Encoder layer

從全局的層面來考慮context與query之間的關系。

采用 stacked blocks x 3(權值共享), stack blocks = encoder block x 7. 輸入: \([c,a,c\odot a, c\odot b]\), 三個stacked blocks 分別輸出\(M_0, M_1,M_2\).

Output layer

解析answer在context中的位置(start position, end position):

\[pos^{start} = softmax(W_{start} [M_0; M_1]),\quad pos^{end} = softmax(W_{end}[M_0; M_2]) \]

Loss function

\[L(\theta) = - \frac{1}{N}\sum_{i}^N\left[\log(p_{y_i^{start}}^{start}) + \log(p_{y_i^{end}}^{end})\right] \]

其中\(y_{i}^{start},y_i^{end}\) 分別表示真實的answer 在context中的真實起始，終止位置。

Other experiment detail

Optimization & Regularization:

L2 weight decay (\(\lambda = 3 \times 10^{{-7}}\))
stochastic depth[8] (layer dropout)(在每個encoder中) (survival rate of layer l \(p_l = 1 - \frac{1}{L}(1-p_L)\), L 表示最后一層， \(p_L = 0.9\),)
layer normalization
dropout (絕大部分0.1，在character embedding為0.05)
ADAM[7] (\(\beta_1 = 0.8, beta_2 = 0.999, \epsilon = 10^{-7}\))
depth-wise separable convolutions
self-attention, multi-head, position encoding(compared with rnn)
position encoding
exponentially moving average(EMA: 0.9999)

PS: 代碼過幾天附上。

Reference

Pranav Rajpurkar, Jian Zhang,Konstantin Lopyrev, and Percy Liang. Squad: 100, 000+ questions for machine comprehension of text. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, EMNLP 2016, Austin, Texas, USA, November 1-4, 2016, pp. 2383–2392, 2016
Rupesh Kumar Srivastava, Klaus Greff, and J¨ urgen Schmidhuber. Highway networks. CoRR, abs/1505.00387, 2015. URL http://arxiv.org/abs/1505.00387
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin, Attention is all you need, In Neural Information Processing Systems, 2017b
François Chollet. Xception: Deep learning with depthwise separable convolutions. arXiv preprint arXiv:1610.02357, 2016
Ba, J. L., Kiros, J. R., & Hinton, G. E. (2016). Layer Normalization. https://doi.org/10.1038/nature14236
Min Joon Seo, Aniruddha Kembhavi, Ali Farhadi, and Hannaneh Hajishirzi. Bidirectional attention flow for machine comprehension. CoRR, abs/1611.01603, 2016. URL http://arxiv.org/ abs/1611.01603
Kingma, D. P., & Ba, J. (2014). Adam: A Method for Stochastic Optimization, 1–15. https://doi.org/http://doi.acm.org.ezproxy.lib.ucf.edu/10.1145/1830483.1830503
Kingma, D. P., & Ba, J. (2014). Adam: A Method for Stochastic Optimization, 1–15. https://doi.org/http://doi.acm.org.ezproxy.lib.ucf.edu/10.1145/1830483.1830503

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 古文選讀161篇--蔡禮旭老師選畢業論文選題之開題報告關於 Texstudio 語言設置沒有中文選項 CSS-上下文選擇器 materializecss datepicker 中文選項及日期格式化【滾動更新】C++ 八股文選集（沒代碼，純應試）論文總結怎樣讀論文？ GCN論文 [論文理解] 半監督論文總結（一）