目錄
LSTMs網絡架構
LSTM的核心思想
遺忘門(Forget gate)
輸入門(Input gate)
輸出門(Output gate)
LSTMs是如何解決長程依賴問題的?
Peephole是啥
多層LSTM
參考資料
長短期記憶網絡通常稱為LSTMs,是一種特殊的RNN,能夠學習長期依賴關系。他們是由Hochreiter 等人在1997年提出的,在之后的工作中又被很多人精煉和推廣。它們對各種各樣的問題都非常有效,現在被廣泛使用。LSTMs被明確設計為避免長期依賴問題。長時間記憶信息實際上是他們的默認行為,而不是他們努力學習的東西。
LSTMs網絡架構 |
LSTM的核心思想 |
LSTMs的關鍵是單元狀態,即貫穿圖表頂部的水平線。細胞的狀態有點像傳送帶。它沿着整個鏈向下延伸,只有一些小的線性相互作用。很容易讓信息不加改變地流動。
LSTM確實有能力刪除或添加信息到細胞狀態,由稱為門的結構仔細地調節。門是一種選擇性地讓信息通過的方式。一個LSTM有三個門,以保護和控制單元的狀態。
遺忘門(Forget gate) |
遺忘門會輸出一個0到1之間的向量,然后與記憶細胞C做Pointwize的乘法,可以理解為模型正在忘記一些東西。
輸入門(Input gate) |
有的資料也叫更新門
輸入門有兩條分支,左側輸出一個0到1之間的向量,表示要當前輪多少百分比的信息更新到記憶細胞C上去;右側表示當前輪提出來的信息。
經過遺忘門和輸入門之后,記憶細胞便有了一定的變化。
注意LSTM中的記憶細胞只經過遺忘門和輸入門,它是不直接經過輸出門的。
輸出門(Output gate) |
輸出門需要接受來自三個方向的輸入信息,然后產生兩個方向的輸出。
三個方向輸入的信息分別是:當前時間步的信息、上一個時間步的輸出和當前記憶細胞的信息。
兩個方向輸出的信息分別是:產生當前輪的預測和作為下一個時間步的輸入。
LSTMs是如何解決長程依賴問題的? |
與簡單的RNN網絡模型比,LSTMs不是僅僅依靠快速變化的hidden state的信息去產生預測,而是還去考慮記憶細胞C中的信息。
比如有一個有長程依賴的待預測數據:
I grew up in France… I speak fluent ().
當LSTMs讀到France后,就把France的信息存在記憶細胞特定的位置上,再經過后面的時間步時,這個France的信息會因為遺忘門的乘法而沖淡,但是要注意,這個沖淡的效果很弱,如果沖刷記憶的效果太強那就和簡單的RNN類似了(可能有人要問,要把這個沖刷的強度置為多少呢?答:這是LSTMs自己學出來的),當LSTMs讀到fluent時,結合記憶細胞中France的信息,就預測出French的答案。
Peephole是啥 |
2000年學者Gers & Schmidhuber對LSTMs做了一些變體,peephole如圖所示,就是讓三個門能利用好記憶細胞里的信息,從而讓模型更強。
下圖為對應李宏毅老師的結構,是完全一樣的。
Pytorch demo

# https://pytorch.org/tutorials/beginner/nlp/sequence_models_tutorial.html?highlight=lstm # tensorboard --logdir=runs/lstm --host=127.0.0.1 import torch import torch.nn as nn import torch.nn.functional as F import torch.optim as optim torch.manual_seed(1) class LSTMTagger(nn.Module): def __init__(self, embedding_dim, hidden_dim, vocab_size, tagset_size): super(LSTMTagger, self).__init__() self.hidden_dim = hidden_dim self.word_embeddings = nn.Embedding(vocab_size, embedding_dim) # The LSTM takes word embeddings as inputs, and outputs hidden states # with dimensionality hidden_dim. self.lstm = nn.LSTM(embedding_dim, hidden_dim) # The linear layer that maps from hidden state space to tag space self.hidden2tag = nn.Linear(hidden_dim, tagset_size) def forward(self, sentence): embeds = self.word_embeddings(sentence) lstm_out, _ = self.lstm(embeds.view(len(sentence), 1, -1)) tag_space = self.hidden2tag(lstm_out.view(len(sentence), -1)) tag_scores = F.log_softmax(tag_space, dim=1) return tag_scores def prepare_sequence(seq, to_ix): idxs = [to_ix[w] for w in seq] return torch.tensor(idxs, dtype=torch.long) training_data = [ ("The dog ate the apple".split(), ["DET", "NN", "V", "DET", "NN"]), ("Everybody read that book".split(), ["NN", "V", "DET", "NN"]) ] word_to_ix = {} for sent, tags in training_data: for word in sent: if word not in word_to_ix: word_to_ix[word] = len(word_to_ix) print(word_to_ix) tag_to_ix = {"DET": 0, "NN": 1, "V": 2} # These will usually be more like 32 or 64 dimensional. # We will keep them small, so we can see how the weights change as we train. EMBEDDING_DIM = 6 HIDDEN_DIM = 6 model = LSTMTagger(EMBEDDING_DIM, HIDDEN_DIM, len(word_to_ix), len(tag_to_ix)) loss_function = nn.NLLLoss() optimizer = optim.SGD(model.parameters(), lr=0.1) # See what the scores are before training # Note that element i,j of the output is the score for tag j for word i. # Here we don't need to train, so the code is wrapped in torch.no_grad() from torch.utils.tensorboard import SummaryWriter writer = SummaryWriter('../runs/lstm') with torch.no_grad(): inputs = prepare_sequence(training_data[0][0], word_to_ix) tag_scores = model(inputs) writer.add_graph(model, inputs) writer.close() print(tag_scores) for epoch in range(300): # again, normally you would NOT do 300 epochs, it is toy data for sentence, tags in training_data: # Step 1. Remember that Pytorch accumulates gradients. # We need to clear them out before each instance model.zero_grad() # Step 2. Get our inputs ready for the network, that is, turn them into # Tensors of word indices. sentence_in = prepare_sequence(sentence, word_to_ix) targets = prepare_sequence(tags, tag_to_ix) # Step 3. Run our forward pass. tag_scores = model(sentence_in) # Step 4. Compute the loss, gradients, and update the parameters by # calling optimizer.step() loss = loss_function(tag_scores, targets) loss.backward() optimizer.step() # See what the scores are after training with torch.no_grad(): inputs = prepare_sequence(training_data[0][0], word_to_ix) tag_scores = model(inputs) # The sentence is "the dog ate the apple". i,j corresponds to score for tag j # for word i. The predicted tag is the maximum scoring tag. # Here, we can see the predicted sequence below is 0 1 2 0 1 # since 0 is index of the maximum value of row 1, # 1 is the index of maximum value of row 2, etc. # Which is DET NOUN VERB DET NOUN, the correct sequence! print(tag_scores)
多層LSTM |
和簡單的RNN一樣,可以疊多層,也可以雙向。
參考資料 |
https://colah.github.io/posts/2015-08-Understanding-LSTMs/
https://www.bilibili.com/video/BV1JE411g7XF?p=20