pytorch做seq2seq注意力模型的翻譯

本文轉載自查看原文 2018-12-15 14:40 737
以下是對pytorch 1.0版本的seq2seq+注意力模型做法語--英語翻譯的理解（這個代碼在pytorch0.4上也可以正常跑）：
  1 # -*- coding: utf-8 -*-
  2 """
  3 Translation with a Sequence to Sequence Network and Attention
  4 *************************************************************
  5 **Author**: `Sean Robertson <https://github.com/spro/practical-pytorch>`_
  6 
  7 In this project we will be teaching a neural network to translate from
  8 French to English.
  9 
 10 ::
 11 
 12     [KEY: > input, = target, < output]
 13 
 14     > il est en train de peindre un tableau .
 15     = he is painting a picture .
 16     < he is painting a picture .
 17 
 18     > pourquoi ne pas essayer ce vin delicieux ?
 19     = why not try that delicious wine ?
 20     < why not try that delicious wine ?
 21 
 22     > elle n est pas poete mais romanciere .
 23     = she is not a poet but a novelist .
 24     < she not not a poet but a novelist .
 25 
 26     > vous etes trop maigre .
 27     = you re too skinny .
 28     < you re all alone .
 29 
 30 ... to varying degrees of success.
 31 
 32 This is made possible by the simple but powerful idea of the `sequence
 33 to sequence network <http://arxiv.org/abs/1409.3215>`__, in which two
 34 recurrent neural networks work together to transform one sequence to
 35 another. An encoder network condenses an input sequence into a vector,
 36 and a decoder network unfolds that vector into a new sequence.
 37 
 38 .. figure:: /_static/img/seq-seq-images/seq2seq.png
 39    :alt:
 40 
 41 To improve upon this model we'll use an `attention
 42 mechanism <https://arxiv.org/abs/1409.0473>`__, which lets the decoder
 43 learn to focus over a specific range of the input sequence.
 44 
 45 **Recommended Reading:**
 46 
 47 I assume you have at least installed PyTorch, know Python, and
 48 understand Tensors:
 49 
 50 -  https://pytorch.org/ For installation instructions
 51 -  :doc:`/beginner/deep_learning_60min_blitz` to get started with PyTorch in general
 52 -  :doc:`/beginner/pytorch_with_examples` for a wide and deep overview
 53 -  :doc:`/beginner/former_torchies_tutorial` if you are former Lua Torch user
 54 
 55 
 56 It would also be useful to know about Sequence to Sequence networks and
 57 how they work:
 58 
 59 -  `Learning Phrase Representations using RNN Encoder-Decoder for
 60    Statistical Machine Translation <http://arxiv.org/abs/1406.1078>`__
 61 -  `Sequence to Sequence Learning with Neural
 62    Networks <http://arxiv.org/abs/1409.3215>`__
 63 -  `Neural Machine Translation by Jointly Learning to Align and
 64    Translate <https://arxiv.org/abs/1409.0473>`__
 65 -  `A Neural Conversational Model <http://arxiv.org/abs/1506.05869>`__
 66 
 67 You will also find the previous tutorials on
 68 :doc:`/intermediate/char_rnn_classification_tutorial`
 69 and :doc:`/intermediate/char_rnn_generation_tutorial`
 70 helpful as those concepts are very similar to the Encoder and Decoder
 71 models, respectively.
 72 
 73 And for more, read the papers that introduced these topics:
 74 
 75 -  `Learning Phrase Representations using RNN Encoder-Decoder for
 76    Statistical Machine Translation <http://arxiv.org/abs/1406.1078>`__
 77 -  `Sequence to Sequence Learning with Neural
 78    Networks <http://arxiv.org/abs/1409.3215>`__
 79 -  `Neural Machine Translation by Jointly Learning to Align and
 80    Translate <https://arxiv.org/abs/1409.0473>`__
 81 -  `A Neural Conversational Model <http://arxiv.org/abs/1506.05869>`__
 82 
 83 
 84 **Requirements**
 85 """
 86 from __future__ import unicode_literals, print_function, division
 87 from io import open
 88 import unicodedata
 89 import string
 90 import re
 91 import random
 92 
 93 import torch
 94 import torch.nn as nn
 95 from torch import optim
 96 import torch.nn.functional as F
 97 
 98 device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
 99 
100 ######################################################################
101 # Loading data files
102 # ==================
103 #
104 # The data for this project is a set of many thousands of English to
105 # French translation pairs.
106 #
107 # `This question on Open Data Stack
108 # Exchange <http://opendata.stackexchange.com/questions/3888/dataset-of-sentences-translated-into-many-languages>`__
109 # pointed me to the open translation site http://tatoeba.org/ which has
110 # downloads available at http://tatoeba.org/eng/downloads - and better
111 # yet, someone did the extra work of splitting language pairs into
112 # individual text files here: http://www.manythings.org/anki/
113 #
114 # The English to French pairs are too big to include in the repo, so
115 # download to ``data/eng-fra.txt`` before continuing. The file is a tab
116 # separated list of translation pairs:
117 #
118 # ::
119 #
120 #     I am cold.    J'ai froid.
121 #
122 # .. Note::
123 #    Download the data from
124 #    `here <https://download.pytorch.org/tutorial/data.zip>`_
125 #    and extract it to the current directory.
126 
127 ######################################################################
128 # Similar to the character encoding used in the character-level RNN
129 # tutorials, we will be representing each word in a language as a one-hot
130 # vector, or giant vector of zeros except for a single one (at the index
131 # of the word). Compared to the dozens of characters that might exist in a
132 # language, there are many many more words, so the encoding vector is much
133 # larger. We will however cheat a bit and trim the data to only use a few
134 # thousand words per language.
135 #
136 # .. figure:: /_static/img/seq-seq-images/word-encoding.png
137 #    :alt:
138 #
139 #
140 
141 
142 ######################################################################
143 # We'll need a unique index per word to use as the inputs and targets of
144 # the networks later. To keep track of all this we will use a helper class
145 # called ``Lang`` which has word → index (``word2index``) and index → word
146 # (``index2word``) dictionaries, as well as a count of each word
147 # ``word2count`` to use to later replace rare words.
148 #
149 
150 SOS_token = 0
151 EOS_token = 1
152 
153 
154 # 每個單詞需要對應唯一的索引作為稍后的網絡輸入和目標.為了追蹤這些索引
155 # 則使用一個幫助類 Lang ，類中有 詞 → 索引 (word2index) 和 索引 → 詞
156 # (index2word) 的字典, 以及每個詞word2count 用來替換稀疏詞匯.
157 
158 
159 # 此處創建的Lang 對象來表示源/目標語言，它包含三部分：word2index、
160 # index2word 和word2count，分別表示單詞到id、id 到單詞和單詞的詞頻。
161 # word2count的作用是用於過濾一些低頻詞（把它變成unknown）
162 
163 class Lang:
164     def __init__(self, name):
165         self.name = name
166         self.word2index = {}
167         self.word2count = {}
168         self.index2word = {0: "SOS", 1: "EOS"}
169         self.n_words = 2  # Count SOS and EOS
170 
171     def addSentence(self, sentence):
172         for word in sentence.split(' '):
173             self.addWord(word)  # 用於添加單詞
174 
175     def addWord(self, word):
176         if word not in self.word2index:  # 是不是新的詞
177             # 如果不在word2index里，則需要新的定義字典
178             self.word2index[word] = self.n_words
179             self.word2count[word] = 1
180             self.index2word[self.n_words] = word
181             self.n_words += 1  # 相當於每次index+1
182         else:
183             self.word2count[word] += 1  # 計算每次詞的個數
184 
185 
186 ######################################################################
187 # The files are all in Unicode, to simplify we will turn Unicode
188 # characters to ASCII, make everything lowercase, and trim most
189 # punctuation.
190 #
191 
192 # Turn a Unicode string to plain ASCII, thanks to
193 # http://stackoverflow.com/a/518232/2809427
194 
195 # 此處是為了將Unicode字符串轉換為純ASCII
196 # 原文件是Unicode編碼
197 def unicodeToAscii(s):
198     return ''.join(
199         c for c in unicodedata.normalize('NFD', s)
200         if unicodedata.category(c) != 'Mn'
201     )
202 
203 
204 # Lowercase, trim, and remove non-letter characters
205 
206 # 小寫,修剪和刪除非字母字符
207 def normalizeString(s):
208     s = unicodeToAscii(s.lower().strip())
209     s = re.sub(r"([.!?])", r" \1", s)
210     s = re.sub(r"[^a-zA-Z.!?]+", r" ", s)
211     return s
212 
213 
214 ######################################################################
215 # To read the data file we will split the file into lines, and then split
216 # lines into pairs. The files are all English → Other Language, so if we
217 # want to translate from Other Language → English I added the ``reverse``
218 # flag to reverse the pairs.
219 #
220 
221 
222 # 要讀取數據文件,我們將把文件分成行,然后將行成對分開. 這些文件
223 # 都是英文→其他語言,所以如果我們想從其他語言翻譯→英文,我們添加了
224 # 翻轉標志 reverse來翻轉詞語對.
225 def readLangs(lang1, lang2, reverse=False):
226     print("Reading lines...")
227 
228     # Read the file and split into lines
229     # 讀取文件並按行分開
230     lines = open('data/%s-%s.txt' % (lang1, lang2), encoding='utf-8'). \
231         read().strip().split('\n')
232 
233     # Split every line into pairs and normalize
234     # 將每一行分成兩列並進行標准化
235     pairs = [[normalizeString(s) for s in l.split('\t')] for l in lines]
236 
237     # Reverse pairs, make Lang instances
238     # 翻轉對,Lang實例化
239     if reverse:
240         pairs = [list(reversed(p)) for p in pairs]
241         input_lang = Lang(lang2)
242         output_lang = Lang(lang1)
243     else:
244         input_lang = Lang(lang1)
245         output_lang = Lang(lang2)
246 
247     return input_lang, output_lang, pairs
248 
249 
250 ######################################################################
251 # Since there are a *lot* of example sentences and we want to train
252 # something quickly, we'll trim the data set to only relatively short and
253 # simple sentences. Here the maximum length is 10 words (that includes
254 # ending punctuation) and we're filtering to sentences that translate to
255 # the form "I am" or "He is" etc. (accounting for apostrophes replaced
256 # earlier).
257 #
258 
259 # 由於例句較多,為了方便快速訓練,則會將數據集裁剪為相對簡短的句子.
260 # 這里的單詞的最大長度是10詞(包括結束標點符號),
261 # 保留”I am” 和”He is” 開頭的數據
262 
263 MAX_LENGTH = 10
264 
265 eng_prefixes = (
266     "i am ", "i m ",
267     "he is", "he s ",
268     "she is", "she s",
269     "you are", "you re ",
270     "we are", "we re ",
271     "they are", "they re "
272 )
273 
274 
275 def filterPair(p):
276     return len(p[0].split(' ')) < MAX_LENGTH and \
277            len(p[1].split(' ')) < MAX_LENGTH and \
278            p[1].startswith(eng_prefixes)
279     # 是否滿足長度
280 
281 
282 def filterPairs(pairs):
283     return [pair for pair in pairs if filterPair(pair)]
284 
285 
286 ######################################################################
287 # The full process for preparing the data is:
288 #
289 # -  Read text file and split into lines, split lines into pairs
290 # -  Normalize text, filter by length and content
291 # -  Make word lists from sentences in pairs
292 #
293 
294 def prepareData(lang1, lang2, reverse=False):
295     input_lang, output_lang, pairs = readLangs(lang1, lang2, reverse)
296     # 讀入數據lang1,lang2,並翻轉
297     print("Read %s sentence pairs" % len(pairs))
298     # 一共讀入了多少對
299     pairs = filterPairs(pairs)
300     # 符合條件的配對有多少對
301     print("Trimmed to %s sentence pairs" % len(pairs))
302     print("Counting words...")
303     for pair in pairs:
304         input_lang.addSentence(pair[0])
305         output_lang.addSentence(pair[1])
306     print("Counted words:")
307     print(input_lang.name, input_lang.n_words)
308     print(output_lang.name, output_lang.n_words)
309     return input_lang, output_lang, pairs
310 
311 
312 # 對數據進行預處理
313 input_lang, output_lang, pairs = prepareData('eng', 'fra', True)
314 print(random.choice(pairs))  # 隨機展示一對
315 
316 
317 ######################################################################
318 # The Seq2Seq Model
319 # =================
320 #
321 # A Recurrent Neural Network, or RNN, is a network that operates on a
322 # sequence and uses its own output as input for subsequent steps.
323 #
324 # A `Sequence to Sequence network <http://arxiv.org/abs/1409.3215>`__, or
325 # seq2seq network, or `Encoder Decoder
326 # network <https://arxiv.org/pdf/1406.1078v3.pdf>`__, is a model
327 # consisting of two RNNs called the encoder and decoder. The encoder reads
328 # an input sequence and outputs a single vector, and the decoder reads
329 # that vector to produce an output sequence.
330 #
331 # .. figure:: /_static/img/seq-seq-images/seq2seq.png
332 #    :alt:
333 #
334 # Unlike sequence prediction with a single RNN, where every input
335 # corresponds to an output, the seq2seq model frees us from sequence
336 # length and order, which makes it ideal for translation between two
337 # languages.
338 #
339 # Consider the sentence "Je ne suis pas le chat noir" → "I am not the
340 # black cat". Most of the words in the input sentence have a direct
341 # translation in the output sentence, but are in slightly different
342 # orders, e.g. "chat noir" and "black cat". Because of the "ne/pas"
343 # construction there is also one more word in the input sentence. It would
344 # be difficult to produce a correct translation directly from the sequence
345 # of input words.
346 #
347 # With a seq2seq model the encoder creates a single vector which, in the
348 # ideal case, encodes the "meaning" of the input sequence into a single
349 # vector — a single point in some N dimensional space of sentences.
350 #
351 
352 
353 ######################################################################
354 # The Encoder
355 # -----------
356 #
357 # The encoder of a seq2seq network is a RNN that outputs some value for
358 # every word from the input sentence. For every input word the encoder
359 # outputs a vector and a hidden state, and uses the hidden state for the
360 # next input word.
361 #
362 # .. figure:: /_static/img/seq-seq-images/encoder-network.png
363 #    :alt:
364 #
365 #
366 
367 class EncoderRNN(nn.Module):
368     def __init__(self, input_size, hidden_size):
369         super(EncoderRNN, self).__init__()
370         self.hidden_size = hidden_size
371         # 定義隱藏層
372         self.embedding = nn.Embedding(input_size, hidden_size)
373         # word embedding的定義可以這么理解，例如nn.Embedding(2, 4)
374         # 2表示有2個詞，4表示4維度，其實也就是一個2x4的矩陣，
375         # 如果有100個詞，每個詞10維，就可以寫為nn.Embedding(100, 10)
376         # 注意這里的詞向量的建立只是初始的詞向量，並沒有經過任何修改優化
377         # 需要建立神經網絡通過learning的辦法修改word embedding里面的參數
378         # 使得word embedding每一個詞向量能夠表示每一個不同的詞。
379         self.gru = nn.GRU(hidden_size, hidden_size)  # 用到了上面提到的GRU模型
380 
381     def forward(self, input, hidden):
382         embedded = self.embedding(input).view(1, 1, -1)  # -1是指自適應，view相當於reshape函數
383         output = embedded
384         output, hidden = self.gru(output, hidden)
385         return output, hidden
386 
387     def initHidden(self):  # 初始化
388         return torch.zeros(1, 1, self.hidden_size, device=device)
389 
390 
391 ######################################################################
392 # The Decoder
393 # -----------
394 #
395 # The decoder is another RNN that takes the encoder output vector(s) and
396 # outputs a sequence of words to create the translation.
397 #
398 
399 
400 ######################################################################
401 # Simple Decoder
402 # ^^^^^^^^^^^^^^
403 #
404 # In the simplest seq2seq decoder we use only last output of the encoder.
405 # This last output is sometimes called the *context vector* as it encodes
406 # context from the entire sequence. This context vector is used as the
407 # initial hidden state of the decoder.
408 #
409 # At every step of decoding, the decoder is given an input token and
410 # hidden state. The initial input token is the start-of-string ``<SOS>``
411 # token, and the first hidden state is the context vector (the encoder's
412 # last hidden state).
413 #
414 # .. figure:: /_static/img/seq-seq-images/decoder-network.png
415 #    :alt:
416 #
417 #
418 
419 class DecoderRNN(nn.Module):
420     # DecoderRNN與encoderRNN結構類似，結合圖片即可搞清邏輯
421     def __init__(self, hidden_size, output_size):
422         super(DecoderRNN, self).__init__()
423         self.hidden_size = hidden_size
424 
425         self.embedding = nn.Embedding(output_size, hidden_size)
426         self.gru = nn.GRU(hidden_size, hidden_size)
427         self.out = nn.Linear(hidden_size, output_size)
428         self.softmax = nn.LogSoftmax(dim=1)
429 
430     def forward(self, input, hidden):
431         output = self.embedding(input).view(1, 1, -1)  # -1是指自適應，view相當於reshape函數
432         output = F.relu(output)
433         output, hidden = self.gru(output, hidden)  # 此處使用gru神經網絡
434         # 對上述結果使用softmax,就是圖片中左邊倒數第二個
435         output = self.softmax(self.out(output[0]))
436         return output, hidden
437 
438     def initHidden(self):
439         return torch.zeros(1, 1, self.hidden_size, device=device)
440 
441 
442 ######################################################################
443 # I encourage you to train and observe the results of this model, but to
444 # save space we'll be going straight for the gold and introducing the
445 # Attention Mechanism.
446 #
447 
448 
449 ######################################################################
450 # Attention Decoder
451 # ^^^^^^^^^^^^^^^^^
452 #
453 # If only the context vector is passed betweeen the encoder and decoder,
454 # that single vector carries the burden of encoding the entire sentence.
455 #
456 # Attention allows the decoder network to "focus" on a different part of
457 # the encoder's outputs for every step of the decoder's own outputs. First
458 # we calculate a set of *attention weights*. These will be multiplied by
459 # the encoder output vectors to create a weighted combination. The result
460 # (called ``attn_applied`` in the code) should contain information about
461 # that specific part of the input sequence, and thus help the decoder
462 # choose the right output words.
463 #
464 # .. figure:: https://i.imgur.com/1152PYf.png
465 #    :alt:
466 #
467 # Calculating the attention weights is done with another feed-forward
468 # layer ``attn``, using the decoder's input and hidden state as inputs.
469 # Because there are sentences of all sizes in the training data, to
470 # actually create and train this layer we have to choose a maximum
471 # sentence length (input length, for encoder outputs) that it can apply
472 # to. Sentences of the maximum length will use all the attention weights,
473 # while shorter sentences will only use the first few.
474 #
475 # .. figure:: /_static/img/seq-seq-images/attention-decoder-network.png
476 #    :alt:
477 #
478 #
479 
480 class AttnDecoderRNN(nn.Module):
481     def __init__(self, hidden_size, output_size, dropout_p=0.1, max_length=MAX_LENGTH):
482         super(AttnDecoderRNN, self).__init__()
483         self.hidden_size = hidden_size
484         self.output_size = output_size
485         self.dropout_p = dropout_p
486         self.max_length = max_length
487 
488         self.embedding = nn.Embedding(self.output_size, self.hidden_size)
489         self.attn = nn.Linear(self.hidden_size * 2, self.max_length)
490         self.attn_combine = nn.Linear(self.hidden_size * 2, self.hidden_size)
491         self.dropout = nn.Dropout(self.dropout_p)
492         self.gru = nn.GRU(self.hidden_size, self.hidden_size)
493         self.out = nn.Linear(self.hidden_size, self.output_size)
494 
495     def forward(self, input, hidden, encoder_outputs):
496         # 對於輸入的input內容進行embedding和dropout操作
497         # dropout是指隨機丟棄一些神經元
498         embedded = self.embedding(input).view(1, 1, -1)
499         embedded = self.dropout(embedded)
500 
501         # 此處相當於學出來了attention的權重
502         # 需要注意的是torch的concatenate函數是torch.cat，是在已有的維度上拼接，
503         # 而stack是建立一個新的維度，然后再在該緯度上進行拼接。
504         attn_weights = F.softmax(
505             self.attn(torch.cat((embedded[0], hidden[0]), 1)), dim=1)
506 
507         # 將attention權重作用在encoder_outputs上
508         # 對存儲在兩個批batch1和batch2內的矩陣進行批矩陣乘操作。
509         # batch1和 batch2都為包含相同數量矩陣的3維張量。
510         # 如果batch1是形為b×n×m的張量，batch1是形為b×m×p的張量，
511         # 則out和mat的形狀都是n×p
512         attn_applied = torch.bmm(attn_weights.unsqueeze(0),
513                                  encoder_outputs.unsqueeze(0))
514         # 拼接操作，將embedded和attn_Applied拼接起來
515         output = torch.cat((embedded[0], attn_applied[0]), 1)
516         # 返回一個新的張量，對輸入的制定位置插入維度 1
517         output = self.attn_combine(output).unsqueeze(0)
518 
519         output = F.relu(output)
520         output, hidden = self.gru(output, hidden)
521 
522         output = F.log_softmax(self.out(output[0]), dim=1)
523         return output, hidden, attn_weights
524 
525     def initHidden(self):
526         return torch.zeros(1, 1, self.hidden_size, device=device)
527 
528 
529 ######################################################################
530 # .. note:: There are other forms of attention that work around the length
531 #   limitation by using a relative position approach. Read about "local
532 #   attention" in `Effective Approaches to Attention-based Neural Machine
533 #   Translation <https://arxiv.org/abs/1508.04025>`__.
534 #
535 # Training
536 # ========
537 #
538 # Preparing Training Data
539 # -----------------------
540 #
541 # To train, for each pair we will need an input tensor (indexes of the
542 # words in the input sentence) and target tensor (indexes of the words in
543 # the target sentence). While creating these vectors we will append the
544 # EOS token to both sequences.
545 #
546 
547 def indexesFromSentence(lang, sentence):
548     return [lang.word2index[word] for word in sentence.split(' ')]
549 
550 
551 def tensorFromSentence(lang, sentence):
552     # 獲得詞的索引
553     indexes = indexesFromSentence(lang, sentence)
554     # 將EOS標記添加到兩個序列中
555     indexes.append(EOS_token)
556     return torch.tensor(indexes, dtype=torch.long, device=device).view(-1, 1)
557 
558 
559 def tensorsFromPair(pair):
560     # 每一對為需要輸入的張量（輸入句子中的詞的索引）和目標張量
561     # （目標語句中的詞的索引）
562     input_tensor = tensorFromSentence(input_lang, pair[0])
563     target_tensor = tensorFromSentence(output_lang, pair[1])
564     return (input_tensor, target_tensor)
565 
566 
567 ######################################################################
568 # Training the Model
569 # ------------------
570 #
571 # To train we run the input sentence through the encoder, and keep track
572 # of every output and the latest hidden state. Then the decoder is given
573 # the ``<SOS>`` token as its first input, and the last hidden state of the
574 # encoder as its first hidden state.
575 #
576 # "Teacher forcing" is the concept of using the real target outputs as
577 # each next input, instead of using the decoder's guess as the next input.
578 # Using teacher forcing causes it to converge faster but `when the trained
579 # network is exploited, it may exhibit
580 # instability <http://minds.jacobs-university.de/sites/default/files/uploads/papers/ESNTutorialRev.pdf>`__.
581 #
582 # You can observe outputs of teacher-forced networks that read with
583 # coherent grammar but wander far from the correct translation -
584 # intuitively it has learned to represent the output grammar and can "pick
585 # up" the meaning once the teacher tells it the first few words, but it
586 # has not properly learned how to create the sentence from the translation
587 # in the first place.
588 #
589 # Because of the freedom PyTorch's autograd gives us, we can randomly
590 # choose to use teacher forcing or not with a simple if statement. Turn
591 # ``teacher_forcing_ratio`` up to use more of it.
592 #
593 
594 teacher_forcing_ratio = 0.5
595 
596 
597 # teacher forcing即指使用教師強迫其能夠更快的收斂
598 # 不過當訓練好的網絡被利用時，容易表現出不穩定性
599 # teacher_forcing_ratio即指教師訓練比率
600 # 用於訓練的函數
601 
602 
603 def train(input_tensor, target_tensor, encoder, decoder, encoder_optimizer, decoder_optimizer, criterion,
604           max_length=MAX_LENGTH):
605     # encoder即指EncoderRNN(input_lang.n_words, hidden_size)
606     # attn_decoder即指 AttnDecoderRNN(hidden_size, output_lang.n_words, dropout_p=0.1)
607     # hidden=256
608     encoder_hidden = encoder.initHidden()
609 
610     # encoder_optimizer 即指optim.SGD(encoder.parameters(), lr=learning_rate)
611     # decoder_optimizer 即指optim.SGD(decoder.parameters(), lr=learning_rate)
612     # nn.Parameter()是Variable的一種，常被用於模塊參數(module parameter)。
613     # Parameters 是 Variable 的子類。Paramenters和Modules一起使用的時候會有一些特殊的屬性，
614     # 即：當Paramenters賦值給Module的屬性的時候，他會自動的被加到 Module的 參數列表中
615     # (即：會出現在 parameters() 迭代器中)。將Varibale賦值給Module屬性則不會有這樣的影響。
616     # 這樣做的原因是：我們有時候會需要緩存一些臨時的狀態(state), 比如：模型中RNN的最后一個隱狀態。
617     # 如果沒有Parameter這個類的話，那么這些臨時變量也會注冊成為模型變量。
618     encoder_optimizer.zero_grad()
619     decoder_optimizer.zero_grad()
620 
621     # 得到長度
622     input_length = input_tensor.size(0)
623     target_length = target_tensor.size(0)
624 
625     # 初始化outour值
626     encoder_outputs = torch.zeros(max_length, encoder.hidden_size, device=device)
627 
628     loss = 0
629 
630     # 以下循環是學習過程
631     for ei in range(input_length):
632         encoder_output, encoder_hidden = encoder(input_tensor[ei], encoder_hidden)
633         encoder_outputs[ei] = encoder_output[0, 0]  # 這里為什么取 0,0
634 
635     # 定義decoder的Input值
636     decoder_input = torch.tensor([[SOS_token]], device=device)
637 
638     decoder_hidden = encoder_hidden
639 
640     use_teacher_forcing = True if random.random() < teacher_forcing_ratio else False
641 
642     if use_teacher_forcing:
643         # Teacher forcing: Feed the target as the next input
644         # 教師強制: 將目標作為下一個輸入
645         # 你觀察教師強迫網絡的輸出,這些網絡是用連貫的語法閱讀的,但卻遠離了正確的翻譯 -
646         # 直觀地來看它已經學會了代表輸出語法,並且一旦老師告訴它前幾個單詞,就可以"拾取"它的意思,
647         # 但它沒有適當地學會如何從翻譯中創建句子.
648         for di in range(target_length):
649             # 通過decoder得到輸出值
650             decoder_output, decoder_hidden, decoder_attention = decoder(
651                 decoder_input, decoder_hidden, encoder_outputs)
652             # 定義損失函數並計算
653             loss += criterion(decoder_output, target_tensor[di])
654             decoder_input = target_tensor[di]  # Teacher forcing
655 
656     else:
657         # Without teacher forcing: use its own predictions as the next input
658         # 沒有教師強迫: 使用自己的預測作為下一個輸入
659         for di in range(target_length):
660             # 通過decoder得到輸出值
661             decoder_output, decoder_hidden, decoder_attention = decoder(
662                 decoder_input, decoder_hidden, encoder_outputs)
663 
664             # topk：第k個最小元素,返回第k個最小元素
665             # 返回前k個最大元素，注意是前k個，largest=False，返回前k個最小元素
666             # 此函數的功能是求取1-D 或N-D Tensor的最低維度的前k個最大的值，返回值為兩個Tuple
667             # 其中values是前k個最大值的Tuple，indices是對應的下標，默認返回結果是從大到小排序的。
668             topv, topi = decoder_output.topk(1)
669             decoder_input = topi.squeeze().detach()  # detach from history as input
670 
671             loss += criterion(decoder_output, target_tensor[di])
672             if decoder_input.item() == EOS_token:
673                 break
674     # 反向傳播
675     loss.backward()
676 
677     # 更新參數
678     encoder_optimizer.step()
679     decoder_optimizer.step()
680 
681     return loss.item() / target_length
682 
683 
684 ######################################################################
685 # This is a helper function to print time elapsed and estimated time
686 # remaining given the current time and progress %.
687 #
688 
689 import time
690 import math
691 
692 
693 # 根據當前時間和進度百分比,這是一個幫助功能,用於打印經過的時間和估計的剩余時間.
694 
695 def asMinutes(s):
696     m = math.floor(s / 60)
697     s -= m * 60
698     return '%dm %ds' % (m, s)
699 
700 
701 def timeSince(since, percent):
702     now = time.time()
703     s = now - since
704     es = s / (percent)
705     rs = es - s
706     return '%s (- %s)' % (asMinutes(s), asMinutes(rs))
707 
708 
709 ######################################################################
710 # The whole training process looks like this:
711 #
712 # -  Start a timer
713 # -  Initialize optimizers and criterion
714 # -  Create set of training pairs
715 # -  Start empty losses array for plotting
716 #
717 # Then we call ``train`` many times and occasionally print the progress (%
718 # of examples, time so far, estimated time) and average loss.
719 #
720 
721 def trainIters(encoder, decoder, n_iters, print_every=1000, plot_every=100, learning_rate=0.01):
722     start = time.time()
723     plot_losses = []
724     print_loss_total = 0  # Reset every print_every
725     plot_loss_total = 0  # Reset every plot_every
726 
727     encoder_optimizer = optim.SGD(encoder.parameters(), lr=learning_rate)
728     decoder_optimizer = optim.SGD(decoder.parameters(), lr=learning_rate)
729 
730     # 獲取訓練的一對樣本
731     training_pairs = [tensorsFromPair(random.choice(pairs))
732                       for i in range(n_iters)]
733     # 定義出的損失函數
734     criterion = nn.NLLLoss()
735 
736     for iter in range(1, n_iters + 1):
737         training_pair = training_pairs[iter - 1]
738         input_tensor = training_pair[0]
739         target_tensor = training_pair[1]
740 
741         # 訓練的過程並用於當損失函數
742         loss = train(input_tensor, target_tensor, encoder,
743                      decoder, encoder_optimizer, decoder_optimizer, criterion)
744         print_loss_total += loss
745         plot_loss_total += loss
746 
747         if iter % print_every == 0:
748             print_loss_avg = print_loss_total / print_every
749             print_loss_total = 0
750             # 打印進度(樣本的百分比,到目前為止的時間,估計的時間)和平均損失.
751             print('%s (%d %d%%) %.4f' % (timeSince(start, iter / n_iters),
752                                          iter, iter / n_iters * 100, print_loss_avg))
753 
754         if iter % plot_every == 0:
755             plot_loss_avg = plot_loss_total / plot_every
756             plot_losses.append(plot_loss_avg)
757             plot_loss_total = 0
758     # 繪制圖像
759     showPlot(plot_losses)
760 
761 
762 ######################################################################
763 # Plotting results
764 # ----------------
765 #
766 # Plotting is done with matplotlib, using the array of loss values
767 # ``plot_losses`` saved while training.
768 #
769 
770 import matplotlib.pyplot as plt
771 
772 plt.switch_backend('agg')
773 import matplotlib.ticker as ticker
774 import numpy as np
775 
776 
777 # 使用matplotlib進行繪圖，使用訓練時保存的損失值plot_losses數組.
778 def showPlot(points):
779     plt.figure()
780     fig, ax = plt.subplots()
781     # this locator puts ticks at regular intervals
782     # 這個定位器會定期發出提示信息
783     loc = ticker.MultipleLocator(base=0.2)
784     ax.yaxis.set_major_locator(loc)
785     plt.plot(points)
786 
787 
788 ######################################################################
789 # Evaluation
790 # ==========
791 #
792 # Evaluation is mostly the same as training, but there are no targets so
793 # we simply feed the decoder's predictions back to itself for each step.
794 # Every time it predicts a word we add it to the output string, and if it
795 # predicts the EOS token we stop there. We also store the decoder's
796 # attention outputs for display later.
797 #
798 
799 def evaluate(encoder, decoder, sentence, max_length=MAX_LENGTH):
800     with torch.no_grad():
801         # 從sentence中得到對應的變量
802         input_tensor = tensorFromSentence(input_lang, sentence)
803         # 長度
804         input_length = input_tensor.size()[0]
805 
806         # encoder即指EncoderRNN(input_lang.n_words, hidden_size)
807         # attn_decoder即指 AttnDecoderRNN(hidden_size,
808         # output_lang.n_words, dropout_p=0.1)
809         # hidden=256
810         encoder_hidden = encoder.initHidden()
811 
812         # 初始化outputs值
813         encoder_outputs = torch.zeros(max_length, encoder.hidden_size, device=device)
814 
815         # 以下是學習過程
816         for ei in range(input_length):
817             encoder_output, encoder_hidden = encoder(input_tensor[ei],
818                                                      encoder_hidden)
819             encoder_outputs[ei] += encoder_output[0, 0]
820 
821         # 定義好decoder部分的input值
822         decoder_input = torch.tensor([[SOS_token]], device=device)  # SOS
823 
824         # 設置好隱藏層
825         decoder_hidden = encoder_hidden
826 
827         decoded_words = []
828         decoder_attentions = torch.zeros(max_length, max_length)
829 
830         for di in range(max_length):
831             # 得到結果
832             decoder_output, decoder_hidden, decoder_attention = decoder(decoder_input, decoder_hidden, encoder_outputs)
833 
834             # attention部分的數據
835             decoder_attentions[di] = decoder_attention.data
836             # 選擇output中的第一個值
837             topv, topi = decoder_output.data.topk(1)
838             if topi.item() == EOS_token:
839                 decoded_words.append('<EOS>')
840                 break
841             else:
842                 decoded_words.append(output_lang.index2word[topi.item()])  # 將output_lang添加到decoded
843 
844             decoder_input = topi.squeeze().detach()
845 
846         return decoded_words, decoder_attentions[:di + 1]
847 
848 
849 ######################################################################
850 # We can evaluate random sentences from the training set and print out the
851 # input, target, and output to make some subjective quality judgements:
852 #
853 
854 # 從訓練集中評估隨機的句子並打印出輸入,目標和輸出以作出一些主觀質量判斷
855 def evaluateRandomly(encoder, decoder, n=10):
856     for i in range(n):
857         pair = random.choice(pairs)
858         print('>', pair[0])
859         print('=', pair[1])
860         output_words, attentions = evaluate(encoder, decoder, pair[0])
861         output_sentence = ' '.join(output_words)
862         print('<', output_sentence)
863         print('')
864 
865 
866 ######################################################################
867 # Training and Evaluating
868 # =======================
869 #
870 # With all these helper functions in place (it looks like extra work, but
871 # it makes it easier to run multiple experiments) we can actually
872 # initialize a network and start training.
873 #
874 # Remember that the input sentences were heavily filtered. For this small
875 # dataset we can use relatively small networks of 256 hidden nodes and a
876 # single GRU layer. After about 40 minutes on a MacBook CPU we'll get some
877 # reasonable results.
878 #
879 # .. Note::
880 #    If you run this notebook you can train, interrupt the kernel,
881 #    evaluate, and continue training later. Comment out the lines where the
882 #    encoder and decoder are initialized and run ``trainIters`` again.
883 #
884 
885 hidden_size = 256
886 # 編碼部分
887 encoder1 = EncoderRNN(input_lang.n_words, hidden_size).to(device)
888 # 加入了attention機制的解碼部分
889 attn_decoder1 = AttnDecoderRNN(hidden_size, output_lang.n_words, dropout_p=0.1).to(device)
890 # 訓練部分
891 trainIters(encoder1, attn_decoder1, 75000, print_every=5000)
892 
893 ######################################################################
894 # 隨機生成一組結果
895 evaluateRandomly(encoder1, attn_decoder1)
896 
897 ######################################################################
898 # Visualizing Attention
899 # ---------------------
900 #
901 # A useful property of the attention mechanism is its highly interpretable
902 # outputs. Because it is used to weight specific encoder outputs of the
903 # input sequence, we can imagine looking where the network is focused most
904 # at each time step.
905 #
906 # You could simply run ``plt.matshow(attentions)`` to see attention output
907 # displayed as a matrix, with the columns being input steps and rows being
908 # output steps:
909 #
910 
911 output_words, attentions = evaluate(encoder1, attn_decoder1, "je suis trop froid .")
912 plt.matshow(attentions.numpy())
913 
914 
915 ######################################################################
916 # For a better viewing experience we will do the extra work of adding axes
917 # and labels:
918 
919 def showAttention(input_sentence, output_words, attentions):
920     # Set up figure with colorbar
921     fig = plt.figure()
922     ax = fig.add_subplot(111)
923     cax = ax.matshow(attentions.numpy(), cmap='bone')
924     fig.colorbar(cax)
925 
926     # Set up axes
927     ax.set_xticklabels([''] + input_sentence.split(' ') +
928                        ['<EOS>'], rotation=90)
929     ax.set_yticklabels([''] + output_words)
930 
931     # Show label at every tick
932     ax.xaxis.set_major_locator(ticker.MultipleLocator(1))
933     ax.yaxis.set_major_locator(ticker.MultipleLocator(1))
934 
935     plt.show()
936 
937 
938 def evaluateAndShowAttention(input_sentence):
939     output_words, attentions = evaluate(
940         encoder1, attn_decoder1, input_sentence)
941     print('input =', input_sentence)
942     print('output =', ' '.join(output_words))
943     showAttention(input_sentence, output_words, attentions)
944 
945 
946 evaluateAndShowAttention("elle a cinq ans de moins que moi .")
947 evaluateAndShowAttention("elle est trop petit .")
948 evaluateAndShowAttention("je ne crains pas de mourir .")
949 evaluateAndShowAttention("c est un jeune directeur plein de talent .")
950 
951 ######################################################################
952 # Exercises
953 # =========
954 #
955 # -  Try with a different dataset
956 #
957 #    -  Another language pair
958 #    -  Human → Machine (e.g. IOT commands)
959 #    -  Chat → Response
960 #    -  Question → Answer
961 #
962 # -  Replace the embeddings with pre-trained word embeddings such as word2vec or
963 #    GloVe
964 # -  Try with more layers, more hidden units, and more sentences. Compare
965 #    the training time and results.
966 # -  If you use a translation file where pairs have two of the same phrase
967 #    (``I am test \t I am test``), you can use this as an autoencoder. Try
968 #    this:
969 #
970 #    -  Train as an autoencoder
971 #    -  Save only the Encoder network
972 #    -  Train a new Decoder for translation from there
973 #
免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。
猜您在找 Pytorch系列教程-使用Seq2Seq網絡和注意力機制進行機器翻譯 Seq2Seq模型與注意力機制具有注意力機制的seq2seq模型動手學pytorch-注意力機制和Seq2Seq模型深度學習之注意力機制（Attention Mechanism）和Seq2Seq pytorch seq2seq模型示例 pytorch seq2seq模型訓練測試 PyTorch: 序列到序列模型(Seq2Seq)實現機器翻譯實戰介紹 Seq2Seq 模型 PyTorch實現Seq2Seq機器翻譯