強化學習算法之DQN算法中的經驗池的實現，experience_replay_buffer部分的實現

本文轉載自查看原文 2020-11-23 13:21 1356 強化學習

本文的相關鏈接：

github上DQN代碼的環境搭建，及運行（Human-Level Control through Deep Reinforcement Learning）conda配置

------------------------------------------------------------------

經驗池的引入算是DQN算法的一個重要貢獻，而且experience replay buffer本身也是算法中比較核心的部分，並且該部分實現起來也是比較困難的，尤其是一個比較好的、速度不太慢的實現。為此，在本博客介紹下相關的實現方法，並且給出了三種不同的變體，對不同變體測試並分析運行性能。

本文所介紹的experience replay buffer的最原始實現為：

Code from https://github.com/tambetm/simple_dqn/blob/master/src/replay_memory.py

在原始代碼上進行了一些微調，得到第一種變體，如下：

# encoding:UTF-8
"""Code from https://github.com/tambetm/simple_dqn/blob/master/src/replay_memory.py"""

import random
import numpy as np

class ReplayBuffer(object):
    def __init__(self, config):
        self.cnn_format = config.cnn_format   # buffer中數據的格式，'NCHW'或'NHWC'
        self.buffer_size = config.replay_buffer_size  # 緩存池的最大容量
        self.history_length = config.history_length   # 一個狀態，state的歷史數據長度
        self.dims = (config.screen_height, config.screen_width)  # 一幀圖像的高、寬
        self.batch_size = config.batch_size    # mini_batch_size 大小
        self.count = 0     # 當前緩存池中現有存儲數據的大小
        self.current = 0   # 指針指向的索引號，下一幀新數據存儲的位置

        """ expericence replay buffer  定義經驗池 pre_state->a,r,s,terminal """
        self.actions = np.empty(self.buffer_size, dtype=np.uint8)
        self.rewards = np.empty(self.buffer_size, dtype=np.int8) # 這里我們設定reward為:0，+1，-1，三個種類
        self.screens = np.empty((self.buffer_size, config.screen_height, config.screen_width), \
                                dtype=np.float32) # 設定屏幕截圖匯總，states
        self.terminals = np.empty(self.buffer_size, dtype=np.bool) #terminal對應同索引號的screen

        # pre-allocate prestates and poststates for minibatch
        # 選擇動作前的狀態 s,a,s+1,中的狀態s,當前狀態
        self.prestates = np.empty((self.batch_size, self.history_length) + self.dims, \
                                  dtype=np.float32)
        # 選擇動作前的狀態 s,a,s+1,中的狀態s+1,下一狀態
        self.poststates = np.empty((self.batch_size, self.history_length) + self.dims, \
                                   dtype=np.float32)

        # 判斷設置是否正確
        assert self.history_length>=1  # history_length，狀態state由幾個圖像組成，大小至少為1

    def add(self, action, reward, screen, terminal):
        """ 向experience buffer中加入新的a,r,s,terminal操作 """
        assert screen.shape == self.dims  #判斷傳入的screen變量維度是否符合設定
        # screen is post-state, after action and reward
        # screen 是動作后的圖像，前一狀態執行動作action后獲得reward，screen
        # current指示當前的加入位置
        self.actions[self.current] = action
        self.rewards[self.current] = reward
        self.screens[self.current, ...] = screen
        self.terminals[self.current] = terminal
        # experience buffer沒有滿時，current等於count，current自加一后賦值給count
        # buffer滿時，count等於buffer容量,固定不變，count=buffer_size, current自加一,進行指針平移
        self.count = max(self.count, self.current + 1)
        # 加入新值后，指針位置自動加一
        self.current = (self.current + 1) % self.buffer_size #buffer_size經驗池大小

    def getState(self, index):
        return self.screens[(index - (self.history_length - 1)):(index + 1), ...]

    def sample(self):
        # memory must include poststate, prestate and history
        assert self.count > self.history_length # history_length至少為1，由於要考慮前后兩個狀態所以count至少為2
        # sample random indexes
        indexes = []
        while len(indexes) < self.batch_size:
            # find random index
            while True:
                # sample one index (ignore states wraping over
                index = random.randint(self.history_length, self.count - 1)
                # if wraps over current pointer, then get new one
                if index - self.history_length < self.current <= index:
                    continue
                # if wraps over episode end, then get new one
                # poststate (last screen) can be terminal state!
                if self.terminals[(index - self.history_length):index].any():
                    continue
                # otherwise use this index
                break

            # having index first is fastest in C-order matrices
            self.prestates[len(indexes), ...] = self.getState(index - 1)
            self.poststates[len(indexes), ...] = self.getState(index)
            indexes.append(index)

        actions = self.actions[indexes]
        rewards = self.rewards[indexes]
        terminals = self.terminals[indexes]

        # return s,a,s,a+1,terminal
        if self.cnn_format == 'NHWC':
            return np.transpose(self.prestates, (0, 2, 3, 1)), actions, \
                   rewards, np.transpose(self.poststates, (0, 2, 3, 1)), terminals
        else:  #format is 'NCHW', faster than 'NHWC'
            return self.prestates, actions, rewards, self.poststates, terminals

View Code

該代碼原理就是 add方法 把每次agent執行的action，從環境中獲得的reward，游戲屏幕轉移到的新狀態state,s+1, 以及新狀態是否為終止狀態即terminal，這四個元素（action，reward，游戲屏幕的新狀態，s+1, 是否終止terminal）加入到緩存池（eperience replay buffer）中。由於experience replay buffer有容量限制，於是每一次把游戲的新屏幕圖像存入后還需要檢查是否超過buffer容量，如果超出得自動刪除最早存入buffer中的游戲屏幕圖像。

sample方法則每次從緩沖中分別取出history_length長度的preState，和postState, preState是強化學習算法中執行動作之前的狀態，postState則是強化學習算法中執行動作之后獲得的新狀態。從preState 狀態執行action 跳轉到新狀態postState獲得reward，和postState是否為終止狀態的標志。

其中不論是preState還是postState都是由history_length長度的游戲畫面組成，也就是由history_length幀游戲畫面組成，而游戲畫面幀則是存入緩存池中的。比如history_length長度為4，緩存池（experience replay buffer）中5個畫面幀s0,s1,s2,s3,s4中(s0,s1,s2,s3)可以看做是preState，而緩存池中s3對應的action也為強化學習中preState的動作，而強化學習中跳轉到的新狀態則表示為（s1,s2,s3,s4），此時獲得的reward已經terminal則為緩存池中畫面幀s4對應的reward及terminal。

對於sample抽取出的樣本並不一定是可用的，這里我們要剔除掉不可用的，因為畫面幀s0,s1,s2,s3,s4，分別對應的terminal為t0,t1,t2,t3,t4, 除t4可以為terminal以外t0,t1,t2,t3不能為terminal，或者說t0,t1,t2,t3的terminal標志必須為faulse，否則需要重新抽樣。因為s0,s1,s2,s3對應的terminal為true，那么就不能用其組成preState和postState。preState和postState是強化學習算法中的概念，在游戲中我們沒有辦法獲得agent在尤其中的速度，以及游戲中其他角色的速度等信息，因此在組成強化學習中的state時我們選擇將游戲畫面中的幾個幀組成一個state，這樣組成的preState和postState就含有了游戲中agent和其他角色的速度等信息，當然如果是強化學習中的平衡桿問題cartpole，我們本身就是可以獲得小車和桿子的速度及位置的，這樣我們就不需要把這幾幀畫面組成一個強化學習中的狀態，在平衡桿問題中獲得的每一幀非圖像數據（位置，速度信息）直接可以作為強化學習中的state，由於DQN的一個創新就是使用了圖像數據這種感知數據，因此無法直接獲得游戲中各角色的速度等信息，也正是由此創新性的給出了一種用幾幀游戲畫面作為一個強化學習中的state的表示。但這同時也暗含着一個要求就是這幾幀畫面是要求連續的，因此不能有除最后一幀畫面以外的畫面對應的terminal為true，即終止畫面（結束畫面）。

replay buffer的最直觀的設計就是使用一進一出的方法，就是一個隊列的實現，即add加入一個新畫面幀及其他信息則在隊列的后面進入隊列，如果隊列滿了則從隊列的最前面彈出最好的畫面幀及其他信息。但是由於DQN中設計的buffer容量過大 100*10000，那么使用python中的list或numpy中array來實現隊列，進行增刪時都會需要較大的資源消耗，當然這個問題使用C、C++、Java等語言是較好解決的，在不考慮使用其他語言混合編寫的情況下對此進行討論。因此第一種變體使用循環隊列的思想來用numpy中的array來實現，也就是使用一個指針性的變量指向下一次需要加入新畫面的索引號，如果buffer填滿則循環指向最初的位置，也就是從0號位置進行覆蓋填充，如此循環。而由於有循環指針的設計變體1需要考慮一個連續的畫面幀s0,s1,s2,s3,s4中是否有指針current所指向的幀，這里我們假設s4對應的索引號為index,那么s0對應的索引號為index-history_length, 也就是說一個組成preState和postState的畫面幀上下限索引號為index-history_length與index。

如果我們保證index-history_length>=0，那么current對應的索引號如果大於index-history_length並且小於等於index的話說明s0,s1,s2,s3存在新覆蓋進來的畫面幀，由此不能保證s0,s1,s2,s3,s4為連續畫面幀，所以保證index-history_length>=0在判斷條件上更方便。同時由於是環形的緩沖池設計，如果我們保證index-history_length>=0，那么所選取的s0,s1,s2,s3,s4都是物理存儲上連續的，更便於選取操作。

-----------------------------------------------------------------------

第二種變體：

# encoding:UTF-8
"""Code from https://github.com/tambetm/simple_dqn/blob/master/src/replay_memory.py"""

import random
import numpy as np

class ReplayBuffer(object):
    def __init__(self, config):
        self.cnn_format = config.cnn_format   # buffer中數據的格式，'NCHW'或'NHWC'
        self.buffer_size = config.replay_buffer_size  # 緩存池的最大容量
        self.history_length = config.history_length   # 一個狀態，state的歷史數據長度
        self.dims = (config.screen_height, config.screen_width)  # 一幀圖像的高、寬
        self.batch_size = config.batch_size    # mini_batch_size 大小
        self.count = 0     # 當前緩存池中現有存儲數據的大小
        self.current = 0   # 指針指向的索引號，下一幀新數據存儲的位置

        """ expericence replay buffer  定義經驗池 pre_state->a,r,s,terminal """
        self.actions = np.empty(self.buffer_size, dtype=np.uint8)
        self.rewards = np.empty(self.buffer_size, dtype=np.int8) # 這里我們設定reward為:0，+1，-1，三個種類
        self.screens = np.empty((self.buffer_size, config.screen_height, config.screen_width), \
                                dtype=np.float32) # 設定屏幕截圖匯總，states
        self.terminals = np.empty(self.buffer_size, dtype=np.bool) #terminal對應同索引號的screen

        # pre-allocate prestates and poststates for minibatch
        # 選擇動作前的狀態 s,a,s+1,中的狀態s,當前狀態
        self.prestates = np.empty((self.batch_size, self.history_length) + self.dims, \
                                  dtype=np.float32)
        # 選擇動作前的狀態 s,a,s+1,中的狀態s+1,下一狀態
        self.poststates = np.empty((self.batch_size, self.history_length) + self.dims, \
                                   dtype=np.float32)

        assert self.history_length>=1  # history_length，狀態state由幾個圖像組成，大小至少為1

    def add(self, action, reward, screen, terminal):
        """ 向experience buffer中加入新的a,r,s,terminal操作 """
        assert screen.shape == self.dims  #判斷傳入的screen變量維度是否符合設定
        # screen is post-state, after action and reward
        # screen 是動作后的圖像，前一狀態執行動作action后獲得reward，screen
        # current指示當前的加入位置
        self.actions[self.current] = action
        self.rewards[self.current] = reward
        self.screens[self.current, ...] = screen
        self.terminals[self.current] = terminal
        # experience buffer沒有滿時，current等於count，current自加一后賦值給count
        # buffer滿時，count等於buffer容量,固定不變，count=buffer_size, current自加一,進行指針平移
        self.count = max(self.count, self.current + 1)
        # 加入新值后，指針位置自動加一
        self.current = (self.current + 1) % self.buffer_size #buffer_size經驗池大小

    def getState_order(self, index):  #索引號:index - (self.history_length - 1)到index數據,不存在環形，順序排列
        return self.screens[(index - (self.history_length - 1)):(index + 1), ...]

    def getState_ring(self, index):
        if index - (self.history_length - 1) >= 0:
            # use faster slicing
            return self.screens[(index - (self.history_length - 1)):(index + 1), ...]
        else:
            # otherwise normalize indexes and use slower list based access
            _indexes = [(index - i) % self.count for i in reversed(range(self.history_length))]
            return self.screens[_indexes, ...]

    def sample(self):
        # memory must include poststate, prestate and history
        assert self.count > self.history_length # history_length至少為1，由於要考慮前后兩個狀態所以count至少為2
        # sample random indexes
        indexes = []
        while len(indexes) < self.batch_size:
            # find random index
            if self.count == self.buffer_size:  # buffer 已滿
                while True:
                    # sample one index (ignore states wraping over
                    index = random.randint(0, self.buffer_size - 1)
                    low_index_exclude = index - self.history_length
                    upper_index_include = index

                    if low_index_exclude >= 0:
                        # if wraps over current pointer, then get new one
                        if low_index_exclude < self.current <= upper_index_include:
                            continue
                        # if wraps over episode end, then get new one
                        # poststate (last screen) can be terminal state!
                        if self.terminals[(index - self.history_length):index].any():
                            continue
                        # having index first is fastest in C-order matrices
                        self.prestates[len(indexes), ...] = self.getState_order(index - 1)
                        self.poststates[len(indexes), ...] = self.getState_order(index)
                        indexes.append(index)
                        # otherwise use this index
                        break
                    else:  #low_index_exclude < 0
                        if self.current > low_index_exclude + self.buffer_size or self.current <= upper_index_include:
                            continue
                        # poststate (last screen) can be terminal state!
                        if self.terminals[low_index_exclude:].any() or self.terminals[:upper_index_include].any():
                            continue
                        # having index first is fastest in C-order matrices
                        self.prestates[len(indexes), ...] = self.getState_ring(index - 1)
                        self.poststates[len(indexes), ...] = self.getState_ring(index)
                        indexes.append(index)
                        # otherwise use this index
                        break
            else: #self.count (not equal) self.buffer_size, buffer 緩存池,未滿, current=count
                while True:
                    # sample one index (ignore states wraping over
                    index = random.randint(self.history_length, self.count - 1)
                    # if wraps over episode end, then get new one
                    # poststate (last screen) can be terminal state!
                    if self.terminals[(index - self.history_length):index].any():
                        continue
                    # having index first is fastest in C-order matrices
                    self.prestates[len(indexes), ...] = self.getState_order(index - 1)
                    self.poststates[len(indexes), ...] = self.getState_order(index)
                    indexes.append(index)
                    # otherwise use this index
                    break

        actions = self.actions[indexes]
        rewards = self.rewards[indexes]
        terminals = self.terminals[indexes]

        # return s,a,s,a+1,terminal
        if self.cnn_format == 'NHWC':
            return np.transpose(self.prestates, (0, 2, 3, 1)), actions, \
                   rewards, np.transpose(self.poststates, (0, 2, 3, 1)), terminals
        else:  #format is 'NCHW', faster than 'NHWC'
            return self.prestates, actions, rewards, self.poststates, terminals

View Code

第二種變體是在第一種的基礎上改進的，采用了相同的思路和處理方法。第一種變體要求保證index-history_length>=0這個條件，但是在第二種變體中對此不作要求，而是分為三種情況進行處理，增加了條件判斷以及不同條件下的處理辦法。三種情況分別為：第一種情況，buffer填滿時從0到buffer_size-1中選取索引號index，index決定的序列下限index-history_length>=0時，此時可以視作第一種變體所考慮的情況，采樣第一種變體中相同的處理方法；第二種情況，buffer填滿時從0到buffer_size-1中選取索引號index，index決定的序列下限index-history_length<時，此時存在有的屏幕畫面（前面的畫面幀）索引號大於后續畫面幀的索引號，因此需要分別考慮索引號小於0的畫面幀及索引號大於0的畫面幀，既要分別滿足current條件又要分別保證terminal的條件；第三種情況則是buffer沒有滿時，此時和第三種變體的考慮情況相同，此時就沒有current變量需要考慮，因為current保證在隊列的最后面，不存在覆蓋的問題，此時只需要考慮滿足terminal條件即可，並且第三種情況時index_history_length>=0條件是需要滿足的，判斷條件減少了。

----------------------------------------------------------------

第三種變體：

# encoding:UTF-8
"""Code from https://github.com/tambetm/simple_dqn/blob/master/src/replay_memory.py"""

import random
import numpy as np

class ReplayBuffer(object):
    def __init__(self, config):
        self.cnn_format = config.cnn_format   # buffer中數據的格式，'NCHW'或'NHWC'
        self.buffer_size = config.replay_buffer_size  # 緩存池的最大容量
        self.history_length = config.history_length   # 一個狀態，state的歷史數據長度
        self.dims = (config.screen_height, config.screen_width)  # 一幀圖像的高、寬
        self.batch_size = config.batch_size    # mini_batch_size 大小
        self.count = 0     # 當前緩存池中現有存儲數據的大小
        """
        # expericence replay buffer  定義經驗池 pre_state->a,r,s,terminal
        # 這里我們設定reward為:0，+1，-1，三個種類
        # 設定屏幕截圖匯總，states
        # terminal對應同索引號的screen
        """
        self.actions = []
        self.rewards = []
        self.screens = []
        self.terminals =[]

        # pre-allocate prestates and poststates for minibatch
        # 選擇動作前的狀態 s,a,s+1,中的狀態s,當前狀態
        self.prestates = np.empty((self.batch_size, self.history_length) + self.dims, \
                                  dtype=np.float32)
        # 選擇動作前的狀態 s,a,s+1,中的狀態s+1,下一狀態
        self.poststates = np.empty((self.batch_size, self.history_length) + self.dims, \
                                   dtype=np.float32)

        # 判斷設置是否正確
        assert self.history_length>=1  # history_length，狀態state由幾個圖像組成，大小至少為1

    def add(self, action, reward, screen, terminal):
        """ 向experience buffer中加入新的a,r,s,terminal操作 """
        assert screen.shape == self.dims  #判斷傳入的screen變量維度是否符合設定
        # screen is post-state, after action and reward
        # screen 是動作后的圖像，前一狀態執行動作action后獲得reward，screen
        # current指示當前的加入位置
        self.actions.append(action)
        self.rewards.append(reward)
        self.screens.append(screen)
        self.terminals.append(terminal)

        if self.count<self.buffer_size:
            self.count+=1
        else:
            self.actions.pop(0)
            self.rewards.pop(0)
            self.screens.pop(0)
            self.terminals.pop(0)

    def getState(self, index):
        return self.screens[(index - (self.history_length - 1)):(index + 1)]

    def sample(self):
        # memory must include poststate, prestate and history
        assert self.count > self.history_length # history_length至少為1，由於要考慮前后兩個狀態所以count至少為2
        # sample random indexes
        indexes = []
        while len(indexes) < self.batch_size:
            # find random index
            while True:
                # sample one index (ignore states wraping over
                index = random.randint(self.history_length, self.count - 1)
                # if wraps over episode end, then get new one
                # poststate (last screen) can be terminal state!
                if sum(self.terminals[(index - self.history_length):index])!=0:
                    continue
                # otherwise use this index
                break
            # having index first is fastest in C-order matrices
            self.prestates[len(indexes), ...] = self.getState(index - 1)
            self.poststates[len(indexes), ...] = self.getState(index)
            indexes.append(index)

        actions = []
        rewards = []
        terminals = []
        for index in indexes:
            actions.append(self.actions[index])
            rewards.append(self.rewards[index])
            terminals.append(self.terminals[index])
        actions = np.array(actions)
        rewards = np.array(rewards)
        terminals = np.array(terminals)

        # return s,a,s,a+1,terminal
        if self.cnn_format == 'NHWC':
            return np.transpose(self.prestates, (0, 2, 3, 1)), actions, \
                   rewards, np.transpose(self.poststates, (0, 2, 3, 1)), terminals
        else:  #format is 'NCHW', faster than 'NHWC'
            return self.prestates, actions, rewards, self.poststates, terminals

View Code

變體三與變體一的設計思路基本一致，只不過區別在於：1.變體3沒有使用numpy中的array實現循環隊列，而是使用python的list實現了一個普通的隊列，入隊則append，出隊列則pop(0)。2.由於沒有使用循環隊列因此沒有了current的變量，在使用sample方法判斷時保證index-history_length>=0的前提下不用考慮current造成的不連續性，因為新狀態都是在隊列的最末尾。在使用add時需要考慮是否達到容量，如果達到則出隊，使用pop(0)。

-----------------------------------------------------

性能分析：

測試文件：

# encoding:UTF-8
import numpy as np
import time

from replay_buffer import ReplayBuffer as ReplayBuffer_1
from replay_buffer_2 import ReplayBuffer as ReplayBuffer_2
from replay_buffer_3 import ReplayBuffer as ReplayBuffer_3

class Config(object):
    def __init__(self):
        self.cnn_format = "NCHW"
        self.replay_buffer_size = 5*10000#100*10000
        self.history_length= 4
        self.screen_height = 100
        self.screen_width = 100
        self.batch_size = 32

config = Config()

rf_1 = ReplayBuffer_1(config)
rf_2 = ReplayBuffer_2(config)
rf_3 = ReplayBuffer_3(config)
state = np.random.random([config.screen_height, config.screen_width])
action = np.uint8(0)
reward = np.int8(1)

for i in range(5000*10000):  #總步數
    terminal =np.random.choice([True, False], 1,  [0.1, 0.9])[0]
    rf_1.add(action, reward, state, terminal)
    if rf_1.count >= 5*10000:    # 開始抽樣的步數
        rf_1.sample()
    if i%10000 == 0:
        print(i)
    if i == 5*10000:
        a = time.time()
    if i ==55*10000:
        b = time.time()
        break
print(b-a)

第一種變體資源消耗：

第二種變體資源消耗：

第三種變體資源消耗：

可以看到三種算法對CPU的占有都可以達到100%，同時第一、二種變體內存占用相同。第三種其實內存占用應該要大於第一、二種的，之所以這里顯示第三種變體算法消耗內存少是因為測試文件中傳入變體算法的輸入值為固定不變的，而第三種變體中緩沖池是使用列表來構建的，由於Python語言中變量引用原理導致大量變量所指向的內存空間為同一塊。

運行時間：

第一種變體：

820.264048815

801.370502949

802.797419071

第二種變體：
781.069205999
795.399427891
789.664218903

第三種變體：

1906.51530409
1825.43317413
1893.87990212

可以看到設計比較復雜的第一、二種變體運行時間相當，第三種變體運行時間2倍多於前兩種。

由於我們只測試了50萬次的計算，原DQN中設計的是5000萬次計算，由此我們按最快的800秒/50萬次來計算，總共進行5000萬次需要時間：

而本文相關的DQN算法（使用第一種變體的experience replay buffer）共需要運行99小時左右，也就是說整個DQN算法運行過程中需要處理buffer的時間就占總算法的20%以上，由此可見在DQN算法中在CPU上進行處理的時間開銷還是很大的，如果考慮和真是游戲環境進行交互等可能CPU花銷更大些，根據不完整的估計在整個DQN算法中CPU上需要的開銷可能會占整個算法的50%，也就是說cpu上的開銷其實可能和GPU上的開銷相當，如果我們有效提升CPU上運算性能可以對整個DQN算法進行較大運算性能的提升。

-------------------------------------------------------------

時間過得很快，最早是在14年左右接觸到強化學習算法的，那時還是在讀研究生的時候，但是就是感覺強化學習算法很抽象，搞不清楚它是要做什么的，它的目的是啥，關鍵點在，具體途徑是啥，等等吧，都是全然不知，所以接觸不就后也就放棄掉了。2017年下半年開始讀phd，由於種種不可言說的原因在入學快一年的時候才獲得研究課題，或是說才有研究方向，也不知怎的居然是強化學習方向。由於也是經歷重重考驗才得到的研究方向，所以也沒資格，更是不敢去挑肥揀瘦的，所以也就只好硬着頭皮接了下來。當時手上僅有的設備就是一台i5 CPU，NVIDIA1060顯卡，在加上前期的種種不可言說的經歷，而且在半年前也是由於朋友介紹接觸過這個方向（當時阿里正在搞強行學習的推薦算法），總之就是既感覺有幾分欣慰（畢竟有研究方向了），也感覺有些失落，因為這個研究方向在我所知道的人里面就沒有搞這個研究方向的，硬件上也是不太行，軟件上啥歷史遺留資源也沒有，就連身邊可以問的人都沒有，於是乎一邊做着一邊考慮出路了。這中間去過公司面試，去過國企和科研單位面試過，總的說就是感覺不合適，一晃就2018年下半年了，此時也是研究上啥進展都沒有（可能確實有些三心二意了），在外邊找出路上也是難以如意順心，就這樣帶着無奈回家過年了。再一次返校就是2019年了，可能正是前期種種的不如意，遭遇的種種難堪，這是終於爆發了，2019年上半年居然病倒了，而且是毫無預兆，各種壓力，各種難受，還要不停的跑醫院，就這樣經歷了三四個月終於算是好的差不多了，此時2019年的上半年也是快過去了，原本打算外邊出路不好找的話就安心在學校學習了，可誰想到又趕上這事情，真可謂是屋漏偏逢連夜雨，也是沒有正經做啥的。本就因為各種不可言說的原因導致自己晚了一年入實驗室，這又出了這么一個插曲，算下來自己相當於晚了兩年才開始正式工作。2019年下半年也是想開了，也算是讀phd的副產品，那就是心態得到了提升，也更加耐折磨了，下半年開學后也是該准備開題了，由於各種不可言說的原因原本應該2018年開的題被延期到2019年年末了。這時候雖然開始正式准備工作了，可是心里即使再想得開也難免難受，對於自己遭受的那些不可言說的經歷還是不能完全放下，強打精神去往下走卻還是感覺心不由衷，不管怎么說最后還是水過了開題，畢竟開題這個東西還是比較水的。開完題就是快到2020的元旦了，前腳剛到家就爆發了疫情，沒法子就在農村的家呆到了今年9月返校，此時已是畢業季，但是對於我來說更像是入學季，可以說我比同級的人整個晚了好幾年，而且本身就開始的晚，再加上這么長時間的假期，回來后感覺很多東西都是要重新開始的，這時候也把19年的一些資料找了出來，這便有了這篇博客要說的內容。

---------------------------------------------

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 強化學習中的經驗回放（The Experience Replay in Reinforcement Learning）強化學習 8 —— DQN 算法 Tensorflow 2.0 實現強化學習算法實例DQN代碼PyTorch實現強化學習 9 —— DQN 改進算法DDQN、Dueling DQN tensorflow 2.0 實現強化學習(十一) Prioritized Replay DQN 【強化學習】DQN 算法改進六、強化學習第六篇--DQN算法【轉】【強化學習】Deep Q Network(DQN)算法詳解強化學習入門筆記系列——DQN算法【算法總結】強化學習部分基礎算法總結（Q-learning DQN PG AC DDPG TD3）