DQN-深度Q網絡


深度Q網絡是用深度學習來解決強化中Q學習的問題,可以先了解一下Q學習的過程是一個怎樣的過程,實際上就是不斷的試錯,從試錯的經驗之中尋找最優解

關於Q學習,我看到一個非常好的例子,另外知乎上面也有相關的討論

其實早在13年的時候,deepmind出來了第一篇用深度學習來解決Q學習的問題的paper,那個時候deepmind還不夠火,和一般的Q學習不同的是,由於12年Alex率先用CNN解決圖像中的high level的語義的提取,deepmind也同時采用了CNN來直接對圖像進行特征提取,而非傳統的進行手工特征提取

我想從代碼的角度來看一下DQN是如何實現的

pytorcyh的代碼在官網上是有的,我也貼出了自己添加了注釋的代碼,以及寫一下自己的對於代碼的理解

  1 # -*-coding:utf-8-*-
  2 import gym
  3 import math
  4 import random
  5 import numpy as np
  6 import matplotlib
  7 import matplotlib.pyplot as plt
  8 from collections import namedtuple
  9 from itertools import count
 10 from PIL import Image
 11 
 12 import torch
 13 import torch.nn as nn
 14 import torch.optim as optim
 15 import torch.nn.functional as F
 16 import torchvision.transforms as T
 17 
 18 
 19 env = gym.make('CartPole-v0').unwrapped
 20 
 21 # set up matplotlib
 22 is_ipython = 'inline' in matplotlib.get_backend()
 23 if is_ipython:
 24     from IPython import display
 25 
 26 plt.ion()
 27 
 28 # if gpu is to be used
 29 # device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
 30 
 31 Transition = namedtuple('Transition',
 32                         ('state', 'action', 'next_state', 'reward'))  # 聲明一個name為Transition,里面的變量為以下的類似dict的
 33 
 34 
 35 class ReplayMemory(object):
 36 
 37     def __init__(self, capacity):
 38         self.capacity = capacity
 39         self.memory = []
 40         self.position = 0
 41 
 42     def push(self, *args):
 43         """Saves a transition."""
 44         if len(self.memory) < self.capacity:
 45             self.memory.append(None)
 46         self.memory[self.position] = Transition(*args)
 47         self.position = (self.position + 1) % self.capacity
 48 
 49     def sample(self, batch_size):
 50         return random.sample(self.memory, batch_size)
 51 
 52     def __len__(self):  # 定義__len__以便於用len函數?
 53         return len(self.memory)
 54 
 55 
 56 class DQN(nn.Module):
 57 
 58     def __init__(self):
 59         super(DQN, self).__init__()
 60         self.conv1 = nn.Conv2d(3, 16, kernel_size=5, stride=2)
 61         self.bn1 = nn.BatchNorm2d(16)
 62         self.conv2 = nn.Conv2d(16, 32, kernel_size=5, stride=2)
 63         self.bn2 = nn.BatchNorm2d(32)
 64         self.conv3 = nn.Conv2d(32, 32, kernel_size=5, stride=2)
 65         self.bn3 = nn.BatchNorm2d(32)
 66         self.head = nn.Linear(448, 2)
 67 
 68     def forward(self, x):
 69         x = F.relu(self.bn1(self.conv1(x)))
 70         x = F.relu(self.bn2(self.conv2(x)))
 71         x = F.relu(self.bn3(self.conv3(x)))
 72         return self.head(x.view(x.size(0), -1))
 73 
 74 
 75 resize = T.Compose([T.ToPILImage(),
 76                     T.Resize(40, interpolation=Image.CUBIC),
 77                     T.ToTensor()])
 78 
 79 # This is based on the code from gym.
 80 screen_width = 600
 81 
 82 
 83 def get_cart_location():
 84     world_width = env.x_threshold * 2
 85     scale = screen_width / world_width
 86     return int(env.state[0] * scale + screen_width / 2.0)  # MIDDLE OF CART
 87 
 88 
 89 def get_screen():
 90     screen = env.render(mode='rgb_array').transpose(
 91         (2, 0, 1))  # transpose into torch order (CHW)
 92     # Strip off the top and bottom of the screen
 93     screen = screen[:, 160:320]
 94     view_width = 320
 95     cart_location = get_cart_location()
 96     if cart_location < view_width // 2:
 97         slice_range = slice(view_width)
 98     elif cart_location > (screen_width - view_width // 2):
 99         slice_range = slice(-view_width, None)
100     else:
101         slice_range = slice(cart_location - view_width // 2,
102                             cart_location + view_width // 2)
103     # Strip off the edges, so that we have a square image centered on a cart
104     screen = screen[:, :, slice_range]
105     # Convert to float, rescare, convert to torch tensor
106     # (this doesn't require a copy)
107     screen = np.ascontiguousarray(screen, dtype=np.float32) / 255
108     screen = torch.from_numpy(screen)
109     # Resize, and add a batch dimension (BCHW)
110     return resize(screen).unsqueeze(0).cuda()
111 
112 
113 env.reset()
114 # plt.figure()
115 # plt.imshow(get_screen().cpu().squeeze(0).permute(1, 2, 0).numpy(),
116 #            interpolation='none')
117 # plt.title('Example extracted screen')
118 # plt.show()
119 BATCH_SIZE = 128
120 GAMMA = 0.999
121 EPS_START = 0.9
122 EPS_END = 0.05
123 EPS_DECAY = 200
124 TARGET_UPDATE = 10
125 
126 policy_net = DQN().cuda()
127 target_net = DQN().cuda()
128 target_net.load_state_dict(policy_net.state_dict())
129 target_net.eval()
130 
131 optimizer = optim.RMSprop(policy_net.parameters())
132 memory = ReplayMemory(10000)
133 
134 
135 steps_done = 0
136 
137 
138 def select_action(state):
139     global steps_done
140     sample = random.random()
141     eps_threshold = EPS_END + (EPS_START - EPS_END) * \
142         math.exp(-1. * steps_done / EPS_DECAY)
143     steps_done += 1
144     if sample > eps_threshold:
145         with torch.no_grad():
146             return policy_net(state).max(1)[1].view(1, 1)  # policy網絡的輸出
147     else:
148         return torch.tensor([[random.randrange(2)]], dtype=torch.long).cuda()  # 隨機的選擇一個網絡的輸出或者
149 
150 
151 episode_durations = []
152 
153 
154 def plot_durations():
155     plt.figure(2)
156     plt.clf()
157     durations_t = torch.tensor(episode_durations, dtype=torch.float)
158     plt.title('Training...')
159     plt.xlabel('Episode')
160     plt.ylabel('Duration')
161     plt.plot(durations_t.numpy())
162     # Take 100 episode averages and plot them too
163     if len(durations_t) >= 100:
164         means = durations_t.unfold(0, 100, 1).mean(1).view(-1)
165         means = torch.cat((torch.zeros(99), means))
166         plt.plot(means.numpy())
167 
168     plt.pause(0.001)  # pause a bit so that plots are updated
169     if is_ipython:
170         display.clear_output(wait=True)
171         display.display(plt.gcf())
172 
173 
174 def optimize_model():
175     if len(memory) < BATCH_SIZE:
176         return
177     transitions = memory.sample(BATCH_SIZE)  # 進行隨機的sample,序列問題是不存在的
178     # print(transitions)
179     # Transpose the batch (see http://stackoverflow.com/a/19343/3343043 for
180     # detailed explanation).
181     batch = Transition(*zip(*transitions))
182     # print("current")
183     # print(batch.state[0])
184     # print("next")
185     # print(batch.next_state[0])
186     # print(torch.sum(batch.state[0]))
187     # print(torch.sum(batch.next_state[0]))
188     # print(torch.sum(batch.state[1]))
189     # # print(type(batch))
190     # print("@#$%^&*")
191 
192     # Compute a mask of non-final states and concatenate the batch elements
193     non_final_mask = torch.tensor(tuple(map(lambda s: s is not None, batch.next_state)), dtype=torch.uint8).cuda()  # lambda表達式返回的是否為空的二值
194     non_final_next_states = torch.cat([s for s in batch.next_state if s is not None])  # 空的不cat,所以長度不一定是batchsize
195     # print("the non_final_mask is")
196     # print(non_final_mask)
197     # none_total = 0
198     # total = 0
199     # for s in batch.next_state:
200     #     if s is None:
201     #         none_total = none_total + 1
202     #     else:
203     #         total = total + 1
204     # print(none_total, total)
205     state_batch = torch.cat(batch.state)
206     action_batch = torch.cat(batch.action)
207     reward_batch = torch.cat(batch.reward)
208     # print(action_batch)  # 非0即1
209     # print(reward_batch)
210     # print(len(non_final_mask))
211     # Compute Q(s_t, a) - the model computes Q(s_t), then we select the
212     # columns of actions taken
213     state_action_values = policy_net(state_batch).gather(1, action_batch)  # gather將torch.tensor的中對應於action的index取出,dim為1
214     # 從整體公式上而言,Q函數的值即為state_action_value的值
215     # print((policy_net(state_batch)))
216     # print(state_action_values)
217     # Compute V(s_{t+1}) for all next states.
218     next_state_values = torch.zeros(BATCH_SIZE).cuda()
219     # print(next_state_values)
220     # print("no final mask")
221     # print(non_final_mask)
222     # print("@#$%^&*")
223     next_state_values[non_final_mask] = target_net(non_final_next_states).max(1)[0].detach()  # non_final_mask為1的地方進行賦值操作,其余仍為0
224     # print(target_net(non_final_next_states).max(1)[0].detach())
225     # print("12345")
226     # print(next_state_values)
227     # Compute the expected Q values
228     expected_state_action_values = (next_state_values * GAMMA) + reward_batch
229 
230     # Compute Huber loss
231     loss = F.smooth_l1_loss(state_action_values, expected_state_action_values.unsqueeze(1))
232 
233     # compare the parameters of 2 networks
234     print(policy_net.state_dict()['head.bias'])
235     print("!@#$%^&*")
236     print(target_net.state_dict()['head.bias'])
237 
238     # Optimize the model
239     optimizer.zero_grad()
240     loss.backward()
241     for param in policy_net.parameters():
242         param.grad.data.clamp_(-1, 1)
243     optimizer.step()
244 
245 
246 num_episodes = 50
247 for i_episode in range(num_episodes):
248     # print("the episode is %f" % i_episode)
249     # Initialize the environment and state
250     env.reset()
251     last_screen = get_screen()
252     # print(last_screen)
253     # print("#QW&*!$")
254     current_screen = get_screen()  # 得到一張圖片,而非一個batch
255     # print(current_screen)
256     state = current_screen - last_screen  # 兩幀之間的差值,作為一個state,並且輸入網絡,類比於RNN對pose的估計
257     for t in count():  # 創建一個無限循環迭代器,t的數值會一直增加
258         # Select and perform an action
259         action = select_action(state)
260         _, reward, done, _ = env.step(action.item())  # done表示游戲是否結束, reward由gym內部決定;輸入action,gym展示下一個狀態
261         reward = torch.tensor([reward]).cuda()
262 
263         # Observe new state
264         last_screen = current_screen
265         current_screen = get_screen()
266         if not done:
267             next_state = current_screen - last_screen
268         else:
269             next_state = None
270 
271         # Store the transition in memory
272         memory.push(state, action, next_state, reward)  # memory存儲state,action,next_state,以及對應的reward
273         # print("the length of the memory is %d" % len(memory))
274         # Move to the next state
275         state = next_state
276 
277         # Perform one step of the optimization (on the target network)
278         optimize_model()
279         if done:
280             episode_durations.append(t + 1)
281             plot_durations()
282             break
283     # Update the target network
284     if i_episode % TARGET_UPDATE == 0:  # 只有在某個頻率下才會update target網絡結構
285         target_net.load_state_dict(policy_net.state_dict())
286 
287 print('Complete')
288 env.render()
289 env.close()
290 plt.ioff()
291 plt.show()
292 env.close()
View Code

作者調用了一個gym的庫,這個庫可以用作強化學習的訓練樣本,但是蛋疼的是,在用pycharm進行debug的時候,gym庫總會報錯,如果直接運行則不會,我想可能是因為gym庫並不可以進行調試

anyway,代碼的總體流程是,調用gym,聲明一個事件,在強化學習中被稱為agent,這個agent會展示當前的狀態,然后會接收一個action,輸出下一個的狀態以及這個action所得到的獎勵,ok,至於這個agent采取了action之后所得到的獎勵是如何計算的,

這個agent采取了這個action下一個狀態是啥,gym已經給你們寫好了

 

在定義網絡結構之前,作者實際上是把自己試錯的狀態存儲了起來,存儲的內容有,當前的state,采取action,以及nextstate,以及這個action相應的reward,而state並不是當前游戲的截屏,而是兩幀之間的差值,reward是gym自己返回的

至於為什么這樣做?有點兒類似與用RNN解決slam的問題,為什么輸入到網絡中的是視頻兩幀之間的差值,而不是視頻自己本身的內容,要給自己挖個坑

存儲了這些狀態之后就可以訓練網絡了,主體的網絡結構如下

 1 class DQN(nn.Module):
 2 
 3     def __init__(self):
 4         super(DQN, self).__init__()
 5         self.conv1 = nn.Conv2d(3, 16, kernel_size=5, stride=2)
 6         self.bn1 = nn.BatchNorm2d(16)
 7         self.conv2 = nn.Conv2d(16, 32, kernel_size=5, stride=2)
 8         self.bn2 = nn.BatchNorm2d(32)
 9         self.conv3 = nn.Conv2d(32, 32, kernel_size=5, stride=2)
10         self.bn3 = nn.BatchNorm2d(32)
11         self.head = nn.Linear(448, 2)
12 
13     def forward(self, x):
14         x = F.relu(self.bn1(self.conv1(x)))
15         x = F.relu(self.bn2(self.conv2(x)))
16         x = F.relu(self.bn3(self.conv3(x)))
17         return self.head(x.view(x.size(0), -1))
View Code

網絡輸出的兩個值,分別是對應不同的action,其實也不難理解,訓練的網絡最終能夠產生的輸出當然是決策是怎樣的,不過這種自己不斷的試錯,並且把自己試錯的數據保存下來,嚴格意義上來說真的是無監督學習?

anyway,作者用這些試錯的數據進行訓練

不過,網絡的loss怎么設計?

loss如上,實際上就是求取兩個Q函數之間的差值,ok,前一個Q函數的自變量描述的是當前的狀態s以及對應的行為a,后一個r+Q描述的是當前的reward加上,在下一個state如何采取下一步行動能夠讓Q最大的項

而這兩項如何在代碼中體現,實際上作者定義了兩個網絡,一個成為policy,另外一個為target網絡

優化的目標是policy net,target網絡為定期對policy的copy,如下

1     # Update the target network
2     if i_episode % TARGET_UPDATE == 0:  # 只有在某個頻率下才會update target網絡結構
3         target_net.load_state_dict(policy_net.state_dict())
View Code

policy net輸入state batch,並且將實際中的對應的action的那一列輸出,action非0即1,所以policy_net輸出的是batch_size的列向量

在這段代碼中,這個網絡的輸出就是Q函數的值,

target_net網絡輸入的是next_state,並且因為不知道其實際的action是多少,所以取最大的,輸出乘以一個gamma,並且加上當前狀態的reward即可

其實永遠是policy_net更新在前,更新的方向是讓兩個網絡的輸出盡可能的接近,其實也不僅僅是這樣,這中間還有一個reward變量,可是為什么target_net的更新要永遠滯后,一種更加極端的情況是,如果把next_state輸入到policy網絡中呢?

 


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM