本文作者:hhh5460
本文地址:https://www.cnblogs.com/hhh5460/p/10134855.html
問題情境
一個2*2的迷宮,一個入口,一個出口,還有一個陷阱。如圖
(圖片來源:https://jizhi.im/blog/post/intro_q_learning)
這是一個二維的問題,不過我們可以把這個降維,變為一維的問題。
感謝:https://jizhi.im/blog/post/intro_q_learning。網上看了無數文章,無數代碼,都不得要領!直到看了這篇里面的三個矩陣:reward,transition_matrix,valid_actions才真正理解q-learning算法如何操作,如何實現!
的代碼先睹為快,絕對讓你秒懂q-learning算法,當然我也做了部分潤色:

import numpy as np import random ''' 2*2的迷宮 --------------- | 入口 | | --------------- | 陷阱 | 出口 | --------------- # 來源:https://jizhi.im/blog/post/intro_q_learning 每個格子是一個狀態,此時都有上下左右停5個動作 任務:通過學習,找到一條通徑 ''' gamma = 0.7 # u, d, l, r, n reward = np.array([( 0, -10, 0, -1, -1), #0,狀態0 ( 0, 10, -1, 0, -1), #1 (-1, 0, 0, 10, -1), #2 (-1, 0, -10, 0, 10)],#3 dtype=[('u',float),('d',float),('l',float),('r',float),('n',float)]) q_matrix = np.zeros((4, ), dtype=[('u',float),('d',float),('l',float),('r',float),('n',float)]) transition_matrix = np.array([(-1, 2, -1, 1, 0), # 如 state:0,action:'d' --> next_state:2 (-1, 3, 0, -1, 1), ( 0, -1, -1, 3, 2), ( 1, -1, 2, -1, 3)], dtype=[('u',int),('d',int),('l',int),('r',int),('n',int)]) valid_actions = np.array([['d', 'r', 'n'], #0,狀態0 ['d', 'l', 'n'], #1 ['u', 'r', 'n'], #2 ['u', 'l', 'n']])#3 for i in range(1000): current_state = 0 while current_state != 3: current_action = random.choice(valid_actions[current_state]) # 只有探索,沒有利用 next_state = transition_matrix[current_state][current_action] next_reward = reward[current_state][current_action] next_q_values = [q_matrix[next_state][next_action] for next_action in valid_actions[next_state]] #待取最大值 q_matrix[current_state][current_action] = next_reward + gamma * max(next_q_values) # 貝爾曼方程(不完整) current_state = next_state print('Final Q-table:') print(q_matrix)
0.相關參數
epsilon = 0.9 # 貪婪度 greedy alpha = 0.1 # 學習率 gamma = 0.8 # 獎勵遞減值
1.狀態集
探索者的狀態,即其可到達的位置,有4個。所以定義
states = range(4) # 狀態集,從0到3
那么,在某個狀態下執行某個動作之后,到達的下一個狀態如何確定呢?
def get_next_state(state, action): '''對狀態執行動作后,得到下一狀態''' #u,d,l,r,n = -2,+2,-1,+1,0 if state % 2 != 1 and action == 'r': # 除最后一列,皆可向右(+1) next_state = state + 1 elif state % 2 != 0 and action == 'l': # 除最前一列,皆可向左(-1) next_state = state -1 elif state // 2 != 1 and action == 'd': # 除最后一行,皆可向下(+2) next_state = state + 2 elif state // 2 != 0 and action == 'u': # 除最前一行,皆可向上(-2) next_state = state - 2 else: next_state = state return next_state
2.動作集
探索者處於每個狀態時,可行的動作,只有上下左右4個。所以定義
actions = ['u', 'd', 'l', 'r'] # 動作集。上下左右,也可添加動作'n',表示停留
那么,在某個給定的狀態(位置),其所有的合法動作如何確定呢?
def get_valid_actions(state): '''取當前狀態下的合法動作集合,與reward無關!''' global actions # ['u','d','l','r','n'] valid_actions = set(actions) if state % 2 == 1: # 最后一列,則 valid_actions = valid_actions - set(['r']) # 去掉向右的動作 if state % 2 == 0: # 最前一列,則 valid_actions = valid_actions - set(['l']) # 去掉向左 if state // 2 == 1: # 最后一行,則 valid_actions = valid_actions - set(['d']) # 去掉向下 if state // 2 == 0: # 最前一行,則 valid_actions = valid_actions - set(['u']) # 去掉向上 return list(valid_actions)
3.獎勵集
探索者到達每個狀態(位置)時,要有獎勵。所以定義
rewards = [0,0,-10,10] # 獎勵集。到達位置3(出口)獎勵10,位置2(陷阱)獎勵-10,其他皆為0
顯然,取得某狀態state下的獎勵就很簡單了:rewards[state] 。根據state,按圖索驥即可,無需額外定義一個函數。
4.Q table
最重要。Q table是一種記錄狀態-行為值 (Q value) 的表。常見的q-table都是二維的,基本長下面這樣:
(注意,也有3維的Q table)
所以定義
q_table = pd.DataFrame(data=[[0 for _ in actions] for _ in states], index=states, columns=actions)
5.Q-learning算法
Q-learning算法的偽代碼
Q value的更新是根據貝爾曼方程:
$$Q(s_t,a_t) \leftarrow Q(s_t,a_t) + \alpha[r_{t+1} + \lambda \max _{a} Q(s_{t+1}, a) - Q(s_t,a_t)] \tag {1}$$
好吧,是時候實現它了:
# 總共探索300次 for i in range(300): # 0.從最左邊的位置開始(不是必要的) current_state = 0 #current_state = random.choice(states) while current_state != states[-1]: # 1.取當前狀態下的合法動作中,隨機(或貪婪)地選一個作為 當前動作 if (random.uniform(0,1) > epsilon) or ((q_table.ix[current_state] == 0).all()): # 探索 current_action = random.choice(get_valid_actions(current_state)) else: current_action = q_table.ix[current_state].idxmax() # 利用(貪婪) # 2.執行當前動作,得到下一個狀態(位置) next_state = get_next_state(current_state, current_action) # 3.取下一個狀態所有的Q value,待取其最大值 next_state_q_values = q_table.ix[next_state, get_valid_actions(next_state)] # 4.根據貝爾曼方程,更新 Q table 中當前狀態-動作對應的 Q value q_table.ix[current_state, current_action] += alpha * (rewards[next_state] + gamma * next_state_q_values.max() - q_table.ix[current_state, current_action]) # 5.進入下一個狀態(位置) current_state = next_state print('\nq_table:') print(q_table)
可以看到,與例一的代碼一模一樣,不差一字!
6.環境及其更新
這里的環境貌似必須用到GUI,有點麻煩;而在命令行下,我又不知如何實現。所以暫時算了,不搞了。
7.完整代碼
''' 最簡單的四個格子的迷宮 --------------- | start | | --------------- | die | end | --------------- 每個格子是一個狀態,此時都有上下左右4個動作
作者:hhh5460
時間:20181217 ''' import pandas as pd import random epsilon = 0.9 # 貪婪度 greedy alpha = 0.1 # 學習率 gamma = 0.8 # 獎勵遞減值 states = range(4) # 0, 1, 2, 3 四個狀態 actions = list('udlr') # 上下左右 4個動作。還可添加動作'n',表示停留 rewards = [0,0,-10,10] # 獎勵集。到達位置3(出口)獎勵10,位置2(陷阱)獎勵-10,其他皆為0 q_table = pd.DataFrame(data=[[0 for _ in actions] for _ in states], index=states, columns=actions) def get_next_state(state, action): '''對狀態執行動作后,得到下一狀態''' #u,d,l,r,n = -2,+2,-1,+1,0 if state % 2 != 1 and action == 'r': # 除最后一列,皆可向右(+1) next_state = state + 1 elif state % 2 != 0 and action == 'l': # 除最前一列,皆可向左(-1) next_state = state -1 elif state // 2 != 1 and action == 'd': # 除最后一行,皆可向下(+2) next_state = state + 2 elif state // 2 != 0 and action == 'u': # 除最前一行,皆可向上(-2) next_state = state - 2 else: next_state = state return next_state def get_valid_actions(state): '''取當前狀態下的合法動作集合 global reward valid_actions = reward.ix[state, reward.ix[state]!=0].index return valid_actions ''' # 與reward無關! global actions valid_actions = set(actions) if state % 2 == 1: # 最后一列,則 valid_actions = valid_actions - set(['r']) # 無向右的動作 if state % 2 == 0: # 最前一列,則 valid_actions = valid_actions - set(['l']) # 無向左 if state // 2 == 1: # 最后一行,則 valid_actions = valid_actions - set(['d']) # 無向下 if state // 2 == 0: # 最前一行,則 valid_actions = valid_actions - set(['u']) # 無向上 return list(valid_actions) # 總共探索300次 for i in range(300): # 0.從最左邊的位置開始(不是必要的) current_state = 0 #current_state = random.choice(states) while current_state != states[-1]: # 1.取當前狀態下的合法動作中,隨機(或貪婪)地選一個作為 當前動作 if (random.uniform(0,1) > epsilon) or ((q_table.ix[current_state] == 0).all()): # 探索 current_action = random.choice(get_valid_actions(current_state)) else: current_action = q_table.ix[current_state].idxmax() # 利用(貪婪) # 2.執行當前動作,得到下一個狀態(位置) next_state = get_next_state(current_state, current_action) # 3.取下一個狀態所有的Q value,待取其最大值 next_state_q_values = q_table.ix[next_state, get_valid_actions(next_state)] # 4.根據貝爾曼方程,更新 Q table 中當前狀態-動作對應的 Q value q_table.ix[current_state, current_action] += alpha * (rewards[next_state] + gamma * next_state_q_values.max() - q_table.ix[current_state, current_action]) # 5.進入下一個狀態(位置) current_state = next_state print('\nq_table:') print(q_table)
8.效果圖
9.補充
又搞了一個numpy版本,比pandas版本的快了一個數量級!!代碼如下

''' 最簡單的四個格子的迷宮 --------------- | start | | --------------- | die | end | --------------- 每個格子是一個狀態,此時都有上下左右停5個動作 ''' # 作者:hhh5460 # 時間:20181218 import numpy as np epsilon = 0.9 # 貪婪度 greedy alpha = 0.1 # 學習率 gamma = 0.8 # 獎勵遞減值 states = range(4) # 0, 1, 2, 3 四個狀態 actions = list('udlrn') # 上下左右停 五個動作 rewards = [0,0,-10,10] # 獎勵集。到達位置3(出口)獎勵10,位置2(陷阱)獎勵-10,其他皆為0 # 給numpy數組的列加標簽,參考https://cloud.tencent.com/developer/ask/72790 q_table = np.zeros(shape=(4, ), # 坑二:這里不能是(4,5)!! dtype=list(zip(actions, ['float']*5))) #dtype=[('u',float),('d',float),('l',float),('r',float),('n',float)]) #dtype={'names':actions, 'formats':[float]*5}) def get_next_state(state, action): '''對狀態執行動作后,得到下一狀態''' #u,d,l,r,n = -2,+2,-1,+1,0 if state % 2 != 1 and action == 'r': # 除最后一列,皆可向右(+1) next_state = state + 1 elif state % 2 != 0 and action == 'l': # 除最前一列,皆可向左(-1) next_state = state -1 elif state // 2 != 1 and action == 'd': # 除最后一行,皆可向下(+2) next_state = state + 2 elif state // 2 != 0 and action == 'u': # 除最前一行,皆可向上(-2) next_state = state - 2 else: next_state = state return next_state def get_valid_actions(state): '''取當前狀態下的合法動作集合,與reward無關!''' global actions # ['u','d','l','r','n'] valid_actions = set(actions) if state % 2 == 1: # 最后一列,則 valid_actions = valid_actions - set(['r']) # 去掉向右的動作 if state % 2 == 0: # 最前一列,則 valid_actions = valid_actions - set(['l']) # 去掉向左 if state // 2 == 1: # 最后一行,則 valid_actions = valid_actions - set(['d']) # 去掉向下 if state // 2 == 0: # 最前一行,則 valid_actions = valid_actions - set(['u']) # 去掉向上 return list(valid_actions) for i in range(1000): #current_state = states[0] # 固定 current_state = np.random.choice(states,1)[0] while current_state != 3: if (np.random.uniform() > epsilon) or ((np.array(list(q_table[current_state])) == 0).all()): # q_table[current_state]是numpy.void類型,只能這么操作!! current_action = np.random.choice(get_valid_actions(current_state), 1)[0] else: current_action = actions[np.array(list(q_table[current_state])).argmax()] # q_table[current_state]是numpy.void類型,只能這么操作!! next_state = get_next_state(current_state, current_action) next_state_q_values = [q_table[next_state][action] for action in get_valid_actions(next_state)] q_table[current_state][current_action] = rewards[next_state] + gamma * max(next_state_q_values) current_state = next_state print('Final Q-table:') print(q_table)
10.補充2:三維Q table實現!
經過不斷的試驗,終於擼出了一個三維版的Q table,代碼如下!

''' 最簡單的四個格子的迷宮 --------------- | start | | --------------- | die | end | --------------- 每個格子是一個狀態,此時都有上下左右停5個動作 ''' '''三維 Q table 版!!''' # 作者:hhh5460 # 時間:20181218 import numpy as np import random # np.random.choice不能選二維元素! epsilon = 0.9 # 貪婪度 greedy alpha = 0.1 # 學習率 gamma = 0.8 # 獎勵遞減值 states = [(0,0),(0,1),(1,0),(1,1)] #狀態集,四個位置 actions = [(-1,0),(1,0),(0,-1),(0,1)] #動作集,上下左右 rewards = [[ 0., 0.], # 獎勵集 [-10.,10.]] # q_table是三維的,注意把動作放在了第三維! # 最里面的[0.,0.,0.,0.]表示某一個狀態(格子)對應的四個動作“上下左右”的Q value q_table = np.array([[[0.,0.,0.,0.],[0.,0.,0.,0.]], [[0.,0.,0.,0.],[0.,0.,0.,0.]]]) def get_next_state(state, action): '''對狀態執行動作后,得到下一狀態''' if ((state[1] == 1 and action == (0,1)) or # 最后一列、向右 (state[1] == 0 and action == (0,-1)) or # 最前一列、向左 (state[0] == 1 and action == (1,0)) or # 最后一行、向下 (state[0] == 0 and action == (-1,0))): # 最前一行、向上 next_state = state else: next_state = (state[0] + action[0], state[1] + action[1]) return next_state def get_valid_actions(state): '''取當前狀態下的合法動作集合''' valid_actions = [] if state[1] < 1: # 除最后一列,可向右 valid_actions.append((0,1)) if state[1] > 0: # 除最前一列,可向左(-1) valid_actions.append((0,-1)) if state[0] < 1: # 除最后一行,可向下 valid_actions.append((1,0)) if state[0] > 0: # 除最前一行,可向上 valid_actions.append((-1,0)) return valid_actions # 總共探索300次 for i in range(1000): # 0.從最左邊的位置開始(不是必要的) current_state = (0,0) #current_state = random.choice(states) #current_state = tuple(np.random.randint(2, size=2)) while current_state != states[-1]: # 1.取當前狀態下的合法動作中,隨機(或貪婪)地選一個作為 當前動作 if (np.random.uniform() > epsilon) or ((q_table[current_state[0],current_state[1]] == 0).all()): # 探索 current_action = random.choice(get_valid_actions(current_state)) else: current_action = actions[q_table[current_state[0],current_state[1]].argmax()] # 利用(貪婪) # 2.執行當前動作,得到下一個狀態(位置) next_state = get_next_state(current_state, current_action) # 3.取下一個狀態所有的Q value,待取其最大值 next_state_q_values = [q_table[next_state[0],next_state[1],actions.index(action)] for action in get_valid_actions(next_state)] # 4.根據貝爾曼方程,更新 Q table 中當前狀態-動作對應的 Q value q_table[current_state[0],current_state[1],actions.index(current_action)] += alpha * (rewards[next_state[0]][next_state[1]] + gamma * max(next_state_q_values) - q_table[current_state[0],current_state[1],actions.index(current_action)]) # 5.進入下一個狀態(位置) current_state = next_state print('\nq_table:') print(q_table)
11.課后思考題
有緣看到此文的朋友,請嘗試下實現更大規模的迷宮問題,評論交作業哦。迷宮如下:
(圖片來源:https://jizhi.im/blog/post/intro_q_learning)