【強化學習】python 實現 q-learning 例二

本文轉載自查看原文 2018-12-18 00:43 2791 python/ 強化學習/ q-learning

本文作者：hhh5460

本文地址：https://www.cnblogs.com/hhh5460/p/10134855.html

問題情境

一個2*2的迷宮，一個入口，一個出口，還有一個陷阱。如圖

（圖片來源：https://jizhi.im/blog/post/intro_q_learning）

這是一個二維的問題，不過我們可以把這個降維，變為一維的問題。

感謝：https://jizhi.im/blog/post/intro_q_learning。網上看了無數文章，無數代碼，都不得要領！直到看了這篇里面的三個矩陣：reward，transition_matrix，valid_actions才真正理解q-learning算法如何操作，如何實現！

Kaiser的代碼先睹為快，絕對讓你秒懂q-learning算法，當然我也做了部分潤色：

import numpy as np
import random

'''
2*2的迷宮
---------------
| 入口 |      |
---------------
| 陷阱 | 出口 |
---------------
# 來源：https://jizhi.im/blog/post/intro_q_learning

每個格子是一個狀態，此時都有上下左右停5個動作

任務：通過學習，找到一條通徑
'''

gamma = 0.7

#                    u,   d,   l,  r,  n
reward = np.array([( 0, -10,   0, -1, -1), #0，狀態0
                   ( 0,  10,  -1,  0, -1), #1
                   (-1,   0,   0, 10, -1), #2
                   (-1,   0, -10,  0, 10)],#3
                   dtype=[('u',float),('d',float),('l',float),('r',float),('n',float)])

q_matrix = np.zeros((4, ),
                    dtype=[('u',float),('d',float),('l',float),('r',float),('n',float)])

transition_matrix = np.array([(-1,  2, -1,  1, 0), # 如 state:0,action:'d' --> next_state:2
                              (-1,  3,  0, -1, 1),
                              ( 0, -1, -1,  3, 2),
                              ( 1, -1,  2, -1, 3)],
                              dtype=[('u',int),('d',int),('l',int),('r',int),('n',int)])

valid_actions = np.array([['d', 'r', 'n'], #0，狀態0
                          ['d', 'l', 'n'], #1
                          ['u', 'r', 'n'], #2
                          ['u', 'l', 'n']])#3


for i in range(1000):
    current_state = 0
    while current_state != 3:
        current_action = random.choice(valid_actions[current_state]) # 只有探索，沒有利用
        
        next_state = transition_matrix[current_state][current_action]
        next_reward = reward[current_state][current_action]
        next_q_values = [q_matrix[next_state][next_action] for next_action in valid_actions[next_state]] #待取最大值
        
        q_matrix[current_state][current_action] = next_reward + gamma * max(next_q_values) # 貝爾曼方程（不完整）
        current_state = next_state

print('Final Q-table:')
print(q_matrix)

View Code

0.相關參數

epsilon = 0.9   # 貪婪度 greedy
alpha = 0.1     # 學習率
gamma = 0.8     # 獎勵遞減值

1.狀態集

探索者的狀態，即其可到達的位置，有4個。所以定義

states = range(4) # 狀態集，從0到3

那么，在某個狀態下執行某個動作之后，到達的下一個狀態如何確定呢？

def get_next_state(state, action):
    '''對狀態執行動作后，得到下一狀態'''
    #u,d,l,r,n = -2,+2,-1,+1,0
    if state % 2 != 1 and action == 'r':    # 除最后一列，皆可向右(+1)
        next_state = state + 1
    elif state % 2 != 0 and action == 'l':  # 除最前一列，皆可向左(-1)
        next_state = state -1
    elif state // 2 != 1 and action == 'd': # 除最后一行，皆可向下(+2)
        next_state = state + 2
    elif state // 2 != 0 and action == 'u': # 除最前一行，皆可向上(-2)
        next_state = state - 2
    else:
        next_state = state
    return next_state

2.動作集

探索者處於每個狀態時，可行的動作，只有上下左右4個。所以定義

actions = ['u', 'd', 'l', 'r'] # 動作集。上下左右，也可添加動作'n'，表示停留

那么，在某個給定的狀態（位置），其所有的合法動作如何確定呢？

def get_valid_actions(state):
    '''取當前狀態下的合法動作集合，與reward無關！'''
    global actions # ['u','d','l','r','n']
    
    valid_actions = set(actions)
    if state % 2 == 1:                              # 最后一列，則
        valid_actions = valid_actions - set(['r'])  # 去掉向右的動作
    if state % 2 == 0:                              # 最前一列，則
        valid_actions = valid_actions - set(['l'])  # 去掉向左
    if state // 2 == 1:                             # 最后一行，則
        valid_actions = valid_actions - set(['d'])  # 去掉向下
    if state // 2 == 0:                             # 最前一行，則
        valid_actions = valid_actions - set(['u'])  # 去掉向上
    return list(valid_actions)

3.獎勵集

探索者到達每個狀態（位置）時，要有獎勵。所以定義

rewards = [0,0,-10,10] # 獎勵集。到達位置3（出口）獎勵10，位置2（陷阱）獎勵-10，其他皆為0

顯然，取得某狀態state下的獎勵就很簡單了：rewards[state] 。根據state，按圖索驥即可，無需額外定義一個函數。

4.Q table

最重要。Q table是一種記錄狀態-行為值 (Q value) 的表。常見的q-table都是二維的，基本長下面這樣：

（注意，也有3維的Q table）

所以定義

q_table = pd.DataFrame(data=[[0 for _ in actions] for _ in states],
                       index=states, columns=actions)

5.Q-learning算法

Q-learning算法的偽代碼

Q value的更新是根據貝爾曼方程：

$$Q(s_t,a_t) \leftarrow Q(s_t,a_t) + \alpha[r_{t+1} + \lambda \max _{a} Q(s_{t+1}, a) - Q(s_t,a_t)] \tag {1}$$

好吧，是時候實現它了：

# 總共探索300次
for i in range(300):
    # 0.從最左邊的位置開始（不是必要的）
    current_state = 0
    #current_state = random.choice(states)
    while current_state != states[-1]:
        # 1.取當前狀態下的合法動作中，隨機（或貪婪）地選一個作為 當前動作
        if (random.uniform(0,1) > epsilon) or ((q_table.ix[current_state] == 0).all()):  # 探索
            current_action = random.choice(get_valid_actions(current_state))
        else:
            current_action = q_table.ix[current_state].idxmax() # 利用（貪婪）
        # 2.執行當前動作，得到下一個狀態（位置）
        next_state = get_next_state(current_state, current_action)
        # 3.取下一個狀態所有的Q value，待取其最大值
        next_state_q_values = q_table.ix[next_state, get_valid_actions(next_state)]
        # 4.根據貝爾曼方程，更新 Q table 中當前狀態-動作對應的 Q value
        q_table.ix[current_state, current_action] += alpha * (rewards[next_state] + gamma * next_state_q_values.max() - q_table.ix[current_state, current_action])
        # 5.進入下一個狀態（位置）
        current_state = next_state

print('\nq_table:')
print(q_table)

可以看到，與例一的代碼一模一樣，不差一字！

6.環境及其更新

這里的環境貌似必須用到GUI，有點麻煩；而在命令行下，我又不知如何實現。所以暫時算了，不搞了。

7.完整代碼

'''
最簡單的四個格子的迷宮
---------------
| start |     |
---------------
|  die  | end |
---------------

每個格子是一個狀態，此時都有上下左右4個動作

作者：hhh5460
時間：20181217
'''

import pandas as pd
import random

epsilon = 0.9   # 貪婪度 greedy
alpha = 0.1     # 學習率
gamma = 0.8     # 獎勵遞減值

states = range(4)       # 0, 1, 2, 3 四個狀態
actions = list('udlr') # 上下左右 4個動作。還可添加動作'n'，表示停留
rewards = [0,0,-10,10] # 獎勵集。到達位置3（出口）獎勵10，位置2（陷阱）獎勵-10，其他皆為0


q_table = pd.DataFrame(data=[[0 for _ in actions] for _ in states],
                       index=states, columns=actions)

def get_next_state(state, action):
    '''對狀態執行動作后，得到下一狀態'''
    #u,d,l,r,n = -2,+2,-1,+1,0
    if state % 2 != 1 and action == 'r':    # 除最后一列，皆可向右(+1)
        next_state = state + 1
    elif state % 2 != 0 and action == 'l':  # 除最前一列，皆可向左(-1)
        next_state = state -1
    elif state // 2 != 1 and action == 'd': # 除最后一行，皆可向下(+2)
        next_state = state + 2
    elif state // 2 != 0 and action == 'u': # 除最前一行，皆可向上(-2)
        next_state = state - 2
    else:
        next_state = state
    return next_state
        

def get_valid_actions(state):
    '''取當前狀態下的合法動作集合
    global reward
    valid_actions = reward.ix[state, reward.ix[state]!=0].index
    return valid_actions
    '''
    # 與reward無關！
    global actions
    valid_actions = set(actions)
    if state % 2 == 1:                              # 最后一列，則
        valid_actions = valid_actions - set(['r'])  # 無向右的動作
    if state % 2 == 0:                              # 最前一列，則
        valid_actions = valid_actions - set(['l'])  # 無向左
    if state // 2 == 1:                             # 最后一行，則
        valid_actions = valid_actions - set(['d'])  # 無向下
    if state // 2 == 0:                             # 最前一行，則
        valid_actions = valid_actions - set(['u'])  # 無向上
    return list(valid_actions)
    
    
# 總共探索300次
for i in range(300):
    # 0.從最左邊的位置開始（不是必要的）
    current_state = 0
    #current_state = random.choice(states)
    while current_state != states[-1]:
        # 1.取當前狀態下的合法動作中，隨機（或貪婪）地選一個作為 當前動作
        if (random.uniform(0,1) > epsilon) or ((q_table.ix[current_state] == 0).all()):  # 探索
            current_action = random.choice(get_valid_actions(current_state))
        else:
            current_action = q_table.ix[current_state].idxmax() # 利用（貪婪）
        # 2.執行當前動作，得到下一個狀態（位置）
        next_state = get_next_state(current_state, current_action)
        # 3.取下一個狀態所有的Q value，待取其最大值
        next_state_q_values = q_table.ix[next_state, get_valid_actions(next_state)]
        # 4.根據貝爾曼方程，更新 Q table 中當前狀態-動作對應的 Q value
        q_table.ix[current_state, current_action] += alpha * (rewards[next_state] + gamma * next_state_q_values.max() - q_table.ix[current_state, current_action])
        # 5.進入下一個狀態（位置）
        current_state = next_state

print('\nq_table:')
print(q_table)

8.效果圖

9.補充

又搞了一個numpy版本，比pandas版本的快了一個數量級！！代碼如下

'''
最簡單的四個格子的迷宮
---------------
| start |     |
---------------
|  die  | end |
---------------

每個格子是一個狀態，此時都有上下左右停5個動作
'''

# 作者：hhh5460
# 時間：20181218

import numpy as np


epsilon = 0.9   # 貪婪度 greedy
alpha = 0.1     # 學習率
gamma = 0.8     # 獎勵遞減值

states = range(4)       # 0, 1, 2, 3 四個狀態
actions = list('udlrn') # 上下左右停 五個動作
rewards = [0,0,-10,10]  # 獎勵集。到達位置3（出口）獎勵10，位置2（陷阱）獎勵-10，其他皆為0


# 給numpy數組的列加標簽，參考https://cloud.tencent.com/developer/ask/72790
q_table = np.zeros(shape=(4, ), # 坑二：這里不能是(4,5)!!
                   dtype=list(zip(actions, ['float']*5)))
                   #dtype=[('u',float),('d',float),('l',float),('r',float),('n',float)])
                   #dtype={'names':actions, 'formats':[float]*5})

def get_next_state(state, action):
    '''對狀態執行動作后，得到下一狀態'''
    #u,d,l,r,n = -2,+2,-1,+1,0
    if state % 2 != 1 and action == 'r':    # 除最后一列，皆可向右(+1)
        next_state = state + 1
    elif state % 2 != 0 and action == 'l':  # 除最前一列，皆可向左(-1)
        next_state = state -1
    elif state // 2 != 1 and action == 'd': # 除最后一行，皆可向下(+2)
        next_state = state + 2
    elif state // 2 != 0 and action == 'u': # 除最前一行，皆可向上(-2)
        next_state = state - 2
    else:
        next_state = state
    return next_state
        

def get_valid_actions(state):
    '''取當前狀態下的合法動作集合，與reward無關！'''
    global actions # ['u','d','l','r','n']
    
    valid_actions = set(actions)
    if state % 2 == 1:                              # 最后一列，則
        valid_actions = valid_actions - set(['r'])  # 去掉向右的動作
    if state % 2 == 0:                              # 最前一列，則
        valid_actions = valid_actions - set(['l'])  # 去掉向左
    if state // 2 == 1:                             # 最后一行，則
        valid_actions = valid_actions - set(['d'])  # 去掉向下
    if state // 2 == 0:                             # 最前一行，則
        valid_actions = valid_actions - set(['u'])  # 去掉向上
    return list(valid_actions)
    
    
for i in range(1000):
    #current_state = states[0] # 固定
    current_state = np.random.choice(states,1)[0]
    while current_state != 3:
        if (np.random.uniform() > epsilon) or ((np.array(list(q_table[current_state])) == 0).all()):  # q_table[current_state]是numpy.void類型，只能這么操作！！
            current_action = np.random.choice(get_valid_actions(current_state), 1)[0]
        else:
            current_action = actions[np.array(list(q_table[current_state])).argmax()] # q_table[current_state]是numpy.void類型，只能這么操作！！
        next_state = get_next_state(current_state, current_action)
        next_state_q_values = [q_table[next_state][action] for action in get_valid_actions(next_state)]
        q_table[current_state][current_action] = rewards[next_state] + gamma * max(next_state_q_values)
        current_state = next_state
        
        
print('Final Q-table:')
print(q_table)

View Code

10.補充2：三維Q table實現！

經過不斷的試驗，終於擼出了一個三維版的Q table，代碼如下！

'''
最簡單的四個格子的迷宮
---------------
| start |     |
---------------
|  die  | end |
---------------

每個格子是一個狀態，此時都有上下左右停5個動作
'''

'''三維 Q table 版！！'''

# 作者：hhh5460
# 時間：20181218

import numpy as np
import random # np.random.choice不能選二維元素！

epsilon = 0.9   # 貪婪度 greedy
alpha = 0.1     # 學習率
gamma = 0.8     # 獎勵遞減值

states = [(0,0),(0,1),(1,0),(1,1)] #狀態集，四個位置
actions = [(-1,0),(1,0),(0,-1),(0,1)] #動作集，上下左右
rewards = [[  0., 0.],    # 獎勵集
           [-10.,10.]]

# q_table是三維的，注意把動作放在了第三維！
# 最里面的[0.,0.,0.,0.]表示某一個狀態（格子）對應的四個動作“上下左右”的Q value
q_table = np.array([[[0.,0.,0.,0.],[0.,0.,0.,0.]],
                    [[0.,0.,0.,0.],[0.,0.,0.,0.]]])

def get_next_state(state, action):
    '''對狀態執行動作后，得到下一狀態'''
    if ((state[1] == 1 and action == (0,1))  or # 最后一列、向右
        (state[1] == 0 and action == (0,-1)) or # 最前一列、向左
        (state[0] == 1 and action == (1,0))  or # 最后一行、向下
        (state[0] == 0 and action == (-1,0))):  # 最前一行、向上
        next_state = state
    else:
        next_state = (state[0] + action[0], state[1] + action[1])
    return next_state
    
def get_valid_actions(state):
    '''取當前狀態下的合法動作集合'''
    valid_actions = []
    if state[1] < 1:  # 除最后一列，可向右
        valid_actions.append((0,1))
    if state[1] > 0:  # 除最前一列，可向左(-1)
        valid_actions.append((0,-1))
    if state[0] < 1:  # 除最后一行，可向下
        valid_actions.append((1,0))
    if state[0] > 0:  # 除最前一行，可向上
        valid_actions.append((-1,0))
    return valid_actions

# 總共探索300次
for i in range(1000):
    # 0.從最左邊的位置開始（不是必要的）
    current_state = (0,0)
    #current_state = random.choice(states)
    #current_state = tuple(np.random.randint(2, size=2))
    while current_state != states[-1]:
        # 1.取當前狀態下的合法動作中，隨機（或貪婪）地選一個作為 當前動作
        if (np.random.uniform() > epsilon) or ((q_table[current_state[0],current_state[1]] == 0).all()):  # 探索
            current_action = random.choice(get_valid_actions(current_state))
        else:
            current_action = actions[q_table[current_state[0],current_state[1]].argmax()] # 利用（貪婪）
        # 2.執行當前動作，得到下一個狀態（位置）
        next_state = get_next_state(current_state, current_action)
        # 3.取下一個狀態所有的Q value，待取其最大值
        next_state_q_values = [q_table[next_state[0],next_state[1],actions.index(action)] for action in get_valid_actions(next_state)]
        # 4.根據貝爾曼方程，更新 Q table 中當前狀態-動作對應的 Q value
        q_table[current_state[0],current_state[1],actions.index(current_action)] += alpha * (rewards[next_state[0]][next_state[1]] + gamma * max(next_state_q_values) - q_table[current_state[0],current_state[1],actions.index(current_action)])
        # 5.進入下一個狀態（位置）
        current_state = next_state


print('\nq_table:')
print(q_table)

View Code

11.課后思考題

有緣看到此文的朋友，請嘗試下實現更大規模的迷宮問題，評論交作業哦。迷宮如下：

（圖片來源：https://jizhi.im/blog/post/intro_q_learning）

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 【強化學習】python 實現 q-learning 例一【強化學習】python 實現 q-learning 例五（GUI）【強化學習】python 實現 q-learning 例三（例一改寫）【強化學習】python 實現 q-learning 例四（例二改寫）強化學習 Q-learning 及python實現強化學習 5 —— SARSA 和 Q-Learning算法代碼實現強化學習——Q-learning算法強化學習-Q-Learning算法強化學習之Q-learning ^_^ 強化學習之Q-learning簡介