【強化學習】python 實現 q-learning 例一

本文轉載自查看原文 2018-12-17 21:23 7637 python/ 強化學習/ q-learning

本文作者：hhh5460

本文地址：https://www.cnblogs.com/hhh5460/p/10134018.html

問題情境

-o---T
# T 就是寶藏的位置, o 是探索者的位置

這一次我們會用 q-learning 的方法實現一個小例子，例子的環境是一個一維世界，在世界的右邊有寶藏，探索者只要得到寶藏嘗到了甜頭，然后以后就記住了得到寶藏的方法，這就是他用強化學習所學習到的行為。

Q-learning 是一種記錄行為值 (Q value) 的方法，每種在一定狀態的行為都會有一個值 Q(s, a)，就是說行為 a 在 s 狀態的值是 Q(s, a)。s 在上面的探索者游戲中，就是 o 所在的地點了。而每一個地點探索者都能做出兩個行為 left/right，這就是探索者的所有可行的 a 啦。

致謝：上面三段文字來自這里：https://morvanzhou.github.io/tutorials/machine-learning/reinforcement-learning/2-1-general-rl/

要解決這個問題，下面的幾個事情要先搞清楚：

0.相關參數

epsilon = 0.9   # 貪婪度 greedy
alpha = 0.1     # 學習率
gamma = 0.8     # 獎勵遞減值

1.狀態集

探索者的狀態，即其可到達的位置，有6個。所以定義

states = range(6) # 狀態集，從0到5

那么，在某個狀態下執行某個動作之后，到達的下一個狀態如何確定呢？

def get_next_state(state, action):
    '''對狀態執行動作后，得到下一狀態'''
    global states
    
    # left, right = -1,+1 # 一般來說是這樣，不過要考慮首尾兩個位置
    if action == 'right' and state != states[-1]: # 除最后一個狀態（位置），皆可向右(+1)
        next_state = state + 1
    elif action == 'left' and state != states[0]: # 除最前一個狀態（位置），皆可向左(-1)
        next_state = state -1
    else:
        next_state = state
    return next_state

2.動作集

探索者處於每個狀態時，可行的動作，只有"左"或"右"2個。所以定義

actions = ['left', 'right'] # 動作集。也可添加動作'none'，表示停留

那么，在某個給定的狀態（位置），其所有的合法動作如何確定呢？

def get_valid_actions(state):
    '''取當前狀態下的合法動作集合，與rewards無關！'''
    global actions # ['left', 'right']
    
    valid_actions = set(actions)
    if state == states[-1]:             # 最后一個狀態（位置），則
        valid_actions -= set(['right']) # 去掉向右的動作
    if state == states[0]:              # 最前一個狀態（位置），則
        valid_actions -= set(['left'])  # 去掉向左
    return list(valid_actions)

3.獎勵集

探索者到達每個狀態（位置）時，要有獎勵。所以定義

rewards = [0,0,0,0,0,1] # 獎勵集。只有最后的寶藏所在位置才有獎勵1，其他皆為0

顯然，取得狀態state下的獎勵就很簡單了：rewards[state] 。根據state，按圖索驥即可，無需額外定義一個函數。

4.Q table

最重要。Q table是一種記錄狀態-行為值 (Q value) 的表。常見的q-table都是二維的，基本長下面這樣：

（注意，也有3維的Q table）

所以定義

q_table = pd.DataFrame(data=[[0 for _ in actions] for _ in states],
                       index=states, columns=actions)

5.環境及其更新

考慮環境的目的，是讓人們能通過屏幕觀察到探索者的探索過程，僅此而已。

環境環境很簡單，就是一串字符 '-----T'！探索者到達狀態（位置）時，將該位置的字符替換成'o'即可，最后重新打印整個字符串！所以

def update_env(state):
    '''更新環境，並打印'''
    global states
    
    env = list('-----T')
    if state != states[-1]:
        env[state] = 'o'
    print('\r{}'.format(''.join(env)), end='')
    time.sleep(0.1)

6.最后，Q-learning算法

Q-learning算法的偽代碼

中文版的偽代碼：

圖片來源：https://www.hhyz.me/2018/08/05/2018-08-05-RL/

Q value的更新是根據貝爾曼方程：

$$Q(s_t,a_t) \leftarrow Q(s_t,a_t) + \alpha[r_{t+1} + \lambda \max _{a} Q(s_{t+1}, a) - Q(s_t,a_t)] \tag {1}$$

好吧，是時候實現它了：

# 總共探索13次
for i in range(13):
    # 0.從最左邊的位置開始（不是必要的）
    current_state = 0
    #current_state = random.choice(states) # 亦可隨機
    while current_state != states[-1]:
        # 1.取當前狀態下的合法動作中，隨機（或貪婪）地選一個作為 當前動作
        if (random.uniform(0,1) > epsilon) or ((q_table.ix[current_state] == 0).all()):  # 探索
            current_action = random.choice(get_valid_actions(current_state))
        else:
            current_action = q_table.ix[current_state].idxmax() # 利用（貪婪）
        # 2.執行當前動作，得到下一個狀態（位置）
        next_state = get_next_state(current_state, current_action)
        # 3.取下一個狀態所有的Q value，待取其最大值
        next_state_q_values = q_table.ix[next_state, get_valid_actions(next_state)]
        # 4.根據貝爾曼方程，更新 Q table 中當前狀態-動作對應的 Q value
        q_table.ix[current_state, current_action] += alpha * (rewards[next_state] + gamma * next_state_q_values.max() - q_table.ix[current_state, current_action])
        # 5.進入下一個狀態（位置）
        current_state = next_state

print('\nq_table:')
print(q_table)

好了，這就是大名鼎鼎的Q-learning算法！

注意，貝爾曼方程中，取獎勵是用了 rewards[next_state]，再強調一下：next_state

當然，我們希望能看到探索者的探索過程，那就隨時更新（打印）環境即可：

for i in range(13):
    #current_state = random.choice(states)
    current_state = 0
    
    update_env(current_state) # 環境相關
    total_steps = 0           # 環境相關
    
    while current_state != states[-1]:
        if (random.uniform(0,1) > epsilon) or ((q_table.ix[current_state] == 0).all()):  # 探索
            current_action = random.choice(get_valid_actions(current_state))
        else:
            current_action = q_table.ix[current_state].idxmax() # 利用（貪婪）

        next_state = get_next_state(current_state, current_action)
        next_state_q_values = q_table.ix[next_state, get_valid_actions(next_state)]
        q_table.ix[current_state, current_action] += alpha * (reward[next_state] + gamma * next_state_q_values.max() - q_table.ix[current_state, current_action])
        current_state = next_state
        
        update_env(current_state) # 環境相關
        total_steps += 1          # 環境相關
        
    print('\rEpisode {}: total_steps = {}'.format(i, total_steps), end='') # 環境相關
    time.sleep(1)                                                          # 環境相關
    print('\r                                ', end='')                    # 環境相關
        
print('\nq_table:')
print(q_table)

7.完整代碼

'''
-o---T
# T 就是寶藏的位置, o 是探索者的位置
'''

# 作者: hhh5460
# 時間：20181217

import pandas as pd
import random
import time


epsilon = 0.9   # 貪婪度 greedy
alpha = 0.1     # 學習率
gamma = 0.8     # 獎勵遞減值

states = range(6)           # 狀態集。從0到5
actions = ['left', 'right'] # 動作集。也可添加動作'none'，表示停留
rewards = [0,0,0,0,0,1]     # 獎勵集。只有最后的寶藏所在位置才有獎勵1，其他皆為0

q_table = pd.DataFrame(data=[[0 for _ in actions] for _ in states],
                       index=states, columns=actions)
                       

def update_env(state):
    '''更新環境，並打印'''
    global states
    
    env = list('-----T') # 環境，就是這樣一個字符串(list)！！
    if state != states[-1]:
        env[state] = 'o'
    print('\r{}'.format(''.join(env)), end='')
    time.sleep(0.1)
                       
def get_next_state(state, action):
    '''對狀態執行動作后，得到下一狀態'''
    global states
    
    # l,r,n = -1,+1,0
    if action == 'right' and state != states[-1]: # 除非最后一個狀態（位置），向右就+1
        next_state = state + 1
    elif action == 'left' and state != states[0]: # 除非最前一個狀態（位置），向左就-1
        next_state = state -1
    else:
        next_state = state
    return next_state
                       
def get_valid_actions(state):
    '''取當前狀態下的合法動作集合，與reward無關！'''
    global actions # ['left', 'right']
    
    valid_actions = set(actions)
    if state == states[-1]:             # 最后一個狀態（位置），則
        valid_actions -= set(['right']) # 不能向右
    if state == states[0]:              # 最前一個狀態（位置），則
        valid_actions -= set(['left'])  # 不能向左
    return list(valid_actions)
    
for i in range(13):
    #current_state = random.choice(states)
    current_state = 0
    
    update_env(current_state) # 環境相關
    total_steps = 0           # 環境相關
    
    while current_state != states[-1]:
        if (random.uniform(0,1) > epsilon) or ((q_table.ix[current_state] == 0).all()):  # 探索
            current_action = random.choice(get_valid_actions(current_state))
        else:
            current_action = q_table.ix[current_state].idxmax() # 利用（貪婪）

        next_state = get_next_state(current_state, current_action)
        next_state_q_values = q_table.ix[next_state, get_valid_actions(next_state)]
        q_table.ix[current_state, current_action] += alpha * (rewards[next_state] + gamma * next_state_q_values.max() - q_table.ix[current_state, current_action])
        current_state = next_state
        
        update_env(current_state) # 環境相關
        total_steps += 1          # 環境相關
        
    print('\rEpisode {}: total_steps = {}'.format(i, total_steps), end='') # 環境相關
    time.sleep(2)                                                          # 環境相關
    print('\r                                ', end='')                    # 環境相關
        
print('\nq_table:')
print(q_table)

8.真正的最后，效果圖

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 【強化學習】python 實現 q-learning 例二【強化學習】python 實現 q-learning 例五（GUI）【強化學習】python 實現 q-learning 例三（例一改寫）【強化學習】python 實現 q-learning 例四（例二改寫）強化學習 Q-learning 及python實現強化學習 5 —— SARSA 和 Q-Learning算法代碼實現強化學習——Q-learning算法強化學習-Q-Learning算法強化學習之Q-learning ^_^ 強化學習之Q-learning簡介