強化學習詳解與代碼實現

本文轉載自查看原文 2019-04-29 22:35 2668 強化學習（RL Reinforcement Learing）/ 強化學習

強化學習詳解與代碼實現

本文系作者原創，轉載請注明出處:https://www.cnblogs.com/further-further-further/p/10789375.html

1.引言

2.強化學習原理

2.1 強化學習定義（RL Reinforcement Learing）

2.2 馬爾科夫決策過程（MDP Markov Decision Process）

2.3 貝爾曼方程（Bellman）

2.4 Q-Learning

3.代碼實現與說明（python3.5）

4.運行結果

5.參考文獻

1.引言

相信大家對由Google開發的AlphaGo機器人在2016年圍棋對弈中擊敗韓國的圍棋大師李世石還記憶猶新吧。當時，個人也確實被這場人機大戰的結果深深震撼並且恐懼了。震撼是因為機器人的智慧超越了人類只有在科幻大片中看到，而

今，這種故事卻真真實實的發生在我們的現實中，恐懼是對未知的一種自然反應，也正是因為這種恐懼，我們才有了去探索未知的本能，去揭開AlphaGo機器人背后技術原理的面紗。相信大家已經猜到了AlphaGo機器運用的技術原理，不

錯，那就是強化學習（Reinforcement Learning）。

2.強化學習原理

2.1 強化學習定義

強化學習是一種通過交互的目標導向學習方法，旨在找到連續時間序列的最優策略。這個定義比較抽象（說實話，抽象的東西雖然簡潔、准確，但是也非常難以理解）。舉個容易理解的例子：

在你面前有兩條路，自然就有兩個不同方向，只有一條路，一個方向可以到達目的地，有個前提條件是你不知道目的地在他們當中的哪個方向？

是不是感覺很抓瞎，完全沒辦法。對的，如果按照這種場景，我們肯定是沒辦法的，但是如果給你個機會，讓你在兩個不同方向都去嘗試一下，你是不是就知道哪一個方向是正確的。

強化學習的一個核心點就是要嘗試，因為只有嘗試了之后，它才能發現哪些行為會導致獎勵的最大化，而當前的行為可能不僅僅會影響即時獎勵，還會影響下一步的獎勵以及后續的所有獎勵。因為一個目標的實現，是由一步一步的行為串聯實現的。

在上面的場景當中，涉及到了強化學習的幾個主要因素：智能體（Agent）,環境（Environment）,狀態（State）、動作（Action）、獎勵（Reward）、策略（Policy）。

智能體（Agent）：強化學習的本體，作為學習者或者決策者，上述場景是指我們自己。

環境（Environment）：強化學習智能體以外的一切，主要由狀態集合組成。

狀態（State）：一個表示環境的數據，狀態集則是環境中所有可能的狀態。比如，走一步就會達到一個新的狀態。

動作（Action）：智能體可以做出的動作，動作集則是智能體可以做出的所有動作。比如，走一步這個過程就是一個動作。

獎勵（Reward）：智能體在執行一個動作后，獲得的正/負反饋信號，獎勵集則是智能體可以獲得的所有反饋信息。走正確就獎勵，錯誤就懲罰。

策略（Policy）：強化學習是從環境狀態到動作的映射學習，稱該映射關系為策略。通俗的理解，即智能體如何選擇動作的思考過程稱為策略。

第一步：智能體嘗試執行了某個動作后，環境將會轉換到一個新的狀態，當然，對於這個新的狀態，環境會給出獎勵或者懲罰。

第二步：智能體根據新的狀態和環境反饋的獎勵或懲罰，執行新的動作，如此反復，直至到達目標。

第三步：智能體根據獎勵最大值找到到達目標的最佳策略，然后根據這個策略到達目標。

要注意的是，智能體要嘗試執行所有可能的動作，到達目標，最終會有所有可能動作對應所有可能狀態的一張映射表（Q-table）。

這里借用知乎論壇關於強化學習各個因素關系的一幅圖（https://www.zhihu.com/topic/20039099/intro）

原理我們明白了，接下來我們就來看大神如何將這些原理抽象出來，如何用數學的公式來表示的。

2.2 馬爾科夫決策過程（MDP Markov Decision Process）

馬爾科夫決策過程由5個因素組成：

S：狀態集（states）

A：動作集（actions）

P：狀態轉移概率

R：即時獎勵（reward）

$γ$

$γ$ $γ$

$γ$

狀態價值函數（評價某個狀態獎勵的數學公式）：

表示在時刻的狀態能獲得獎勵的期望。

最優價值函數（某個策略下獎勵期望最大值）：

2.3 貝爾曼方程（Bellman）

貝爾曼方程是更一般的狀態價值函數表達式，它表示當前狀態的價值由當前的獎勵和下一狀態的價值組成。這里借用某位大神的一幅圖形象說明：

這里假定期望都是 0.5，7.4對應的狀態的下一個狀態是（紅色圓圈）和 This is the rendered form of the equation. You can not edit this directly. Right click will give you the option to save the image, and in most browsers you can drag the image onto your desktop or another program. （0）。

下一個狀態分別是 This is the rendered form of the equation. You can not edit this directly. Right click will give you the option to save the image, and in most browsers you can drag the image onto your desktop or another program. ，，。

This is the rendered form of the equation. You can not edit this directly. Right click will give you the option to save the image, and in most browsers you can drag the image onto your desktop or another program. 沒有下一個狀態。

對應的價值 = 當前的獎勵 + 下一狀態的價值

上述圖計算價值在表述上有點不對，沒有將即時獎勵和下一狀態價值分開，在理解上容易造成混亂。

所以對應的價值 = 0.5 * (10 + 1) + 0.5 * (0.2 *（-1.3）+ 0.4 * 2.7 + 0.4 * 7.4 )

貝爾曼最優化方程：

表示在某個狀態下最優價值函數的值，也就是說，智能體在該狀態下所能獲得累積獎勵值得最大值。

2.4 Q-Learning

學習在一個給定的狀態時，采取了一個特定的行動后，所得到的回報，然后遍歷所有可能的行動，得到所有狀態的回報 Q （Table）。

其實，每個 == 。Q-Table生成的算法流程：

1. 初始化Q-Table 每個狀態（s）對應的回報為 0；

2. 隨機選取一個狀態（s）作為遍歷的起始點；

3. 在當前狀態（s）的所有可能的行動（A）中按順序遍歷每一個行動（a）;

4. 移動到下一個狀態；

5. 在新狀態上選擇 Q 值最大的那個行動（a1）；

6. 用貝爾曼方程更新Q-Table中相應狀態-行動對應的價值（）。按順序遍歷第3步其他可能的行動，重復第3 - 6步；

7. 將新狀態設置為當前狀態，然后重復第2 - 6步，直到到達目標狀態；

這里注意有兩層循環：

外層：遍歷所有狀態；

內層：遍歷每個狀態的所有可能的行動；

舉個樣例，左邊是Q-Learning過程（R:獎勵），中間是state = 1，action = 5生成Q-Table結果，右邊是最終Q-Table結果。

3.代碼實現與說明（python3.5）

這里舉一個例子來加深對強化學習原理的理解。游戲規則如下：以灰色兩個框作為目標出口，圖中有16個狀態（加上目標出口），每個狀態都有4個方向（上，下，左，右），找出任一狀態到目標出口的最優方向（如下圖）。

gridworld.py

 1 import io  2 import numpy as np  3 import sys  4 from gym.envs.toy_text import discrete  5 
 6 UP = 0  7 RIGHT = 1
 8 DOWN = 2
 9 LEFT = 3
 10 
 11 class GridworldEnv(discrete.DiscreteEnv):  12     """
 13  Grid World environment from Sutton's Reinforcement Learning book chapter 4.  14  You are an agent on an MxN grid and your goal is to reach the terminal  15  state at the top left or the bottom right corner.  16 
 17  For example, a 4x4 grid looks as follows:  18 
 19  T o o o  20  o x o o  21  o o o o  22  o o o T  23 
 24  x is your position and T are the two terminal states.  25 
 26  You can take actions in each direction (UP=0, RIGHT=1, DOWN=2, LEFT=3).  27  Actions going off the edge leave you in your current state.  28  You receive a reward of -1 at each step until you reach a terminal state.  29     """
 30 
 31     metadata = {'render.modes': ['human', 'ansi']}  32 
 33     def __init__(self, shape=[4,4]):  34         if not isinstance(shape, (list, tuple)) or not len(shape) == 2:  35             raise ValueError('shape argument must be a list/tuple of length 2')  36 
 37         self.shape = shape  38 
 39         nS = np.prod(shape)  40         nA = 4
 41 
 42         MAX_Y = shape[0]  43         MAX_X = shape[1]  44 
 45         P = {}  46         grid = np.arange(nS).reshape(shape)  47         it = np.nditer(grid, flags=['multi_index'])  48 
 49         while not it.finished:  50             s = it.iterindex  51             y, x = it.multi_index  52 
 53             # P[s][a] = (prob, next_state, reward, is_done)
 54             P[s] = {a : [] for a in range(nA)}  55 
 56             is_done = lambda s: s == 0 or s == (nS - 1)  57             reward = 0.0 if is_done(s) else -1.0
 58 
 59             # We're stuck in a terminal state
 60             if is_done(s):  61                 P[s][UP] = [(1.0, s, reward, True)]  62                 P[s][RIGHT] = [(1.0, s, reward, True)]  63                 P[s][DOWN] = [(1.0, s, reward, True)]  64                 P[s][LEFT] = [(1.0, s, reward, True)]  65             # Not a terminal state
 66             else:  67                 ns_up = s if y == 0 else s - MAX_X  68                 ns_right = s if x == (MAX_X - 1) else s + 1
 69                 ns_down = s if y == (MAX_Y - 1) else s + MAX_X  70                 ns_left = s if x == 0 else s - 1
 71                 P[s][UP] = [(1.0, ns_up, reward, is_done(ns_up))]  72                 P[s][RIGHT] = [(1.0, ns_right, reward, is_done(ns_right))]  73                 P[s][DOWN] = [(1.0, ns_down, reward, is_done(ns_down))]  74                 P[s][LEFT] = [(1.0, ns_left, reward, is_done(ns_left))]  75 
 76  it.iternext()  77 
 78         # Initial state distribution is uniform
 79         isd = np.ones(nS) / nS  80 
 81         # We expose the model of the environment for educational purposes
 82         # This should not be used in any model-free learning algorithm
 83         self.P = P  84 
 85         super(GridworldEnv, self).__init__(nS, nA, P, isd)  86 
 87     def _render(self, mode='human', close=False):  88         """ Renders the current gridworld layout  89 
 90  For example, a 4x4 grid with the mode="human" looks like:  91  T o o o  92  o x o o  93  o o o o  94  o o o T  95  where x is your position and T are the two terminal states.  96         """
 97         if close:  98             return
 99 
100         outfile = io.StringIO() if mode == 'ansi' else sys.stdout 101 
102         grid = np.arange(self.nS).reshape(self.shape) 103         it = np.nditer(grid, flags=['multi_index']) 104         while not it.finished: 105             s = it.iterindex 106             y, x = it.multi_index 107 
108             if self.s == s: 109                 output = " x "
110             elif s == 0 or s == self.nS - 1: 111                 output = " T "
112             else: 113                 output = " o "
114 
115             if x == 0: 116                 output = output.lstrip() 117             if x == self.shape[1] - 1: 118                 output = output.rstrip() 119 
120  outfile.write(output) 121 
122             if x == self.shape[1] - 1: 123                 outfile.write("\n") 124 
125             it.iternext()

View Code

ValueIteration.py

 1 import numpy as np  2 import gridworld as gw  3 
 4 
 5 env = gw.GridworldEnv()  6 # V表是當前狀態走下一步時最大回報（先求下一步不同方向的回報，然后求最大值）；
 7 # Q表是計算所有狀態所有可能的走法；
 8 def value_iteration(env, theta=0.0001, discount_factor=1.0):  9     """
10  Value Iteration Algorithm. 11     
12  Args: 13  env: OpenAI environment. env.P represents the transition probabilities of the environment. 14  theta: Stopping threshold. If the value of all states changes less than theta 15  in one iteration we are done. 16  discount_factor: lambda time discount factor. 17         
18  Returns: 19  A tuple (policy, V) of the optimal policy and the optimal value function. 20     """
21     
22     def one_step_lookahead(state, V): 23         """
24  Helper function to calculate the value for all action in a given state. 25         
26  Args: 27  state: The state to consider (int) 28  V: The value to use as an estimator, Vector of length env.nS 29         
30  Returns: 31  A vector of length env.nA containing the expected value of each action. 32         """
33         """
34  游戲規則： 0，15 作為出口 ，找出每個狀態（16個）到達出口最快的一步 35  0 o o o 36  o x o o 37  o o o o 38  o o o 15 39         """
40         # env.P 是個初始狀態table 在狀態0,15 有獎勵，其他都是懲罰 -1
41         # 每個action回報(上，下，左，右) = 概率因子 *（即時獎勵 + 折扣因子 * 下個狀態回報(當前行動到達的下個狀態)
42         A = np.zeros(env.nA) 43         for a in range(env.nA): 44             for prob, next_state, reward, done in env.P[state][a]: 45                 A[a] += prob * (reward + discount_factor * V[next_state]) 46         return A 47 
48     #每次獲取16個狀態的回報V（矩陣），
49     # 找到當前狀態某個行為的最大回報與當前狀態歷史回報最小值 < 超參theta，就結束循環。
50     # 這時狀態V就是每個狀態某個即將行為的最大回報
51     V = np.zeros(env.nS) 52     while True: 53         # Stopping condition
54         delta = 0 55         # Update each state...
56         # 有16個狀態，每次都需要全部遍歷
57         for s in range(env.nS): 58             # Do a one-step lookahead to find the best action
59             A = one_step_lookahead(s, V) 60             best_action_value = np.max(A) 61             # Calculate delta across all states seen so far
62             delta = max(delta, np.abs(best_action_value - V[s])) 63             # Update the value function
64             V[s] = best_action_value 65         # Check if we can stop 
66         if delta < theta: 67             break
68     #(16,4)
69     # Create a deterministic policy using the optimal value function
70     policy = np.zeros([env.nS, env.nA]) 71     for s in range(env.nS): 72         # One step lookahead to find the best action for this state
73         A = one_step_lookahead(s, V) 74         best_action = np.argmax(A)  # 最大值對應索引
75         # Always take the best action
76         policy[s, best_action] = 1.0
77     
78     return policy, V 79 
80 policy, v = value_iteration(env) 81 
82 print("Policy Probability Distribution:") 83 print(policy) 84 print("") 85 
86 print("Reshaped Grid Policy (0=up, 1=right, 2=down, 3=left):") 87 # 返回的是最大值的索引 axis = 1 列
88 print(np.reshape(np.argmax(policy, axis=1), env.shape)) 89 print("")

View Code

4.運行結果

游戲規則有16個狀態（state），每個狀態有4個行動方向（action），因此Q-Table大小是（16, 4），具體生成結果如下圖：

最終每個狀態最佳行動方向大小（4, 4），結果為

5.參考文獻

1.https://www.zhihu.com/topic/20039099/intro

2.http://baijiahao.baidu.com/s?id=1597978859962737001&wfr=spider&for=pc

3.https://applenob.github.io/gridworld.html

$γ$

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 強化學習代碼實戰【強化學習篇】--強化學習案例詳解一強化學習算法實例DQN代碼PyTorch實現強化學習原理與python實現PDF代碼運行分析強化學習-策略迭代代碼實現強化學習-價值迭代代碼實現強化學習 | D3QN原理及代碼實現強化學習 5 —— SARSA 和 Q-Learning算法代碼實現【強化學習】python 實現 saras 例一強化學習--DDPG---tensorflow實現