前言：

本文是根據的文章Introduction to Monte Carlo Tree Search by Jeff Bradberry所寫。
Jeff Bradberry還提供了一整套的例子，用python寫的。
board game server
board game client
Tic Tac Toe board
AI implementation of Tic Tac Toe

阿袁工作的第一天 - 蒙特卡羅樹搜索算法 - 游戲的通用接口board 和 player

阿袁看到阿靜最近在學習蒙特卡羅樹搜索算法。急忙湊上去問：“蒙特卡羅樹搜索算法是干什么用的？”
"蒙特卡羅樹搜索算法是一種方法（或者說框架），用於解決完美信息博弈。我現在學習一個蒙特卡羅樹搜索算法的變種：UCT算法，用於提供一種通用的游戲對弈解決算法。"

注: perfect information games (完美信息)博弈，指的是沒有任何信息被隱藏的游戲。

"通用的游戲對弈算法，是對任何游戲都有效，是嗎？"
"簡單的說，是這樣的。重要的一點是，算法並不用了解游戲的領域知識。"
"領域知識？不是很好理解。難道連游戲規則也不知道，就可以贏嗎？"
"游戲的領域知識。舉個例子，國際象棋中每個棋子的子力，比如皇后的子力是10，車是5等等。這些就是領域知識。在通用的情況下，馬的走法-這樣的規則，也算是領域知識。"
"有點糊塗了！AI算法該如何下子呢？"
"用面向對象的邏輯來說，我們可以給游戲定義有一個通用接口(board)，具體的游戲只能實現這個接口，不能提供其它的信息。"
"對於程序猿來說，這就容易理解多了。我們可以先看看這個接口(board)，都應該定義什么樣屬性和方法。"
"首先，有一個num_players屬性，返回游戲的玩家數。"
"嗯，讓我想想，游戲開始的時候，需要一個方法start，啟動一個游戲。"
"很好，這個方法需要返回一個state對象，用於記錄游戲當前的狀態。state對象的內容，外部是不可知的。使用board自己可以解釋。"
"然后，需要顯示棋盤的狀態。這樣，board就需要提供一個display方法，返回當前的狀態或者是棋盤狀態。"
"對。應該有個方法返回誰是該下子的玩家:current_player."
"當前玩家是一個AI玩家（也就是對弈算法的使用者），怎么知道如何下子呢？這里需要許多的領域知識吧？"
"一個技巧是讓board根據歷史的狀態列表，返回當前允許的所有下法：legal_actions。"
"再加上一個is_legal(action)，來判斷一個下法是否合適。"
"下來應該是根據現在的action，返回下一個游戲狀態，next_state。"
"為了判斷勝負，需要一個winner方法。"
"如果有了贏家，board需要返回一個winner_message信息。通知玩家誰勝了。"
"看起來不錯！我們總結一下board接口的內容。"

class Board(object):
    '''
    Define general rules of a game.
    State: State is an object which is only be used inside the board class.
        Normally, a state include game board information (e.g. chessmen positions, action index, current action, current player, etc.)
    Action: an object to describe a move. 
    '''
    
    '''
    num_players: The player numbers of the board.
    '''
    num_players = 2

    def start(self):
        ''' 
        Start the game
        Return: the initial state
        '''
        return None

    def display(self, state, action, _unicode=True):
        '''
        Dispaly the board
        state: current state
        action: current action
        Return: display information
        '''
        return None

    def parse(self, action):
        '''
        Parse player input text into an action.
        If the input action is invalid, return None.
        The method is used by a human player to parse human input.
        action: player input action texxt.
        Return: action if input is a valid action, otherwise None.
        '''
        return None

    def next_state(self, state, action):
        '''
        Calculate the next state base on current state and action.
        state: the current state
        action: the current action
        Return: the next state
        '''
        return tuple(state)

    def is_legal(self, history, action):
        '''
        Check if an action is legal.
        The method is used by a human player to validate human input.
        history: an array of history states.
        Return: ture if the action is legal, otherwise return false.
        '''
        return (R, C) == (state[20], state[21])

    def legal_actions(self, history):
        '''
        Calculate legal action from history states.
        The method is mainly used by AI players.
        history: an array of history states.
        Return: an array of legal actions.
        '''
        return actions

    def current_player(self, state):
        '''
        Gets the current player.
        state: the current state.
        Return: the current player number.
        '''
        return None

    def winner(self, history):
        '''
        Gets the win player.
        history: an array of history states.
        Return: win player number. 0: no winner and no end, players numbers + 1: draw.
        '''
        return 0

    def winner_message(self, winner):
        '''
        Gets game result.
        winner: win player number
        Return: winner message, the game result.
        '''
        return ""

"另外，我們需要定義一個player接口，玩家主要是下子，所以需要一個get_action方法。"
"當一個玩家下完子后，需要通過一個update方法通知所有的玩家，狀態要更新了。"

class Player(object):
    def update(self, state):
        '''
        Update current state into all states.
        state: the current state.
        '''
        self.states.append(state)

    def display(self, state, action):
        '''
        Display board.
        state: the current state.
        action: the current action.
        Return: display information.
        '''
        return self.board.display(state, action)

    def winner_message(self, msg):
        '''
        Display winner message.
        msg: winner infomation
        Return: winner message
        '''
        return self.board.winner_message(msg)

    def get_action(self):
        '''
        Get player next action.
        Return: the next action.
        '''
        return action

注：方法: diplay and winner_message用於向游戲的客戶端提供board的信息。這樣隔離了客戶端和board。

阿袁工作的第2天 - 蒙特卡羅樹搜索算法 - MonteCarlo Player

阿袁和阿靜繼續關於蒙特卡羅樹搜索算法的討論。
阿靜說道，“在編寫一個人工智能游戲對弈的應用中，至少需要兩個具體的player，一個是human player，一個是MonteCarlo player。”
"human player向人類玩家提供了一個交互界面。"
“對，MonteCarlo player是一個AI player，也是我們要討論的重點，MonteCarlo player在實現get_action中，通過board，模擬后面可能下法；並根據模擬的結果，獲得一個最優的下法。”
"我們先從一個簡單的問題開始：一個游戲下法的組合可能是一個很大的數，我們如何控制這個模擬行為是滿足一定時間上的限制的。"
“對於這個問題，解決方法有一些。這里，我們允許一個參數calculation_time來控制時間。每次模擬一條路徑，模擬完后，檢測一下是否到時。”
“一條路徑就是從游戲的當前狀態到對局結束的所有步驟。如果這些步驟太長了呢？”
“盡管游戲的下法組合數會很大。但是一個游戲的正常步驟卻不會很大哦。我們也可以通過另外一個參數max_actions來控制。”
“明白了。代碼大概是這個樣子。”

class MonteCarlo(object):

    def __init__(self, board, **kwargs):
        # ...

        self.calculation_time = float(kwargs.get('time', 30))
        self.max_actions = int(kwargs.get('max_actions', 1000))

        # ...

    def get_action(self):
        # ...

        # Control period of simulation
        moves = 0
        begin = time.time()
        while time.time() - begin < self.calculation_time:
            self.run_simulation()
            moves += 1

        # ...

    def run_simulation(self):
        # ...
        
        # Control number of simulation actions
        for t in range(1, self.max_actions + 1):
            # ...
        
        # ...

注：為了易於理解，我簡單地重構了源代碼，主要是rename了一些變量名。

"今天時間有些緊張，明天我們討論蒙特卡羅樹搜索的步驟"

阿袁工作的第3天 - 蒙特卡羅樹搜索 - 蒙特卡羅樹搜索的步驟

阿袁昨天晚上，也好好學習了蒙特卡羅樹搜索。今天，他開始發言。
"蒙特卡羅樹搜索是一個方法，應該是來自於蒙特卡羅方法。這個方法定義了幾個步驟，用於找到最優的下法。"
“嚴格的說，蒙特卡羅樹搜索並不是一個算法。”
“是的。所以蒙特卡羅樹搜索有很多變種，我們現在學習的算法是蒙特卡羅樹搜索算法的一個變種：信任度上限樹(Upper Confidence bound applied to Trees(UCT))。這個我們明天研究。”
“好，今天主要了解蒙特卡羅樹搜索方法的步驟”
"從文章上看一共有四個步驟。"
"是的。分別是選舉(selection)，擴展(expansion)，模擬(simulation)，反向傳播(Back-Propagation)。"
“我們看看這張圖。綠色部分是蒙特卡羅樹搜索的四個步驟。”

“**選舉(selection)**是根據當前獲得所有子步驟的統計結果，選擇一個最優的子步驟。” “**擴展(expansion)**在當前獲得的統計結果不足以計算出下一個步驟時，隨機選擇一個子步驟。” “**模擬(simulation)**模擬游戲，進入下一步。” “**反向傳播(Back-Propagation)**根據游戲結束的結果，計算對應路徑上統計記錄的值。” “從上面這張圖可以看出，選舉的算法很重要，這個算法可以說是來評價每個步驟的價值的。” “好了。今天，我們了解了蒙特卡羅樹搜索的步驟。” “明天，可以學習Upper Confidence bound applied to Trees(UCT) - 信任度上限樹算法。”

阿袁工作的第4天 - 蒙特卡羅樹搜索 - Upper Confidence bound applied to Trees(UCT) - 信任度上限樹算法

一開始，阿靜就開始講到。
“信任度上限樹算法UCT是根據統計學的信任區間公式，來計算一個步驟的價值。這個方法比較簡單，只需要每個步驟的訪問數和獲勝數就可以了。”
“信任區間公式的是什么呢？”
阿靜寫下信任區間公式。
置信區間(confidence intervals)

\[\bar{x}_i \pm \sqrt{\frac{z\ln{n}}{n_i}} \\ where: \\ \qquad \bar{x}_i \text{ : the mean of choose i.} \\ \qquad n_i \text{ : the number of plays of choose i.} \\ \qquad n \text{ : the total number of plays.} \\ \qquad z \text{ : 1.96 for 95% confidence level.} \]

阿靜進一步解釋道。
“置信區間是一個統計上的計算值，如果z使用1.96，可以使置信區間的置信度達到95%。也就是說：有95%的信心，樣本的平均值在置信區間內。”
“UCT算法使用了置信區間的上限值做為每個步驟的價值。”
“使用置信區間的上限值帶來的一個好處是：如果當前選擇的最優子步驟在多次失敗的模擬后，這個值會變小，從而導致另一個同級的子步驟可能會變得更優。”
“另外一個關鍵點是選舉的條件，文章中的選舉條件是當前所有子步驟都有了統計記錄（也就是至少訪問了一次，有了訪問數。）。”

阿袁工作的第5天 - 蒙特卡羅樹搜索 - 圖形化模擬 Upper Confidence bound applied to Trees(UCT) - 信任度上限樹算法

阿袁今天做了一天功課，畫了一些圖來說明UCT算法的過程。

首先，初始狀態下，所有的子步驟都沒有統計數據。

所以，先做擴展(Expansion)，隨機選擇一個子步驟，不停的模擬(Simulation)，直到游戲結束。然后反向傳播(Back-Propagation)，記錄擴展步驟的統計數據。

多次擴展(Expansion)之后，達到了選舉(selection)的條件，開始選舉(selection)，選出最優的一個子步驟。

繼續擴展(Expansion)，模擬(Simulation)，反向傳播(Back-Propagation)
下圖說明以前最優的子步驟，可能在多次擴展后，發生變化。

阿袁的日記

2016年10月X日星期六
這周和阿靜一起學習了蒙特卡羅樹搜索的一些知識。基本上了解了蒙特卡羅樹搜索的步驟和使用方法。
發現在使用蒙特卡羅樹搜索方法中，有許多可以優化的地方。比如：

步驟價值計算
- 是否可以在沒有贏的情況下，計算價值？
- 是否可以計算一個步驟是沒有價值的，因而可以及早的砍掉它。

還有許多問題：

是否AI程序可以理解規則?比如，理解馬走日。
是否AI程序可以算出一些領域規則。開局的方法、子力計算等。

參考

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 python實現的基於蒙特卡洛樹搜索(MCTS)與UCT RAVE的五子棋游戲蒙特卡洛樹搜索介紹蒙特卡洛算法蒙特卡洛算法蒙特卡洛算法蒙特卡洛樹簡單介紹蒙特卡羅方法、蒙特卡洛樹搜索（Monte Carlo Tree Search，MCTS）初探蒙特卡洛算法（簡單理解）蒙特卡洛算法及其實現蒙特卡洛