【推薦算法工程師技術棧系列】機器學習深度學習--強化學習


強化學習基本要素

  • 智能體(agent):與環境交互,負責執行動作的主體;

  • 環境(Environment):可以分為完全可觀測環境(Fully Observable Environment)和部分可觀測環境(Partially Observable Environment)。
    1)Fully Observable Environment就是agent了解了整個環境,顯然是一個理想情況。
    2)Partially Observable Environment是agent了解部分環境的情況(E&E的利用部分),剩下的需要靠agent去探索。

  • 動作空間(Action space,A):指的是智能主體可以采取的所有合法動作的集合;

  • 狀態空間(State space,S):智能體從環境獲取的信息;

  • 獎勵(Reward/Return,R):執行該動作所接受的瞬時獎賞\(r_{t+1}\);折扣未來獎勵(Discounted Future Reward)是執行該動作將來可能帶來的獎勵;

\[R_t = \sum^T_{i=t} \gamma^{(i-t)} r(s_i,a_i) \]

即當前時刻的獎勵等於當前時刻的即時獎勵加上下一時刻的獎勵乘上折扣因子γ。
如果γ等於0,意味着只看當前獎勵;
如果γ等於1,意味着環境是確定的,相同的動作總會獲得相同的獎勵(也就是cyclic Markov processes)。
因此實際中γ往往取類似0.9這樣的值

  • 狀態轉移概率矩陣(Transition):狀態轉移概率矩陣會根據agent當前的動作給出所有可能的下一個棋盤狀態以及對應的概率;

  • 策略(Policy):Policy就是我們的算法追求的目標,可以看做一個函數,在輸入state的時候,能夠返回此時應該執行的action或者action的概率分布。

\[\pi(a \mid s) = P[A_t = a \mid S_t = s] \]

1.1)確定性策略(Deterministic policy):指在某特定狀態下執行某個特定動作,即$ \pi(s) = a \( 1.2)隨機性策略(Stochastic policy):根據概率來執行某個動作,即\) \pi(s,a_i) = p_i $;也稱為greedy policy
2.1)行為策略(Behavior Policy):用來與環境互動產生數據的策略,即在訓練過程中做決策
2.2)目標策略(Target Policy):學習訓練完畢后拿去應用的策略
2.3)離線策略(off-policy):目標策略和行為策略分開,基本思想是利用Importance Sampling,即使用行為策略估計目標策略
2.4)在線策略(online-policy):目標策略和行為策略是同一個策略,即直接學習目標策略。

  • Value(價值函數):表示在輸入state,action的時候能得到的Discounted future reward的(期望)值。
    Value function一般有兩種。
    1)state-value function:\(V_{\pi}(s) = E_{\pi} [R_t \mid S_t = s]\)
    2)action-value function:\(Q_{\pi}(s; a) = E_{\pi} [R_t \mid S_t = s; A_t = a]\)

  • Bellman方程:當前狀態的價值和下一步的價值以及當前的反饋Reward有關, Bellman方程透出的含義就是價值函數的計算可以通過迭代的方式來實現。

\[V(s) = \mathbb E[R_{t+1} + \gamma V(S_{t+1})|S_t = s] \]

\[Q(s,a) = \mathbb E[R_{t+1} + \gamma V(S_{t+1},A_{t+1})|S_t = s,A_t = a] \]

  • 強化學習分類
    Value-based RL,值方法。顯式地構造一個model來表示值函數Q,找到最優策略對應的Q函數,自然就找到了最優策略。
    Policy-based RL,策略方法。顯式地構造一個model來表示策略函數,然后去尋找能最大化discounted future reward的策略。
    Model-based RL,基於環境模型的方法。先得到關於environment transition的model,然后再根據這個model去尋求最佳的策略。

馬爾科夫決策過程

一個馬爾科夫決策過程(Markov Decision Processes,MDP)是對強化學習中環境(Environment)的形式化的描述,或者說是對於agent所處的環境的一個建模。在強化學習中,幾乎所有的問題都可以形式化的表示為一個MDP。

MDP要素 符號 描述
狀態/狀態空間 S 狀態是對環境的描述,在智能體做出動作后,狀態會發生變化,且演變具有馬爾可夫性質。MDP所有狀態的集合是狀態空間。狀態空間可以是離散或連續的。
動作/動作空間 A 動作是對智能體行為的描述,是智能體決策的結果。MDP所有可能動作的集合是動作空間。動作空間可以是離散或連續的。
策略 $ \pi(a|s) $ MDP的策略是按狀態給出的,動作的條件概率分布,在強化學習的語境下屬於隨機性策略。
瞬時獎勵 R 智能體給出動作后環境對智能體的反饋。是當前時刻狀態、動作和下個時刻狀態的標量函數。
累計回報 G 回報是獎勵隨時間步的積累,在引入軌跡的概念后,回報也是軌跡上所有獎勵的總和。

在離散時間上建立的MDP被稱為“離散時間馬爾可夫決策過程(descrete-time MDP)”,反之則被稱為“連續時間馬爾可夫決策過程(continuous-time MDP)” [1] 。此外MDP存在一些變體,包括部分可觀察馬爾可夫決策過程、約束馬爾可夫決策過程和模糊馬爾可夫決策過程。

策略學習(Policy Learning)

策略學習(Policy Learning),可理解為一組很詳細的指示,它能告訴代理在每一步該做的動作。我們也可以把這個策略看作是函數,它只有一個輸入,即代理當前狀態。
策略搜索是將策略進行參數化即 \(\pi_\theta(s)\),利用線性或非線性(如神經網絡)對策略進行表示,尋找最優的參數使得強化學習的目標:累積回報的期望\(E[\sum^H_{t=0} R(s_t)|\pi_\theta]\)最大。策略搜索方法中,我們直接對策略進行迭代計算,也就是迭代更新參數值,直到累積回報的期望最大,此時的參數所對應的策略為最優策略。

時序差分方法(TD method)

時間差分方法結合了蒙特卡羅的采樣方法(即做試驗)和動態規划方法的bootstrapping(利用后繼狀態的值函數估計當前值函數);數學表示如下:

$ V(S_t) \gets V(S_t) + \alpha \delta_t \( \) \delta_t = [R_{t+1} + \gamma\ V(S_{t+1} - V(S_t)] \( \) 這里
\delta_t \text{ 稱為TD偏差 } \
\alpha \text{ - 學習步長 learning step size} \
\gamma \text{ - 稱為折扣未來獎勵 reward discount rate} \
$

Q-Learning算法

另一個指導代理的方式是給定框架后讓代理根據當前環境獨自做出動作,而不是明確地告訴它在每個狀態下該執行的動作。與策略學習不同,Q-Learning算法有兩個輸入,分別是狀態和動作,並為每個狀態動作對返回對應值。當你面臨選擇時,這個算法會計算出該代理采取不同動作時對應的期望值。
Q-Learning的創新點在於,它不僅估計了當前狀態下采取行動的短時價值,還能得到采取指定行動后可能帶來的潛在未來價值。由於未來獎勵會少於當前獎勵,因此Q-Learning算法還會使用折扣因子來模擬這個過程。
Q-Learning的算法流程圖
Q-Learning偽代碼

Actor-Critic方法

Actor-Critic方法是一種很重要的強化學習算法,其是一種時序差分方法(TD method),結合了基於值函數的方法和基於策略函數的方法。其中策略函數為行動者(Actor),給出動作;價值函數為評價者(Critic),評價行動者給出動作的好壞,並產生時序差分信號,來指導價值函數和策略函數的更新。

Acror-Critic結構如下圖:
Acror-Critic結構

DQN

DQN是基於Q-Learning,用深度神經網絡擬合其中的Q值的一種方法。DQN所做的是用一個深度神經網絡進行端到端的擬合,發揮深度網絡對高維數據輸入的處理能力。解決如下兩個問題:

  • 1.深度學習需要大量有標簽的數據樣本;而強化學習是智能體主動獲取樣本,樣本量稀疏且有延遲。
  • 2.深度學習要求每個樣本相互之間是獨立同分布的;而強化學習獲取的相鄰樣本相互關聯,並不是相互獨立的。

2015版DQN結構.jpg

其有兩個關鍵技術:
1、樣本池(experience reply/replay buffer):將采集到的樣本先放入樣本池,然后從樣本池中隨機選出一條樣本用於對網絡的訓練。這種處理打破了樣本間的關聯,使樣本間相互獨立。
2、固定目標值網絡(fixed Q-target):計算網絡目標值需用到現有的Q值,現用一個更新較慢的網絡專門提供此Q值。這提高了訓練的穩定性和收斂性。

DDPG

DDPG方法可以應對高維的輸入,實現端對端的控制,且可以輸出連續動作,使得深度強化學習方法可以應用於較為復雜的有大的動作空間和連續動作空間的情境。DDPG是基於Actor-Critic方法,在動作輸出方面采用一個網絡來擬合策略函數,直接輸出動作,可以應對連續動作的輸出及大的動作空間。

DDPG_ALGO

該結構包含兩個網絡,一個策略網絡(Actor),一個價值網絡(Critic)。策略網絡輸出動作,價值網絡評判動作。兩者都有自己的更新信息。策略網絡通過梯度計算公式進行更新,而價值網絡根據目標值進行更新。
DDPG采用了DQN的成功經驗。即采用了樣本池和固定目標值網絡這兩項技術。也就是說這兩個網絡分別有一個變化較慢的副本,該變化較慢的網絡提供給更新信息中需要的一些值。DDPG的整體結構如下:
DDPG的整體結構

TF實現DDPG

import numpy as np
from collections import deque
import random
import tensorflow as tf
from math import sqrt


class Agent(object):
    def __init__(self, model, replay_buffer, exploration_noise, discout_factor, verbose=False):
        self.model = model
        self.replay_buffer = replay_buffer
        self.exploration_noise = exploration_noise
        self.discout_factor = discout_factor
        self.verbose = verbose

    def predict_action(self, observation):
        return self.model.predict_action(observation)

    def select_action(self, observation, p=None):
        pred_action = self.predict_action(observation)
        noise = self.exploration_noise.return_noise()
        if p is not None:
            return pred_action * p + noise * (1 - p)
        else:
            return pred_action + noise

    def store_transition(self, transition):
        self.replay_buffer.store_transition(transition)

    def init_process(self):
        self.exploration_noise.init_process()

    def get_transition_batch(self):
        batch = self.replay_buffer.get_batch()
        transpose_batch = list(zip(*batch))
        s_batch = np.vstack(transpose_batch[0])
        a_batch = np.vstack(transpose_batch[1])
        r_batch = np.vstack(transpose_batch[2])
        next_s_batch = np.vstack(transpose_batch[3])
        done_batch = np.vstack(transpose_batch[4])
        return s_batch, a_batch, r_batch, next_s_batch, done_batch

    def preprocess_batch(self, s_batch, a_batch, r_batch, next_s_batch, done_batch):
        target_actor_net_pred_action = self.model.actor.predict_action_target_net(next_s_batch)
        target_critic_net_pred_q = self.model.critic.predict_q_target_net(next_s_batch, target_actor_net_pred_action)
        y_batch = r_batch + self.discout_factor * target_critic_net_pred_q * (1 - done_batch)
        return s_batch, a_batch, y_batch

    def train_model(self):
        s_batch, a_batch, r_batch, next_s_batch, done_batch = self.get_transition_batch()
        self.model.update(*self.preprocess_batch(s_batch, a_batch, r_batch, next_s_batch, done_batch))


class Replay_Buffer(object):
    def __init__(self, buffer_size=10e6, batch_size=1):
        self.buffer_size = buffer_size
        self.batch_size = batch_size
        self.memory = deque(maxlen=buffer_size)

    def __call__(self):
        return self.memory

    def store_transition(self, transition):
        self.memory.append(transition)

    def store_transitions(self, transitions):
        self.memory.extend(transitions)

    def get_batch(self, batch_size=None):
        b_s = batch_size or self.batch_size
        cur_men_size = len(self.memory)
        if cur_men_size < b_s:
            return random.sample(list(self.memory), cur_men_size)
        else:
            return random.sample(list(self.memory), b_s)

    def memory_state(self):
        return {"buffer_size": self.buffer_size,
                "current_size": len(self.memory),
                "full": len(self.memory) == self.buffer_size}

    def empty_transition(self):
        self.memory.clear()


class DDPG_Actor(object):
    def __init__(self, state_dim, action_dim, optimizer=None, learning_rate=0.001, tau=0.001, scope="", sess=None):
        self.scope = scope
        self.sess = sess
        self.state_dim = state_dim
        self.action_dim = action_dim
        self.learning_rate = learning_rate
        self.l2_reg = 0.01
        self.optimizer = optimizer or tf.train.AdamOptimizer(self.learning_rate)
        self.tau = tau
        self.h1_dim = 400
        self.h2_dim = 300
        # self.h3_dim = 200
        self.activation = tf.nn.relu
        self.kernel_initializer = tf.contrib.layers.variance_scaling_initializer()
        # fan-out uniform initializer which is different from original paper
        self.kernel_initializer_1 = tf.random_uniform_initializer(minval=-1 / sqrt(self.h1_dim),
                                                                  maxval=1 / sqrt(self.h1_dim))
        self.kernel_initializer_2 = tf.random_uniform_initializer(minval=-1 / sqrt(self.h2_dim),
                                                                  maxval=1 / sqrt(self.h2_dim))
        self.kernel_initializer_3 = tf.random_uniform_initializer(minval=-3e-3, maxval=3e-3)
        self.kernel_regularizer = tf.contrib.layers.l2_regularizer(self.l2_reg)

        with tf.name_scope("actor_input"):
            self.input_state = tf.placeholder(tf.float32, shape=[None, self.state_dim], name="states")

        with tf.name_scope("actor_label"):
            self.actions_grad = tf.placeholder(tf.float32, shape=[None, self.action_dim], name="actions_grad")

        self.source_var_scope = "ddpg/" + "actor_net"
        with tf.variable_scope(self.source_var_scope):
            self.action_output = self.__create_actor_network()

        self.target_var_scope = "ddpg/" + "actor_target_net"
        with tf.variable_scope(self.target_var_scope):
            self.target_net_actions_output = self.__create_target_network()

        with tf.name_scope("compute_policy_gradients"):
            self.__create_loss()

        self.train_op_scope = "actor_train_op"
        with tf.variable_scope(self.train_op_scope):
            self.__create_train_op()

        with tf.name_scope("actor_target_update_train_op"):
            self.__create_update_target_net_op()

        self.__create_get_layer_weight_op_source()
        self.__create_get_layer_weight_op_target()

    def __create_actor_network(self):
        h1 = tf.layers.dense(self.input_state,
                             units=self.h1_dim,
                             activation=self.activation,
                             kernel_initializer=self.kernel_initializer_1,
                             # kernel_initializer=self.kernel_initializer,
                             kernel_regularizer=self.kernel_regularizer,
                             name="hidden_1")

        h2 = tf.layers.dense(h1,
                             units=self.h2_dim,
                             activation=self.activation,
                             kernel_initializer=self.kernel_initializer_2,
                             # kernel_initializer=self.kernel_initializer,
                             kernel_regularizer=self.kernel_regularizer,
                             name="hidden_2")

        # h3 = tf.layers.dense(h2,
        # units=self.h3_dim,
        # activation=self.activation,
        # kernel_initializer=self.kernel_initializer,
        # kernel_regularizer=self.kernel_regularizer,
        # name="hidden_3")

        action_output = tf.layers.dense(h2,
                                        units=self.action_dim,
                                        activation=tf.nn.tanh,
                                        # activation=tf.nn.tanh,
                                        kernel_initializer=self.kernel_initializer_3,
                                        # kernel_initializer=self.kernel_initializer,
                                        kernel_regularizer=self.kernel_regularizer,
                                        use_bias=False,
                                        name="action_outputs")

        return action_output

    def __create_target_network(self):
        # get source variales and initialize
        source_vars = tf.get_collection(tf.GraphKeys.TRAINABLE_VARIABLES, scope=self.source_var_scope)
        self.sess.run(tf.variables_initializer(source_vars))

        # create target network and initialize it by source network
        action_output = self.__create_actor_network()
        target_vars = tf.get_collection(tf.GraphKeys.TRAINABLE_VARIABLES, scope=self.target_var_scope)

        target_init_op_list = [target_vars[i].assign(source_vars[i]) for i in range(len(source_vars))]
        self.sess.run(target_init_op_list)

        return action_output

    def __create_loss(self):
        source_vars = tf.get_collection(tf.GraphKeys.TRAINABLE_VARIABLES, scope=self.source_var_scope)
        self.policy_gradient = tf.gradients(self.action_output, source_vars, -self.actions_grad)
        self.grads_and_vars = zip(self.policy_gradient, source_vars)

    def __create_train_op(self):
        self.train_policy_op = self.optimizer.apply_gradients(self.grads_and_vars,
                                                              global_step=tf.contrib.framework.get_global_step())
        train_op_vars = tf.get_collection(tf.GraphKeys.GLOBAL_VARIABLES,
                                          scope=self.scope + "/" + self.train_op_scope)  # to do: remove prefix
        train_op_vars.extend(tf.get_collection(tf.GraphKeys.GLOBAL_VARIABLES, scope=self.train_op_scope))
        self.sess.run(tf.variables_initializer(train_op_vars))

    def __create_update_target_net_op(self):
        source_vars = tf.get_collection(tf.GraphKeys.TRAINABLE_VARIABLES, scope=self.source_var_scope)
        target_vars = tf.get_collection(tf.GraphKeys.TRAINABLE_VARIABLES, scope=self.target_var_scope)
        update_target_net_op_list = [target_vars[i].assign(self.tau * source_vars[i] + (1 - self.tau) * target_vars[i])
                                     for i in range(len(source_vars))]

        # source_net_dict = {var.name[len(self.source_var_scope):]: var for var in source_vars}
        # target_net_dict = {var.name[len(self.target_var_scope):]: var for var in target_vars}
        # keys = source_net_dict.keys()
        # update_target_net_op_list = [target_net_dict[key].assign((1-self.tau)*target_net_dict[key]+self.tau*source_net_dict[key]) \
        # for key in keys]

        # for s_v, t_v in zip(source_vars, target_vars):
        # update_target_net_op_list.append(t_v.assign(self.tau*s_v - (1-self.tau)*t_v))

        self.update_target_net_op = tf.group(*update_target_net_op_list)

    def predict_action_source_net(self, feed_state, sess=None):
        sess = sess or self.sess
        return sess.run(self.action_output, {self.input_state: feed_state})

    def predict_action_target_net(self, feed_state, sess=None):
        sess = sess or self.sess
        return sess.run(self.target_net_actions_output, {self.input_state: feed_state})

    def update_source_actor_net(self, feed_state, actions_grad, sess=None):
        sess = sess or self.sess
        batch_size = len(actions_grad)
        return sess.run([self.train_policy_op],
                        {self.input_state: feed_state,
                         self.actions_grad: actions_grad / batch_size})

    def update_target_actor_net(self, sess=None):
        sess = sess or self.sess
        return sess.run(self.update_target_net_op)

    def __create_get_layer_weight_op_source(self):
        with tf.variable_scope(self.source_var_scope, reuse=True):
            self.h1_weight_source = tf.get_variable("hidden_1/kernel")
            self.h1_bias_source = tf.get_variable("hidden_1/bias")

    def run_layer_weight_source(self, sess=None):
        sess = sess or self.sess
        return sess.run([self.h1_weight_source, self.h1_bias_source])

    def __create_get_layer_weight_op_target(self):
        with tf.variable_scope(self.target_var_scope, reuse=True):
            self.h1_weight_target = tf.get_variable("hidden_1/kernel")
            self.h1_bias_target = tf.get_variable("hidden_1/bias")

    def run_layer_weight_target(self, sess=None):
        sess = sess or self.sess
        return sess.run([self.h1_weight_target, self.h1_bias_target])


class DDPG_Critic(object):
    def __init__(self, state_dim, action_dim, optimizer=None, learning_rate=0.001, tau=0.001, scope="", sess=None):
        self.scope = scope
        self.sess = sess
        self.state_dim = state_dim
        self.action_dim = action_dim
        self.learning_rate = learning_rate
        self.l2_reg = 0.01
        self.optimizer = optimizer or tf.train.AdamOptimizer(self.learning_rate)
        self.tau = tau
        self.h1_dim = 400
        self.h2_dim = 100
        self.h3_dim = 300
        self.activation = tf.nn.relu
        self.kernel_initializer = tf.contrib.layers.variance_scaling_initializer()
        # fan-out uniform initializer which is different from original paper
        self.kernel_initializer_1 = tf.random_uniform_initializer(minval=-1 / sqrt(self.h1_dim),
                                                                  maxval=1 / sqrt(self.h1_dim))
        self.kernel_initializer_2 = tf.random_uniform_initializer(minval=-1 / sqrt(self.h2_dim),
                                                                  maxval=1 / sqrt(self.h2_dim))
        self.kernel_initializer_3 = tf.random_uniform_initializer(minval=-1 / sqrt(self.h3_dim),
                                                                  maxval=1 / sqrt(self.h3_dim))
        self.kernel_initializer_4 = tf.random_uniform_initializer(minval=-3e-3, maxval=3e-3)
        self.kernel_regularizer = tf.contrib.layers.l2_regularizer(self.l2_reg)

        with tf.name_scope("critic_input"):
            self.input_state = tf.placeholder(tf.float32, shape=[None, self.state_dim], name="states")
            self.input_action = tf.placeholder(tf.float32, shape=[None, self.action_dim], name="actions")

        with tf.name_scope("critic_label"):
            self.y = tf.placeholder(tf.float32, shape=[None, 1], name="y")

        self.source_var_scope = "ddpg/" + "critic_net"
        with tf.variable_scope(self.source_var_scope):
            self.q_output = self.__create_critic_network()

        self.target_var_scope = "ddpg/" + "critic_target_net"
        with tf.variable_scope(self.target_var_scope):
            self.target_net_q_output = self.__create_target_network()

        with tf.name_scope("compute_critic_loss"):
            self.__create_loss()

        self.train_op_scope = "critic_train_op"
        with tf.variable_scope(self.train_op_scope):
            self.__create_train_op()

        with tf.name_scope("critic_target_update_train_op"):
            self.__create_update_target_net_op()

        with tf.name_scope("get_action_grad_op"):
            self.__create_get_action_grad_op()

        self.__create_get_layer_weight_op_source()
        self.__create_get_layer_weight_op_target()

    def __create_critic_network(self):
        h1 = tf.layers.dense(self.input_state,
                             units=self.h1_dim,
                             activation=self.activation,
                             kernel_initializer=self.kernel_initializer_1,
                             # kernel_initializer=self.kernel_initializer,
                             kernel_regularizer=self.kernel_regularizer,
                             name="hidden_1")

        # h1_with_action = tf.concat([h1, self.input_action], 1, name="hidden_1_with_action")

        h2 = tf.layers.dense(self.input_action,
                             units=self.h2_dim,
                             activation=self.activation,
                             kernel_initializer=self.kernel_initializer_2,
                             # kernel_initializer=self.kernel_initializer,
                             kernel_regularizer=self.kernel_regularizer,
                             name="hidden_2")

        h_concat = tf.concat([h1, h2], 1, name="h_concat")

        h3 = tf.layers.dense(h_concat,
                             units=self.h3_dim,
                             activation=self.activation,
                             kernel_initializer=self.kernel_initializer_3,
                             # kernel_initializer=self.kernel_initializer,
                             kernel_regularizer=self.kernel_regularizer,
                             name="hidden_3")

        # h2_with_action = tf.concat([h2, self.input_action], 1, name="hidden_3_with_action")

        q_output = tf.layers.dense(h3,
                                   units=1,
                                   # activation=tf.nn.sigmoid,
                                   activation=None,
                                   kernel_initializer=self.kernel_initializer_4,
                                   # kernel_initializer=self.kernel_initializer,
                                   kernel_regularizer=self.kernel_regularizer,
                                   name="q_output")

        return q_output

    def __create_target_network(self):
        # get source variales and initialize
        source_vars = tf.get_collection(tf.GraphKeys.TRAINABLE_VARIABLES, scope=self.source_var_scope)
        self.sess.run(tf.variables_initializer(source_vars))

        # create target network and initialize it by source network
        q_output = self.__create_critic_network()
        target_vars = tf.get_collection(tf.GraphKeys.TRAINABLE_VARIABLES, scope=self.target_var_scope)

        target_init_op_list = [target_vars[i].assign(source_vars[i]) for i in range(len(source_vars))]
        self.sess.run(target_init_op_list)

        return q_output

    def __create_loss(self):
        self.loss = tf.losses.mean_squared_error(self.y, self.q_output)

    def __create_train_op(self):
        self.train_q_op = self.optimizer.minimize(self.loss)
        train_op_vars = tf.get_collection(tf.GraphKeys.GLOBAL_VARIABLES,
                                          scope=self.scope + "/" + self.train_op_scope)  # to do: remove prefix
        train_op_vars.extend(tf.get_collection(tf.GraphKeys.GLOBAL_VARIABLES, scope=self.train_op_scope))
        self.sess.run(tf.variables_initializer(train_op_vars))

    def __create_update_target_net_op(self):
        source_vars = tf.get_collection(tf.GraphKeys.TRAINABLE_VARIABLES, scope=self.source_var_scope)
        target_vars = tf.get_collection(tf.GraphKeys.TRAINABLE_VARIABLES, scope=self.target_var_scope)
        update_target_net_op_list = [target_vars[i].assign(self.tau * source_vars[i] + (1 - self.tau) * target_vars[i])
                                     for i in range(len(source_vars))]
        # source_net_dict = {var.name[len(self.source_var_scope):]: var for var in source_vars}
        # target_net_dict = {var.name[len(self.target_var_scope):]: var for var in target_vars}
        # keys = source_net_dict.keys()
        # update_target_net_op_list = [target_net_dict[key].assign((1-self.tau)*target_net_dict[key]+self.tau*source_net_dict[key]) \
        # for key in keys]

        # for s_v, t_v in zip(source_vars, target_vars):
        # update_target_net_op_list.append(t_v.assign(self.tau*s_v - (1-self.tau)*t_v))

        self.update_target_net_op = tf.group(*update_target_net_op_list)

    def __create_get_action_grad_op(self):
        self.get_action_grad_op = tf.gradients(self.q_output, self.input_action)

    def predict_q_source_net(self, feed_state, feed_action, sess=None):
        sess = sess or self.sess
        return sess.run(self.q_output, {self.input_state: feed_state,
                                        self.input_action: feed_action})

    def predict_q_target_net(self, feed_state, feed_action, sess=None):
        sess = sess or self.sess
        return sess.run(self.target_net_q_output, {self.input_state: feed_state,
                                                   self.input_action: feed_action})

    def update_source_critic_net(self, feed_state, feed_action, feed_y, sess=None):
        sess = sess or self.sess
        return sess.run([self.train_q_op],
                        {self.input_state: feed_state,
                         self.input_action: feed_action,
                         self.y: feed_y})

    def update_target_critic_net(self, sess=None):
        sess = sess or self.sess
        return sess.run(self.update_target_net_op)

    def get_action_grads(self, feed_state, feed_action, sess=None):
        sess = sess or self.sess
        return (sess.run(self.get_action_grad_op, {self.input_state: feed_state,
                                                   self.input_action: feed_action}))[0]

    def __create_get_layer_weight_op_source(self):
        with tf.variable_scope(self.source_var_scope, reuse=True):
            self.h1_weight_source = tf.get_variable("hidden_1/kernel")
            self.h1_bias_source = tf.get_variable("hidden_1/bias")

    def run_layer_weight_source(self, sess=None):
        sess = sess or self.sess
        return sess.run([self.h1_weight_source, self.h1_bias_source])

    def __create_get_layer_weight_op_target(self):
        with tf.variable_scope(self.target_var_scope, reuse=True):
            self.h1_weight_target = tf.get_variable("hidden_1/kernel")
            self.h1_bias_target = tf.get_variable("hidden_1/bias")

    def run_layer_weight_target(self, sess=None):
        sess = sess or self.sess
        return sess.run([self.h1_weight_target, self.h1_bias_target])


class Model(object):
    def __init__(self,
                 state_dim,
                 action_dim,
                 optimizer=None,
                 actor_learning_rate=1e-4,
                 critic_learning_rate=1e-3,
                 tau=0.001,
                 sess=None):
        self.state_dim = state_dim
        self.action_dim = action_dim
        self.actor_learning_rate = actor_learning_rate
        self.critic_learning_rate = critic_learning_rate
        self.tau = tau

        # tf.reset_default_graph()
        self.sess = sess or tf.Session()

        self.global_step = tf.Variable(0, name="global_step", trainable=False)
        global_step_vars = tf.get_collection(tf.GraphKeys.GLOBAL_VARIABLES, scope="global_step")
        self.sess.run(tf.variables_initializer(global_step_vars))

        self.actor_scope = "actor_net"
        with tf.name_scope(self.actor_scope):
            self.actor = DDPG_Actor(self.state_dim,
                                    self.action_dim,
                                    learning_rate=self.actor_learning_rate,
                                    tau=self.tau,
                                    scope=self.actor_scope,
                                    sess=self.sess)

        self.critic_scope = "critic_net"
        with tf.name_scope(self.critic_scope):
            self.critic = DDPG_Critic(self.state_dim,
                                      self.action_dim,
                                      learning_rate=self.critic_learning_rate,
                                      tau=self.tau,
                                      scope=self.critic_scope,
                                      sess=self.sess)

    def update(self, state_batch, action_batch, y_batch, sess=None):
        sess = sess or self.sess
        self.critic.update_source_critic_net(state_batch, action_batch, y_batch, sess)
        action_batch_for_grad = self.actor.predict_action_source_net(state_batch, sess)
        action_grad_batch = self.critic.get_action_grads(state_batch, action_batch_for_grad, sess)
        self.actor.update_source_actor_net(state_batch, action_grad_batch, sess)

        self.critic.update_target_critic_net(sess)
        self.actor.update_target_actor_net(sess)

    def predict_action(self, observation, sess=None):
        sess = sess or self.sess
        return self.actor.predict_action_source_net(observation, sess)

推薦系統強化學習建模

強化學習(MDP)概念 對應推薦系統中的概念
智能體(Agent) 推薦系統
環境(Environment) 用戶
狀態(State) 狀態來自於Agent對Environment的觀察,在推薦場景下即用戶的意圖和所處場景;具體可以使用Dense和Embedding特征表達用戶所處的時間、地點、場景,以及更長時間周期內用戶行為習慣的挖掘。
動作(Action) 建議先建模獎勵后再建模動作;解決業務問題不同對應的動作也不同,比較常見的是多目標排序時的模型融合比例,或者推薦系統中各個召回的流量分發占比等。
獎勵(Reward) 根據用戶反饋給予Agent相應的獎勵,為業務目標直接負責。比較常見的是點擊率,轉化率或者停留時長等

實時強化學習框架設計
結構

附錄


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM