強化學習框架RLlib教程004：Training APIs的使用（三）高級pythonAPI

本文轉載自查看原文 2020-10-06 17:29 506 0065.強化學習框架RLlib介紹

有時，我們需要協調運行在不同進程中的代碼。比如，維護一個全局變量，或者policies使用的超參數。Ray提供了一個通用的方式來實現，即actors。這些actors被分配一個全局名字，並且對他們的處理可以通過這個名字獲取。例如，想維護一個共享的全局計數器，他根據環境做累加，並且由driver程序在不同時期讀取：

import os
os.environ["CUDA_VISIBLE_DEVICES"] = '3'
import ray
import numpy as np
import ray.rllib.agents.ppo as ppo
from ray.tune.logger import pretty_print
ray.init()
import gym
# Get a reference to the policy
from ray.rllib.agents.ppo import PPOTrainer

@ray.remote
class Counter:
   def __init__(self):
      self.count = 0
   def inc(self, n):
      self.count += n
   def get(self):
      return self.count

# on the driver
counter = Counter.options(name="global_counter").remote()
print(ray.get(counter.get.remote()))  # get the latest count

# in your envs
counter = ray.get_actor("global_counter")
counter.inc.remote(1)  # async call to increment the global count
print(ray.get(counter.get.remote()))  # get the latest count

View Code

Ray actor提供了高水平的性能，因此在更復雜的情況下，它們可以用於實現通信模式，如參數服務器和allreduce。

返回目錄

回調函數和自定義准則（Callbacks and Custom Metrics）

在評估policy的時候可以添加回調函數，這個回調函數可以獲取到這個episode里面的狀態。某些回調，如on_postprocess_trajectory、on_sample_end和on_train_result，也是可以對中間數據或結果應用自定義后處理的地方。

用戶自定義的狀態會被存在episode.user_data字典中，自定義的評估值會被保存在episode.custom_metrics 字典中。這些自定義的評估將被聚合並記錄在訓練結果中。

返回目錄

可視化自定義的度量（Visualizing Custom Metrics）

可以像任何其他training結果一樣訪問和可視化自定義的度量:

返回目錄

自定義探索行為（Customizing Exploration Behavior）

RLlib提供了一套統一的高層級的API來配置和自定義agent的探索行為，包括從動作分布（隨機或固定）中抽取action（how and whether）。這個可以通過內嵌Exploration類來做，可以用Trainer.config["exploration_config"]來配。除了使用內嵌的類，也可以實現內嵌類的子類，然后在config中使用。

每一個policy都有一個Exploration（或其子類）的對象。這個Exploration對象由Trainer’s config[“exploration_config”] 字典創造：

# in Trainer.config:
"exploration_config": { "type": "StochasticSampling",  # <- Special `type` key provides class information
    "[c'tor arg]" : "[value]",  # <- Add any needed constructor args here.
    # etc
} # ...

下表列出了內嵌的Exploration子類和agent默認使用的情況：

Exploration類實現了get_exploration_action 方法，在里面可以定義額外的探索。它接收模型的輸出、action分布類、模型本身、時間步（全局env-sampling步）以及explore開關，輸出一個動作和概率：

def get_exploration_action(self, distribution_inputs, action_dist_class, model=None, explore=True, timestep=None): """Returns a (possibly) exploratory action and its log-likelihood. Given the Model's logits outputs and action distribution, returns an exploratory action. Args: distribution_inputs (any): The output coming from the model, ready for parameterizing a distribution (e.g. q-values or PG-logits). action_dist_class (class): The action distribution class to use. model (ModelV2): The Model object. explore (bool): True: "Normal" exploration behavior. False: Suppress all exploratory behavior and return a deterministic action. timestep (int): The current sampling time step. If None, the component should try to use an internal counter, which it then increments by 1. If provided, will set the internal counter to the given value. Returns: Tuple: - The chosen exploration action or a tf-op to fetch the exploration action from the graph. - The log-likelihood of the exploration action. """
    pass

View Code

在最高級別，Trainer.compute_action 和 Policy.compute_action(s)方法有一個explore開關，會傳給xploration.get_exploration_action。如果是None，Trainer.config[“explore”] 會被使用。因此config[“explore”]描述了policy的默認行為，他可以直接關閉探索行為（用於評估的時候）

下面是一些例子，展示了不同的Trainer config使用不同的探索行為：

# All of the following configs go into Trainer.config.

# 1) Switching *off* exploration by default. # Behavior: Calling `compute_action(s)` without explicitly setting its `explore` # param will result in no exploration. # However, explicitly calling `compute_action(s)` with `explore=True` will # still(!) result in exploration (per-call overrides default).
"explore": False, # 2) Switching *on* exploration by default. # Behavior: Calling `compute_action(s)` without explicitly setting its # explore param will result in exploration. # However, explicitly calling `compute_action(s)` with `explore=False` # will result in no(!) exploration (per-call overrides default).
"explore": True, # 3) Example exploration_config usages: # a) DQN: see rllib/agents/dqn/dqn.py
"explore": True, "exploration_config": { # Exploration sub-class by name or full path to module+class
   # (e.g. “ray.rllib.utils.exploration.epsilon_greedy.EpsilonGreedy”)
   "type": "EpsilonGreedy", # Parameters for the Exploration class' constructor:
   "initial_epsilon": 1.0, "final_epsilon": 0.02, "epsilon_timesteps": 10000,  # Timesteps over which to anneal epsilon.
}, # b) DQN Soft-Q: In order to switch to Soft-Q exploration, do instead:
"explore": True, "exploration_config": { "type": "SoftQ", # Parameters for the Exploration class' constructor:
   "temperature": 1.0, }, # c) PPO: see rllib/agents/ppo/ppo.py # Behavior: The algo samples stochastically by default from the # model-parameterized distribution. This is the global Trainer default # setting defined in trainer.py and used by all PG-type algos.
"explore": True, "exploration_config": { "type": "StochasticSampling", },

View Code

返回目錄

訓練過程中自定義評估（Customized Evaluation During Training）

RLlib將會報告在線訓練的回報值，然而在一些場合你也許想用特殊的設置計算回報（比如關閉exploration或者使用特殊的環境配置）

你可以在訓練中評估policies通過設置evaluation_interval config，然后還可選evaluation_num_episodes, evaluation_config, evaluation_num_workers, and custom_eval_function參數

默認情況下，exploration是evaluation_config里保持不變的。然而你可以關閉所有的exploration通過：

# Switching off exploration behavior for evaluation workers

# (see rllib/agents/trainer.py)

"evaluation_config": {

"explore": False

}

這有一個端到端的例子，展示了如何設置自定義的在線評估 custom_eval.py。注意：如果你只想在訓練結束時評估policy，你可以設置 evaluation_interval: N，N表示停止之前訓練迭代數。

返回目錄

重寫軌跡（Rewriting Trajectories）

注意：在回調函數on_postprocess_traj 里，你可以獲取trajectory batch的所有信息和其他訓練的狀態信息。這可以用來重寫trajectory，在以下情況可能有用：

1.回溯獎勵之前的時間步（基於info里的值）

2.添加一個model-based 好奇獎勵給reward（你可以訓練這個模型使用自己的監督方法）

返回目錄

課程式學習（Curriculum Learning）

有兩種方式實現課程式學習。在課程式學習中，agent的任務是隨着時間調整訓練過程。假設有一個環境類，其中有一個set_phase() 方法，我們可以隨着時間調整任務的難度：

方法一：

在調用train()期間使用Trainer的API並更新環境。這個例子展示了trainer在Tune函數中的使用：

import ray from ray import tune from ray.rllib.agents.ppo import PPOTrainer def train(config, reporter): trainer = PPOTrainer(config=config, env=YourEnv) while True: result = trainer.train() reporter(**result) if result["episode_reward_mean"] > 200: phase = 2
        elif result["episode_reward_mean"] > 100: phase = 1
        else: phase = 0 trainer.workers.foreach_worker( lambda ev: ev.foreach_env( lambda env: env.set_phase(phase))) ray.init() tune.run( train, config={ "num_gpus": 0, "num_workers": 2, }, resources_per_trial={ "cpu": 1, "gpu": lambda spec: spec.config.num_gpus, "extra_cpu": lambda spec: spec.config.num_workers, }, )

View Code

方法二：

使用回調API來更新新的訓練結果的環境

import ray from ray import tune def on_train_result(info): result = info["result"] if result["episode_reward_mean"] > 200: phase = 2
    elif result["episode_reward_mean"] > 100: phase = 1
    else: phase = 0 trainer = info["trainer"] trainer.workers.foreach_worker( lambda ev: ev.foreach_env( lambda env: env.set_phase(phase))) ray.init() tune.run( "PPO", config={ "env": YourEnv, "callbacks": { "on_train_result": on_train_result, }, }, )