Multi-Agent Actor-Critic for Mixed Cooperative-Competitive Environments環境代碼詳解

本文轉載自查看原文 2021-06-08 23:15 1115 技術儲備

Github：https://github.com/openai/multiagent-particle-envs

論文Blog：Multi-Agent Actor-Critic for Mixed Cooperative-Competitive Environments - 窮酸秀才大艹包 - 博客園 (cnblogs.com)

創造新環境

您可以通過實現上面的前4個函數來創建新的場景 (`make_world()`, `reset_world()`, `reward()`, and `observation()`).

環境列表

| 代碼中的環境名稱 (文中的名稱) | 是否溝通? | 是否競爭? | 筆記 |

| simple.py | N | N | 單個智能體看到地標位置，根據它離地標的距離給予獎勵。不是用於調試策略的多智能體環境。|

獎勵：智能體與目標地標的距離的平方和的相反數。

| simple_adversary.py (物理欺騙) | N | Y | 一個對手 (紅色)，N個協作智能體 (綠色)，N個地標 (通常N=2)。所有智能體觀察地標和其他智能體的位置。一個地標是"目標地標" (綠色)。協作智能體會根據它們中的一個離目標地標有多近而得到獎勵，但如果對手離目標地標很近，則會得到負獎勵。敵方會根據離目標有多近而得到獎勵，但它不知道哪個地標是目標地標。因此，協作智能體必須學會"分頭行動"，覆蓋所有的地標來欺騙對手。|

協作智能體的獎勵：對手與目標地標的距離的L2范式減去協作智能體與目標地標的距離的L2范式。
對手的獎勵：對手與目標地標的距離的L2范式的相反數。
在基於距離的獎勵之外，還有基於鄰近的獎勵，會根據閾值函數調整(增/減)原本的獎勵。

| simple_crypto.py (隱蔽通信) | Y | Y | 兩個協作智能體(Alice和Bob)，一個對手(Eve)。Allien必須通過公共渠道給Bob發私人信息。Alice和Bob會根據Bob重建信息的程度得到獎勵，但是如果Eve能重建信息，則會得到負獎勵。Alice和Bob有一個私鑰(在每個回合開始時隨機生成)，他們必須學會用它來加密消息。|

| simple_push.py (遠離) | N |Y | 一個智能體，一個對手，一個地標。智能體根據到地標的距離獲得獎勵。如果對手離地標很近，如果智能體離地標很遠，它就會得到獎勵。所以對手學會了把智能體推離地標。|

智能體的獎勵：智能體與目標地標的距離的L2范式的相反數。
對手的獎勵：智能體與目標地標的距離的L2范式減去對手與目標地標的距離的L2范式。

| simple_reference.py | Y | N | 兩個智能體，三個不同顏色的地標。每個智能體都希望到達它們的目標地標，而這個地標只有其他智能體知道。獎勵是集體的。因此，智能體必須學會傳達另一個智能體的目標，並導航到它們的地標。這與簡單的說話者-聆聽者場景相同，其中兩個智能體同時是說話者和聆聽者。|

獎勵：智能體與目標地標的距離的平方和的相反數。

| simple_speaker_listener.py (協作溝通) | Y | N | 與simple_reference相同，只是一個智能體是不移動的"說話者" (灰色)(觀察另一個智能體的目標)，另一個智能體是聆聽者(不能說話，但必須導航到正確的地標)。|

獎勵：智能體與目標地標的距離的平方和的相反數。

| simple_spread.py (協作導航) | N | N | N個智能體，N個地標。根據每個智能體距每個地標的距離來獎勵智能體。如果智能體與其他智能體發生碰撞，它們將受到懲罰。所以，智能體們必須學會在避免碰撞的同時覆蓋所有的地標。|

獎勵：智能體與目標地標的距離的L2范式的相反數。

| simple_tag.py (捕食者-被捕食者) | N | Y | 捕食環境。協作智能體(綠色)速度更快，希望避免被對手(紅色)擊中。對手速度較慢，想要擊中協作智能體。障礙物(大黑圈)擋住了路。|

協作智能體的獎勵：協作智能體與任一對手的距離小於一定閾值，則減去相應正值；同時，應當減去邊界懲罰。
對手的獎勵：對手與協作智能體的距離小於一定閾值，則加上相應正值。

PS：代碼實現與論文以及圖中描述的不一致，其中協作智能體與對手描述相反。

import numpy as np from multiagent.core import World, Agent, Landmark from multiagent.scenario import BaseScenario class Scenario(BaseScenario): def make_world(self): world = World() # set any world properties first
        world.dim_c = 2 # 溝通通道維度 num_good_agents = 1 # 協作智能體數目 num_adversaries = 3 # 對手數目 num_agents = num_adversaries + num_good_agents # 智能體數目 num_landmarks = 2 # 地標數目
        # add agents
        world.agents = [Agent() for i in range(num_agents)] for i, agent in enumerate(world.agents): agent.name = 'agent %d' % i agent.collide = True # 實體與其他實體碰撞 agent.silent = True # 無法發送通信信號 agent.adversary = True if i < num_adversaries else False # 對手/協作智能體 agent.size = 0.075 if agent.adversary else 0.05 # 大小 agent.accel = 3.0 if agent.adversary else 4.0 # 加速度 agent.max_speed = 1.0 if agent.adversary else 1.3 # 最大速度
        # add landmarks
        world.landmarks = [Landmark() for i in range(num_landmarks)] for i, landmark in enumerate(world.landmarks): landmark.name = 'landmark %d' % i landmark.collide = True # 實體與其他實體碰撞 landmark.movable = False # 實體不可以移動/被推 landmark.size = 0.2 # 大小 landmark.boundary = False # make initial conditions
 self.reset_world(world) return world def reset_world(self, world): # random properties for agents
        for i, agent in enumerate(world.agents): agent.color = np.array([0.35, 0.85, 0.35]) if not agent.adversary else np.array([0.85, 0.35, 0.35]) # random properties for landmarks
        for i, landmark in enumerate(world.landmarks): landmark.color = np.array([0.25, 0.25, 0.25]) # set random initial states
        for agent in world.agents: agent.state.p_pos = np.random.uniform(-1, +1, world.dim_p) # 位置初始化 agent.state.p_vel = np.zeros(world.dim_p) # 速度初始化 agent.state.c = np.zeros(world.dim_c) # 通信信號初始化 for i, landmark in enumerate(world.landmarks): if not landmark.boundary: landmark.state.p_pos = np.random.uniform(-0.9, +0.9, world.dim_p) # 位置初始化 landmark.state.p_vel = np.zeros(world.dim_p) # 速度初始化 def benchmark_data(self, agent, world): # returns data for benchmarking purposes
        if agent.adversary: collisions = 0 for a in self.good_agents(world): if self.is_collision(a, agent): collisions += 1
            return collisions else: return 0 
 # 判斷是否發生碰撞 def is_collision(self, agent1, agent2): delta_pos = agent1.state.p_pos - agent2.state.p_pos dist = np.sqrt(np.sum(np.square(delta_pos))) dist_min = agent1.size + agent2.size return True if dist < dist_min else False # return all agents that are not adversaries
    def good_agents(self, world): return [agent for agent in world.agents if not agent.adversary] # return all adversarial agents
    def adversaries(self, world): return [agent for agent in world.agents if agent.adversary] def reward(self, agent, world): # Agents are rewarded based on minimum agent distance to each landmark
        main_reward = self.adversary_reward(agent, world) if agent.adversary else self.agent_reward(agent, world) return main_reward def agent_reward(self, agent, world): # Agents are negatively rewarded if caught by adversaries
        rew = 0 shape = False adversaries = self.adversaries(world) if shape:  # reward can optionally be shaped (increased reward for increased distance from adversary)
            for adv in adversaries: rew += 0.1 * np.sqrt(np.sum(np.square(agent.state.p_pos - adv.state.p_pos))) if agent.collide: for a in adversaries: if self.is_collision(a, agent): rew -= 10

        # agents are penalized for exiting the screen, so that they can be caught by the adversaries
        # 智能體離開屏幕會受到懲罰，這樣它就可以被對手抓住
        def bound(x): if x < 0.9: return 0 if x < 1.0: return (x - 0.9) * 10
            return min(np.exp(2 * x - 2), 10)
 for p in range(world.dim_p): x = abs(agent.state.p_pos[p]) rew -= bound(x) return rew def adversary_reward(self, agent, world): # Adversaries are rewarded for collisions with agents
        rew = 0 shape = False agents = self.good_agents(world) adversaries = self.adversaries(world) if shape:  # reward can optionally be shaped (decreased reward for increased distance from agents)
            for adv in adversaries: rew -= 0.1 * min([np.sqrt(np.sum(np.square(a.state.p_pos - adv.state.p_pos))) for a in agents]) if agent.collide: for ag in agents: for adv in adversaries: if self.is_collision(ag, adv): rew += 10
        return rew def observation(self, agent, world): # get positions of all entities in this agent's reference frame
        entity_pos = [] for entity in world.landmarks: if not entity.boundary: entity_pos.append(entity.state.p_pos - agent.state.p_pos) # communication of all other agents
        comm = [] other_pos = [] other_vel = [] for other in world.agents: if other is agent: continue comm.append(other.state.c) other_pos.append(other.state.p_pos - agent.state.p_pos) if not other.adversary: other_vel.append(other.state.p_vel)
 # 智能體的速度，智能體的位置，智能體與每個地標的距離，智能體與其他智能體的位置，其他智能體的速度 return np.concatenate([agent.state.p_vel] + [agent.state.p_pos] + entity_pos + other_pos + other_vel)

| simple_world_comm.py | Y | Y | 從論文附帶的視頻中看到的環境。和簡單的標簽一樣，除了(1) 有食物(藍色小球)，協作智能體在附近會得到獎勵，(2) 我們現在有了"森林"，把智能體藏在里面，不讓外界看到；(3) 有一個"頭目對手"，可以隨時看到智能體，並可以與其他對手溝通，幫助協調追捕。|

相關代碼

make_env.py：

"""
Code for creating a multiagent environment with one of the scenarios listed
in ./scenarios/.
Can be called by using, for example:
    env = make_env('simple_speaker_listener')
After producing the env object, can be used similarly to an OpenAI gym
environment.

A policy using this environment must output actions in the form of a list
for all agents. Each element of the list should be a numpy array,
of size (env.world.dim_p + env.world.dim_c, 1). Physical actions precede
communication actions in this array. See environment.py for more details.
"""

def make_env(scenario_name, benchmark=False):
    '''
    Creates a MultiAgentEnv object as env. This can be used similar to a gym
    environment by calling env.reset() and env.step().
    Use env.render() to view the environment on the screen.

    Input:
        scenario_name   :   name of the scenario from ./scenarios/ to be Returns
                            (without the .py extension)
        benchmark       :   whether you want to produce benchmarking data 是否要生成基准數據
                            (usually only done during evaluation)（通常僅在評估期間進行）

    Some useful env properties (see environment.py):
        .observation_space  :   Returns the observation space for each agent
        .action_space       :   Returns the action space for each agent
        .n                  :   Returns the number of Agents
    '''
    from multiagent.environment import MultiAgentEnv
    import multiagent.scenarios as scenarios

    # load scenario from script
    scenario = scenarios.load(scenario_name + ".py").Scenario()
    # create world
    world = scenario.make_world()
    # create multiagent environment
    if benchmark:        
        env = MultiAgentEnv(world, scenario.reset_world, scenario.reward, scenario.observation, scenario.benchmark_data)
    else:
        env = MultiAgentEnv(world, scenario.reset_world, scenario.reward, scenario.observation)
    return env

multiagent/scenarios/__init__.py：

import imp
import os.path as osp


def load(name):
    pathname = osp.join(osp.dirname(__file__), name)
    return imp.load_source('', pathname)

multiagent/environment.py：

import gym
from gym import spaces
from gym.envs.registration import EnvSpec
import numpy as np
from multiagent.multi_discrete import MultiDiscrete

# environment for all agents in the multiagent world
# currently code assumes that no agents will be created/destroyed at runtime!
class MultiAgentEnv(gym.Env):
    metadata = {
        'render.modes' : ['human', 'rgb_array']
    }

    def __init__(self, world, reset_callback=None, reward_callback=None,
                 observation_callback=None, info_callback=None,
                 done_callback=None, shared_viewer=True):

        self.world = world
        self.agents = self.world.policy_agents
        # set required vectorized gym env property
        self.n = len(world.policy_agents)
        # scenario callbacks
        self.reset_callback = reset_callback
        self.reward_callback = reward_callback
        self.observation_callback = observation_callback
        self.info_callback = info_callback
        self.done_callback = done_callback
        # environment parameters
        self.discrete_action_space = True
        # if true, action is a number 0...N, otherwise action is a one-hot N-dimensional vector
        self.discrete_action_input = False
        # if true, even the action is continuous, action will be performed discretely
        self.force_discrete_action = world.discrete_action if hasattr(world, 'discrete_action') else False
        # if true, every agent has the same reward
        self.shared_reward = world.collaborative if hasattr(world, 'collaborative') else False
        self.time = 0

        # configure spaces
        self.action_space = []
        self.observation_space = []
        for agent in self.agents:
            total_action_space = []
            # physical action space
            if self.discrete_action_space:
                u_action_space = spaces.Discrete(world.dim_p * 2 + 1)
            else:
                u_action_space = spaces.Box(low=-agent.u_range, high=+agent.u_range, shape=(world.dim_p,), dtype=np.float32)
            if agent.movable:
                total_action_space.append(u_action_space)
            # communication action space
            if self.discrete_action_space:
                c_action_space = spaces.Discrete(world.dim_c)
            else:
                c_action_space = spaces.Box(low=0.0, high=1.0, shape=(world.dim_c,), dtype=np.float32)
            if not agent.silent:
                total_action_space.append(c_action_space)
            # total action space
            if len(total_action_space) > 1:
                # all action spaces are discrete, so simplify to MultiDiscrete action space
                if all([isinstance(act_space, spaces.Discrete) for act_space in total_action_space]):
                    act_space = MultiDiscrete([[0, act_space.n - 1] for act_space in total_action_space])
                else:
                    act_space = spaces.Tuple(total_action_space)
                self.action_space.append(act_space)
            else:
                self.action_space.append(total_action_space[0])
            # observation space
            obs_dim = len(observation_callback(agent, self.world))
            self.observation_space.append(spaces.Box(low=-np.inf, high=+np.inf, shape=(obs_dim,), dtype=np.float32))
            agent.action.c = np.zeros(self.world.dim_c)

        # rendering
        self.shared_viewer = shared_viewer
        if self.shared_viewer:
            self.viewers = [None]
        else:
            self.viewers = [None] * self.n
        self._reset_render()

    def step(self, action_n):
        obs_n = []
        reward_n = []
        done_n = []
        info_n = {'n': []}
        self.agents = self.world.policy_agents
        # set action for each agent
        for i, agent in enumerate(self.agents):
            self._set_action(action_n[i], agent, self.action_space[i])
        # advance world state
        self.world.step()
        # record observation for each agent
        for agent in self.agents:
            obs_n.append(self._get_obs(agent))
            reward_n.append(self._get_reward(agent))
            done_n.append(self._get_done(agent))

            info_n['n'].append(self._get_info(agent))

        # all agents get total reward in cooperative case
        reward = np.sum(reward_n)
        if self.shared_reward:
            reward_n = [reward] * self.n

        return obs_n, reward_n, done_n, info_n

    def reset(self):
        # reset world
        self.reset_callback(self.world)
        # reset renderer
        self._reset_render()
        # record observations for each agent
        obs_n = []
        self.agents = self.world.policy_agents
        for agent in self.agents:
            obs_n.append(self._get_obs(agent))
        return obs_n

    # get info used for benchmarking
    def _get_info(self, agent):
        if self.info_callback is None:
            return {}
        return self.info_callback(agent, self.world)

    # get observation for a particular agent
    def _get_obs(self, agent):
        if self.observation_callback is None:
            return np.zeros(0)
        return self.observation_callback(agent, self.world)

    # get dones for a particular agent
    # unused right now -- agents are allowed to go beyond the viewing screen
    def _get_done(self, agent):
        if self.done_callback is None:
            return False
        return self.done_callback(agent, self.world)

    # get reward for a particular agent
    def _get_reward(self, agent):
        if self.reward_callback is None:
            return 0.0
        return self.reward_callback(agent, self.world)

    # set env action for a particular agent
    def _set_action(self, action, agent, action_space, time=None):
        agent.action.u = np.zeros(self.world.dim_p)
        agent.action.c = np.zeros(self.world.dim_c)
        # process action
        if isinstance(action_space, MultiDiscrete):
            act = []
            size = action_space.high - action_space.low + 1
            index = 0
            for s in size:
                act.append(action[index:(index+s)])
                index += s
            action = act
        else:
            action = [action]

        if agent.movable:
            # physical action
            if self.discrete_action_input:
                agent.action.u = np.zeros(self.world.dim_p)
                # process discrete action
                if action[0] == 1: agent.action.u[0] = -1.0
                if action[0] == 2: agent.action.u[0] = +1.0
                if action[0] == 3: agent.action.u[1] = -1.0
                if action[0] == 4: agent.action.u[1] = +1.0
            else:
                if self.force_discrete_action:
                    d = np.argmax(action[0])
                    action[0][:] = 0.0
                    action[0][d] = 1.0
                if self.discrete_action_space:
                    agent.action.u[0] += action[0][1] - action[0][2]
                    agent.action.u[1] += action[0][3] - action[0][4]
                else:
                    agent.action.u = action[0]
            sensitivity = 5.0
            if agent.accel is not None:
                sensitivity = agent.accel
            agent.action.u *= sensitivity
            action = action[1:]
        if not agent.silent:
            # communication action
            if self.discrete_action_input:
                agent.action.c = np.zeros(self.world.dim_c)
                agent.action.c[action[0]] = 1.0
            else:
                agent.action.c = action[0]
            action = action[1:]
        # make sure we used all elements of action
        assert len(action) == 0

    # reset rendering assets
    def _reset_render(self):
        self.render_geoms = None
        self.render_geoms_xform = None

    # render environment
    def render(self, mode='human'):
        if mode == 'human':
            alphabet = 'ABCDEFGHIJKLMNOPQRSTUVWXYZ'
            message = ''
            for agent in self.world.agents:
                comm = []
                for other in self.world.agents:
                    if other is agent: continue
                    if np.all(other.state.c == 0):
                        word = '_'
                    else:
                        word = alphabet[np.argmax(other.state.c)]
                    message += (other.name + ' to ' + agent.name + ': ' + word + '   ')
            print(message)

        for i in range(len(self.viewers)):
            # create viewers (if necessary)
            if self.viewers[i] is None:
                # import rendering only if we need it (and don't import for headless machines)
                #from gym.envs.classic_control import rendering
                from multiagent import rendering
                self.viewers[i] = rendering.Viewer(700,700)

        # create rendering geometry
        if self.render_geoms is None:
            # import rendering only if we need it (and don't import for headless machines)
            #from gym.envs.classic_control import rendering
            from multiagent import rendering
            self.render_geoms = []
            self.render_geoms_xform = []
            for entity in self.world.entities:
                geom = rendering.make_circle(entity.size)
                xform = rendering.Transform()
                if 'agent' in entity.name:
                    geom.set_color(*entity.color, alpha=0.5)
                else:
                    geom.set_color(*entity.color)
                geom.add_attr(xform)
                self.render_geoms.append(geom)
                self.render_geoms_xform.append(xform)

            # add geoms to viewer
            for viewer in self.viewers:
                viewer.geoms = []
                for geom in self.render_geoms:
                    viewer.add_geom(geom)

        results = []
        for i in range(len(self.viewers)):
            from multiagent import rendering
            # update bounds to center around agent
            cam_range = 1
            if self.shared_viewer:
                pos = np.zeros(self.world.dim_p)
            else:
                pos = self.agents[i].state.p_pos
            self.viewers[i].set_bounds(pos[0]-cam_range,pos[0]+cam_range,pos[1]-cam_range,pos[1]+cam_range)
            # update geometry positions
            for e, entity in enumerate(self.world.entities):
                self.render_geoms_xform[e].set_translation(*entity.state.p_pos)
            # render to display or array
            results.append(self.viewers[i].render(return_rgb_array = mode=='rgb_array'))

        return results

    # create receptor field locations in local coordinate frame
    def _make_receptor_locations(self, agent):
        receptor_type = 'polar'
        range_min = 0.05 * 2.0
        range_max = 1.00
        dx = []
        # circular receptive field
        if receptor_type == 'polar':
            for angle in np.linspace(-np.pi, +np.pi, 8, endpoint=False):
                for distance in np.linspace(range_min, range_max, 3):
                    dx.append(distance * np.array([np.cos(angle), np.sin(angle)]))
            # add origin
            dx.append(np.array([0.0, 0.0]))
        # grid receptive field
        if receptor_type == 'grid':
            for x in np.linspace(-range_max, +range_max, 5):
                for y in np.linspace(-range_max, +range_max, 5):
                    dx.append(np.array([x,y]))
        return dx


# vectorized wrapper for a batch of multi-agent environments
# assumes all environments have the same observation and action space
class BatchMultiAgentEnv(gym.Env):
    metadata = {
        'runtime.vectorized': True,
        'render.modes' : ['human', 'rgb_array']
    }

    def __init__(self, env_batch):
        self.env_batch = env_batch

    @property
    def n(self):
        return np.sum([env.n for env in self.env_batch])

    @property
    def action_space(self):
        return self.env_batch[0].action_space

    @property
    def observation_space(self):
        return self.env_batch[0].observation_space

    def step(self, action_n, time):
        obs_n = []
        reward_n = []
        done_n = []
        info_n = {'n': []}
        i = 0
        for env in self.env_batch:
            obs, reward, done, _ = env.step(action_n[i:(i+env.n)], time)
            i += env.n
            obs_n += obs
            # reward = [r / len(self.env_batch) for r in reward]
            reward_n += reward
            done_n += done
        return obs_n, reward_n, done_n, info_n

    def reset(self):
        obs_n = []
        for env in self.env_batch:
            obs_n += env.reset()
        return obs_n

    # render environment
    def render(self, mode='human', close=True):
        results_n = []
        for env in self.env_batch:
            results_n += env.render(mode, close)
        return results_n

multiagent/core.py：

import numpy as np

# physical/external base state of all entites
class EntityState(object):
    def __init__(self):
        # physical position
        self.p_pos = None
        # physical velocity
        self.p_vel = None

# state of agents (including communication and internal/mental state)
class AgentState(EntityState):
    def __init__(self):
        super(AgentState, self).__init__()
        # communication utterance
        self.c = None

# action of the agent
class Action(object):
    def __init__(self):
        # physical action
        self.u = None
        # communication action
        self.c = None

# properties and state of physical world entity
class Entity(object):
    def __init__(self):
        # name 
        self.name = ''
        # properties:
        self.size = 0.050
        # entity can move / be pushed
        self.movable = False
        # entity collides with others
        self.collide = True
        # material density (affects mass)
        self.density = 25.0
        # color
        self.color = None
        # max speed and accel
        self.max_speed = None
        self.accel = None
        # state
        self.state = EntityState()
        # mass
        self.initial_mass = 1.0

    @property
    def mass(self):
        return self.initial_mass

# properties of landmark entities
class Landmark(Entity):
     def __init__(self):
        super(Landmark, self).__init__()

# properties of agent entities
class Agent(Entity):
    def __init__(self):
        super(Agent, self).__init__()
        # agents are movable by default
        self.movable = True
        # cannot send communication signals
        self.silent = False
        # cannot observe the world
        self.blind = False
        # physical motor noise amount
        self.u_noise = None
        # communication noise amount
        self.c_noise = None
        # control range
        self.u_range = 1.0
        # state
        self.state = AgentState()
        # action
        self.action = Action()
        # script behavior to execute
        self.action_callback = None

# multi-agent world
class World(object):
    def __init__(self):
        # list of agents and entities (can change at execution-time!)
        self.agents = []
        self.landmarks = []
        # communication channel dimensionality
        self.dim_c = 0
        # position dimensionality
        self.dim_p = 2
        # color dimensionality
        self.dim_color = 3
        # simulation timestep
        self.dt = 0.1
        # physical damping
        self.damping = 0.25
        # contact response parameters
        self.contact_force = 1e+2
        self.contact_margin = 1e-3

    # return all entities in the world
    @property
    def entities(self):
        return self.agents + self.landmarks

    # return all agents controllable by external policies
    @property
    def policy_agents(self):
        return [agent for agent in self.agents if agent.action_callback is None]

    # return all agents controlled by world scripts
    @property
    def scripted_agents(self):
        return [agent for agent in self.agents if agent.action_callback is not None]

    # update state of the world
    def step(self):
        # set actions for scripted agents 
        for agent in self.scripted_agents:
            agent.action = agent.action_callback(agent, self)
        # gather forces applied to entities
        p_force = [None] * len(self.entities)
        # apply agent physical controls
        p_force = self.apply_action_force(p_force)
        # apply environment forces
        p_force = self.apply_environment_force(p_force)
        # integrate physical state
        self.integrate_state(p_force)
        # update agent state
        for agent in self.agents:
            self.update_agent_state(agent)

    # gather agent action forces
    def apply_action_force(self, p_force):
        # set applied forces
        for i,agent in enumerate(self.agents):
            if agent.movable:
                noise = np.random.randn(*agent.action.u.shape) * agent.u_noise if agent.u_noise else 0.0
                p_force[i] = agent.action.u + noise                
        return p_force

    # gather physical forces acting on entities
    def apply_environment_force(self, p_force):
        # simple (but inefficient) collision response
        for a,entity_a in enumerate(self.entities):
            for b,entity_b in enumerate(self.entities):
                if(b <= a): continue
                [f_a, f_b] = self.get_collision_force(entity_a, entity_b)
                if(f_a is not None):
                    if(p_force[a] is None): p_force[a] = 0.0
                    p_force[a] = f_a + p_force[a] 
                if(f_b is not None):
                    if(p_force[b] is None): p_force[b] = 0.0
                    p_force[b] = f_b + p_force[b]        
        return p_force

    # integrate physical state
    def integrate_state(self, p_force):
        for i,entity in enumerate(self.entities):
            if not entity.movable: continue
            entity.state.p_vel = entity.state.p_vel * (1 - self.damping)
            if (p_force[i] is not None):
                entity.state.p_vel += (p_force[i] / entity.mass) * self.dt
            if entity.max_speed is not None:
                speed = np.sqrt(np.square(entity.state.p_vel[0]) + np.square(entity.state.p_vel[1]))
                if speed > entity.max_speed:
                    entity.state.p_vel = entity.state.p_vel / np.sqrt(np.square(entity.state.p_vel[0]) +
                                                                  np.square(entity.state.p_vel[1])) * entity.max_speed
            entity.state.p_pos += entity.state.p_vel * self.dt

    def update_agent_state(self, agent):
        # set communication state (directly for now)
        if agent.silent:
            agent.state.c = np.zeros(self.dim_c)
        else:
            noise = np.random.randn(*agent.action.c.shape) * agent.c_noise if agent.c_noise else 0.0
            agent.state.c = agent.action.c + noise      

    # get collision forces for any contact between two entities
    def get_collision_force(self, entity_a, entity_b):
        if (not entity_a.collide) or (not entity_b.collide):
            return [None, None] # not a collider
        if (entity_a is entity_b):
            return [None, None] # don't collide against itself
        # compute actual distance between entities
        delta_pos = entity_a.state.p_pos - entity_b.state.p_pos
        dist = np.sqrt(np.sum(np.square(delta_pos)))
        # minimum allowable distance
        dist_min = entity_a.size + entity_b.size
        # softmax penetration
        k = self.contact_margin
        penetration = np.logaddexp(0, -(dist - dist_min)/k)*k
        force = self.contact_force * delta_pos / dist * penetration
        force_a = +force if entity_a.movable else None
        force_b = -force if entity_b.movable else None
        return [force_a, force_b]

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 強化學習(十四) Actor-Critic Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor 雙層優化問題：統一GAN，演員-評論員與元學習方法(Bilevel Optimization Problem unifies GAN, Actor-Critic, and Meta-Learning Methods) MyBatis配置文件(七)－－environments運行環境 actix rust actor 框架學習二 ping actor demo 代碼使用 Postman 的 Environments 和 Tests 簡化在不同環境中的切換步驟 User-Agent詳解論文筆記：MSDNet動態部分代碼詳解《Multi-scale Dense Networks for Resource Efficient Image Classification》（10）ASP.NET Core2.2 中的環境（Environments：dev, stage, prod） MyBatis配置項--配置環境（environments）--事務管理器（transactionManager）