強化學習之MountainCarContinuous（注冊自己的gym環境）

本文轉載自查看原文 2021-11-30 18:19 1494

1. 問題概述
2. 環境
3. 代碼
- 3.1 導入lib
- 3.2 定義Continuous_MountainCarEnv類
4. 運行
五：參考

1. 問題概述

問題：MountainCarContinuous-v0

代碼地址：https://github.com/openai/gym/blob/master/gym/envs/classic_control/continuous_mountain_car.py

細節：動力不足的汽車必須爬上一維小山才能到達目標。 MountainCarContinuous-v0與MountainCar-v0不同，動作（應用的引擎力）允許是連續值。

目標位於汽車右側的山頂上。如果汽車到達或超出，則劇集終止。

在左側，還有另一座山。攀登這座山丘可以用來獲得潛在的能量，並朝着目標加速。在這第二座山頂上，汽車不能超過等於-1的位置，好像有一堵牆。達到此限制不會產生懲罰（可能在更具挑戰性的版本中）[1].
類型：連續控制

2. 環境

2.1 Observation & state

	textObservation	Min	Max
0	Position	-1.2	0.6
1	Velocity	-0.07	0.07

請注意，速度受到限制以便於探索，但在更具挑戰性的版本中可能會放寬此約束。

注意：Observation是 state的函數，二者有時相同，有時不同，在此例中，二者是一樣的，在 Pendulum-v0中，Observation是state的函數。

2.2 Actions

n	Action
0	將車推向左側（負值）或向右側（正值）

2.3 Reward

獎勵為到達右側山丘目標的100，減去從開始到目標的動作平方總和。這個獎勵函數提出了一個探索挑戰，因為如果代理人沒有盡快到達目標，它將會發現最好不要移動，並且不再找到目標。
請注意，對於大多數已發表的作品而言，這種獎勵是不尋常的，其目標是盡可能快地達到目標，因此有利於爆炸戰略。

更多的獎勵函數形式查看這個Leaderboard

2.4 初始狀態

位於-0.6和-0.4之間，無速度。

2.5 終止狀態- Episode Termination

位置等於0.5（此值可能被調整）。可以在更具挑戰性的版本中添加對速度的約束。
添加最大步數可能是個好主意。

2.6 Solved Requirements

獲得超過90的獎勵。此值可能會被調整。

3. 代碼

3.1 導入lib

import math
import gym
from gym import spaces
from gym.utils import seeding
import numpy as np

3.2 定義`Continuous_MountainCarEnv`類

class Continuous_MountainCarEnv(gym.Env):
    metadata = {
        'render.modes': ['human', 'rgb_array'],
        'video.frames_per_second': 30
    }

3.2.1 定義`init(self)`函數

def __init__(self):
        self.min_action = -1.0  # 最小動作值
        self.max_action = 1.0   # 最大動作值
        self.min_position = -1.2 # 最低位置
        self.max_position = 0.6  # 最高位置
        self.max_speed = 0.07  # 最大速度
        self.goal_position = 0.45 # was 0.5 in gym, 0.45 in Arnaud de Broissia's version
        self.power = 0.0015

        self.low_state = np.array([self.min_position, -self.max_speed]) # [-1.2, -0.07]
        self.high_state = np.array([self.max_position, self.max_speed]) # [0.6, 0.07]

        self.viewer = None
        #   聲明observation space和action space的上下限
        self.action_space = spaces.Box(low=self.min_action, high=self.max_action, shape=(1,)) 
        # (low = 1.0, high = 1.0)
        self.observation_space = spaces.Box(low=self.low_state, high=self.high_state)
        # (low = -1.2, high = 0.6 )
  
        self.seed()
        self.reset()

3.2.2 定義隨機種子函數`seed(self, seed=None)`

    def seed(self, seed=None):
        self.np_random, seed = seeding.np_random(seed)
        return [seed]

3.2.3 定義`step(self, action)`函數

step()函數
該函數在仿真器中扮演物理引擎的角色。其輸入是動作action，輸出是：下一步狀態，立即回報，是否終止,調試項。該函數描述了智能體與環境交互的所有信息，是環境文件中最重要的函數。在該函數中，一般利用智能體的運動學模型和動力學模型計算下一步的狀態和立即回報，並判斷是否達到終止狀態

    def step(self, action):

  1.    position = self.state[0]
  2.    velocity = self.state[1]
        # position, velocity = self.state
  3.    force = min(max(action[0], -1.0), 1.0)

  4.    velocity += force*self.power - 0.0025 * math.cos(3*position)
  5.    if (velocity > self.max_speed): velocity = self.max_speed
  6.    if (velocity < -self.max_speed): velocity = -self.max_speed
  7.    position += velocity
  8.    if (position > self.max_position): position = self.max_position
  9.    if (position < self.min_position): position = self.min_position
  10.   if (position==self.min_position and velocity<0): velocity = 0

  11.   done = bool(position >= self.goal_position)

  12.   reward = 0
  13.   if done:
  14.       reward = 100.0
  15.   reward-= math.pow(action[0],2)*0.1

  16.   self.state = np.array([position, velocity])
  17.   return self.state, reward, done, {}

初始化位置狀態
初始化速度狀態
引擎力：內層的max(action[0], -1.0)確保動作值不低於下界，即 - 1.0，
外層的min(max(action[0], -1.0), 1.0)確保動作值不高於上界，即 1.0
計算速度：注意是速度累加的，這是微分的概念，把連續過程離散成很小的片段以進行近似
判斷當前速度是否大於最大速度：如果是，將當前速度設定為最大速度
判斷當前速度是否小於最小速度：如果是，將當前速度設定為最小速度
計算位置：
判斷當前位置是否高於最高位置：如果是，將當前位置設定為最高位置
判斷當前位置是否低於最低位置：如果是，將當前位置設定為最低位置
如果當前位置是最低位置且速度小於 0 ：將速度設為0
判斷布爾類型的，返回True或者False
初始化 reward = 0
如果當前位置高於目標位置，
給予 agent 值為100的reward
這是執行動作之后得到的新的狀態
step()

函數返回下一時刻的觀測，回報，是否終止,調試項

MountainCarContinuous-v0

11-15 這幾行代碼的意思是：每執行一個step，就會檢查看自己是否越過了右邊的山峰，據此來給done賦值，如果小車沒有越過右邊的山峰，即 done=False，則在這一個step, reward將會記為，也就是這一個時間步我們耗費了多少能量，我們當然不希望耗油太多。如果小車越過右邊的山峰，即 done=True，這一個step就會馬上得到的獎勵。

3.2.4 定義`reset()`函數：

在強化學習算法中，智能體需要一次次地嘗試，累積經驗，然后從經驗中學到好的動作。一次嘗試我們稱之為一條軌跡或一個episode. 每次嘗試都要到達終止狀態. 一次嘗試結束后，智能體需要從頭開始，這就需要智能體具有重新初始化的功能。函數reset()就是這個作用, agent與環境交互前調用該函數，確定agent的初始狀態以及其他可能的一些初始化設置。此例中在每個episode開始時，position初始化為[-0.6,-0.4]之間的一個任意狀態，速度初始化為0.

    def reset(self):
        self.state = np.array([self.np_random.uniform(low=-0.6, high=-0.4), 0])
        return np.array(self.state)

3.2.5 定義`_height(self, xs)`函數：

此函數用於下面的render()函數用來構建圖像引擎

    def _height(self, xs):
        return np.sin(3 * xs)*.45+.55

3.2.6 定義`render(self, mode='human')`函數

render()函數是圖像引擎,就是人機交互界面，進行動畫演示，一個仿真環境必不可少的兩部分是物理引擎和圖像引擎。物理引擎模擬環境中物體的運動規律；圖像引擎用來顯示環境中的物體圖像。

    def render(self, mode='human'):
        screen_width = 600
        screen_height = 400

        world_width = self.max_position - self.min_position
        scale = screen_width/world_width
        carwidth=40
        carheight=20


        if self.viewer is None:
            from gym.envs.classic_control import rendering
            self.viewer = rendering.Viewer(screen_width, screen_height)
            xs = np.linspace(self.min_position, self.max_position, 100)
            ys = self._height(xs)
            xys = list(zip((xs-self.min_position)*scale, ys*scale))

            self.track = rendering.make_polyline(xys)
            self.track.set_linewidth(4)
            self.viewer.add_geom(self.track)

            clearance = 10

            l,r,t,b = -carwidth/2, carwidth/2, carheight, 0
            car = rendering.FilledPolygon([(l,b), (l,t), (r,t), (r,b)])
            car.add_attr(rendering.Transform(translation=(0, clearance)))
            self.cartrans = rendering.Transform()
            car.add_attr(self.cartrans)
            self.viewer.add_geom(car)
            frontwheel = rendering.make_circle(carheight/2.5)
            frontwheel.set_color(.5, .5, .5)
            frontwheel.add_attr(rendering.Transform(translation=(carwidth/4,clearance)))
            frontwheel.add_attr(self.cartrans)
            self.viewer.add_geom(frontwheel)
            backwheel = rendering.make_circle(carheight/2.5)
            backwheel.add_attr(rendering.Transform(translation=(-carwidth/4,clearance)))
            backwheel.add_attr(self.cartrans)
            backwheel.set_color(.5, .5, .5)
            self.viewer.add_geom(backwheel)
            flagx = (self.goal_position-self.min_position)*scale
            flagy1 = self._height(self.goal_position)*scale
            flagy2 = flagy1 + 50
            flagpole = rendering.Line((flagx, flagy1), (flagx, flagy2))
            self.viewer.add_geom(flagpole)
            flag = rendering.FilledPolygon([(flagx, flagy2), (flagx, flagy2-10), 
            (flagx+25, flagy2-5)])
            flag.set_color(.8,.8,0)
            self.viewer.add_geom(flag)

        pos = self.state[0]
        self.cartrans.set_translation((pos-self.min_position)*scale, self._height(pos)*scale)
        self.cartrans.set_rotation(math.cos(3 * pos))

        return self.viewer.render(return_rgb_array = mode=='rgb_array')

強化學習算法可以不用圖像引擎，這里我們不做解釋了。

3.2.7 定義`close(self)`函數

    def close(self):
        if self.viewer:
            self.viewer.close()
            self.viewer = None

4. 運行

4.1 完整代碼：continuous_mountain_car.py

"""
MountainCarContinuous-v1
@author: Olivier Sigaud
A merge between two sources:
* Adaptation of the MountainCar Environment from the "FAReinforcement" library
of Jose Antonio Martin H. (version 1.0), adapted by  'Tom Schaul, tom@idsia.ch'
and then modified by Arnaud de Broissia
* the OpenAI/gym MountainCar environment
itself from
http://incompleteideas.net/sutton/MountainCar/MountainCar1.cp
permalink: https://perma.cc/6Z2N-PFWC
"""

import math

import numpy as np

import gym
from gym import spaces
from gym.utils import seeding


class ContinuousMountainCarEnv(gym.Env):
    """
    Description:
        The agent (a car) is started at the bottom of a valley. For any given
        state the agent may choose to accelerate to the left, right or cease
        any acceleration.
    Observation:
        Type: Box(2)
        Num    Observation               Min            Max
        0      Car Position              -1.2           0.6
        1      Car Velocity              -0.07          0.07
    Actions:
        Type: Box(1)
        Num    Action                    Min            Max
        0      the power coef            -1.0           1.0
        Note: actual driving force is calculated by multiplying the power coef by power (0.0015)
    Reward:
         Reward of 100 is awarded if the agent reached the flag (position = 0.45) on top of the mountain.
         Reward is decrease based on amount of energy consumed each step.
    Starting State:
         The position of the car is assigned a uniform random value in
         [-0.6 , -0.4].
         The starting velocity of the car is always assigned to 0.
    Episode Termination:
         The car position is more than 0.45
         Episode length is greater than 200
    """

    metadata = {"render.modes": ["human", "rgb_array"], "video.frames_per_second": 30}

    def __init__(self, goal_velocity=0):
        self.min_action = -1.0
        self.max_action = 1.0
        self.min_position = -1.2
        self.max_position = 0.6
        self.max_speed = 0.07
        self.goal_position = (
            0.45  # was 0.5 in gym, 0.45 in Arnaud de Broissia's version
        )
        self.goal_velocity = goal_velocity
        self.power = 0.0015

        self.low_state = np.array(
            [self.min_position, -self.max_speed], dtype=np.float32
        )
        self.high_state = np.array(
            [self.max_position, self.max_speed], dtype=np.float32
        )

        self.viewer = None

        self.action_space = spaces.Box(
            low=self.min_action, high=self.max_action, shape=(1,), dtype=np.float32
        )
        self.observation_space = spaces.Box(
            low=self.low_state, high=self.high_state, dtype=np.float32
        )

        self.seed()

    def seed(self, seed=None):
        self.np_random, seed = seeding.np_random(seed)
        return [seed]

    def step(self, action):

        position = self.state[0]
        velocity = self.state[1]
        force = min(max(action[0], self.min_action), self.max_action)

        velocity += force * self.power - 0.0025 * math.cos(3 * position)
        if velocity > self.max_speed:
            velocity = self.max_speed
        if velocity < -self.max_speed:
            velocity = -self.max_speed
        position += velocity
        if position > self.max_position:
            position = self.max_position
        if position < self.min_position:
            position = self.min_position
        if position == self.min_position and velocity < 0:
            velocity = 0

        # Convert a possible numpy bool to a Python bool.
        done = bool(position >= self.goal_position and velocity >= self.goal_velocity)

        reward = 0
        if done:
            reward = 100.0
        reward -= math.pow(action[0], 2) * 0.1

        self.state = np.array([position, velocity], dtype=np.float32)
        return self.state, reward, done, {}

    def reset(self):
        self.state = np.array([self.np_random.uniform(low=-0.6, high=-0.4), 0])
        return np.array(self.state, dtype=np.float32)

    def _height(self, xs):
        return np.sin(3 * xs) * 0.45 + 0.55

    def render(self, mode="human"):
        screen_width = 600
        screen_height = 400

        world_width = self.max_position - self.min_position
        scale = screen_width / world_width
        carwidth = 40
        carheight = 20

        if self.viewer is None:
            from gym.envs.classic_control import rendering

            self.viewer = rendering.Viewer(screen_width, screen_height)
            xs = np.linspace(self.min_position, self.max_position, 100)
            ys = self._height(xs)
            xys = list(zip((xs - self.min_position) * scale, ys * scale))

            self.track = rendering.make_polyline(xys)
            self.track.set_linewidth(4)
            self.viewer.add_geom(self.track)

            clearance = 10

            l, r, t, b = -carwidth / 2, carwidth / 2, carheight, 0
            car = rendering.FilledPolygon([(l, b), (l, t), (r, t), (r, b)])
            car.add_attr(rendering.Transform(translation=(0, clearance)))
            self.cartrans = rendering.Transform()
            car.add_attr(self.cartrans)
            self.viewer.add_geom(car)
            frontwheel = rendering.make_circle(carheight / 2.5)
            frontwheel.set_color(0.5, 0.5, 0.5)
            frontwheel.add_attr(
                rendering.Transform(translation=(carwidth / 4, clearance))
            )
            frontwheel.add_attr(self.cartrans)
            self.viewer.add_geom(frontwheel)
            backwheel = rendering.make_circle(carheight / 2.5)
            backwheel.add_attr(
                rendering.Transform(translation=(-carwidth / 4, clearance))
            )
            backwheel.add_attr(self.cartrans)
            backwheel.set_color(0.5, 0.5, 0.5)
            self.viewer.add_geom(backwheel)
            flagx = (self.goal_position - self.min_position) * scale
            flagy1 = self._height(self.goal_position) * scale
            flagy2 = flagy1 + 50
            flagpole = rendering.Line((flagx, flagy1), (flagx, flagy2))
            self.viewer.add_geom(flagpole)
            flag = rendering.FilledPolygon(
                [(flagx, flagy2), (flagx, flagy2 - 10), (flagx + 25, flagy2 - 5)]
            )
            flag.set_color(0.8, 0.8, 0)
            self.viewer.add_geom(flag)

        pos = self.state[0]
        self.cartrans.set_translation(
            (pos - self.min_position) * scale, self._height(pos) * scale
        )
        self.cartrans.set_rotation(math.cos(3 * pos))

        return self.viewer.render(return_rgb_array=mode == "rgb_array")

    def close(self):
        if self.viewer:
            self.viewer.close()
            self.viewer = None

4.2 注冊環境

第一步：將我們自己的環境文件（我創建的文件名為continuous_mountain_car.py)拷貝到你的gym安裝目錄./gym/gym/envs/classic_control文件夾中。（拷貝在這個文件夾中因為要使用rendering模塊。當然，也有其他辦法。該方法不唯一）

第二步：打開該文件夾（第一步中的文件夾）下的__init__.py文件，在文件末尾加入語句：

from gym.envs.classic_control.continuous_mountain_car import ContinuousMountainCarEnv

第三步：進入文件夾你的gym安裝目錄./gym/gym/envs，打開該文件夾下的__init__.py文件，添加代碼：

register(

id='MountainCarContinuous-v1',

entry_point='gym.envs.classic_control:GridEnv'

)

"""
第一個參數id就是你調用gym.make(‘id’)時的id,　這個id你可以隨便選取，我取的，名字是MountainCarContinuous-v1。
第二個參數就是函數路口。
"""

經過以上三步，就完成了注冊。

4.3 創建運行代碼：MountainCarContinuous.py

創建這個運行代碼后直接運行即可。

#!/usr/bin/env python
# -*- coding:utf-8 -*-
# Toolby: PyCharm

import gym

env = gym.make('MountainCarContinuous-v1')
env = env.unwrapped

total_steps = 0

for i_episode in range(10):

    observation = env.reset()
    ep_r = 0
    while True:
        env.render()

        action = env.action_space.sample()

        observation_, reward, done, info = env.step(action)

        position, velocity = observation_

        # 車開得越高 reward 越大
        reward = abs(position - (-0.5))

        ep_r += reward
        if done:
            get = '| Get' if observation_[0] >= env.unwrapped.goal_position else '| ----'
            print('Epi: ', i_episode,
                  get,
                  '| Ep_r: ', round(ep_r, 4))

            break

        observation = observation
        total_steps += 1

五：參考

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 強化學習平台 openAI 的 gym 安裝（Ubuntu環境下如何安裝Python的gym模塊） Ubuntu18.04部署強化學習環境（安裝gym+mujoco+mujoco-py）保姆級教程學習強化學習-環境配置 NVIDIA公司推出的GPU運行環境下的機器人仿真環境（NVIDIA Isaac Gym）的安裝——強化學習的仿真訓練環境（續2）什么是強化學習？強化學習和ADP（上）強化學習強化學習基礎系列(一)：強化學習基本定義強化學習 IMPALA算法深度強化學習——TRPO