
1. 問題概述
問題:MountainCarContinuous-v0
代碼地址:https://github.com/openai/gym/blob/master/gym/envs/classic_control/continuous_mountain_car.py
細節:動力不足的汽車必須爬上一維小山才能到達目標。 MountainCarContinuous-v0與MountainCar-v0不同,動作(應用的引擎力)允許是連續值。
目標位於汽車右側的山頂上。 如果汽車到達或超出,則劇集終止。
在左側,還有另一座山。 攀登這座山丘可以用來獲得潛在的能量,並朝着目標加速。 在這第二座山頂上,汽車不能超過等於-1的位置,好像有一堵牆。 達到此限制不會產生懲罰(可能在更具挑戰性的版本中)[1].
類型:連續控制
2. 環境
2.1 Observation & state
textObservation | Min | Max | |
---|---|---|---|
0 | Position | -1.2 | 0.6 |
1 | Velocity | -0.07 | 0.07 |
請注意,速度受到限制以便於探索,但在更具挑戰性的版本中可能會放寬此約束。
注意:Observation是 state的函數,二者有時相同,有時不同,在此例中,二者是一樣的,在 Pendulum-v0中,Observation是state的函數。
2.2 Actions
n | Action |
---|---|
0 | 將車推向左側(負值)或向右側(正值) |
2.3 Reward
獎勵為到達右側山丘目標的100,減去從開始到目標的動作平方總和。這個獎勵函數提出了一個探索挑戰,因為如果代理人沒有盡快到達目標,它將會發現最好不要移動,並且不再找到目標。
請注意,對於大多數已發表的作品而言,這種獎勵是不尋常的,其目標是盡可能快地達到目標,因此有利於爆炸戰略。
更多的獎勵函數形式查看這個Leaderboard
2.4 初始狀態
位於-0.6和-0.4之間,無速度。
2.5 終止狀態- Episode Termination
位置等於0.5(此值可能被調整)。 可以在更具挑戰性的版本中添加對速度的約束。
添加最大步數可能是個好主意。
2.6 Solved Requirements
獲得超過90的獎勵。此值可能會被調整。
3. 代碼
3.1 導入lib
import math
import gym
from gym import spaces
from gym.utils import seeding
import numpy as np
3.2 定義Continuous_MountainCarEnv
類
class Continuous_MountainCarEnv(gym.Env):
metadata = {
'render.modes': ['human', 'rgb_array'],
'video.frames_per_second': 30
}
3.2.1 定義__init__(self)
函數
def __init__(self):
self.min_action = -1.0 # 最小動作值
self.max_action = 1.0 # 最大動作值
self.min_position = -1.2 # 最低位置
self.max_position = 0.6 # 最高位置
self.max_speed = 0.07 # 最大速度
self.goal_position = 0.45 # was 0.5 in gym, 0.45 in Arnaud de Broissia's version
self.power = 0.0015
self.low_state = np.array([self.min_position, -self.max_speed]) # [-1.2, -0.07]
self.high_state = np.array([self.max_position, self.max_speed]) # [0.6, 0.07]
self.viewer = None
# 聲明observation space和action space的上下限
self.action_space = spaces.Box(low=self.min_action, high=self.max_action, shape=(1,))
# (low = 1.0, high = 1.0)
self.observation_space = spaces.Box(low=self.low_state, high=self.high_state)
# (low = -1.2, high = 0.6 )
self.seed()
self.reset()
3.2.2 定義隨機種子函數seed(self, seed=None)
def seed(self, seed=None):
self.np_random, seed = seeding.np_random(seed)
return [seed]
3.2.3 定義step(self, action)
函數
step()
函數
該函數在仿真器中扮演物理引擎的角色。其輸入是動作action
, 輸出是:下一步狀態,立即回報,是否終止,調試項。該函數描述了 智能體與環境交互的所有信息,是環境文件中最重要的函數。在該函數中, 一般利用智能體的運動學模型和動力學模型計算下一步的狀態和立即回報,並判斷是否達到終止狀態
def step(self, action):
1. position = self.state[0]
2. velocity = self.state[1]
# position, velocity = self.state
3. force = min(max(action[0], -1.0), 1.0)
4. velocity += force*self.power - 0.0025 * math.cos(3*position)
5. if (velocity > self.max_speed): velocity = self.max_speed
6. if (velocity < -self.max_speed): velocity = -self.max_speed
7. position += velocity
8. if (position > self.max_position): position = self.max_position
9. if (position < self.min_position): position = self.min_position
10. if (position==self.min_position and velocity<0): velocity = 0
11. done = bool(position >= self.goal_position)
12. reward = 0
13. if done:
14. reward = 100.0
15. reward-= math.pow(action[0],2)*0.1
16. self.state = np.array([position, velocity])
17. return self.state, reward, done, {}
-
初始化位置狀態
-
初始化速度狀態
-
引擎力:內層的
max(action[0], -1.0)
確保動作值不低於下界,即 - 1.0,
外層的min(max(action[0], -1.0), 1.0)
確保動作值不高於上界,即 1.0 -
計算速度:注意是速度累加的,這是微分的概念,把連續過程離散成很小的片段以進行近似
-
判斷當前速度是否大於最大速度:如果是,將當前速度設定為最大速度
-
判斷當前速度是否小於最小速度:如果是,將當前速度設定為最小速度
-
計算位置:
-
判斷當前位置是否高於最高位置:如果是,將當前位置設定為最高位置
-
判斷當前位置是否低於最低位置:如果是,將當前位置設定為最低位置
-
如果當前位置是最低位置且速度小於 0 :將速度設為0
-
判斷布爾類型的,返回True或者False
-
初始化 reward = 0
-
如果當前位置高於目標位置,
-
給予 agent 值為100的reward
-
-
這是執行動作之后得到的新的狀態
-
step()
函數返回下一時刻的觀測,回報,是否終止,調試項
MountainCarContinuous-v0
11-15 這幾行代碼的意思是:每執行一個step,就會檢查看自己是否越過了右邊的山峰,據此來給done賦值,如果小車沒有越過右邊的山峰,即 done=False,則在這一個step, reward將會記為
,也就是這一個時間步我們耗費了多少能量,我們當然不希望耗油太多。如果小車越過右邊的山峰,即 done=True,這一個step就會馬上得到
的獎勵。
3.2.4 定義reset()
函數:
在強化學習算法中,智能體需要一次次地嘗試,累積經驗,然后從經驗中學到好的動作。一次嘗試我們稱之為一條軌跡或一個episode. 每次嘗試都要到達終止狀態. 一次嘗試結束后,智能體需要從頭開始,這就需要智能體具有重新初始化的功能。函數
reset()
就是這個作用, agent與環境交互前調用該函數,確定agent的初始狀態以及其他可能的一些初始化設置。此例中在每個episode開始時,position初始化為[-0.6,-0.4]之間的一個任意狀態,速度初始化為0.
def reset(self):
self.state = np.array([self.np_random.uniform(low=-0.6, high=-0.4), 0])
return np.array(self.state)
3.2.5 定義_height(self, xs)
函數:
此函數用於下面的render()
函數用來構建圖像引擎
def _height(self, xs):
return np.sin(3 * xs)*.45+.55
3.2.6 定義render(self, mode='human')
函數
render()
函數是圖像引擎,就是人機交互界面,進行動畫演示,一個仿真環境必不可少的兩部分 是物理引擎和圖像引擎。物理引擎模擬環境中物體的運動規律;圖像引擎用來顯示環境中的物體圖像。
def render(self, mode='human'):
screen_width = 600
screen_height = 400
world_width = self.max_position - self.min_position
scale = screen_width/world_width
carwidth=40
carheight=20
if self.viewer is None:
from gym.envs.classic_control import rendering
self.viewer = rendering.Viewer(screen_width, screen_height)
xs = np.linspace(self.min_position, self.max_position, 100)
ys = self._height(xs)
xys = list(zip((xs-self.min_position)*scale, ys*scale))
self.track = rendering.make_polyline(xys)
self.track.set_linewidth(4)
self.viewer.add_geom(self.track)
clearance = 10
l,r,t,b = -carwidth/2, carwidth/2, carheight, 0
car = rendering.FilledPolygon([(l,b), (l,t), (r,t), (r,b)])
car.add_attr(rendering.Transform(translation=(0, clearance)))
self.cartrans = rendering.Transform()
car.add_attr(self.cartrans)
self.viewer.add_geom(car)
frontwheel = rendering.make_circle(carheight/2.5)
frontwheel.set_color(.5, .5, .5)
frontwheel.add_attr(rendering.Transform(translation=(carwidth/4,clearance)))
frontwheel.add_attr(self.cartrans)
self.viewer.add_geom(frontwheel)
backwheel = rendering.make_circle(carheight/2.5)
backwheel.add_attr(rendering.Transform(translation=(-carwidth/4,clearance)))
backwheel.add_attr(self.cartrans)
backwheel.set_color(.5, .5, .5)
self.viewer.add_geom(backwheel)
flagx = (self.goal_position-self.min_position)*scale
flagy1 = self._height(self.goal_position)*scale
flagy2 = flagy1 + 50
flagpole = rendering.Line((flagx, flagy1), (flagx, flagy2))
self.viewer.add_geom(flagpole)
flag = rendering.FilledPolygon([(flagx, flagy2), (flagx, flagy2-10),
(flagx+25, flagy2-5)])
flag.set_color(.8,.8,0)
self.viewer.add_geom(flag)
pos = self.state[0]
self.cartrans.set_translation((pos-self.min_position)*scale, self._height(pos)*scale)
self.cartrans.set_rotation(math.cos(3 * pos))
return self.viewer.render(return_rgb_array = mode=='rgb_array')
強化學習算法可以不用圖像引擎,這里我們不做解釋了。
3.2.7 定義close(self)
函數
def close(self):
if self.viewer:
self.viewer.close()
self.viewer = None
4. 運行
4.1 完整代碼:continuous_mountain_car.py
"""
MountainCarContinuous-v1
@author: Olivier Sigaud
A merge between two sources:
* Adaptation of the MountainCar Environment from the "FAReinforcement" library
of Jose Antonio Martin H. (version 1.0), adapted by 'Tom Schaul, tom@idsia.ch'
and then modified by Arnaud de Broissia
* the OpenAI/gym MountainCar environment
itself from
http://incompleteideas.net/sutton/MountainCar/MountainCar1.cp
permalink: https://perma.cc/6Z2N-PFWC
"""
import math
import numpy as np
import gym
from gym import spaces
from gym.utils import seeding
class ContinuousMountainCarEnv(gym.Env):
"""
Description:
The agent (a car) is started at the bottom of a valley. For any given
state the agent may choose to accelerate to the left, right or cease
any acceleration.
Observation:
Type: Box(2)
Num Observation Min Max
0 Car Position -1.2 0.6
1 Car Velocity -0.07 0.07
Actions:
Type: Box(1)
Num Action Min Max
0 the power coef -1.0 1.0
Note: actual driving force is calculated by multiplying the power coef by power (0.0015)
Reward:
Reward of 100 is awarded if the agent reached the flag (position = 0.45) on top of the mountain.
Reward is decrease based on amount of energy consumed each step.
Starting State:
The position of the car is assigned a uniform random value in
[-0.6 , -0.4].
The starting velocity of the car is always assigned to 0.
Episode Termination:
The car position is more than 0.45
Episode length is greater than 200
"""
metadata = {"render.modes": ["human", "rgb_array"], "video.frames_per_second": 30}
def __init__(self, goal_velocity=0):
self.min_action = -1.0
self.max_action = 1.0
self.min_position = -1.2
self.max_position = 0.6
self.max_speed = 0.07
self.goal_position = (
0.45 # was 0.5 in gym, 0.45 in Arnaud de Broissia's version
)
self.goal_velocity = goal_velocity
self.power = 0.0015
self.low_state = np.array(
[self.min_position, -self.max_speed], dtype=np.float32
)
self.high_state = np.array(
[self.max_position, self.max_speed], dtype=np.float32
)
self.viewer = None
self.action_space = spaces.Box(
low=self.min_action, high=self.max_action, shape=(1,), dtype=np.float32
)
self.observation_space = spaces.Box(
low=self.low_state, high=self.high_state, dtype=np.float32
)
self.seed()
def seed(self, seed=None):
self.np_random, seed = seeding.np_random(seed)
return [seed]
def step(self, action):
position = self.state[0]
velocity = self.state[1]
force = min(max(action[0], self.min_action), self.max_action)
velocity += force * self.power - 0.0025 * math.cos(3 * position)
if velocity > self.max_speed:
velocity = self.max_speed
if velocity < -self.max_speed:
velocity = -self.max_speed
position += velocity
if position > self.max_position:
position = self.max_position
if position < self.min_position:
position = self.min_position
if position == self.min_position and velocity < 0:
velocity = 0
# Convert a possible numpy bool to a Python bool.
done = bool(position >= self.goal_position and velocity >= self.goal_velocity)
reward = 0
if done:
reward = 100.0
reward -= math.pow(action[0], 2) * 0.1
self.state = np.array([position, velocity], dtype=np.float32)
return self.state, reward, done, {}
def reset(self):
self.state = np.array([self.np_random.uniform(low=-0.6, high=-0.4), 0])
return np.array(self.state, dtype=np.float32)
def _height(self, xs):
return np.sin(3 * xs) * 0.45 + 0.55
def render(self, mode="human"):
screen_width = 600
screen_height = 400
world_width = self.max_position - self.min_position
scale = screen_width / world_width
carwidth = 40
carheight = 20
if self.viewer is None:
from gym.envs.classic_control import rendering
self.viewer = rendering.Viewer(screen_width, screen_height)
xs = np.linspace(self.min_position, self.max_position, 100)
ys = self._height(xs)
xys = list(zip((xs - self.min_position) * scale, ys * scale))
self.track = rendering.make_polyline(xys)
self.track.set_linewidth(4)
self.viewer.add_geom(self.track)
clearance = 10
l, r, t, b = -carwidth / 2, carwidth / 2, carheight, 0
car = rendering.FilledPolygon([(l, b), (l, t), (r, t), (r, b)])
car.add_attr(rendering.Transform(translation=(0, clearance)))
self.cartrans = rendering.Transform()
car.add_attr(self.cartrans)
self.viewer.add_geom(car)
frontwheel = rendering.make_circle(carheight / 2.5)
frontwheel.set_color(0.5, 0.5, 0.5)
frontwheel.add_attr(
rendering.Transform(translation=(carwidth / 4, clearance))
)
frontwheel.add_attr(self.cartrans)
self.viewer.add_geom(frontwheel)
backwheel = rendering.make_circle(carheight / 2.5)
backwheel.add_attr(
rendering.Transform(translation=(-carwidth / 4, clearance))
)
backwheel.add_attr(self.cartrans)
backwheel.set_color(0.5, 0.5, 0.5)
self.viewer.add_geom(backwheel)
flagx = (self.goal_position - self.min_position) * scale
flagy1 = self._height(self.goal_position) * scale
flagy2 = flagy1 + 50
flagpole = rendering.Line((flagx, flagy1), (flagx, flagy2))
self.viewer.add_geom(flagpole)
flag = rendering.FilledPolygon(
[(flagx, flagy2), (flagx, flagy2 - 10), (flagx + 25, flagy2 - 5)]
)
flag.set_color(0.8, 0.8, 0)
self.viewer.add_geom(flag)
pos = self.state[0]
self.cartrans.set_translation(
(pos - self.min_position) * scale, self._height(pos) * scale
)
self.cartrans.set_rotation(math.cos(3 * pos))
return self.viewer.render(return_rgb_array=mode == "rgb_array")
def close(self):
if self.viewer:
self.viewer.close()
self.viewer = None
4.2 注冊環境
第一步:將我們自己的環境文件(我創建的文件名為continuous_mountain_car.py
)拷貝到你的gym安裝目錄./gym/gym/envs/classic_control
文件夾中。(拷貝在這個文件夾中因為要使用rendering模塊。當然,也有其他辦法。該方法不唯一)
第二步:打開該文件夾(第一步中的文件夾)下的__init__.py
文件,在文件末尾加入語句:
from gym.envs.classic_control.continuous_mountain_car import ContinuousMountainCarEnv
第三步:進入文件夾你的gym安裝目錄./gym/gym/envs
,打開該文件夾下的__init__.py
文件,添加代碼:
register(
id='MountainCarContinuous-v1',
entry_point='gym.envs.classic_control:GridEnv'
)
"""
第一個參數id就是你調用gym.make(‘id’)時的id, 這個id你可以隨便選取,我取的,名字是MountainCarContinuous-v1。
第二個參數就是函數路口。
"""
經過以上三步,就完成了注冊。
4.3 創建運行代碼:MountainCarContinuous.py
創建這個運行代碼后直接運行即可。
#!/usr/bin/env python
# -*- coding:utf-8 -*-
# Toolby: PyCharm
import gym
env = gym.make('MountainCarContinuous-v1')
env = env.unwrapped
total_steps = 0
for i_episode in range(10):
observation = env.reset()
ep_r = 0
while True:
env.render()
action = env.action_space.sample()
observation_, reward, done, info = env.step(action)
position, velocity = observation_
# 車開得越高 reward 越大
reward = abs(position - (-0.5))
ep_r += reward
if done:
get = '| Get' if observation_[0] >= env.unwrapped.goal_position else '| ----'
print('Epi: ', i_episode,
get,
'| Ep_r: ', round(ep_r, 4))
break
observation = observation
total_steps += 1