(元)強化學習相關開源代碼調研
本地代碼:https://github.com/lucifer2859/meta-RL
元強化學習簡介:https://www.cnblogs.com/lucifer1997/p/13603979.html
一、Meta-RL
1、Learning to Reinforcement Learn:CogSci 2017
- https://github.com/awjuliani/Meta-RL
- 環境:TensorFlow,CPU;
- 任務:Dependent(Easy, Medium, Hard, Uniform)/Independent/Restless Bandit,Contextual Bandit,GridWorld
- A3C-Meta-Bandit - Set of bandit tasks described in paper. Including: Independent, Dependent, and Restless bandits.
- A3C-Meta-Context - Rainbow bandit task using randomized colors to indicate reward-giving arm in each episode.
- A3C-Meta-Grid - Rainbow Gridworld task; a variation of gridworld in which goal colors are randomzied each episode and must be learned "on the fly."
- 模型:one-layer LSTM A3C [Figure 1(a),無Enc層];
- 實驗:成功運行,無bug;訓練收斂;結果大致相符;性能未達到論文效果(當前超參數);本地代碼對其略有修改,參見https://github.com/lucifer2859/meta-RL/tree/master/Meta-RL;
- https://github.com/achao2013/Learning-To-Reinforcement-Learn
- 環境:MXNet,CPU;
- 任務:Dependent(Easy, Medium, Hard, Uniform)/Independent/Restless Bandit;
- 模型:multi-layer LSTM A3C[無Enc層];
- 實驗:未運行;
- https://github.com/lucifer2859/meta-RL/tree/master/L2RL-pytorch
- 環境:PyTorch,CPU;
- 任務:Dependent(Easy, Medium, Hard, Uniform)/Independent/Restless Bandit;
- 模型:one-layer LSTM A3C [Figure 1(a),with GAE,無Enc層];
- 實驗:成功運行,無bug;訓練收斂;結果大致相符;性能未達到論文效果(當前超參數);
2、RL^2: Fast Reinforcement Learning via Slow Reinforcement Learning (RL2):ICLR 2017
- https://github.com/mwufi/meta-rl-bandits
- 環境:PyTorch,CPU;
- 任務:Independent Bandit;
- 模型:two-layer LSTM REINFORCE;
- 實驗:成功運行,無bug;模型與論文不符,原文RNN模型為GRU;訓練不收斂(當前超參數);
- https://github.com/VashishtMadhavan/rl2
- 環境:TensorFlow,CPU;
- 任務:Dependent Bandit;
- 模型:one-layer LSTM A3C [無Enc層];
- 實驗:運行失敗,gym.error.UnregisteredEnv: No registered env with id: MediumBandit-v0;
3、Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks (MAML):ICML 2017
- https://github.com/tristandeleu/pytorch-maml-rl
- 環境:PyTorch,GPU;
- 任務:Multi-armed Bandit,Tabular MDP,Continuous Control with MuJoCo,2D Navigation Task;
- 模型:MAML TRPO;
- 實驗:初始運行失敗,terminate called after throwing an instance of 'c10::Error';參見https://github.com/tristandeleu/pytorch-maml-rl/issues/40#issuecomment-632598191即可解決;但是出現新問題(AttributeError: Can't pickle local object 'make_env.<locals>._make_env');參見https://github.com/tristandeleu/pytorch-maml-rl/issues/51即可解決;最終成功運行train.py,但test.py運行失敗;bandit-k5-n10不收斂(當前超參數);
- https://github.com/cbfinn/maml_rl
- 環境:the TensorFlow rllab version,CPU;
- 任務:MuJoCo;
- 模型:MAML TRPO;
- 實驗:未運行;
4、Evolved Policy Gradients (EPG):NeurIPS, 2018
- https://github.com/openai/EPG
- 環境:Chainer,CPU;
- 任務:MuJoCo;
- 模型:EPG PPO;
- 實驗:未運行;
5、A Simple Neural Attentive Meta-Learner:ICLR 2018
- https://github.com/chanb/metalearning_RL
- 環境:PyTorch,GPU;
- 任務:Multi-armed Bandit,Tabular MDP;
- 模型:SNAIL,RL2(GRU)+ PPO;
- 實驗:成功運行,無bug;
6、Efficient Off-Policy Meta-Reinforcement Learning via Probabilistic Context Variables (PEARL):arXiv: Learning, 2019
- https://github.com/katerakelly/oyster
- 環境:PyTorch,GPU;
- 任務:MuJoCo;
- 模型:PEARL (SAC-based);
- 實驗:Docker配置過程中運行docker build . -t pearl失敗;放棄Docker配置在本地對相關包進行安裝后,可以成功運行;使用本地包需要提前加一句:conda config --set restore_free_channel true,不然找不到大部分特定版本的包,就會導致創建環境失敗;相關問題可以咨詢Chains朱朱的主頁 - 博客園 (cnblogs.com);
7、Improving Generalization in Meta Reinforcement Learning using Learned Objectives (MetaGenRL): ICLR 2020
- http://louiskirsch.com/code/metagenrl
- 環境:TensorFlow,GPU;
- 任務:MuJoCo;
- 模型:MetaGenRL;
- 實驗:在tensorflow-gpu==1.14.0與tensorflow==1.13.2環境上運行python ray_experiments.py train時都會出現bug;
二、RL-Adventure
1、Deep Q-Learning:
- 參見先前的Blog
- https://github.com/Kaixhin/Rainbow
- 環境:PyTorch,GPU;
- 任務:Atari;
- 模型:Rainbow;
- 實驗:成功運行;
- https://github.com/TianhongDai/hindsight-experience-replay
- 環境:PyTorch,GPU(Not Recommended, Better Use CPU);
- 任務:MuJoCo;
- 模型:HER;
- 實驗:未運行;
2、Policy Gradients:
- https://github.com/higgsfield/RL-Adventure-2
- 環境:PyTorch,GPU;
- 任務:Gym;
- 模型:A2C,GAE,PPO,ACER,DDPG,TD3,SAC,GAIL,HER;
- 實驗:成功運行;本地代碼基於bug、issue以及性能對其進行修改,參見https://github.com/lucifer2859/Policy-Gradients;在本地代碼中,所有模型(HER除外)均可以收斂且獲得較好性能;HER的問題參見https://github.com/higgsfield/RL-Adventure-2/issues/14;SAC實現似乎與原文不符(參見https://github.com/higgsfield/RL-Adventure-2/issues/11);A2C實驗僅在CartPole-v0上能夠收斂;
- https://github.com/ikostrikov/pytorch-a2c-ppo-acktr-gail;
- 環境:PyTorch/TensorFlow GPU;
- 任務:Atari,MuJoCo,PyBullet (including Racecar, Minitaur and Kuka),DeepMind Control Suite;
- 模型:A2C,PPO,ACKTR,GAIL
- 實驗:未運行;
- https://github.com/ikostrikov/pytorch-a3c
- 環境:PyTorch,CPU;
- 任務:Atari;
- 模型:A3C;
- 實驗:初始運行失敗,NotImplementedError;參考https://github.com/ikostrikov/pytorch-a3c/issues/66#issuecomment-559785590修改envs.py即可解決;最終成功運行;
- https://github.com/haarnoja/sac
- 環境:TensorFlow,GPU;
- 任務:Continuous Control Tasks (MuJoCo);
- 模型:Soft Actor-Critic(SAC,第一版,模型帶有狀態價值函數V);
- 實驗:未運行;
- https://github.com/denisyarats/pytorch_sac
- 環境:PyTorch,GPU;
- 任務:Continuous Control Tasks (MuJoCo);
- 模型:Soft Actor-Critic(SAC,第一版,模型帶有狀態價值函數V);
- 實驗:未運行;
- http://github.com/rail-berkeley/softlearning/
- 環境:TensorFlow,GPU;
- 任務:Continuous Control Tasks (MuJoCo);
- 模型:Soft Actor-Critic(SAC,第二版,模型去掉了狀態價值函數V);
- 實驗:未運行;
- https://github.com/ku2482/sac-discrete.pytorch
- 環境:PyTorch,GPU;
- 任務:Atari;
- 模型:SAC-Discrete(基於新版連續控制任務下的SAC改進的離散版本);
- 實驗:成功運行;本地代碼對其略有修改,參見https://github.com/lucifer2859/sac-discrete-pytorch;訓練收斂,但性能與論文描述存在差異;
3、兩者兼有:
- https://github.com/ShangtongZhang/DeepRL
- 環境:PyTorch,GPU;
- 任務:Atari,MuJoCo;
- 模型:(Double/Dueling/Prioritized) DQN,C51,QR-DQN,(Continuous/Discrete) Synchronous Advantage A2C,N-Step DQN,DDPG,PPO,OC,TD3,COF-PAC,GradientDICE,Bi-Res-DDPG,DAC,Geoff-PAC,QUOTA,ACE;
- 實驗:成功運行;
- https://github.com/astooke/rlpyt
- 環境:PyTorch,GPU;
- 任務:Atari;
- 模型:Modular, optimized implementations of common deep RL algorithms in PyTorch, with unified infrastructure supporting all three major families of model-free algorithms: policy gradient, deep-q learning, and q-function policy gradient.
- Policy Gradient:A2C, PPO.
- Replay Buffers:(supporting both DQN + QPG) non-sequence and sequence (for recurrent) replay, n-step returns, uniform or prioritized replay, full-observation or frame-based buffer (e.g. for Atari, stores only unique frames to save memory, reconstructs multi-frame observations).
- Deep Q-Learning DQN + variants: Double, Dueling, Categorical (up to Rainbow minus Noisy Nets), Recurrent (R2D2-style).
- Q-Function Policy Gradient DDPG, TD3, SAC.
- 實驗:
- 成功運行,無bug;
- https://github.com/vitchyr/rlkit
- 環境:PyTorch,GPU;
- 任務:gym[all]
- 模型:Skew-Fit,RIG,TDM,HER,DQN,SAC(新版),TD3,AWAC;
- 實驗:未運行;
- p-christ/Deep-Reinforcement-Learning-Algorithms-with-PyTorch: PyTorch implementations of deep reinforcement learning algorithms and environments (github.com)
- 環境:PyTorch;
- 任務:CartPole,MountainCar,Bit Flipping,Four Rooms,Long Corridor,Ant-[Maze, Push, Fall];
- 模型:DQN,DQN with Fixed Q Target,DDQN,DDQN with Prioritised Experience Replay,Dueling DDQN,REINFORCE,DDPG,TD3,SAC,SAC-Discrete,A3C,A2C,PPO,DQN-HER,DDPG-HER,h-DQN,Stochastic NN-HRL,DIAYN;
- 實驗:部分模型在部分任務上成功運行(例如SAC-Discrete無法在Atari上成功運行);
- https://github.com/hill-a/stable-baselines
- 環境:TensorFlow;
- https://github.com/openai/baselines
- 環境:TensorFlow;
- https://github.com/openai/spinningup
- 環境:TensorFlow/PyTorch
- 介紹:This is an educational resource produced by OpenAI that makes it easier to learn about deep reinforcement learning (deep RL). For the unfamiliar: reinforcement learning (RL) is a machine learning approach for teaching agents how to solve tasks by trial and error. Deep RL refers to the combination of RL with deep learning. This module contains a variety of helpful resources, including:
- a short introduction to RL terminology, kinds of algorithms, and basic theory,
- an essay about how to grow into an RL research role,
- a curated list of important papers organized by topic,
- a well-documented code repo of short, standalone implementations of key algorithms,
- and a few exercises to serve as warm-ups.
- 實驗:TD3在MuJuCo任務上運行成功;
- quantumiracle/Popular-RL-Algorithms: PyTorch implementation of Soft Actor-Critic (SAC), Twin Delayed DDPG (TD3), Actor-Critic (AC/A2C), Proximal Policy Optimization (PPO), QT-Opt, PointNet.. (github.com)
- 環境:PyTorch/Tensorflow 2.0 + TensorLayer 2.0
- 介紹:PyTorch和Tensorflow 2.0在OpenAI gym環境和自行實現的Reacher環境中實現了最先進的無模型強化學習算法。算法包括SAC,DDPG,TD3,AC/A2C,PPO,QT-Opt(包括交叉熵方法),PointNet,Transporter,Recurrent Policy Gradient,Soft Decision Tree,Probabilistic Mixture-of-Experts等。請注意,此repo更多的是我在研究和學習期間實現和測試的算法的個人集合,而不是供使用的官方開源庫/包。然而,我認為與其他人分享可能會有所幫助,我期待着對我的實現進行有益的討論。但我沒有花太多時間清理或構建代碼。正如您可能注意到的,每個算法可能有幾個版本的實現,我特意在這里展示它們,供您參考和比較。此外,該repo僅包含PyTorch實施。對於RL算法的官方庫,我提供了以下兩個使用TensorFlow 2.0 + TensorLayer 2.0的方案:
-
RL Tutorial (Status: Released) contains RL algorithms implementation as tutorials with simple structures.
-
RLzoo (Status: Released) is a baseline implementation with high-level API supporting a variety of popular environments, with more hierarchical structures for simple usage.
由於Tensorflow 2.0已經包含了動態圖形構造而不是靜態圖形構造,因此在Tensorflow和PyTorch之間傳輸RL代碼就變得非常簡單。
-
- 實驗:PPO在Atari任務上運行性能無法收斂;
三、Meta Learning (Learn to Learn)
1、Platform: