(元)强化学习开源代码调研

本文转载自查看原文 2020-09-19 23:50 1778 元学习/ 强化学习

(元)强化学习相关开源代码调研

本地代码：https://github.com/lucifer2859/meta-RL

元强化学习简介：https://www.cnblogs.com/lucifer1997/p/13603979.html

一、Meta-RL

1、Learning to Reinforcement Learn：CogSci 2017

https://github.com/awjuliani/Meta-RL

环境：TensorFlow，CPU；
任务：Dependent(Easy, Medium, Hard, Uniform)/Independent/Restless Bandit，Contextual Bandit，GridWorld
- A3C-Meta-Bandit - Set of bandit tasks described in paper. Including: Independent, Dependent, and Restless bandits.
- A3C-Meta-Context - Rainbow bandit task using randomized colors to indicate reward-giving arm in each episode.
- A3C-Meta-Grid - Rainbow Gridworld task; a variation of gridworld in which goal colors are randomzied each episode and must be learned "on the fly."
模型：one-layer LSTM A3C [Figure 1(a)，无Enc层]；
实验：成功运行，无bug；训练收敛；结果大致相符；性能未达到论文效果(当前超参数)；本地代码对其略有修改，参见https://github.com/lucifer2859/meta-RL/tree/master/Meta-RL；

https://github.com/achao2013/Learning-To-Reinforcement-Learn

环境：MXNet，CPU；
任务：Dependent(Easy, Medium, Hard, Uniform)/Independent/Restless Bandit；
模型：multi-layer LSTM A3C[无Enc层]；
实验：未运行；

https://github.com/lucifer2859/meta-RL/tree/master/L2RL-pytorch
- 环境：PyTorch，CPU；
- 任务：Dependent(Easy, Medium, Hard, Uniform)/Independent/Restless Bandit；
- 模型：one-layer LSTM A3C [Figure 1(a)，with GAE，无Enc层]；
- 实验：成功运行，无bug；训练收敛；结果大致相符；性能未达到论文效果(当前超参数)；

2、RL^2: Fast Reinforcement Learning via Slow Reinforcement Learning (RL²)：ICLR 2017

https://github.com/mwufi/meta-rl-bandits

环境：PyTorch，CPU；
任务：Independent Bandit；
模型：two-layer LSTM REINFORCE；
实验：成功运行，无bug；模型与论文不符，原文RNN模型为GRU；训练不收敛(当前超参数)；

https://github.com/VashishtMadhavan/rl2
- 环境：TensorFlow，CPU；
- 任务：Dependent Bandit；
- 模型：one-layer LSTM A3C [无Enc层]；
- 实验：运行失败，gym.error.UnregisteredEnv: No registered env with id: MediumBandit-v0；

3、Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks (MAML)：ICML 2017

https://github.com/tristandeleu/pytorch-maml-rl
- 环境：PyTorch，GPU；
- 任务：Multi-armed Bandit，Tabular MDP，Continuous Control with MuJoCo，2D Navigation Task；
- 模型：MAML TRPO；
- 实验：初始运行失败，terminate called after throwing an instance of 'c10::Error'；参见https://github.com/tristandeleu/pytorch-maml-rl/issues/40#issuecomment-632598191即可解决；但是出现新问题(AttributeError: Can't pickle local object 'make_env.<locals>._make_env')；参见https://github.com/tristandeleu/pytorch-maml-rl/issues/51即可解决；最终成功运行train.py，但test.py运行失败；bandit-k5-n10不收敛(当前超参数)；
https://github.com/cbfinn/maml_rl
- 环境：the TensorFlow rllab version，CPU；
- 任务：MuJoCo；
- 模型：MAML TRPO；
- 实验：未运行；

4、Evolved Policy Gradients (EPG)：NeurIPS, 2018

https://github.com/openai/EPG
- 环境：Chainer，CPU；
- 任务：MuJoCo；
- 模型：EPG PPO；
- 实验：未运行；

5、A Simple Neural Attentive Meta-Learner：ICLR 2018

https://github.com/chanb/metalearning_RL
- 环境：PyTorch，GPU；
- 任务：Multi-armed Bandit，Tabular MDP；
- 模型：SNAIL，RL²（GRU）+ PPO；
- 实验：成功运行，无bug；

6、Efficient Off-Policy Meta-Reinforcement Learning via Probabilistic Context Variables (PEARL)：arXiv: Learning, 2019

https://github.com/katerakelly/oyster
- 环境：PyTorch，GPU；
- 任务：MuJoCo；
- 模型：PEARL (SAC-based)；
- 实验：Docker配置过程中运行docker build . -t pearl失败；放弃Docker配置在本地对相关包进行安装后，可以成功运行；使用本地包需要提前加一句：conda config --set restore_free_channel true，不然找不到大部分特定版本的包，就会导致创建环境失败；相关问题可以咨询Chains朱朱的主页 - 博客园 (cnblogs.com)；

7、Improving Generalization in Meta Reinforcement Learning using Learned Objectives (MetaGenRL)： ICLR 2020

http://louiskirsch.com/code/metagenrl

环境：TensorFlow，GPU；
任务：MuJoCo；
模型：MetaGenRL；
实验：在tensorflow-gpu==1.14.0与tensorflow==1.13.2环境上运行python ray_experiments.py train时都会出现bug；

二、RL-Adventure

1、Deep Q-Learning：

参见先前的Blog

https://www.cnblogs.com/lucifer1997/p/13458563.html；
https://github.com/lucifer2859/DQN；

https://github.com/Kaixhin/Rainbow
- 环境：PyTorch，GPU;
- 任务：Atari;
- 模型：Rainbow；
- 实验：成功运行；
https://github.com/TianhongDai/hindsight-experience-replay
- 环境：PyTorch，GPU(Not Recommended, Better Use CPU)；
- 任务：MuJoCo；
- 模型：HER；
- 实验：未运行；

2、Policy Gradients：

https://github.com/higgsfield/RL-Adventure-2

环境：PyTorch，GPU；
任务：Gym；
模型：A2C，GAE，PPO，ACER，DDPG，TD3，SAC，GAIL，HER；
实验：成功运行；本地代码基于bug、issue以及性能对其进行修改，参见https://github.com/lucifer2859/Policy-Gradients；在本地代码中，所有模型(HER除外)均可以收敛且获得较好性能；HER的问题参见https://github.com/higgsfield/RL-Adventure-2/issues/14；SAC实现似乎与原文不符(参见https://github.com/higgsfield/RL-Adventure-2/issues/11)；A2C实验仅在CartPole-v0上能够收敛；

https://github.com/ikostrikov/pytorch-a2c-ppo-acktr-gail；

环境：PyTorch/TensorFlow GPU;
任务：Atari，MuJoCo，PyBullet (including Racecar, Minitaur and Kuka)，DeepMind Control Suite；
模型：A2C，PPO，ACKTR，GAIL
实验：未运行；

https://github.com/ikostrikov/pytorch-a3c

环境：PyTorch，CPU；
任务：Atari；
模型：A3C；
实验：初始运行失败，NotImplementedError；参考https://github.com/ikostrikov/pytorch-a3c/issues/66#issuecomment-559785590修改envs.py即可解决；最终成功运行；

https://github.com/haarnoja/sac

环境：TensorFlow，GPU；
任务：Continuous Control Tasks (MuJoCo)；
模型：Soft Actor-Critic（SAC，第一版，模型带有状态价值函数V）；
实验：未运行；

https://github.com/denisyarats/pytorch_sac

环境：PyTorch，GPU；
任务：Continuous Control Tasks (MuJoCo)；
模型：Soft Actor-Critic（SAC，第一版，模型带有状态价值函数V）；
实验：未运行；

http://github.com/rail-berkeley/softlearning/

环境：TensorFlow，GPU；
任务：Continuous Control Tasks (MuJoCo)；
模型：Soft Actor-Critic（SAC，第二版，模型去掉了状态价值函数V）；
实验：未运行；

https://github.com/ku2482/sac-discrete.pytorch
- 环境：PyTorch，GPU；
- 任务：Atari；
- 模型：SAC-Discrete(基于新版连续控制任务下的SAC改进的离散版本)；
- 实验：成功运行；本地代码对其略有修改，参见https://github.com/lucifer2859/sac-discrete-pytorch；训练收敛，但性能与论文描述存在差异；

3、两者兼有：

https://github.com/ShangtongZhang/DeepRL

环境：PyTorch，GPU；
任务：Atari，MuJoCo；
模型：(Double/Dueling/Prioritized) DQN，C51，QR-DQN，(Continuous/Discrete) Synchronous Advantage A2C，N-Step DQN，DDPG，PPO，OC，TD3，COF-PAC，GradientDICE，Bi-Res-DDPG，DAC，Geoff-PAC，QUOTA，ACE；
实验：成功运行；

https://github.com/astooke/rlpyt

环境：PyTorch，GPU；
任务：Atari；
模型：Modular, optimized implementations of common deep RL algorithms in PyTorch, with unified infrastructure supporting all three major families of model-free algorithms: policy gradient, deep-q learning, and q-function policy gradient.
- Policy Gradient：A2C, PPO.
- Replay Buffers：(supporting both DQN + QPG) non-sequence and sequence (for recurrent) replay, n-step returns, uniform or prioritized replay, full-observation or frame-based buffer (e.g. for Atari, stores only unique frames to save memory, reconstructs multi-frame observations).
- Deep Q-Learning DQN + variants: Double, Dueling, Categorical (up to Rainbow minus Noisy Nets), Recurrent (R2D2-style).
- Q-Function Policy Gradient DDPG, TD3, SAC.
实验：

成功运行，无bug；

https://github.com/vitchyr/rlkit

环境：PyTorch，GPU；
任务：gym[all]
模型：Skew-Fit，RIG，TDM，HER，DQN，SAC（新版），TD3，AWAC；
实验：未运行；

p-christ/Deep-Reinforcement-Learning-Algorithms-with-PyTorch: PyTorch implementations of deep reinforcement learning algorithms and environments (github.com)

环境：PyTorch；
任务：CartPole，MountainCar，Bit Flipping，Four Rooms，Long Corridor，Ant-[Maze, Push, Fall]；
模型：DQN，DQN with Fixed Q Target，DDQN，DDQN with Prioritised Experience Replay，Dueling DDQN，REINFORCE，DDPG，TD3，SAC，SAC-Discrete，A3C，A2C，PPO，DQN-HER，DDPG-HER，h-DQN，Stochastic NN-HRL，DIAYN；
实验：部分模型在部分任务上成功运行(例如SAC-Discrete无法在Atari上成功运行)；

https://github.com/hill-a/stable-baselines

环境：TensorFlow；

https://github.com/openai/baselines
- 环境：TensorFlow；
https://github.com/openai/spinningup

环境：TensorFlow/PyTorch
介绍：This is an educational resource produced by OpenAI that makes it easier to learn about deep reinforcement learning (deep RL). For the unfamiliar: reinforcement learning (RL) is a machine learning approach for teaching agents how to solve tasks by trial and error. Deep RL refers to the combination of RL with deep learning. This module contains a variety of helpful resources, including:

a short introduction to RL terminology, kinds of algorithms, and basic theory,
an essay about how to grow into an RL research role,
a curated list of important papers organized by topic,
a well-documented code repo of short, standalone implementations of key algorithms,
and a few exercises to serve as warm-ups.

实验：TD3在MuJuCo任务上运行成功；

quantumiracle/Popular-RL-Algorithms: PyTorch implementation of Soft Actor-Critic (SAC), Twin Delayed DDPG (TD3), Actor-Critic (AC/A2C), Proximal Policy Optimization (PPO), QT-Opt, PointNet.. (github.com)
- 环境：PyTorch/Tensorflow 2.0 + TensorLayer 2.0
- 介绍：PyTorch和Tensorflow 2.0在OpenAI gym环境和自行实现的Reacher环境中实现了最先进的无模型强化学习算法。算法包括SAC，DDPG，TD3，AC/A2C，PPO，QT-Opt(包括交叉熵方法)，PointNet，Transporter，Recurrent Policy Gradient，Soft Decision Tree，Probabilistic Mixture-of-Experts等。请注意，此repo更多的是我在研究和学习期间实现和测试的算法的个人集合，而不是供使用的官方开源库/包。然而，我认为与其他人分享可能会有所帮助，我期待着对我的实现进行有益的讨论。但我没有花太多时间清理或构建代码。正如您可能注意到的，每个算法可能有几个版本的实现，我特意在这里展示它们，供您参考和比较。此外，该repo仅包含PyTorch实施。对于RL算法的官方库，我提供了以下两个使用TensorFlow 2.0 + TensorLayer 2.0的方案：
  - RL Tutorial (Status: Released) contains RL algorithms implementation as tutorials with simple structures.
  - RLzoo (Status: Released) is a baseline implementation with high-level API supporting a variety of popular environments, with more hierarchical structures for simple usage.
  由于Tensorflow 2.0已经包含了动态图形构造而不是静态图形构造，因此在Tensorflow和PyTorch之间传输RL代码就变得非常简单。
- 实验：PPO在Atari任务上运行性能无法收敛；

三、Meta Learning (Learn to Learn)

1、Platform：

https://github.com/learnables/learn2learn

免责声明！

本站转载的文章为个人学习借鉴使用，本站对版权不负任何法律责任。如果侵犯了您的隐私权益，请联系本站邮箱yoyou2525@163.com删除。

猜您在找 强化学习详解与代码实现谷歌重磅开源强化学习框架Dopamine吊打OpenAI 强化学习总结强化学习——入门强化学习（MATLAB）什么是强化学习？强化学习杂谈强化学习之CartPole 强化学习原理与python实现PDF代码运行分析【强化学习篇】--强化学习案例详解一