(元)強化學習開源代碼調研

本文轉載自查看原文 2020-09-19 23:50 1778 元學習/ 強化學習

(元)強化學習相關開源代碼調研

本地代碼：https://github.com/lucifer2859/meta-RL

元強化學習簡介：https://www.cnblogs.com/lucifer1997/p/13603979.html

一、Meta-RL

1、Learning to Reinforcement Learn：CogSci 2017

https://github.com/awjuliani/Meta-RL

環境：TensorFlow，CPU；
任務：Dependent(Easy, Medium, Hard, Uniform)/Independent/Restless Bandit，Contextual Bandit，GridWorld
- A3C-Meta-Bandit - Set of bandit tasks described in paper. Including: Independent, Dependent, and Restless bandits.
- A3C-Meta-Context - Rainbow bandit task using randomized colors to indicate reward-giving arm in each episode.
- A3C-Meta-Grid - Rainbow Gridworld task; a variation of gridworld in which goal colors are randomzied each episode and must be learned "on the fly."
模型：one-layer LSTM A3C [Figure 1(a)，無Enc層]；
實驗：成功運行，無bug；訓練收斂；結果大致相符；性能未達到論文效果(當前超參數)；本地代碼對其略有修改，參見https://github.com/lucifer2859/meta-RL/tree/master/Meta-RL；

https://github.com/achao2013/Learning-To-Reinforcement-Learn

環境：MXNet，CPU；
任務：Dependent(Easy, Medium, Hard, Uniform)/Independent/Restless Bandit；
模型：multi-layer LSTM A3C[無Enc層]；
實驗：未運行；

https://github.com/lucifer2859/meta-RL/tree/master/L2RL-pytorch
- 環境：PyTorch，CPU；
- 任務：Dependent(Easy, Medium, Hard, Uniform)/Independent/Restless Bandit；
- 模型：one-layer LSTM A3C [Figure 1(a)，with GAE，無Enc層]；
- 實驗：成功運行，無bug；訓練收斂；結果大致相符；性能未達到論文效果(當前超參數)；

2、RL^2: Fast Reinforcement Learning via Slow Reinforcement Learning (RL²)：ICLR 2017

https://github.com/mwufi/meta-rl-bandits

環境：PyTorch，CPU；
任務：Independent Bandit；
模型：two-layer LSTM REINFORCE；
實驗：成功運行，無bug；模型與論文不符，原文RNN模型為GRU；訓練不收斂(當前超參數)；

https://github.com/VashishtMadhavan/rl2
- 環境：TensorFlow，CPU；
- 任務：Dependent Bandit；
- 模型：one-layer LSTM A3C [無Enc層]；
- 實驗：運行失敗，gym.error.UnregisteredEnv: No registered env with id: MediumBandit-v0；

3、Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks (MAML)：ICML 2017

https://github.com/tristandeleu/pytorch-maml-rl
- 環境：PyTorch，GPU；
- 任務：Multi-armed Bandit，Tabular MDP，Continuous Control with MuJoCo，2D Navigation Task；
- 模型：MAML TRPO；
- 實驗：初始運行失敗，terminate called after throwing an instance of 'c10::Error'；參見https://github.com/tristandeleu/pytorch-maml-rl/issues/40#issuecomment-632598191即可解決；但是出現新問題(AttributeError: Can't pickle local object 'make_env.<locals>._make_env')；參見https://github.com/tristandeleu/pytorch-maml-rl/issues/51即可解決；最終成功運行train.py，但test.py運行失敗；bandit-k5-n10不收斂(當前超參數)；
https://github.com/cbfinn/maml_rl
- 環境：the TensorFlow rllab version，CPU；
- 任務：MuJoCo；
- 模型：MAML TRPO；
- 實驗：未運行；

4、Evolved Policy Gradients (EPG)：NeurIPS, 2018

https://github.com/openai/EPG
- 環境：Chainer，CPU；
- 任務：MuJoCo；
- 模型：EPG PPO；
- 實驗：未運行；

5、A Simple Neural Attentive Meta-Learner：ICLR 2018

https://github.com/chanb/metalearning_RL
- 環境：PyTorch，GPU；
- 任務：Multi-armed Bandit，Tabular MDP；
- 模型：SNAIL，RL²（GRU）+ PPO；
- 實驗：成功運行，無bug；

6、Efficient Off-Policy Meta-Reinforcement Learning via Probabilistic Context Variables (PEARL)：arXiv: Learning, 2019

https://github.com/katerakelly/oyster
- 環境：PyTorch，GPU；
- 任務：MuJoCo；
- 模型：PEARL (SAC-based)；
- 實驗：Docker配置過程中運行docker build . -t pearl失敗；放棄Docker配置在本地對相關包進行安裝后，可以成功運行；使用本地包需要提前加一句：conda config --set restore_free_channel true，不然找不到大部分特定版本的包，就會導致創建環境失敗；相關問題可以咨詢Chains朱朱的主頁 - 博客園 (cnblogs.com)；

7、Improving Generalization in Meta Reinforcement Learning using Learned Objectives (MetaGenRL)： ICLR 2020

http://louiskirsch.com/code/metagenrl

環境：TensorFlow，GPU；
任務：MuJoCo；
模型：MetaGenRL；
實驗：在tensorflow-gpu==1.14.0與tensorflow==1.13.2環境上運行python ray_experiments.py train時都會出現bug；

二、RL-Adventure

1、Deep Q-Learning：

參見先前的Blog

https://www.cnblogs.com/lucifer1997/p/13458563.html；
https://github.com/lucifer2859/DQN；

https://github.com/Kaixhin/Rainbow
- 環境：PyTorch，GPU;
- 任務：Atari;
- 模型：Rainbow；
- 實驗：成功運行；
https://github.com/TianhongDai/hindsight-experience-replay
- 環境：PyTorch，GPU(Not Recommended, Better Use CPU)；
- 任務：MuJoCo；
- 模型：HER；
- 實驗：未運行；

2、Policy Gradients：

https://github.com/higgsfield/RL-Adventure-2

環境：PyTorch，GPU；
任務：Gym；
模型：A2C，GAE，PPO，ACER，DDPG，TD3，SAC，GAIL，HER；
實驗：成功運行；本地代碼基於bug、issue以及性能對其進行修改，參見https://github.com/lucifer2859/Policy-Gradients；在本地代碼中，所有模型(HER除外)均可以收斂且獲得較好性能；HER的問題參見https://github.com/higgsfield/RL-Adventure-2/issues/14；SAC實現似乎與原文不符(參見https://github.com/higgsfield/RL-Adventure-2/issues/11)；A2C實驗僅在CartPole-v0上能夠收斂；

https://github.com/ikostrikov/pytorch-a2c-ppo-acktr-gail；

環境：PyTorch/TensorFlow GPU;
任務：Atari，MuJoCo，PyBullet (including Racecar, Minitaur and Kuka)，DeepMind Control Suite；
模型：A2C，PPO，ACKTR，GAIL
實驗：未運行；

https://github.com/ikostrikov/pytorch-a3c

環境：PyTorch，CPU；
任務：Atari；
模型：A3C；
實驗：初始運行失敗，NotImplementedError；參考https://github.com/ikostrikov/pytorch-a3c/issues/66#issuecomment-559785590修改envs.py即可解決；最終成功運行；

https://github.com/haarnoja/sac

環境：TensorFlow，GPU；
任務：Continuous Control Tasks (MuJoCo)；
模型：Soft Actor-Critic（SAC，第一版，模型帶有狀態價值函數V）；
實驗：未運行；

https://github.com/denisyarats/pytorch_sac

環境：PyTorch，GPU；
任務：Continuous Control Tasks (MuJoCo)；
模型：Soft Actor-Critic（SAC，第一版，模型帶有狀態價值函數V）；
實驗：未運行；

http://github.com/rail-berkeley/softlearning/

環境：TensorFlow，GPU；
任務：Continuous Control Tasks (MuJoCo)；
模型：Soft Actor-Critic（SAC，第二版，模型去掉了狀態價值函數V）；
實驗：未運行；

https://github.com/ku2482/sac-discrete.pytorch
- 環境：PyTorch，GPU；
- 任務：Atari；
- 模型：SAC-Discrete(基於新版連續控制任務下的SAC改進的離散版本)；
- 實驗：成功運行；本地代碼對其略有修改，參見https://github.com/lucifer2859/sac-discrete-pytorch；訓練收斂，但性能與論文描述存在差異；

3、兩者兼有：

https://github.com/ShangtongZhang/DeepRL

環境：PyTorch，GPU；
任務：Atari，MuJoCo；
模型：(Double/Dueling/Prioritized) DQN，C51，QR-DQN，(Continuous/Discrete) Synchronous Advantage A2C，N-Step DQN，DDPG，PPO，OC，TD3，COF-PAC，GradientDICE，Bi-Res-DDPG，DAC，Geoff-PAC，QUOTA，ACE；
實驗：成功運行；

https://github.com/astooke/rlpyt

環境：PyTorch，GPU；
任務：Atari；
模型：Modular, optimized implementations of common deep RL algorithms in PyTorch, with unified infrastructure supporting all three major families of model-free algorithms: policy gradient, deep-q learning, and q-function policy gradient.
- Policy Gradient：A2C, PPO.
- Replay Buffers：(supporting both DQN + QPG) non-sequence and sequence (for recurrent) replay, n-step returns, uniform or prioritized replay, full-observation or frame-based buffer (e.g. for Atari, stores only unique frames to save memory, reconstructs multi-frame observations).
- Deep Q-Learning DQN + variants: Double, Dueling, Categorical (up to Rainbow minus Noisy Nets), Recurrent (R2D2-style).
- Q-Function Policy Gradient DDPG, TD3, SAC.
實驗：

成功運行，無bug；

https://github.com/vitchyr/rlkit

環境：PyTorch，GPU；
任務：gym[all]
模型：Skew-Fit，RIG，TDM，HER，DQN，SAC（新版），TD3，AWAC；
實驗：未運行；

p-christ/Deep-Reinforcement-Learning-Algorithms-with-PyTorch: PyTorch implementations of deep reinforcement learning algorithms and environments (github.com)

環境：PyTorch；
任務：CartPole，MountainCar，Bit Flipping，Four Rooms，Long Corridor，Ant-[Maze, Push, Fall]；
模型：DQN，DQN with Fixed Q Target，DDQN，DDQN with Prioritised Experience Replay，Dueling DDQN，REINFORCE，DDPG，TD3，SAC，SAC-Discrete，A3C，A2C，PPO，DQN-HER，DDPG-HER，h-DQN，Stochastic NN-HRL，DIAYN；
實驗：部分模型在部分任務上成功運行(例如SAC-Discrete無法在Atari上成功運行)；

https://github.com/hill-a/stable-baselines

環境：TensorFlow；

https://github.com/openai/baselines
- 環境：TensorFlow；
https://github.com/openai/spinningup

環境：TensorFlow/PyTorch
介紹：This is an educational resource produced by OpenAI that makes it easier to learn about deep reinforcement learning (deep RL). For the unfamiliar: reinforcement learning (RL) is a machine learning approach for teaching agents how to solve tasks by trial and error. Deep RL refers to the combination of RL with deep learning. This module contains a variety of helpful resources, including:

a short introduction to RL terminology, kinds of algorithms, and basic theory,
an essay about how to grow into an RL research role,
a curated list of important papers organized by topic,
a well-documented code repo of short, standalone implementations of key algorithms,
and a few exercises to serve as warm-ups.

實驗：TD3在MuJuCo任務上運行成功；

quantumiracle/Popular-RL-Algorithms: PyTorch implementation of Soft Actor-Critic (SAC), Twin Delayed DDPG (TD3), Actor-Critic (AC/A2C), Proximal Policy Optimization (PPO), QT-Opt, PointNet.. (github.com)
- 環境：PyTorch/Tensorflow 2.0 + TensorLayer 2.0
- 介紹：PyTorch和Tensorflow 2.0在OpenAI gym環境和自行實現的Reacher環境中實現了最先進的無模型強化學習算法。算法包括SAC，DDPG，TD3，AC/A2C，PPO，QT-Opt(包括交叉熵方法)，PointNet，Transporter，Recurrent Policy Gradient，Soft Decision Tree，Probabilistic Mixture-of-Experts等。請注意，此repo更多的是我在研究和學習期間實現和測試的算法的個人集合，而不是供使用的官方開源庫/包。然而，我認為與其他人分享可能會有所幫助，我期待着對我的實現進行有益的討論。但我沒有花太多時間清理或構建代碼。正如您可能注意到的，每個算法可能有幾個版本的實現，我特意在這里展示它們，供您參考和比較。此外，該repo僅包含PyTorch實施。對於RL算法的官方庫，我提供了以下兩個使用TensorFlow 2.0 + TensorLayer 2.0的方案：
  - RL Tutorial (Status: Released) contains RL algorithms implementation as tutorials with simple structures.
  - RLzoo (Status: Released) is a baseline implementation with high-level API supporting a variety of popular environments, with more hierarchical structures for simple usage.
  由於Tensorflow 2.0已經包含了動態圖形構造而不是靜態圖形構造，因此在Tensorflow和PyTorch之間傳輸RL代碼就變得非常簡單。
- 實驗：PPO在Atari任務上運行性能無法收斂；

三、Meta Learning (Learn to Learn)

1、Platform：

https://github.com/learnables/learn2learn

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 強化學習詳解與代碼實現谷歌重磅開源強化學習框架Dopamine吊打OpenAI 強化學習總結強化學習——入門強化學習（MATLAB）什么是強化學習？強化學習雜談強化學習之CartPole 強化學習原理與python實現PDF代碼運行分析【強化學習篇】--強化學習案例詳解一