(元)強化學習開源代碼調研


(元)強化學習相關開源代碼調研

本地代碼:https://github.com/lucifer2859/meta-RL 

元強化學習簡介:https://www.cnblogs.com/lucifer1997/p/13603979.html

 

一、Meta-RL

1、Learning to Reinforcement Learn:CogSci 2017

  • https://github.com/awjuliani/Meta-RL
    • 環境:TensorFlow,CPU;
    • 任務:Dependent(Easy, Medium, Hard, Uniform)/Independent/Restless Bandit,Contextual Bandit,GridWorld
      • A3C-Meta-Bandit - Set of bandit tasks described in paper. Including: Independent, Dependent, and Restless bandits.
      • A3C-Meta-Context - Rainbow bandit task using randomized colors to indicate reward-giving arm in each episode.
      • A3C-Meta-Grid - Rainbow Gridworld task; a variation of gridworld in which goal colors are randomzied each episode and must be learned "on the fly."
    • 模型:one-layer LSTM A3C [Figure 1(a),無Enc層];
    • 實驗:成功運行,無bug;訓練收斂;結果大致相符;性能未達到論文效果(當前超參數);本地代碼對其略有修改,參見https://github.com/lucifer2859/meta-RL/tree/master/Meta-RL

2、RL^2: Fast Reinforcement Learning via Slow Reinforcement Learning (RL2):ICLR 2017

  • https://github.com/mwufi/meta-rl-bandits
    • 環境:PyTorch,CPU;
    • 任務:Independent Bandit;
    • 模型:two-layer LSTM REINFORCE;
    • 實驗:成功運行,無bug;模型與論文不符,原文RNN模型為GRU;訓練不收斂(當前超參數);
  • https://github.com/VashishtMadhavan/rl2
    • 環境:TensorFlow,CPU;
    • 任務:Dependent Bandit;
    • 模型:one-layer LSTM A3C [無Enc層];
    • 實驗:運行失敗,gym.error.UnregisteredEnv: No registered env with id: MediumBandit-v0;

3、Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks (MAML):ICML 2017

4、Evolved Policy Gradients (EPG):NeurIPS, 2018

5、A Simple Neural Attentive Meta-Learner:ICLR 2018

6、Efficient Off-Policy Meta-Reinforcement Learning via Probabilistic Context Variables (PEARL):arXiv: Learning, 2019

  •  https://github.com/katerakelly/oyster
    • 環境:PyTorch,GPU;
    • 任務:MuJoCo;
    • 模型:PEARL (SAC-based);
    • 實驗:Docker配置過程中運行docker build . -t pearl失敗;放棄Docker配置在本地對相關包進行安裝后,可以成功運行;使用本地包需要提前加一句:conda config --set restore_free_channel true,不然找不到大部分特定版本的包,就會導致創建環境失敗;相關問題可以咨詢Chains朱朱的主頁 - 博客園 (cnblogs.com)

7、Improving Generalization in Meta Reinforcement Learning using Learned Objectives (MetaGenRL): ICLR 2020

  • http://louiskirsch.com/code/metagenrl
    • 環境:TensorFlow,GPU;
    • 任務:MuJoCo;
    • 模型:MetaGenRL;
    • 實驗:在tensorflow-gpu==1.14.0與tensorflow==1.13.2環境上運行python ray_experiments.py train時都會出現bug;

 

二、RL-Adventure

1、Deep Q-Learning:

2、Policy Gradients:

  • https://github.com/haarnoja/sac
    • 環境:TensorFlow,GPU;
    • 任務:Continuous Control Tasks (MuJoCo);
    • 模型:Soft Actor-Critic(SAC,第一版,模型帶有狀態價值函數V);
    • 實驗:未運行;
  • https://github.com/denisyarats/pytorch_sac
    • 環境:PyTorch,GPU;
    • 任務:Continuous Control Tasks (MuJoCo);
    • 模型:Soft Actor-Critic(SAC,第一版,模型帶有狀態價值函數V);
    • 實驗:未運行;
  • http://github.com/rail-berkeley/softlearning/
    • 環境:TensorFlow,GPU;
    • 任務:Continuous Control Tasks (MuJoCo);
    • 模型:Soft Actor-Critic(SAC,第二版,模型去掉了狀態價值函數V);
    • 實驗:未運行;

3、兩者兼有:

  • https://github.com/ShangtongZhang/DeepRL
    • 環境:PyTorch,GPU;
    • 任務:Atari,MuJoCo;
    • 模型:(Double/Dueling/Prioritized) DQN,C51,QR-DQN,(Continuous/Discrete) Synchronous Advantage A2C,N-Step DQN,DDPG,PPO,OC,TD3,COF-PAC,GradientDICE,Bi-Res-DDPG,DAC,Geoff-PAC,QUOTA,ACE;
    • 實驗:成功運行;
  • https://github.com/astooke/rlpyt
    • 環境:PyTorch,GPU;
    • 任務:Atari;
    • 模型:Modular, optimized implementations of common deep RL algorithms in PyTorch, with unified infrastructure supporting all three major families of model-free algorithms: policy gradient, deep-q learning, and q-function policy gradient.
      • Policy Gradient:A2C, PPO.
      • Replay Buffers:(supporting both DQN + QPG) non-sequence and sequence (for recurrent) replay, n-step returns, uniform or prioritized replay, full-observation or frame-based buffer (e.g. for Atari, stores only unique frames to save memory, reconstructs multi-frame observations).
      • Deep Q-Learning DQN + variants: Double, Dueling, Categorical (up to Rainbow minus Noisy Nets), Recurrent (R2D2-style).
      • Q-Function Policy Gradient DDPG, TD3, SAC.
    • 實驗:
      • 成功運行,無bug;
  • https://github.com/vitchyr/rlkit
    • 環境:PyTorch,GPU;
    • 任務:gym[all]
    • 模型:Skew-Fit,RIG,TDM,HER,DQN,SAC(新版),TD3,AWAC;
    • 實驗:未運行;
  • p-christ/Deep-Reinforcement-Learning-Algorithms-with-PyTorch: PyTorch implementations of deep reinforcement learning algorithms and environments (github.com)
    • 環境:PyTorch;
    • 任務:CartPole,MountainCar,Bit Flipping,Four Rooms,Long Corridor,Ant-[Maze, Push, Fall];
    • 模型:DQN,DQN with Fixed Q Target,DDQN,DDQN with Prioritised Experience Replay,Dueling DDQN,REINFORCE,DDPG,TD3,SAC,SAC-Discrete,A3C,A2C,PPO,DQN-HER,DDPG-HER,h-DQN,Stochastic NN-HRL,DIAYN;
    • 實驗:部分模型在部分任務上成功運行(例如SAC-Discrete無法在Atari上成功運行);
  • https://github.com/hill-a/stable-baselines
    • 環境:TensorFlow;
  • https://github.com/openai/baselines
    • 環境:TensorFlow;
  • https://github.com/openai/spinningup
    • 環境:TensorFlow/PyTorch
    • 介紹:This is an educational resource produced by OpenAI that makes it easier to learn about deep reinforcement learning (deep RL). For the unfamiliar: reinforcement learning (RL) is a machine learning approach for teaching agents how to solve tasks by trial and error. Deep RL refers to the combination of RL with deep learningThis module contains a variety of helpful resources, including:
      • a short introduction to RL terminology, kinds of algorithms, and basic theory,
      • an essay about how to grow into an RL research role,
      • curated list of important papers organized by topic,
      • a well-documented code repo of short, standalone implementations of key algorithms,
      • and a few exercises to serve as warm-ups.
    • 實驗:TD3在MuJuCo任務上運行成功;
  • quantumiracle/Popular-RL-Algorithms: PyTorch implementation of Soft Actor-Critic (SAC), Twin Delayed DDPG (TD3), Actor-Critic (AC/A2C), Proximal Policy Optimization (PPO), QT-Opt, PointNet.. (github.com)
    • 環境:PyTorch/Tensorflow 2.0 + TensorLayer 2.0
    • 介紹:PyTorch和Tensorflow 2.0在OpenAI gym環境和自行實現的Reacher環境中實現了最先進的無模型強化學習算法。算法包括SAC,DDPG,TD3,AC/A2C,PPO,QT-Opt(包括交叉熵方法),PointNet,Transporter,Recurrent Policy Gradient,Soft Decision Tree,Probabilistic Mixture-of-Experts等。請注意,此repo更多的是我在研究和學習期間實現和測試的算法的個人集合,而不是供使用的官方開源庫/包。然而,我認為與其他人分享可能會有所幫助,我期待着對我的實現進行有益的討論。但我沒有花太多時間清理或構建代碼。正如您可能注意到的,每個算法可能有幾個版本的實現,我特意在這里展示它們,供您參考和比較。此外,該repo僅包含PyTorch實施。對於RL算法的官方庫,我提供了以下兩個使用TensorFlow 2.0 + TensorLayer 2.0的方案:
      • RL Tutorial (Status: Released) contains RL algorithms implementation as tutorials with simple structures.

      • RLzoo (Status: Released) is a baseline implementation with high-level API supporting a variety of popular environments, with more hierarchical structures for simple usage.

      由於Tensorflow 2.0已經包含了動態圖形構造而不是靜態圖形構造,因此在Tensorflow和PyTorch之間傳輸RL代碼就變得非常簡單。

    • 實驗:PPO在Atari任務上運行性能無法收斂;

 

三、Meta Learning (Learn to Learn)

1、Platform:


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM