Non-stationary RL
Published:
Non-Stationary RL
Usually we consider optimizing an objective under a stationary MDP with a fixed transition and reward function. We can learn the optimal policy through policy evaluation and policy improvement steps. However, in a constantly-changing environment where the transition kernal and the reward functions may be unkown, it’s crucial for our learners to adapt itself to the environment through interaction and sampling.
Existing Literature On Continual Learning
- Towards Continual Reinforcement Learning: A Review and Perspectives
- A Comprehensive Survey of Continual Learning: Theory, Method and Application
- Reinforcement learning algorithm for non-stationary environments
Now I’m trying to understand the big picture of the field in Non-stationary RL. Here is the collection of papers I’m planning to read.
- Black Box Multi-Agent System
- Memory-Based Meta-Learning
- Debiased Offline Representation Learning
- Adaptive Deep RL for Piecewise Context
- Factored Adaptation
- Goal Oriented Shortest Path
- Counterfactual Off-Policy
- Inverse Online Learning
- RestartQ-UCB
- Sliding Window Upper-Confidence
- Optimizing for the Future
- Safe Policy Improvement
- Dynamic Regret
Usually, the numerical experiemnts are tested on the Grid World and the MuJoCo environments to show the efficiency of the algorithm.