Non-stationary RL
Published:
Non-Stationary RL
Usually we consider optimizing an objective under a stationary MDP with a fixed transition and reward function. We can learn the optimal policy through policy evaluation and policy improvement steps. However, in a constantly-changing environment where the transition kernal and the reward functions may be unkown, it’s crucial for our learners to adapt itself to the environment through interaction and sampling.
Existing Literature On Continual Learning
- Towards Continual Reinforcement Learning: A Review and Perspectives
- A Comprehensive Survey of Continual Learning: Theory, Method and Application
- Reinforcement learning algorithm for non-stationary environments
Now I’m trying to understand the big picture of the field in Non-stationary RL. Here is the collection of papers I’m planning to read.
- Black Box Multi-Agent System
- Memory-Based Meta-Learning
- Debiased Offline Representation Learning
- Adaptive Deep RL for Piecewise Context
- Factored Adaptation
- Goal Oriented Shortest Path
- Counterfactual Off-Policy
- Inverse Online Learning
- RestartQ-UCB
- Sliding Window Upper-Confidence
- Optimizing for the Future
- Safe Policy Improvement
- Dynamic Regret
Usually, the numerical experiemnts are tested on the Grid World and the MuJoCo environments to show the efficiency of the algorithm.
In general, I think there are three ways to deal with the non-stationary environements.
Learn task representations through encoder-decoder architectures. By representing a latent variable as a new input to the reward function and the transition kernel, we can account for the non-stationarity. And finally optimize the model by both optimizing the policy part and the probablistic model.
Learn good initialization for a new episode, which is essentially a similar idea to meta-learning where we need to learn a meta-policy across different tasks.
Context detection. Sometimes the context length may be stochastic, we need to collect the data or trajectories only related to the current context.
Knowledge distillation or policy consolidation. To prevent the castrophic forgetting, some work froze the past policy and use them as the input to learn the new environment.