Non-stationary RL

Published:

Non-Stationary RL

Usually we consider optimizing an objective under a stationary MDP with a fixed transition and reward function. We can learn the optimal policy through policy evaluation and policy improvement steps. However, in a constantly-changing environment where the transition kernal and the reward functions may be unkown, it’s crucial for our learners to adapt itself to the environment through interaction and sampling.

Existing Literature On Continual Learning

  1. Towards Continual Reinforcement Learning: A Review and Perspectives
  2. A Comprehensive Survey of Continual Learning: Theory, Method and Application
  3. Reinforcement learning algorithm for non-stationary environments

Now I’m trying to understand the big picture of the field in Non-stationary RL. Here is the collection of papers I’m planning to read.

In general, I think there are three ways to deal with the non-stationary environements.

  • Learn task representations through encoder-decoder architectures. By representing a latent variable as a new input to the reward function and the transition kernel, we can account for the non-stationarity. And finally optimize the model by both optimizing the policy part and the probablistic model.

  • Learn good initialization for a new episode, which is essentially a similar idea to meta-learning where we need to learn a meta-policy across different tasks.

  • Context detection. Sometimes the context length may be stochastic, we need to collect the data or trajectories only related to the current context.

  • Knowledge distillation or policy consolidation. To prevent the castrophic forgetting, some work froze the past policy and use them as the input to learn the new environment.