-
Notifications
You must be signed in to change notification settings - Fork 293
Description
Problem
GAIL and AIRL can be slow to train due to RL in the inner loop, which is both computationally expensive and can require many environment interactions. Behavioral cloning is supervised learning so is fast and does not need any environment interactions, however its peak performance in complex environments is often weaker than GAIL and AIRL.
Solution
Add an option to use BC to train a policy from the demonstrations, and then to fine-tune that policy using GAIL/AIRL with the same set of demonstrations. This option was supported in the Stable Baselines v2 implementation of GAIL, for example.
This may not always help. If BC learns a bad policy, we could get stuck in a local minima. For AIRL the resulting reward function might also be more fragile as the transition distribution seen during training will be more limited (for GAIL the discriminator could be more fragile as well but it was never intended to be reused).
We already have a BC implementation, so I think this should probably be a feature added to the train_imitation script rather than the algorithm itself, although if it ends up being an involved implementation adding a helper method could be appropriate so that people directly using the Python API can also benefit from this.
Possible alternative solutions
We could potentially make this supported more widely, even for algorithms that don't learn from demonstrations like the preference_comparisons module, in case users have both demonstrations and preference comparisons and want to learn from both (a setting that has been studied in e.g. Ibarz et al, 2018). I don't think it's worth us going out of our way to support this, but if it factors out nicely (e.g. some extra Sacred ingredient to warm-start the policy) it could be worth adding.