Trust Region Policy Optimisation (TRPO)

Trust Region Policy Optimisation, or TRPO, is a policy gradient method in reinforcement learning that avoids parameter updates that change the policy too much with a KL-divergence constraint on the size of the policy update at each iteration.

Take the case of off-policy reinforcement learning, where the policy $β$ for collecting trajectories on rollout workers is different from the policy $π$ to optimise for. The objective function in an off-policy model measures the total advantage over the state visitation distribution and actions, while the mismatch between the training data distribution and the true policy state distribution is compensated with an importance sampling estimator:

J (θ) J (θ) J (θ) = s \in S \sum p^{π_{θ_{o l d}}} a \in A \sum (π_{θ} (a ∣ s) \hat{A}_{θ_{o l d}} (s, a)) = s \in S \sum p^{π_{θ_{o l d}}} a \in A \sum (β (a ∣ s) \frac{π _{θ} ( a ∣ s )}{β ( a ∣ s )} \hat{A}_{θ_{o l d}} (s, a)) = E_{s \sim p^{π_{θ_{o l d}}}, a \sim β} (\frac{π _{θ} ( a ∣ s )}{β ( a ∣ s )} \hat{A}_{θ_{o l d}} (s, a))

When training on policy, theoretically the policy for collecting data is same as the policy that we want to optimise. However, when rollout workers and optimisers are running in parallel asynchronously, the behaviour policy can get stale. TRPO considers this subtle difference: It labels the behaviour policy as $π_{θ_{o l d}} (a ∣ s)$ and thus the objective function becomes:

J (θ) = E_{s \sim p^{π_{θ_{o l d}}}, a \sim π_{θ_{o l d}}} (\frac{π _{θ} ( a ∣ s )}{π _{θ_{o l d}} ( a ∣ s )} \hat{A}_{θ_{o l d}} (s, a))

TRPO aims to maximise the objective function $J (θ)$ subject to a trust region constraing which enforces the distance between old and new policies measured by KL-divergence to be small enough, within a parameter $δ$ :

E_{s \sim p^{π_{θ_{o l d}}}} [D_{K L} (π_{θ_{o l d}} (.∣ s) ∣∣ π_{θ} (.∣ s)] \leq δ

References

Papers With Code

Trust Region Policy Optimization

ML Notes

Explorer

Trust Region Policy Optimisation (TRPO)

Graph View

Backlinks