Neil Walton

Durham University Durham University

Markov Policy Gradient Algorithms

Probability Seminar

12th May 2023, 3:00 pm – 4:00 pm
Fry Building, 2.04

We consider a policy gradient algorithm applied to a finite-arm bandit problem with Bernoulli rewards. We allow learning rates to depend on the current state of the algorithm, rather than use a deterministic time-decreasing learning rate. The state of the algorithm forms a Markov chain on the probability simplex. We apply Foster-Lyapunov techniques to analyse the stability of this Markov chain. We prove that if learning rates are well chosen then the policy gradient algorithm is a transient Markov chain and the state of the chain converges on the optimal arm with logarithmic or poly-logarithmic regret.

Organisers: Edward Crane, Luke Turvey

Other Probability seminars

Navigation

Neil Walton

Markov Policy Gradient Algorithms

Comments are closed.