Decentralized Cooperative Multi-armed Bandits
Statistics Seminar
10th December 2021, 4:00 pm – 5:00 pm
Virtual Seminar, Zoom link: TBA
We study a decentralized cooperative multi-agent multi-armed bandit problem with $K$ arms and $N$ agents connected over a network. In our model, each arm's reward distribution is the same for every agent, and rewards are drawn independently across agents and over time steps. In each round, agents choose an arm to play and subsequently send a message to their neighbors. Existing lower bound on the average per-agent regret over the network implies that a maximum of $O(\frac{1}{N})$ reduction can be achieved in the regret when cooperating with other agents in the network over when operating in isolation. We study an information assimilation algorithm that can be combined with existing Bayesian algorithms, and using this, we propose a decentralized Thompson Sampling (TS) algorithm and decentralized Bayes-UCB algorithm. We show that decentralized TS, which achieves a per-agent regret close to the one incurred by the optimal algorithm while exchanging at most poly(K) messages with their neighbors per iteration. Our analysis establishes a problem-dependent upper bound on the average per-agent regret that scales logarithmically over the time horizon for the case of bounded rewards. For the special case of Bernoulli rewards, our upper bound asymptotically matches the lower bound. Our analysis also characterizes the average per-agent regret in terms of the network structure. We empirically show that the proposed decentralized TS algorithm provides significantly lesser per-agent regret than previously proposed algorithms. Furthermore, we show that the proposed decentralized TS can be efficiently extended to general bandit problems, where posterior distribution cannot be computed in closed form, by approximating the posterior to a computationally tractable family of distributions such as Gaussian mixture distribution. We implement our proposed decentralized TS under gossip protocol, and over time-varying networks, where each communication link has a fixed probability of failure.
Comments are closed.