Abstract
Simple regret is a natural and parameter-free performance criterion for pure exploration in multi-armed bandits. However, possibly due to the challenge of characterizing it, in the theory community, it has been largely overlooked in favor of studying the probability of selecting the best, or an ϵ-good arm. In this paper, we make significant progress on minimizing simple regret in both data-rich (T ≥ n) and data-poor regime (T ≤ n) where n is the number of arms, and T is the number of samples. At its heart is our improved instance-dependent analysis of the well-known Sequential Halving (SH) algorithm, where we bound the probability of returning an arm whose mean reward is not within ϵ from the best (i.e., not ϵ-good) for any choice of ϵ > 0, despite ϵ not being input to SH. Our bound not only leads to an optimal worst-case simple regret bound of √n/T up to logarithmic factors but also essentially matches the instance-dependent lower bound for returning an ϵ-good arm reported by Katz-Samuels and Jamieson (2020). For the more challenging data-poor regime, we propose Bracketing SH (BSH) that enjoys the same improvement even without sampling each arm at least once. Our empirical study shows that BSH outperforms existing methods on real-world tasks.
Original language | English (US) |
---|---|
Pages (from-to) | 42110-42158 |
Number of pages | 49 |
Journal | Proceedings of Machine Learning Research |
Volume | 202 |
State | Published - 2023 |
Event | 40th International Conference on Machine Learning, ICML 2023 - Honolulu, United States Duration: Jul 23 2023 → Jul 29 2023 |
ASJC Scopus subject areas
- Artificial Intelligence
- Software
- Control and Systems Engineering
- Statistics and Probability