Room 333 (fishbowl)
Multi-armed bandits capture the essence of the problem of how to
optimally explore an unknown system. In control theory the tradeoff
between control and learning is referred to as “dual control.” In computer
science it is referred to as the problem of exploration versus exploitation.
I will cover the fundamental results of both Gittins and Jones in a Bayesian
context, and those of Lai and Robbins in a non-Bayesian context.
In the former context, a major breakthrough was the discovery that each
arm has an “index,” and that the optimal choice is to just play the arm with
the largest index.
In the latter context, the goal is to minimize what is called the “regret,”
which is related to how many times a non-optimal arm is played. Not only
has the optimal regret been characterized, but simple optimal policies
have been determined that minimize it.