A post by Conor Newton, PhD student on the Compass programme.
(Image credit: Microsoft Research)
Many real-world optimisation problems involve repeated rather than one-off decisions. A decision maker (who we refer to as an agent) is required to repeatedly perform actions from a set of available options. After taking an action, the agent will receive a reward based on the action performed. The agent can then use this feedback to inform later decisions. Some examples of such problems are:
- Choosing advertisements to display on a website each time a page is loaded to maximise click-through rate.
- Calibrating the temperature to maximise the yield from a chemical reaction.
- Distributing a budget between university departments to maximise research output.
- Choosing the best route to commute to work.
In each case there is a fundamental trade-off between exploitation and exploration. On the one hand, the agent should act in ways which exploit the knowledge they have accumulated to promote their short term reward, whether that’s the yield of a chemical process or click-through rate on advertisements. On the other hand, the agent should explore new actions in order to increase their understanding of their environment in ways which may translate into future rewards. (more…)