Nathan's Notes


The Explore vs. Exploit Dilemma

Multi-armed bandits and a prolonged analogy

Oct 11, 2024

I often analogize some real-world problems according to their ML-related counterparts. One of them is the exporation-exploitation problem, but it’s often been met with minimal recognition. I wrote this blog as something I can refer to the next time I get a “what do you mean?”

Introduction: The Multi-Armed Bandit Problem

Suppose we have a series of decisions to make, each with the potential to yield a reward. In our multi-armed bandit problem, we aim to develop a strategy to maximize this reward over time. We envision each “arm” as a slot machine, each one hiding a different reward distribution. Our task is to identify which arm to pull at each step in time to accumulate the most reward.

If we consider t=0t=0 as our starting state—knowing nothing about the reward distributions—and t=1t=1 as the ideal state—where we have complete knowledge of the best arm—then we can define a function between ignorance and an optimal selection. In this framework, we can imagine a vector field guiding us from exploring new arms to exploiting the most rewarding ones.

Our state of knowledge at time tt can be denoted ϕt(x)\phi_t(x), representing our expected reward for each arm, updated after each trial. We can define the expected reward flow ϕt(x)\phi_t(x) as:

ϕt(x)=(1ϵt)Q(x)+ϵtU(x)\phi_t(x) = (1 - \epsilon_t) \, Q(x) + \epsilon_t \, U(x)

where ϵt\epsilon_t is an exploration parameter that decreases over time, Q(x)Q(x) is the action-value function for each arm (capturing the expected reward given our current understanding), and U(x)U(x) is a measure of the uncertainty or unexplored potential for each arm.

In early steps, t=0t=0, our strategy should be primarily exploratory: ϵ0=1\epsilon_0 = 1 and thus ϕ0(x)U(x)\phi_0(x) \approx U(x). On the contrary, as tt approaches 1, ϵt\epsilon_t should approach 0, directing ϕt(x)\phi_t(x) toward the maximum expected value action, or exploitative policy. To represent this transition, we can set ϵt=1βt\epsilon_t = 1 - \beta t, where β\beta is a decay constant that controls how quickly we shift from exploration to exploitation.

The derivative of this function ϕ\phi defines a policy gradient, guiding our choice of action. Instead of learning the policy directly, we train a forward dynamics model fθf_\theta to predict rewards given our current knowledge, aiming to maximize:

θJ(θ)=E[t=0TγtRt]\nabla_\theta J(\theta) = \mathbb{E} \Big[ \sum_{t=0}^{T} \gamma^t R_t \Big]

where γ\gamma is the discount factor controlling the impact of future rewards, and RtR_t is the observed reward at each time step tt. This function captures our goal to maximize cumulative rewards while balancing exploration and exploitation over time.

The model’s predictions thus guide us to iteratively pull arms that will yield the highest cumulative rewards, refining our understanding of each arm’s distribution and honing in on optimal actions.

The Forward Dynamics Model

In the multi-armed bandit problem, a forward dynamics model fθf_\theta is an auxiliary model that predicts the expected reward for each arm based on past actions and observed rewards. By approximating the environment’s response to each action, fθf_\theta helps us make informed choices, directing our strategy toward higher rewards.

1. Defining the Forward Dynamics Model

The forward dynamics model fθf_\theta can be structured as a parameterized function with weights θ\theta, taking in a feature vector xx (representing each arm’s current state) and outputting a predicted reward R^\hat{R}. The model aims to approximate the action-value function Q(x)Q(x), which maps each action (arm) to an expected reward. In essence, fθf_\theta estimates:

fθ(x)Q(x)=E[Rx] f_\theta(x) \approx Q(x) = \mathbb{E}[R | x]

where RR is the actual reward received from pulling the arm represented by xx. Training fθf_\theta involves refining this approximation over multiple trials, improving its ability to generalize from limited samples.

2. Training Objective

To train fθf_\theta, we minimize the error between predicted rewards R^\hat{R} and observed rewards RR. The model is typically trained by minimizing the mean squared error (MSE) over all arms and trials:

L(θ)=1Ni=1N(Rifθ(xi))2 L(\theta) = \frac{1}{N} \sum_{i=1}^N \left( R_i - f_\theta(x_i) \right)^2

where NN is the number of training samples, each associated with an action xix_i (one of the arms) and an observed reward RiR_i. Minimizing this loss function L(θ)L(\theta) encourages fθf_\theta to make accurate reward predictions, allowing the model to distinguish more promising arms from less rewarding ones.

3. Data Collection Strategy

Training the forward dynamics model requires collecting data on observed rewards from pulling each arm. However, the exploration-exploitation dilemma influences how we gather this data:

4. Incorporating Reward Predictions into Policy Gradients

Once trained, fθf_\theta’s reward predictions are used to inform the policy and guide our decision-making. The policy gradient objective, which maximizes cumulative rewards, can be modified to include predicted rewards:

θJ(θ)=E[t=0Tγtfθ(xt)] \nabla_\theta J(\theta) = \mathbb{E} \left[ \sum_{t=0}^{T} \gamma^t f_\theta(x_t) \right]

where γ\gamma is the discount factor for future rewards, xtx_t is the action-state at time tt, and fθ(xt)f_\theta(x_t) provides the reward prediction. This objective encourages the model to pull arms with higher predicted rewards, continually adjusting based on new data from fθf_\theta.

Ultimately, the forward dynamics model fθf_\theta enables a structured approach to decision-making by predicting rewards for each arm, balancing exploration and exploitation based on estimated reward distributions. With ongoing training, fθf_\theta adapts to new information, ensuring the model’s predictions remain accurate as we refine our understanding of each arm. This iterative learning process helps maximize cumulative rewards by aligning actions with fθf_\theta’s growing knowledge of the environment.

The Exploration-Exploitation Dilemma

The exploration-exploitation dilemma is central to multi-armed bandit problems: we need to explore enough to understand each arm’s potential while also exploiting known information to maximize immediate rewards. The parameter β\beta, which governs the rate at which we shift from exploration to exploitation, plays a key role in this balance.

Choosing β\beta is a careful process. If β\beta is too large, ϵt\epsilon_t decreases rapidly, leading the model to quickly favor exploitation. In this case, we may prematurely commit to arms that appear optimal based on limited early data, potentially missing out on higher-reward options that remain unexplored. On the other hand, if β\beta is too small, ϵt\epsilon_t remains high, and we may spend too much time exploring, failing to capitalize on accumulated knowledge to maximize rewards.

To choose an optimal β\beta, consider the following factors:

  1. Expected Variance Across Arms: If the rewards from each arm vary significantly, then a smaller β\beta is generally favorable because it allows more exploration. This ensures that our policy gathers enough information to accurately assess the best arm. On the contrary, if we expect rewards to be fairly consistent across arms, we may prefer a larger β\beta, quickly shifting to exploitation since exploration is less likely to yield drastically different results.

  2. Risk Tolerance: If our strategy prioritizes high-risk, high-reward outcomes, a smaller β\beta allows more exploration, potentially discovering arms with rare but substantial rewards. Conversely, if the strategy is risk-averse, a larger β\beta may be preferable, reducing the time spent on uncertain options and focusing on the most reliable rewards found early on.

  3. Adaptive β\beta Adjustments: In complex environments, an adaptive approach to β\beta can be beneficial. Here, β\beta might dynamically adjust based on observed reward distributions or changes in the variance of Q(x)Q(x). For instance, if new arms produce high variance in Q(x)Q(x), β\beta could decrease temporarily to encourage exploration; once variance stabilizes, β\beta can increase to favor exploitation.

There are also added considerations, such as the total number of trials TTand the decay function used to set ϵt\epsilon_t. Depending on how critical initial exploration is or how smooth one wants to transition between exploration and exploitation, we may choose a different decay function, such as exponential or reciprocal. If TT is large, we can afford to spend more time exploring, and thus a smaller β\beta is suitable. However, if TT is limited, we should increase β\beta to favor quicker convergence toward exploitation.

Overall, the goal is to maximize cumulative reward over time by balancing exploration and exploitation. The parameter β\beta determines this balance by controlling the rate of decay in the exploration factor ϵt\epsilon_t. Experimenting with different values of β\beta and decay functions is often necessary to find a balance that aligns with the specific requirements of the problem.

By fine-tuning β\beta and ϵt\epsilon_t, we guide the model toward the most rewarding actions while avoiding premature convergence on suboptimal options.

The Prolonged Analogy

I find the many people I know go through the exploration-exploitation dilemma. No one wants to be Esther Greenwood, but neither does anyone want to dive completely into some repetitive FANG lifestyle, living somebody else’s dream before they’ve even discovered their own. How do you choose when to explore and when to exploit?

Surely, we should have an adaptive β\beta that depends on our surroundings. How does your β\beta change as all the people around you found B2B SaaS startups (no shade, simply not my kind of intellectual stimulation), go into finance, and marry college lovers? Can I bet on this arm and go all in, or should I hedge my bets and let my forward dynamics model better-model the environmental variability?

Uncertainty can be modeled as

U(x)=1Kk=1K(R^k(x)1Kj=1KR^j(x))2 \text{U}(x) = \frac{1}{K} \sum_{k=1}^K \left( \hat{R}_k(x) - \frac{1}{K} \sum_{j=1}^K \hat{R}_j(x) \right)^2

where KK is the number of ensemble models and R^k(x)\hat{R}_k(x) is the reward prediction from forward dynamics model kk. We unfortunately do not have the benefit of an ensemble, but we can utilize people who are close to us (and hopefully similar in thought process) to provide a proxy for uncertainty.

I have done a lot of exploring all throughout high school, studying many different scientific subjects for various olympiad competitions, working in various research groups, and having a general exposure to what the final t=1t=1 exploitative policy would look like for many possible arms, whether SWE, quant, or academia. Additionally, I feel that exploration is a part of my reward function. An entropy parameter.

So, culminating everything, my dream would be industry research or founding a successful research-heavy startup. You don’t just have a sizeable salary and prestiege to move into any other field you are interested in, but also are continually growing your personal capital through learning/exploration. I also have less uncertainty than most, given that most people I know believe I can do industry research or tackle difficult deeptech problems within a startup. Above all, I am building towards a relatively high risk tolerance — I am okay if a startup fails (it won’t). Regardless, I have a pretty long-range TT.

I was going to make up a decay function for my ϵt\epsilon_t, but the truth is, it’s probably something like

ϵt=Intuition(t,β)\epsilon_t = \text{Intuition}(t, \beta)

Thanks for reading!