## Nathan's Notes

# The Explore vs. Exploit Dilemma

#### Multi-armed bandits and a prolonged analogy

*Oct 30, 2024*

I often analogize real-world problems according to their ML-related counterparts. One of them is the exploration-exploitation problem, but it’s often been met with minimal recognition. I wrote this blog as something I can refer to the next time I get a “what do you mean?”

## Introduction: The Multi-Armed Bandit Problem

Suppose we have a series of decisions to make, each with the potential to yield a reward. In our multi-armed bandit problem, we aim to develop a strategy to maximize this reward over time. We envision each “arm” as a slot machine, each one hiding a different reward distribution. Our task is to identify which arm to pull at each step in time to accumulate the most reward.

If we consider $t=0$ as our starting state—knowing nothing about the reward distributions—and $t=1$ as the ideal state—where we have complete knowledge of the best arm—then we can define a function between ignorance and an optimal selection. In this framework, we can imagine a vector field guiding us from exploring new arms to exploiting the most rewarding ones.

Our state of knowledge at time $t$ can be denoted $\phi_t(x)$, representing our expected reward for each arm, updated after each trial. We can define the expected reward flow $\phi_t(x)$ as:

$\phi_t(x) = (1 - \epsilon_t) \, Q(x) + \epsilon_t \, U(x)$

where $\epsilon_t$ is an exploration parameter that decreases over time, $Q(x)$ is the action-value function for each arm (capturing the expected reward given our current understanding), and $U(x)$ is a measure of the uncertainty or unexplored potential for each arm.

In early steps, $t=0$, our strategy should be primarily **exploratory**: $\epsilon_0 = 1$ and thus $\phi_0(x) \approx U(x)$. On the contrary, as $t$ approaches 1, $\epsilon_t$ should approach 0, directing $\phi_t(x)$ toward the maximum expected value action, or **exploitative** policy. To represent this transition, we can set $\epsilon_t = 1 - \beta t$, where $\beta$ is a decay constant that controls how quickly we shift from exploration to exploitation.

The derivative of this function $\phi$ defines a *policy gradient*, guiding our choice of action. Instead of learning the policy directly, we train **a forward dynamics model** $f_\theta$ to predict rewards given our current knowledge, aiming to maximize:

$\nabla_\theta J(\theta) = \mathbb{E} \Big[ \sum_{t=0}^{T} \gamma^t R_t \Big]$

where $\gamma$ is the discount factor controlling the impact of future rewards, and $R_t$ is the observed reward at each time step $t$. This function captures our goal to maximize cumulative rewards while balancing exploration and exploitation over time.

The model’s predictions thus guide us to iteratively pull arms that will yield the highest cumulative rewards, refining our understanding of each arm’s distribution and honing in on optimal actions.

## The Forward Dynamics Model

In the multi-armed bandit problem, a forward dynamics model $f_\theta$ is an auxiliary model that predicts the expected reward for each arm based on past actions and observed rewards. By approximating the environment’s response to each action, $f_\theta$ helps us make informed choices, directing our strategy toward higher rewards.

#### 1. **Defining the Forward Dynamics Model**

The forward dynamics model $f_\theta$ can be structured as a parameterized function with weights $\theta$, taking in a feature vector $x$ (representing each arm’s current state) and outputting a predicted reward $\hat{R}$. The model aims to approximate the action-value function $Q(x)$, which maps each action (arm) to an expected reward. In essence, $f_\theta$ estimates:

$f_\theta(x) \approx Q(x) = \mathbb{E}[R | x]$

where $R$ is the actual reward received from pulling the arm represented by $x$. Training $f_\theta$ involves refining this approximation over multiple trials, improving its ability to generalize from limited samples.

#### 2. **Training Objective**

To train $f_\theta$, we minimize the error between predicted rewards $\hat{R}$ and observed rewards $R$. The model is typically trained by minimizing the mean squared error (MSE) over all arms and trials:

$L(\theta) = \frac{1}{N} \sum_{i=1}^N \left( R_i - f_\theta(x_i) \right)^2$

where $N$ is the number of training samples, each associated with an action $x_i$ (one of the arms) and an observed reward $R_i$. Minimizing this loss function $L(\theta)$ encourages $f_\theta$ to make accurate reward predictions, allowing the model to distinguish more promising arms from less rewarding ones.

#### 3. **Data Collection Strategy**

Training the forward dynamics model requires collecting data on observed rewards from pulling each arm. However, the exploration-exploitation dilemma influences how we gather this data:

**Exploration Phase**: During exploration, the model pulls different arms at random to collect a diverse dataset of reward outcomes. This phase ensures that $f_\theta$ samples from a wide range of potential rewards, even from suboptimal arms, capturing enough data to model the environment’s variability.**Exploitation Phase**: As the model shifts toward exploitation, it begins to pull the arms predicted to yield higher rewards based on $f_\theta$’s predictions. During this phase, $f_\theta$ refines its understanding of the best arms and focuses on modeling the nuances in expected rewards for these higher-value actions.

#### 4. **Incorporating Reward Predictions into Policy Gradients**

Once trained, $f_\theta$’s reward predictions are used to inform the policy and guide our decision-making. The policy gradient objective, which maximizes cumulative rewards, can be modified to include predicted rewards:

$\nabla_\theta J(\theta) = \mathbb{E} \left[ \sum_{t=0}^{T} \gamma^t f_\theta(x_t) \right]$

where $\gamma$ is the discount factor for future rewards, $x_t$ is the action-state at time $t$, and $f_\theta(x_t)$ provides the reward prediction. This objective encourages the model to pull arms with higher predicted rewards, continually adjusting based on new data from $f_\theta$.

Ultimately, the forward dynamics model $f_\theta$ enables a structured approach to decision-making by predicting rewards for each arm, balancing exploration and exploitation based on estimated reward distributions. With ongoing training, $f_\theta$ adapts to new information, ensuring the model’s predictions remain accurate as we refine our understanding of each arm. This iterative learning process helps maximize cumulative rewards by aligning actions with $f_\theta$’s growing knowledge of the environment.

## The Exploration-Exploitation Dilemma

The exploration-exploitation dilemma is central to multi-armed bandit problems: we need to *explore* enough to understand each arm’s potential while also *exploiting* known information to maximize immediate rewards. The parameter $\beta$, which governs the rate at which we shift from exploration to exploitation, plays a key role in this balance.

Choosing $\beta$ is a careful process. If $\beta$ is too large, $\epsilon_t$ decreases rapidly, leading the model to quickly favor exploitation. In this case, we may prematurely commit to arms that appear optimal based on limited early data, potentially missing out on higher-reward options that remain unexplored. On the other hand, if $\beta$ is too small, $\epsilon_t$ remains high, and we may spend too much time exploring, failing to capitalize on accumulated knowledge to maximize rewards.

To choose an optimal $\beta$, consider the following factors:

**Expected Variance Across Arms**: If the rewards from each arm vary significantly, then a smaller $\beta$ is generally favorable because it allows more exploration. This ensures that our policy gathers enough information to accurately assess the best arm. On the contrary, if we expect rewards to be fairly consistent across arms, we may prefer a larger $\beta$, quickly shifting to exploitation since exploration is less likely to yield drastically different results.**Risk Tolerance**: If our strategy prioritizes high-risk, high-reward outcomes, a smaller $\beta$ allows more exploration, potentially discovering arms with rare but substantial rewards. Conversely, if the strategy is risk-averse, a larger $\beta$ may be preferable, reducing the time spent on uncertain options and focusing on the most reliable rewards found early on.**Adaptive $\beta$ Adjustments**: In complex environments, an adaptive approach to $\beta$ can be beneficial. Here, $\beta$ might dynamically adjust based on observed reward distributions or changes in the variance of $Q(x)$. For instance, if new arms produce high variance in $Q(x)$, $\beta$ could decrease temporarily to encourage exploration; once variance stabilizes, $\beta$ can increase to favor exploitation.

There are also added considerations, such as the total number of trials $T$and the decay function used to set $\epsilon_t$. Depending on how critical initial exploration is or how smooth one wants to transition between exploration and exploitation, we may choose a different decay function, such as exponential or reciprocal. If $T$ is large, we can afford to spend more time exploring, and thus a smaller $\beta$ is suitable. However, if $T$ is limited, we should increase $\beta$ to favor quicker convergence toward exploitation.

Overall, the goal is to maximize cumulative reward over time by balancing exploration and exploitation. The parameter $\beta$ determines this balance by controlling the rate of decay in the exploration factor $\epsilon_t$. Experimenting with different values of $\beta$ and decay functions is often necessary to find a balance that aligns with the specific requirements of the problem.

By fine-tuning $\beta$ and $\epsilon_t$, we guide the model toward the most rewarding actions while avoiding premature convergence on suboptimal options.

## The Prolonged Analogy

I find the many people I know go through the exploration-exploitation dilemma. No one wants to be Esther Greenwood, but neither does anyone want to dive completely into some repetitive FANG lifestyle, living somebody else’s dream before they’ve even discovered their own. How do you choose when to explore and when to exploit?

Surely, we should have an adaptive $\beta$ that depends on our surroundings. How does your $\beta$ change as all the people around you found B2B SaaS startups (no shade, simply not my kind of intellectual stimulation), go into finance, and marry college lovers? Can I bet on this arm and go all in, or should I hedge my bets and let my forward dynamics model better-model the environmental variability?

Uncertainty can be modeled as

$\text{U}(x) = \frac{1}{K} \sum_{k=1}^K \left( \hat{R}_k(x) - \frac{1}{K} \sum_{j=1}^K \hat{R}_j(x) \right)^2$

where $K$ is the number of ensemble models and $\hat{R}_k(x)$ is the reward prediction from forward dynamics model $k$. We unfortunately do not have the benefit of an ensemble, but we can utilize people who are close to us (and hopefully similar in thought process) to provide a proxy for uncertainty.

I have done a lot of exploring all throughout high school, studying many different scientific subjects for various olympiad competitions, working in various research groups, and having a general exposure to what the final $t=1$ exploitative policy would look like for many possible arms, whether SWE, quant, or academia. Additionally, I feel that exploration is a part of my reward function. An entropy parameter.

So, culminating everything, my dream would be industry research or founding a successful research-heavy startup. You don’t just have a sizeable salary and prestiege to move into any other field you are interested in, but also are continually growing your personal capital through exploration. I also have less uncertainty than most, given that most people I know believe I can do industry research or tackle difficult deeptech problems within a startup. Above all, I am building towards a relatively high risk tolerance — I am okay if a startup fails (it won’t). Regardless, I have a pretty long-range $T$ and life offers a relatively high expected variance across its arms. So I won’t pigeonhole myself into one future just yet — life offers a lot to learn, and to some extent, I suppose I am exploiting its possibilities to explore.

I was going to make up an extremely parameterized decay function for my $\epsilon_t$, but the truth is, it’s probably something like

$\epsilon_t = \text{Intuition}(t, \beta)$

Thanks for reading!