Realigning Benchmark Incentives: Sub-Instance Averaging and New Difficulty Ranges

Introduction

Typically, the quality of a solution is measured by its deviation from the optimal solution. In situations where the optimal solution is unknown, the baseline algorithm establishes a reference value, ideally set at a fixed percentage (e.g., 20%) below the optimal. By measuring the improvement of a Benchmarker’s solution relative to this baseline, we can gauge its proximity to the optimal solution and, hence, its overall quality. It is therefore crucial that our baseline algorithms remain stable, exhibiting minimal variance in their deviation from the optimal across all instances. Achieving such consistency is challenging due to the inherent variability among instances and the differing suitability to the baseline algorithm.

When the baseline’s gap to the optimum varies widely across instances, picking a high better_than_baseline can be profitable even for greedy algorithms that get lucky finding ‘easy’ instances, where the baseline algorithm performed poorly. The goal of this note is to realign incentives by

  1. Sub‑instance averaging: Bundle M i.i.d. sub‑instances into each instance and judge the root‑mean‑square (RMS) improvement.
  2. Updating Difficulty Range: Choosing the range of better_than_baseline, i.e., the max and min difficulty, so that a large fraction of instances remain solvable.

Theory

In this section, we consider Benchmarkers solving a TIG challenge with difficulty [N, \beta], where

\beta = \frac{\texttt{better_than_baseline}}{1000}.

In the protocol, Benchmarkers may increment the better_than_baseline parameter in integer steps.

Consider a random instance i of a challenge. Let

  • B_i be the solution found by the baseline algorithm,
  • J_i be the optimal solution obtained via an exact algorithm, and
  • Y_i be the solution provided by the Benchmarker.

Recall that for the Knapsack challenge (QKP), a maximization problem, a Benchmarker’s solution qualifies if

Y_i \geq B_i\left(1+\beta\right),~~~\text{i.e.,}~~ \frac{Y_i}{B_i}\geq 1+\beta.

Thus, for a given instance i and a specified \beta, the probability that a solution exists is

\mathbb{P}(\text{Solution Exists}_\beta)= \mathbb{P}\left(\beta\leq \frac{J_i}{B_i}-1\right).

Similarly, for the Vehicle Routing problem, which is a minimization problem, a solution exists if

\frac{Y_i}{B_i} \leq 1 - \beta,~~~\text{i.e.,}~~ \beta \leq 1-\frac{Y_i}{B_i}.

Hence, the probability that a solution exists for a given instance i is

\mathbb{P}(\text{Solution Exists}_\beta)= \mathbb{P}\left(\beta \leq 1-\frac{J_i}{B_i}\right).

In an ideal scenario, we would employ a baseline solver with zero variance, ensuring that the distance to the optimal solution remains fixed (e.g., 20%) for every instance (solid black in Fig. 1). In such a case, if a Benchmarker selects a \beta greater than 0.2, no instances would be solvable. In practice, however, the ratio of our baseline solver to the optimum exhibits some variance. For example, if a Benchmarker increases \beta to 0.25, approximately 3.3% of instances remain solvable (as indicated by the grey dotted line in Figure 1), and currently, the larger rewards tied to the higher \beta more than offset the reduced reliability that comes from the lower chance of encountering a solvable instance.


Figure 1: Probability of an instance being solvable with current variance in the baseline solver (vehicle routing, N=200)

Variance Reduction via Sub-Instance Averaging

To address the variability issue, we propose the introduction of sub-instances. Under this approach, each benchmark instance comprises M i.i.d. sub-instances, each \textit{identical} to the original instance, thereby redefining an instance as a collection of M sub-instances. To find a solution to an instance, the average improvement (better_than_baseline) across all sub-instances must exceed the chosen difficulty.

Old Approach without Averaging
Consider a minimization problem (such as vehicle routing) with a fixed challenge size. Suppose a Benchmarker selects a difficulty parameter, \beta, and for each challenge instance, i, the condition

\beta\leq \frac{B_i - Y_i}{B_i} \implies \frac{Y_i}{B_i}\leq 1-\beta

must be satisfied. If we consider the ratio \frac{Y_i}{J_i} - the indication of the performance of the benchmarker’s solution - this implies that

\frac{Y_i}{J_i}=\frac{Y_i}{B_i}\frac{B_i}{J_i} \leq (1-\beta) \frac{B_i}{J_i} \quad (1)

Thus, the ratio \frac{Y_i}{J_i} remains small if 1-\beta is small, with the stability of this claim hinging on the variance of \frac{B_i}{J_i}, i.e., \text{Var}\Big[\frac{B_i}{J_i}\Big].

New Approach with Averaging
First, consider the following definitions:

\mathbf{\frac{Y}{B}} = \begin{pmatrix} \frac{Y}{B}_1\\ \vdots \\ \frac{Y}{B}_M \end{pmatrix},\quad \left(\mathbf{\frac{Y}{B}}\right)_{RMS} = \sqrt{ \frac{1}{M}\sum_{i=1}^M \Big(\frac{Y_i}{B_i}\Big)^2}

where \mathbf{\frac{Y}{B}} , an instance, is a vector of length M sub-instances.

Now, let us require that for an instance composed of M sub-instances, the following condition holds:

\left(\mathbf{\frac{Y}{B}}\right)_{RMS} = \sqrt{ \frac{1}{M}\sum_{i=1}^M \Big(\frac{Y_i}{B_i}\Big)^2} \leq 1-\beta. \quad (2)

This allows us to assess the average performance of the Benchmarker’s solution to each sub-instance \frac{Y_i}{J_i}:

\frac{1}{M}\sum_{i=1}^M \frac{Y_i}{J_i} = \frac{1}{M}\sum_{i=1}^M \frac{Y_i}{B_i}\frac{B_i}{J_i} \qquad\qquad\qquad\qquad\qquad\qquad
\qquad\qquad\qquad\qquad \qquad \leq \frac{1}{M}\sqrt{\sum_{i=1}^M\left(\frac{Y_i}{B_i}\right)^2}\sqrt{\sum_{i=1}^M\left(\frac{B_i}{J_i}\right)^2} \quad \text{(by Cauchy-Schwarz)}
\qquad\qquad \qquad\leq (1-\beta) \sqrt{\frac{1}{M}\sum_{i=1}^M\left(\frac{B_i}{J_i}\right)^2} \quad \text{(by equation (2))}
= (1-\beta) \left(\mathbf{\frac{B}{J}}\right)_{RMS} \qquad\qquad \qquad (3)

The right-hand side of equation (3) is considerably more stable than that of equation (1), since

\text{Var}\left[\left(\mathbf{\frac{B}{J}}\right)_{RMS}\right] \ll\text{Var}\left[\frac{B_i}{J_i}\right].

This enhanced stability is illustrated in the following plot (Figure 2):


Figure 2: Probability of an instance being solvable with reduced variance by introduction of sub-instances shown in green (vehicle routing, N=200)

Introducing a Maximum Difficulty

Even with the improved variance from averaging, Benchmarkers might still select a better_than_baseline value that once again puts them in a regime with a low probability of solvable instances. To mitigate this, we propose imposing a maximum difficulty cap.

The cap is chosen so that a desired fraction of instances, e.g. 95%, remain solvable. For Vehicle Routing with N=200, this would translate to setting the maximum difficulty to \beta_{max} = 0.188.

The following plot illustrates the impact of the cap on both the original and the new averaging-based approaches. The advantage of the sub-instance approach combined with the cap, compared to merely applying a cap within the original setup, is that it reduces the variance from the optimal solution for each instance, in addition to ensuring a given fraction of instances are solvable. This ensures that innovation is encouraged closer to the optimal solution.


Figure 3: Plot of the solvability of an instance with a maximum difficulty cap set so that 95% of instances are solvable (vehicle routing, N=200)

Potential Downsides

Introducing a maximum difficulty means that some instances or sub-instances will no longer be solved optimally, as the cap restricts the rewards for the maximum achievable solution in some cases. However, as the protocol matures and the algorithms developed within TIG consistently approach or surpass current state-of-the-art (SOTA) performance, there will be opportunities to reassess and raise the cap.

New Range for Difficulty Parameters

Vehicle Routing

The new choice for the difficulty range of better_than_baseline for Vehicle Routing is therefore

\texttt{better_than_baseline} \in [15,200]

where \texttt{better_than_baseline} = \frac{Y_i-B_i}{B_i}\times1000.
The min difficulty for num_nodes is set to 100.

Knapsack

For Knapsack, the performance of the baseline solver is such that the new difficulty range of better_than_baseline would be [1,8]. Currently, each unit jump in better_than_baseline represents a 0.1% improvement over the baseline. This would lead to abrupt changes in difficulty as this range is too coarse. To provide a smoother progression, we now let one unit correspond to a 0.01% improvement, matching the granularity adopted in recent work on quadratic knapsack problems (e.g., Fennich et al. [1]).
(This change is only for the better_than_baseline difficulty parameter for the knapsack challenge.)

The revised choice for the interval of better_than_baseline for Knapsack is therefore

\texttt{better_than_baseline} \in [10,80]

with

\texttt{better_than_baseline} = \frac{Y_i-B_i}{B_i}\times10000

The min difficulty for num_items is set to 100.


  1. Fennich, Eliass & Djeumou Fomeni, Franklin & Coelho, Leandro. (2024). A novel dynamic programming heuristic for the quadratic knapsack problem. European Journal of Operational Research. 319. 10.1016/j.ejor.2024.06.034. ↩︎

1 Like