Challenge description and formulation
To focus innovation on the optimiser component of the training algorithm, the Neural Network Gradient Descent challenge has a different overall structure from current TIG challenges in two key aspects:
-
The optimiser algorithm submission does not output a ‘solution’, in the sense that it is embedded inside a ‘parent’ algorithm called the training loop that uses the optimiser iteratively to compute the solution.
-
The optimiser algorithm is ran many times for a single challenge instance, with inputs and outputs constrained to the structure of the training loop and determined by intermediate states.
High-level structure
A challenge instance is defined by three main components:
- A dataset \mathcal{D} = \{\mathbf{x}_i, y_i\}_i^N that is randomly generated by adding white noise \xi_i \sim \mathcal{N}(0,\sigma_{\text{data}}^2) to a random smooth function f: \mathbb{R}^D \to \mathbb{R} drawn from a Gaussian process (GP) with some chosen kernel function k(\mathbf{x}, \mathbf{x}')
y_i = f(\mathbf{x}_i) + \xi_i \quad \text{with} \quad f(\cdot) \sim \mathcal{GP}(0. k(\cdot, \cdot)),
evaluated at uniform random locations in the unit hypercube \mathbf{x}_i \in [-1, 1]^D, and the dataset \mathcal{D} is furthermore split into train \mathcal{D}_{\text{train}}, validation \mathcal{D}_{\text{val}} and test \mathcal{D}_{\text{test}} sets of sizes N_{\text{train}}, N_{\text{val}} and N_{\text{test}}.
-
A standard MLP architecture \hat{f}_{\mathbf{w}}: \mathbb{R}^D \to \mathbb{R} with randomly initialized parameters \mathbf{w} where its hidden layers all share the same specified width and contain ReLU-BatchNorm activation functions.
-
A mean squared error (MSE) loss function between some set of targets \{y_i\} and MLP outputs \{\hat{f}_{\mathbf{w}}(\mathbf{x}_i)\}
\mathcal{L}(\mathbf{w}; \mathcal{D}) = \frac{1}{N} \sum_{i=1}^{N} \| y_i - \hat{f}_\mathbf{w} (\mathbf{x}_i) \|^2,
which is used during train, validation and test evaluations.
Details of the random instance generation are given in Section 3 of the detailed write-up attached. The train set is then divided into B batches \mathcal{D}_{\text{train}} \to \{\mathcal{D}_{\text{batch }b}\}_1^B of size N_{\text{batch}} that are then fed into the training loop, where each loop iteration (epoch) consists of:
-
Sampling batches in random order, where for every batch:
- Compute the parameter location \tilde{\mathbf{w}} (which can be different to current \mathbf{w} like in Nesterov momentum) at which we evaluate the training loss gradients
- Compute the regression loss \mathcal{L} and its gradients \mathbf{g} = \nabla_\mathbf{w} \mathcal{L}(\tilde{\mathbf{w}}; \mathcal{D}_{\text{batch}}) using a forward-and-backward pass through the MLP
- Run one optimiser step to transform \mathbf{g} into parameter updates \mathbf{u}
- Apply updates \mathbf{w} \to \mathbf{w} + \mathbf{u}
Note that multiple gradient steps are applied on different subsets (batches) of \mathcal{D}_{\text{train}} per epoch, hence the term ‘stochastic’ gradient descent.
-
Evaluating the validation loss with a MLP forward pass \mathcal{L}(\mathbf{w}; \mathcal{D}_{\text{val}}).
-
Repeating the above steps for multiple epochs until either the maximum number of epochs has been reached or the validation loss has not improved for some chosen number of “patience” epochs, which is a standard early stopping criterion.
Furthermore, the final two layers of the MLP are frozen at initialisation to ensure asymmetry of the challenge instance for solution verification (see Section 4 of the detailed write-up attached). The overall challenge structure is depicted in the schematic figure below, and the detailed structure of this standard training loop is given in Algorithm 1 of the detailed write-up attached. The data generation, MLP construction and training loop components are deterministic conditioned on a random seed associated with the current challenge instance, which allows method verification reproducibility.
Goal
Valid challenge instance attempts involve optimisers that output MLP parameters when run in the training loop for which the MLP has a test error \mathcal{L}(\mathbf{w}; \mathcal{D}_{\text{test}}) lower than a dataset-dependent baseline test error \epsilon_{*} computed without the computationally expensive MLP training loop
\epsilon_{*}^2 > \mathcal{L}_{\text{MSE}}(\mathbf{w} ; \mathcal{D}_{\text{test}}) = \frac{1}{N_{\text{test}}} \sum_{i=1}^{N_{\text{test}}} \| y_i - \hat{f}_\mathbf{w} (\mathbf{x}_i) \|^2,
which formally specifies the solution criterion for \mathbf{w}. We propose a simple empirical expression for \epsilon_{*} in Section 4 of the detailed write-up attached that maintains computational asymmetry of the solution criterion when combined with freezing the final two MLP layers parameters. This represents MLPs that successfully approximate the random function f(\cdot) using only a finite set of noisy observations \{y_i\} at \{\mathbf{x}_i\}.
Submission constraints
This challenge requires innovators to design components of a gradient descent iteration loop of the structure Algorithm 1 of the detailed write-up attached that can successfully optimise MLP parameters to perform regression on a dataset assessed by holdout test set performance. In particular, the optimiser step also has access to current MLP parameters, allowing for optimiser-inherent regularization techniques like weight decay. In order to isolate contributions from optimiser innovation as much as possible, all training loop aspects outside of the optimiser algorithm are preserved across all challenge instances. The backpropagation backbone for computing gradients in particular is identical across all challenge instances, as modifying this would result in moving away from gradient descent.
Optimiser hyperparameters
Note that innovators must specify not only the optimize algorithm, but also hyperparameters such as the learning rate. These hyperparameters should generally depend on challenge difficulty parameters (see Section 5 of the detailed write-up attached), since gradient descent optimizers are known to be sensitive to hyperparameter choices. Extended innovator rewards can be rewarded to improvements solely in hyperparameter selection for existing optimiser algorithm proposals.
Further details
Given the increased level of complexity of this challenge, we leave the remaining technical details and specifications in the technical write-up attached to this post. We encourage the reader to go through the remaining details, as these will complete the specification of the full challenge pipeline. With this first public release of the technical challenge specification, we hope to gain valuable feedback on the challenge design from the community even before live challenge runs on testnet, and are open to modifying design details necessary for aligning this challenge with the target goals set out in our introductory post of the Neural Network Gradient Descent challenge.
Detailed technical write-up (1.1 MB)