Upcoming Challenge: Neural Network Gradient Descent

We are excited to announce our upcoming challenge for TIG:

\large{ \textbf{Neural Network Gradient Descent}}

The recent surge in Artificial Intelligence (AI) has been largely driven by deep learning, a movement resulting from the combination of large datasets, highly parallelized compute, and rich neural network architectures with automatic differentiation. Central to deep learning progress is Stochastic Gradient Descent (SGD) and related algorithms for neural network training [1] [2] [3] [4], which apply some form of the first-order gradient in an iterative manner to the network parameters. In particular, the Adam optimiser [4] is currently one of the most cited papers of the decade, accumulating over 190 thousand citations.

To encourage the discovery of novel first-order gradient descent optimization algorithms capable of effectively optimizing deep neural networks, we introduce the Neural Network Gradient Descent challenge at TIG. This challenge aims to emulate the deep learning setting by training multilayer perceptrons (MLPs), the simplest foundational neural network architecture [5], on a nonlinear regression task, resulting in loss functions with typical deep learning characteristics such as:

  • Very large dimensionality with for example moderate MLP sizes easily reaching O(10^6) parameters.

  • Many local minima and saddle points due to the excessive overparameterizarion of neural networks from the classical perspective of statistics.

  • The phenomenon of overfitting where many training loss minima correspond to networks with poor performance on unseen data points.

Notably, the challenge has been specifically designed to develop optimisers with an inductive bias for training neural networks that generalize well beyond training data as seen in SGD [6] [7]. This is nontrivial, as many advanced gradient descent optimisers like Adam have been empirically observed to find minima that exhibit slightly worse generalization than vanilla SGD [8] despite outperforming in terms of training convergence [9]. No definite theoretical understanding has yet been found to explain these observations, which form a unique facet of deep learning optimisation that distinguishes it from traditional optimisation problems focused solely on finding the lowest loss.

On a practical level, the exploration for more efficient optimisers derived from MLP training may contribute to:

  • Democratizing AGI as improvements in neural network training can reduce the barriers to entry due to hardware costs and specialized training infrastructure.

  • Improving state-of-the-art models as MLP motifs pervade larger architecture, e.g.\ transformer blocks [10], and thus findings from here may inspire meaningful changes to training of larger neural network architectures.

  • Reducing cost and energy usage as training neural networks requires many iterations of applying gradient descent, with the latest state-of-the-art models costing over hundreds of millions of dollars to train, and thus small efficiency gains in the optimisation loop will have big impact.

  • Understanding SGD dynamics as MLPs have been toy models for understanding gradient descent [11].

The Neural Network Gradient Descent challenge requires innovators to design novel deep learning optimisers in the form of a gradient descent iteration function that is restricted to a particular structure set by the challenge. The detailed challenge specification and design is ready, and we will soon share a more accessible version of this technical write-up!

3 Likes

Challenge description and formulation

To focus innovation on the optimiser component of the training algorithm, the Neural Network Gradient Descent challenge has a different overall structure from current TIG challenges in two key aspects:

  • The optimiser algorithm submission does not output a ‘solution’, in the sense that it is embedded inside a ‘parent’ algorithm called the training loop that uses the optimiser iteratively to compute the solution.

  • The optimiser algorithm is ran many times for a single challenge instance, with inputs and outputs constrained to the structure of the training loop and determined by intermediate states.

High-level structure

A challenge instance is defined by three main components:

  • A dataset \mathcal{D} = \{\mathbf{x}_i, y_i\}_i^N that is randomly generated by adding white noise \xi_i \sim \mathcal{N}(0,\sigma_{\text{data}}^2) to a random smooth function f: \mathbb{R}^D \to \mathbb{R} drawn from a Gaussian process (GP) with some chosen kernel function k(\mathbf{x}, \mathbf{x}')
y_i = f(\mathbf{x}_i) + \xi_i \quad \text{with} \quad f(\cdot) \sim \mathcal{GP}(0. k(\cdot, \cdot)),

evaluated at uniform random locations in the unit hypercube \mathbf{x}_i \in [-1, 1]^D, and the dataset \mathcal{D} is furthermore split into train \mathcal{D}_{\text{train}}, validation \mathcal{D}_{\text{val}} and test \mathcal{D}_{\text{test}} sets of sizes N_{\text{train}}, N_{\text{val}} and N_{\text{test}}.

  • A standard MLP architecture \hat{f}_{\mathbf{w}}: \mathbb{R}^D \to \mathbb{R} with randomly initialized parameters \mathbf{w} where its hidden layers all share the same specified width and contain ReLU-BatchNorm activation functions.

  • A mean squared error (MSE) loss function between some set of targets \{y_i\} and MLP outputs \{\hat{f}_{\mathbf{w}}(\mathbf{x}_i)\}

\mathcal{L}(\mathbf{w}; \mathcal{D}) = \frac{1}{N} \sum_{i=1}^{N} \| y_i - \hat{f}_\mathbf{w} (\mathbf{x}_i) \|^2,

which is used during train, validation and test evaluations.

Details of the random instance generation are given in Section 3 of the detailed write-up attached. The train set is then divided into B batches \mathcal{D}_{\text{train}} \to \{\mathcal{D}_{\text{batch }b}\}_1^B of size N_{\text{batch}} that are then fed into the training loop, where each loop iteration (epoch) consists of:

  • Sampling batches in random order, where for every batch:

    • Compute the parameter location \tilde{\mathbf{w}} (which can be different to current \mathbf{w} like in Nesterov momentum) at which we evaluate the training loss gradients
    • Compute the regression loss \mathcal{L} and its gradients \mathbf{g} = \nabla_\mathbf{w} \mathcal{L}(\tilde{\mathbf{w}}; \mathcal{D}_{\text{batch}}) using a forward-and-backward pass through the MLP
    • Run one optimiser step to transform \mathbf{g} into parameter updates \mathbf{u}
    • Apply updates \mathbf{w} \to \mathbf{w} + \mathbf{u}

    Note that multiple gradient steps are applied on different subsets (batches) of \mathcal{D}_{\text{train}} per epoch, hence the term ‘stochastic’ gradient descent.

  • Evaluating the validation loss with a MLP forward pass \mathcal{L}(\mathbf{w}; \mathcal{D}_{\text{val}}).

  • Repeating the above steps for multiple epochs until either the maximum number of epochs has been reached or the validation loss has not improved for some chosen number of “patience” epochs, which is a standard early stopping criterion.

Furthermore, the final two layers of the MLP are frozen at initialisation to ensure asymmetry of the challenge instance for solution verification (see Section 4 of the detailed write-up attached). The overall challenge structure is depicted in the schematic figure below, and the detailed structure of this standard training loop is given in Algorithm 1 of the detailed write-up attached. The data generation, MLP construction and training loop components are deterministic conditioned on a random seed associated with the current challenge instance, which allows method verification reproducibility.

Goal

Valid challenge instance attempts involve optimisers that output MLP parameters when run in the training loop for which the MLP has a test error \mathcal{L}(\mathbf{w}; \mathcal{D}_{\text{test}}) lower than a dataset-dependent baseline test error \epsilon_{*} computed without the computationally expensive MLP training loop

\epsilon_{*}^2 > \mathcal{L}_{\text{MSE}}(\mathbf{w} ; \mathcal{D}_{\text{test}}) = \frac{1}{N_{\text{test}}} \sum_{i=1}^{N_{\text{test}}} \| y_i - \hat{f}_\mathbf{w} (\mathbf{x}_i) \|^2,

which formally specifies the solution criterion for \mathbf{w}. We propose a simple empirical expression for \epsilon_{*} in Section 4 of the detailed write-up attached that maintains computational asymmetry of the solution criterion when combined with freezing the final two MLP layers parameters. This represents MLPs that successfully approximate the random function f(\cdot) using only a finite set of noisy observations \{y_i\} at \{\mathbf{x}_i\}.

Submission constraints

This challenge requires innovators to design components of a gradient descent iteration loop of the structure Algorithm 1 of the detailed write-up attached that can successfully optimise MLP parameters to perform regression on a dataset assessed by holdout test set performance. In particular, the optimiser step also has access to current MLP parameters, allowing for optimiser-inherent regularization techniques like weight decay. In order to isolate contributions from optimiser innovation as much as possible, all training loop aspects outside of the optimiser algorithm are preserved across all challenge instances. The backpropagation backbone for computing gradients in particular is identical across all challenge instances, as modifying this would result in moving away from gradient descent.

Optimiser hyperparameters

Note that innovators must specify not only the optimize algorithm, but also hyperparameters such as the learning rate. These hyperparameters should generally depend on challenge difficulty parameters (see Section 5 of the detailed write-up attached), since gradient descent optimizers are known to be sensitive to hyperparameter choices. Extended innovator rewards can be rewarded to improvements solely in hyperparameter selection for existing optimiser algorithm proposals.

Further details

Given the increased level of complexity of this challenge, we leave the remaining technical details and specifications in the technical write-up attached to this post. We encourage the reader to go through the remaining details, as these will complete the specification of the full challenge pipeline. With this first public release of the technical challenge specification, we hope to gain valuable feedback on the challenge design from the community even before live challenge runs on testnet, and are open to modifying design details necessary for aligning this challenge with the target goals set out in our introductory post of the Neural Network Gradient Descent challenge.

Detailed technical write-up (1.1 MB)

2 Likes

This sounds like a really interesting and well thought out challenge. I do have some questions though if I may?

  1. Are there any limits on what the optimizer is allowed to store between steps? For instance, can it keep track of previous gradients or running averages over time as part of its internal state?

  2. Is the optimizer expected to work only with the gradients and parameters it receives directly, or is it allowed to make use of other information from the model, like BatchNorm statistics?

  3. When designing optimizers, is the main focus on generalization to unseen data, or is the rate of convergence during training, for example reaching low validation loss quickly, also a key factor in how submissions are evaluated?

  4. Looking ahead, would there be interest in allowing optimizers to influence other aspects of the training process, such as how batches are sampled or when early stopping is triggered?

Really looking forward to getting involved in this challenge and look forward to your reply :slight_smile:

2 Likes

Hi Jake,

Glad to hear your interest, we do aim to introduce more structured challenges like this one in the near future. And for your comments/questions:

  1. Are there any limits on what the optimizer is allowed to store between steps? For instance, can it keep track of previous gradients or running averages over time as part of its internal state?

There are no limits apart from the fact that the optimiser state is constructed and updated inside the innovator-submitted functions, which have a fixed input and output signature. This means that your optimiser algorithm and state updates only have access to those variables at every iteration (e.g. previous training loss and validation loss, previous gradients and their moving averages, etc.). Indeed, keeping track of running gradient statistics is one of the key ingredients of modern adaptive optimisers like Adam.

  1. Is the optimizer expected to work only with the gradients and parameters it receives directly, or is it allowed to make use of other information from the model, like BatchNorm statistics?

Related to the answer for (1), yes you are right, the optimiser algorithm submission can only use the variables that are given for each of the functions. There is no way you can access the model directly within the framework proposed currently. This is great point though to think about, as this means we (the challenge designers) should provide as many relevant variables as possible that may be needed for constructing novel powerful optimisers. BatchNorm statistics have not been considered but may very well be powerful for designing optimisers, did you have any particular examples/reference papers where you have seen this before?

  1. When designing optimizers, is the main focus on generalization to unseen data, or is the rate of convergence during training, for example reaching low validation loss quickly, also a key factor in how submissions are evaluated?

Great question, so far the rate of convergence is not explicitly considered for ranking optimiser algorithms. However, if optimisers take way longer on average this will penalize benchmarkers in terms of compute time. We purely evaluate optimisers based on test loss, with the main argument being that in many use cases of deep learning people tend to push for better performance even when the compute requirements for training grow very large (e.g. self-driving cars, LLMs, etc.). Of course, many engineering innovations are very valuable to make training less costly but those tend to involve tangential aspects like quantization, parallelization etc.

  1. Looking ahead, would there be interest in allowing optimizers to influence other aspects of the training process, such as how batches are sampled or when early stopping is triggered?

Very good point! We thought about this quite a bit, since it is known these aspects like batch size and early stopping patience affect the final results significantly given some optimiser. We are definitely open to suggestions, so far we have started with the simplest approach to keep all aspects not directly related to the optimiser fixed to get somewhat of a fair comparison playground.

1 Like

Thanks you for your reply, that all makes sense and I really appreciate the clarification on what’s available to the optimizer and why.

Regarding the BatchNorm question, I was thinking about some of the work exploring adaptive optimizers that adjust learning rates based on feature statistics rather than just gradients. I am not sure if this is exactly relevant, but this is the sort of thing I came across https://arxiv.org/abs/1805.11604

I am glad that generalization is the main metric as ideally I would focus on designs that try to stay robust across different seeds and datasets. It’s also good that training cost is not directly scored, however I also agree about the compute time being important too.

And on the last point, totally understand the value of keeping things clean and comparable early on. I do think it would be exciting to explore things like dynamic patience or learning-based batch selection in a future version, especially since some recent LLM training setups are trending that way. I think this would align more with real-world LLM setups, even though I realise this may be a little further down the line.

Thank you for your time in your replies

2 Likes

Those are great suggestions, thanks for the ideas and the reference! Once we have some real metrics of the initial challenge runs on testnet or beyond, it would be great to discuss these ideas more in depth. In particular, we are keen to extend to training CNN/RNN/transformer architectures after the MLP case. More real-world setups like the LLM training setup you mentioned will be relevant in those cases.

2 Likes