In this section we detail the task distribution used throughout this work. In addition to this text, a Tensorflow (abadi2016tensorflow) implementation is also released at github.com/google-research/google-research/tree/master/task_set .

subword encoding: We encode the text as subwords with a vocabsize of 8k (sennrich2015neural) . We then take length s random slices of each example where s is sampled logarithmically between [10,256]. These examples are then batched into a batch size logarithmically sampled between [8,512]. With probability 0.2 we restrict the number of training examples to a number logarithmically sampled between [1000,50000]. Finally, with a 10% probability just use training data instead of valid / test to test pure optimization as opposed to generalization.

Byte encoding: We take length s random slices of each example where s is sampled logarithmically between [10,160]. These examples are then batched into a batch size logarithmically sampled between [8,512]. With probability 0.2 we restrict the number of training examples to a number logarithmically sampled between [1000,50000]. Finally, with a 10% probability just use training data instead of valid / test to test pure optimization as opposed to generalization.

For the character and word language modeling datasets we make use of the following data sources: imdb movie reviews (maas-EtAl:2011:ACL-HLT2011) , amazon product reviews (amazonreviews) using the Books, Camera, Home, and Video subset each as separate datasets, LM1B (DBLP:journals/corr/ChelbaMSGBK13) , and Wikipedia (wikidump) taken from the 20190301 dump using the zh, ru, ja, hab, and en language codes. We split each article by new lines and only keep resulting examples that contain more than 5 characters. For infrastructure reasons, we only use a million articles from each language and only 200k examples to build the tokenizer.

IMDB sentiment classification: We use text from the IMDB movie reviews dataset (maas-EtAl:2011:ACL-HLT2011) and tokenize using subwords using a vocab size of 8k (sennrich2015neural) . We then take length s random slice from each example where s is sampled logarithmically between [8,64]. These examples are then batched into a batch size logarithmically sampled between [8,512]. We sample the number of training examples logarithmically between [1000,55000] and with 10% probability just use training data instead of valid / test to test pure optimization as opposed to generalization.

Imagenet32x32 / Imagenet16x16: The ImageNet 32x32 and 16x16 dataset as created by chrabaszcz2017downsampled . Batch size is logrithmically sampled between [8,256].

cubic interpolation . We ignore aspect ratio for this resize. Batch size is sampled logarithmically between

{food101_32x32, coil100_32x32, deep_weeds_32x32, sun397_32x32}: These dataset take the original set of images and resize them to 32x32 using OpenCV’s (opencv_library)

Cifar100: Batch size is sampled logarithmically between [8,256]. The number of training examples is sampled logarithmically [1000,50000] (krizhevsky2009cifar) .

Cifar10: Batch size is sampled logarithmically between [8,256]. The number of training examples is sampled logarithmically [1000,50000] (krizhevsky2009cifar) .

Fashion Mnist: Batch size is sampled logarithmically between [8,512]. We sample the number of training images logarithmically between [1000,55000] (xiao2017/online) .

Mnist: Batch size is sampled logarithmically between [8,512]. We sample the number of training images logarithmically between [1000,55000] (lecun1998mnist) .

For all datasets, we sample a switch with low probability (10% of the time) to only use training data and thus not test generalization. This ensures that our learned optimizers are capable of optimizing a loss as opposed to a mix of optimizing and generalizing.

Image Datasets: We sample uniformly from the following image datasets. Each dataset additionally has sampled parameters. For all datasets we make use of four data splits: train, valid-inner, valid-outer, test. Train is used to train models, valid-inner is used while training models to allow for modification of the training procedure (e.g. if validation loss doesn’t increase, drop learning rate). Valid-outer is used to select meta-parameters. Test should not be used during meta-training.

RNN Cores: We define a distribution over different types of RNN cores used by the sequential tasks. With equal probability we sample either a vanilla RNN (elman1990finding) , GRU (chung2014empirical) , or LSTM (hochreiter1997long) . For each cell we either sample 1 shared initialization method or sample a different initialization method per parameter vector with a 4:1 ratio. We sample the core hidden dimension logarithmically between [32,128].

random uniform 1.0: This is defined between [−s,s] where s is sampled logarithmically between [0.1,10].

orthogonal: 1. We sample the “gain”, or multiplication of the orthogonal matrix logarithmically between

We sample initializers according to a weighted distribution. Each initialization sample also optionally samples hyperparameters (e.g. for random normal initializers we sample standard deviation of the underlying distribution).

We define a distribution of activation functions which is sampled corresponding the following listing both name and weight. These are a mix of standard functions ( relu , tanh) to less standard (cos).

As many of the sampled tasks are neural networks. We define common sampling routines used by all the sampled tasks.

An example configuration shown below. In this version of TaskSet the dataset sampling contains a bug. All data used is from the imdb_reviews/subwords8k dataset.

. The text is then embedded into a vocab size logarithmically sampled between. These embeddings get fed into a sampled config RNN. With equal probability the initial state of the rnn is either sampled, or zeros. With equal probability we either take the last RNN prediction, the mean over features, or the per feature max over the sequence. This batch of activations is then passed through a linear layer and a softmax cross entropy loss. The initialization for the linear projection is sampled.

This task consists of using an RNN to classify tokenized text. We first trim the vocab length to be of a size logarithmically sampled between

With probability 50% we add noise from a distribution whose parameters are also sampled.

We define a distribution over B to be either normal with mean and std sampled from N(0, 1), U(0, 2) respectively or uniform with min and range equal to U(-5, 2.5), U(0, 5) respectively.

We define a distribution over matrices A as a sample from one of the following: normal: we sample a mean from a normal draw with a standard deviation of 0.05 and a std from a uniform [0, 0.05]. The elements of A are drawn from the resulting distribution. uniform: linspace_eigen: logspace_eigen:

The output_fn is sampled uniformly between identity, and f(x)=log(max(0,x)). The loss scale is sampled logarithmically between [10−5, 103].

where X=param∗weight\_rescale and where param is initialized by initial_dist.sample() / weight_rescale.

The loss for this task is described by:

This task distribution defines a synthetic problem based on a non-linear modification to a quadratic. The dimensionality of the problem is sampled logarithmically between [2, 3000].

NVP are a family of tractable density generative models. See dinh2016density for more information. The NVP is defined by a sequence of bijectors. For one bijector samples a number of layers to either be 1 or 2 with equal probability, and a number of hidden layers sampled logarithmically between [16,128]. We sample the number of bijector uniformly from [1,4] and use the same hidden layers across all bijector. We sample activation function, and initializer once for the whole model. In this task we model image datasets which are also sampled.

Masked autoregressive flows are a family of tractable density generative models. See XX for more information. The MAF is defined by a sequence of bijectors. For one bijector samples a number of layers to either be 1 or 2 with equal probability, and a number of hidden layers sampled logarithmically between [16,128]. We sample the number of bijector uniformly from [1,4] and use the same hidden layers across all bijector. We sample activation function, and initializer once for the whole model. In this task we model image datasets which are also sampled.

sparse_problems: With probability 0.9 to 0.99 the gradient per parameter is set to zero. Additional noise is added with probability 0.5 sampled from a normal with std sampled logrithmically between [0.01,10.0].

In addition to these base tasks, we also provide a variety of transformations described bellow. The use of these transformations is also sampled.

projection_quadratic: A quadratic minimized by probing different directions. Dimentionality is sampled from 3 to 100 logrithmically.

sum_of_quadratics: A least squares loss of a dimentionality sampled logrithmically between 3 and 100 to a synthetic dataset.

min_max_well: A loss based on the sum of min and max over parameters: maxx+1/(minx)−2. Note that the gradient is zero for all but 2 parameters. We sample dimentaionlity logrithmically between 10 and 1000. Noise is optionally added with probability 0.5 and of the scale s where s is sampled logarithmically between [0.01, 10].

outward_snake: This loss creates a winding path to infinity. Step size should remain constant across this path. We sample dimensionality logrithmically between 3 and 100.

dependency_chain: A synthetic problem where each parameter must be brought to zero sequentially. We sample dimensionality logrithmically between 3, 100.

norm: A problem that finds a minimum error in an arbitrary norm. Specifically: (∑(Wx−y)p)(1p) where W∈RNxN, y∈RNx1. The dimentionality, N, is sampled logrithmically between 3, and 1000. The power, p, is sampled uniformly between 0.1 and 5.0. W, and y are drawn from a standard normal distribution.

fully_connected: A sampled random fully connected classification neural network predicting 2 classes on synthetic data. Number of input features is sampled logrithmically between 1 and 16, with a random activation function, and a sampled number of layers uniformly sampled from 2-5.

bowl: A 2d qaudratic bowl problem with a sampled condition number (logrithmically between [0.01,100]). Noise is optionally added with probability 0.5 and of the scale s where s is sampled logarithmically between [0.01,10].

quadratic: n dimensional quadratic problems where n is sampled logarithmically between [10,1000]. Noise is optionally added with probability 0.5 and of the scale s where s is sampled logarithmically between [0.01,10].

These tasks consist of a mixture of many other tasks. We sample uniformly over the following types of problems. We brielfy describe them here but refer reader to the provided source for more information. In this work we took all the base problems from (wichrowska2017learned) but modified the sampling distributions to better cover the space as opposed to narrowly sampling particular problem families. Future work will consist of evaluating which sets of problems or which sampling decisions are required.

This task takes word embedded data, and embeds in a size s embedding vector where s is sampled logarithmically between [8,128] with random normal initializer with std 1.0. A vocab size for this embedding table is sampled logarithmically between [1000,30000]. We then pass this embedded vector to a RNN with teacher forcing with equal probability we use a trainable initializer or zeros. A linear projection is then applied to the number of vocab tokens. Losses are computed using a softmax cross entropy vector and mean across the sequence.

This task takes character embedded data, and embeds in a size s embedding vector where s is sampled logarithmically between [8,128] with random normal initializer with std 1.0. With 80% we use all 256 tokens, and with 20% chance we only consider a subset of tokens sampled logarithmically [100,256]. We then pass this embedded vector to a RNN with teacher forcing with equal probability we use a trainable initializer or zeros. A linear projection is then applied to the number of vocab tokens. Losses are computed using a softmax cross entropy vector and mean across the sequence.

We sample one activation function and weight init for the entire network. For the convnet we also sample whether or not to use a bias with equal probability. These models are trained on a sampled image dataset.

The output is then flattened, and run through a MLP with hidden layers sampled uniformly from [0,4] and with sizes sampled logrithmically from [32,128]. The loss is then computed via softmax cross entropy.

This task consists of small convolutional neural networks, flattened, then run through a MLP. We sample the number of conv layers uniformly between [1,5]. We sample a stride pattern to be either all stride 2, repeating the stride pattern of 1,2,1,2… for the total number of layers, or 2,1,2,1… for the total number of layers. The hidden units are logarithmically sampled for each layer between [8,64]. Padding for the convolutions are sampled per layer to either be same or valid with equal probability.

. We sample one activation function and weight init for the entire network. Padding for the convolutions are sampled per layer to either be same or valid with equal probability. For the convnet we also sample whether or not to use a bias with equal probability. At the last layer of the convnet we do a reduction spatially using either the mean, max, or squared mean sampled uniformly. This reduced output is fed into a linear layer and a softmax cross entropy loss. These models are trained on a sampled image dataset.

. We sample a stride pattern to be either all stride 2, repeating the stride pattern of 1,2,1,2… for the total number of layers, or 2,1,2,1… for the total number of layers. The hidden units are logarithmically sampled for each layer between

This task consists of small convolutional neural networks with pooling. We sample the number of layers uniformly between [1,5]

. We use the reparameterization trick to compute gradients. This model is trained on sampled image datasets.

. The loss we optimize is the evidence lower bound (ELBO) which is computed by adding this likelihood to the kl divergence between our normal distribution prior and

. The decoder maps samples from the latent space to a quantized gaussian distribution in which we compute data log likelihoods log

This task has an encoder with sampled number of layers between [1,3]. For each layer we sample the number of hidden units logarithmically between [32,128]. For the decoder we sample the number of layers uniformly between [1,3]. For each layer we sample the number of hidden units logarithmically between [32,128]. We use a gaussian prior of dimensionality logarithmically sampled between [32,128]. A single activation function and initialization is chosen for the whole network. The output of the encoder is projected to both a mean, and a log standard deviation which parameterizes the variational distribution, q(z|x)

This task family consists of a multi layer perceptron trained with an auto encoding loss. The amount of layers is sampled uniformly from [2,7]. Layer hidden unit sizes are sampled logarithmically between [16,128] with different number of hidden units per layer. The last layer always maps back to the input dimension. The output activation function is sampled with the following weights: tanh:2, sigmoid:1, linear_center:1, linear:1 where linear_center is an identity mapping. When using the linear_center and tanh activation we shift the ground truth image to [−1,1] before performing a comparison to the model’s predictions. We sample the per dimension distance function used to compute loss with weights l2:2, l1:1, and the reduction function across dimensions to be either mean or sum with equal probability. A single activation function, and initializer is sampled. We train on image datasets which are also sampled.

This task family consists of a multi layer perceptron trained on flattened image data. The amount of layers is sampled uniformly from [1,6]. Layer hidden unit sizes are sampled logarithmically between [16,128] with different number of hidden units per layer. One activation function is chosen for the whole network and is chosen as described in G.1.1 . One shared initializer strategy is also sampled. The image dataset used is also sampled.

g.3 Fixed Tasks

In addition to sampled tasks, we also define a set of hand designed and hand specified tasks. These tasks are either more typical of what researcher would do (e.g. using default initializations) or specific architecture features such as bottlenecks in autoencoders, normalization, or dropout.

In total there are 107 fixed tasks. Each task is labeled by name with some information about the underlying task. We list all tasks, discuss groups of tasks, but will not describe each task in detail. Please see the source for exact details.

Associative_GRU128_BS128_Pairs10_Tokens50

Associative_GRU256_BS128_Pairs20_Tokens50

Associative_LSTM128_BS128_Pairs10_Tokens50

Associative_LSTM128_BS128_Pairs20_Tokens50

Associative_LSTM128_BS128_Pairs5_Tokens20

Associative_LSTM256_BS128_Pairs20_Tokens50

Associative_LSTM256_BS128_Pairs40_Tokens100

Associative_VRNN128_BS128_Pairs10_Tokens50

Associative_VRNN256_BS128_Pairs20_Tokens50



These tasks use RNN’s to perform an associative memory task. Given a vocab of tokens, and some number of pairs to store and a query the RNN’s goal is to produce the desired value. For example given the input sequence A1B2C3?B_ the RNN should produce ________B.

This model embeds tokens, applies an RNN, and applies a linear layer to map back to the output space. Softmax cross entropy loss is used to compare outputs. A weight is also placed on the losses so that loss is incurred only when the RNN is supposed to predict. For RNN cells we use LSTM (hochreiter1997long) , GRU (chung2014empirical) , and VRNN – a vanilla RNN. The previous tasks are defined with the corresponding RNN cell, number of units, batch size, sequence lengths, and number of possible tokens for the retrieval task.

Copy_GRU128_BS128_Length20_Tokens10

Copy_GRU256_BS128_Length40_Tokens50

Copy_LSTM128_BS128_Length20_Tokens10

Copy_LSTM128_BS128_Length20_Tokens20

Copy_LSTM128_BS128_Length50_Tokens5

Copy_LSTM128_BS128_Length5_Tokens10

Copy_LSTM256_BS128_Length40_Tokens50

Copy_VRNN128_BS128_Length20_Tokens10

Copy_VRNN256_BS128_Length40_Tokens50



These tasks use RNN’s to perform a copy task. Given a vocab of tokens and some number of tokens the RNN’s job is to read the tokens and to produce the corresponding outputs. For example an input might be: ABBC|____ and the RNN should output ____|ABBC. See the source for a complete description of the task. Each task in this set varies the RNN core, as well as the dataset structure.

This model embeds tokens, applies an RNN, and applies a linear layer to map back to the output space. Softmax crossentropy loss is used to compare outputs. A weight is also placed on the losses so that loss is incurred only when the RNN is supposed to predict. For RNN cells we use LSTM (hochreiter1997long) , GRU (chung2014empirical) , and VRNN – a vanilla RNN. The previous tasks are defined with the corresponding RNN cell, number of units, batch size, sequence lengths, and number of possible tokens.

FixedImageConvAE_cifar10_32x32x32x32x32_bs128

FixedImageConvAE_cifar10_32x64x8x64x32_bs128

FixedImageConvAE_mnist_32x32x32x32x32_bs128

FixedImageConvAE_mnist_32x64x32x64x32_bs512

FixedImageConvAE_mnist_32x64x8x64x32_bs128



Convolutional autoencoders trained on different datasets and with different architectures (sizes of hidden units).

FixedImageConvVAE_cifar10_32x64x128x64x128x64x32_bs128

FixedImageConvVAE_cifar10_32x64x128x64x128x64x32_bs512

FixedImageConvVAE_cifar10_32x64x128x64x32_bs128

FixedImageConvVAE_cifar10_64x128x256x128x256x128x64_bs128

FixedImageConvVAE_mnist_32x32x32x32x32_bs128

FixedImageConvVAE_mnist_32x64x32x64x32_bs128

FixedImageConvVAE_mnist_64x128x128x128x64_bs128



Convolutional variational autoencoders trained on different datasets, batch sizes, and with different architectures.

FixedImageConv_cifar100_32x64x128_FC64x32_tanh_variance_scaling_bs64

FixedImageConv_cifar100_32x64x64_flatten_bs128

FixedImageConv_cifar100_bn_32x64x128x128_bs128

FixedImageConv_cifar10_32x64x128_flatten_FC64x32_tanh_he_bs8

FixedImageConv_cifar10_32x64x128_flatten_FC64x32_tanh_variance_scaling_bs64

FixedImageConv_cifar10_32x64x128_he_bs64

FixedImageConv_cifar10_32x64x128_largenormal_bs64

FixedImageConv_cifar10_32x64x128_normal_bs64

FixedImageConv_cifar10_32x64x128_smallnormal_bs64

FixedImageConv_cifar10_32x64x128x128x128_avg_he_bs64

FixedImageConv_cifar10_32x64x64_bs128

FixedImageConv_cifar10_32x64x64_fc_64_bs128

FixedImageConv_cifar10_32x64x64_flatten_bs128

FixedImageConv_cifar10_32x64x64_tanh_bs64

FixedImageConv_cifar10_batchnorm_32x32x32x64x64_bs128

FixedImageConv_cifar10_batchnorm_32x64x64_bs128

FixedImageConv_coil10032x32_bn_32x64x128x128_bs128

FixedImageConv_colorectalhistology32x32_32x64x64_flatten_bs128

FixedImageConv_food10164x64_Conv_32x64x64_flatten_bs64

FixedImageConv_food101_batchnorm_32x32x32x64x64_bs128

FixedImageConv_mnist_32x64x64_fc_64_bs128

FixedImageConv_sun39732x32_bn_32x64x128x128_bs128

Mnist_Conv_32x16x64_flatten_FC32_tanh_bs32

Convolutional neural networks doing supervised classification. These models vary in dataset, architecture, and initializations.

FixedLM_lm1b_patch128_GRU128_embed64_avg_bs128

FixedLM_lm1b_patch128_GRU256_embed64_avg_bs128

FixedLM_lm1b_patch128_GRU64_embed64_avg_bs128

FixedLM_lm1b_patch128_LSTM128_embed64_avg_bs128

FixedLM_lm1b_patch128_LSTM256_embed64_avg_bs128



Language modeling tasks on different RNN cell types and sizes.

FixedMAF_cifar10_3layer_bs64

FixedMAF_mnist_2layer_bs64

FixedMAF_mnist_3layer_thin_bs64



Masked auto regressive flows models with different architectures (number of layers and sizes).

FixedMLPAE_cifar10_128x32x128_bs128

FixedMLPAE_mnist_128x32x128_bs128

FixedMLPAE_mnist_32x32x32_bs128



Autoencoder models based on multi layer perceptron with different number of hidden layers and dataset.

FixedMLPVAE_cifar101_128x128x32x128x128_bs128

FixedMLPVAE_cifar101_128x32x128_bs128

FixedMLPVAE_food10132x32_128x64x32x64x128_bs64

FixedMLPVAE_mnist_128x128x8x128_bs128

FixedMLPVAE_mnist_128x64x32x64x128_bs64

FixedMLPVAE_mnist_128x8x128x128_bs128

Imagenet32x30_FC_VAE_128x64x32x64x128_relu_bs256

Variational autoencoder models built from multi layer perceptron with different datasets, batchsizes, and architectures.

FixedMLP_cifar10_BatchNorm_128x128x128_relu_bs128

FixedMLP_cifar10_BatchNorm_64x64x64x64x64_relu_bs128

FixedMLP_cifar10_Dropout02_128x128_relu_bs128

FixedMLP_cifar10_Dropout05_128x128_relu_bs128

FixedMLP_cifar10_Dropout08_128x128_relu_bs128

FixedMLP_cifar10_LayerNorm_128x128x128_relu_bs128

FixedMLP_cifar10_LayerNorm_128x128x128_tanh_bs128

FixedMLP_cifar10_ce_128x128x128_relu_bs128

FixedMLP_cifar10_mse_128x128x128_relu_bs128

FixedMLP_food10132x32_ce_128x128x128_relu_bs128

FixedMLP_food10132x32_mse_128x128x128_relu_bs128

FixedMLP_mnist_ce_128x128x128_relu_bs128

FixedMLP_mnist_mse_128x128x128_relu_bs128

FixedNVP_mnist_2layer_bs64



Image classification based on multi layer perceptron. We vary architecture, data, batchsize, normalization techniques, dropout, and loss type across problems.

FixedNVP_mnist_3layer_thin_bs64

FixedNVP_mnist_5layer_bs64

FixedNVP_mnist_5layer_thin_bs64

FixedNVP_mnist_9layer_thin_bs16



Non volume preserving flow models with different batchsizesm and architectures.

FixedTextRNNClassification_imdb_patch128_LSTM128_avg_bs64

FixedTextRNNClassification_imdb_patch128_LSTM128_bs64

FixedTextRNNClassification_imdb_patch128_LSTM128_embed128_bs64

FixedTextRNNClassification_imdb_patch32_GRU128_bs128

FixedTextRNNClassification_imdb_patch32_GRU64_avg_bs128

FixedTextRNNClassification_imdb_patch32_IRNN64_relu_avg_bs128

FixedTextRNNClassification_imdb_patch32_IRNN64_relu_last_bs128

FixedTextRNNClassification_imdb_patch32_LSTM128_E128_bs128

FixedTextRNNClassification_imdb_patch32_LSTM128_bs128

FixedTextRNNClassification_imdb_patch32_VRNN128_tanh_bs128

FixedTextRNNClassification_imdb_patch32_VRNN64_relu_avg_bs128

FixedTextRNNClassification_imdb_patch32_VRNN64_tanh_avg_bs128



RNN text classification problems with different RNN cell, sizes, embedding sizes, and batchsize.

TwoD_Bowl1

TwoD_Bowl10

TwoD_Bowl100

TwoD_Bowl1000



2D quadratic bowls with different condition numbers.

TwoD_Rosenbrock

TwoD_StyblinskiTang

TwoD_Ackley

TwoD_Beale

