DENSER: Deep Evolutionary Network Structured Representation

Deep Evolutionary Network Structured Representation (DENSER) is a novel approach to automatically design Artificial Neural Networks (ANNs) using Evolutionary Computation. The algorithm not only searches for the best network topology, but also tunes hyper-parameters (e.g., learning or data augmentation parameters). The automatic design is achieved using a representation with two distinct levels, where the outer level encodes the general structure of the network, and the inner level encodes the parameters associated with each layer. The allowed layers and hyper-parameter value ranges are defined by means of a human-readable Context-Free Grammar.

The main contributions of this work are:

DENSER, a general framework based on evolutionary principles that automatically searches for the adequate structure and parameterisation of large scale deep networks that can have different layer types and/or goals;

An automatically generated CNN that without any prior knowledge is effective on the classification of the CIFAR-10 dataset, with an average accuracy of 94.27% ;

; The demonstration that ANNs evolved with DENSER generalise well. In concrete, an average accuracy of 78.75% on the CIFAR-100 dataset is obtained by a network whose topology was evolved for the CIFAR-10 dataset. To the best of our knowledge, this is the best result reported on the CIFAR-100 dataset by methods that automatically design CNNs.

The best trained models can be found in https://git.io/vNtYV.

Proposed Approach: DENSER

To promote the evolution of the structure and parameters of ANNs we propose DENSER: Deep Evolutionary Network Structured Representation. DENSER gathers the basic ideas of Genetic Algorithms (GAs) and Dynamic Structured Grammatical Evolution (DSGE).

Representation

Each solution encodes an ANN by means of an ordered sequence of feedforward layers and their respective parameters; the learning and any other hyper-parameters can be encoded with each individual too. The representation of the candidate solution is made at two different levels:

GA Level : encodes the macro structure of the networks and is responsible for representing the sequence of layers that later serves as an indicator to the grammatical starting symbol. It requires the definition of the allowed structure of the networks, i.e., the valid sequence of layers.



DSGE Level: encodes the parameters associated to a layer. The parameters and their allowed values or ranges are codified in the grammar that must be defined by the user.

Crossover

Two crossover operators are proposed, which are based on the different genotypic levels. In the context of the current work a module does not stand for a set of layers that can be replicated multiple times, but is rather a set of layers that belongs to the same GA structural index.

Taking into account that all individuals have the same modules (with a possibly different number of layers) the first crossover operator is a one-point crossover that changes layers between two individuals, within the same module.

The second crossover operators is a uniform crossover, that changes entire modules between two individuals.

Mutation

We develop a set of mutation operators, specifically targeted at promoting the evolution of ANNs. On the GA level, the mutations aim at manipulating the structure of the network:

Add layer: a new layer is generated based on the starting symbol possibilities of the module where the layer is to be placed;

a new layer is generated based on the starting symbol possibilities of the module where the layer is to be placed; Replicate layer: a layer is selected at random and copied into another valid position of the module. The copy is made by reference, i.e., if a parameter is changed in the layer all the copies are modified;

a layer is selected at random and copied into another valid position of the module. The copy is made by reference, i.e., if a parameter is changed in the layer all the copies are modified; Remove layer: a layer is selected and deleted from a given module.

The above mutation operators only act on the general structure of the network; to change the parameters of the layers the following DSGE mutations are used:

Grammatical mutation: an expansion possibility is replaced by another one;

an expansion possibility is replaced by another one; Integer mutation: a new integer value is generated (within the allowed range);

a new integer value is generated (within the allowed range); Float mutation: a Gaussian perturbation is applied to a given float value.

Experimental Results

To test the approach we perform experiments on the generation of CNNs for the classification of the CIFAR-10 benchmark. The CIFAR-10 is made of 60000 instances, each a 32 x 32 RGB colour image that belongs to one of ten possible classes. The solutions evolved by DENSER are mapped to keras models, so that their performance can be measured. The goal of the task is the maximisation of the accuracy of the object recognition task.

To analyse the generalisation and scalability ability of the evolved topologies we then take the best found CNN topologies and test them on the classification of the CIFAR-100 benchmark.

Evolution of CNNs for the CIFAR-10

We conduct 10 evolutionary runs on the generation of CNNs for the classification of the CIFAR-10 dataset. For the generated networks we analyse their fitness (i.e., accuracy on the classification task), and the number of hidden-layers.

Figure 1 Evolution of the fitness (top) and number of hidden-layers (bottom) of the best individuals across generations.

Figure 1 depicts the evolution of the average fitness and number of layers of the best CNNs across generations. A brief perusal of the results indicates that evolution is occurring, and solutions tend to converge around the 80th generation. Two different and contradictory behaviours are observable. From the start of evolution and until approximately the 60th generation an increase in performance is accompanied by a decrease in the number of layers; this changes from the 60th generation until the last generation where an increase in performance is followed by an increase in the number of hidden-layers of the best networks. This analysis reveals an apparent contradiction, that is explained after the fact that in the first generation the randomly generated solutions have a large number of layers, whose parameters are set at random.

Figure 2 Topology of the best performing network found by DENSER.

The fittest network found during evolution (in terms of validation accuracy) is represented in Figure 2. The most puzzling characteristic of the evolved network is the importance and number of dense layers that are used at the end of the topology. To the best of our knowledge, the sequential use of a such large number of dense layers is unprecedented, and it is fair to say that a human would never think of such topology, which makes this evolutionary outcome remarkable.

Once the evolutionary process is completed, the best network found in each run is re-trained 5 times. First, we train the networks with the same learning rate that is used during evolution (lr=0.01), but during 400 epochs (instead of 10). With this setup we obtain, on average, a classification accuracy of 88.41% on the test set. To further enhance the accuracy of the networks we adopt the strategy described by Snoek et al.:for each instance of the test set we generate 100 augmented images, and the prediction is the maximum value of the average confidence values on the 100 augmented images. Following this validation approach the average accuracy on the test set of the best evolved networks increases to 89.93%.

To investigate if it is possible to increase the performance of the best networks we re-train them using the same strategy of CGP-CNN: a varying learning rate that starts at 0.01; on the 5th epoch it is increased to 0.1; by the 250th epoch it is decreased to 0.01; and finally at the 375th it is reduced to 0.001. With the previous training policy the average accuracy of the fittest network increases to 93.38%. If performing data augmentation on the test set, the average accuracy is 94.13%, i.e., an average error of 5.87%.

Figure 3 CIFAR-10 results.

Method Accuracy DENSER 94.13% 10.81 x 10 6 CGP-CNN (ResSet) 94.02% 1.68 x 106 CGP-CNN (ConvSet) 93.25% 1.52 x 106 Snoek et al. 93.63% – CoDeepNEAT 92.7% – Highway Network 92.31% –

The table of Figure 3 shows a comparison with the best results reported by other methods. An analysis of the results shows that DENSER is the one that reports the highest accuracy. The number of trainable parameters is much higher in our methodology because we allow the placement of fully-connected layers in the evolved CNNs. In addition, during evolution no prior knowledge is used to bias the search space.

Generalisation to other Datasets

To test the generalisation and scalability of the evolved networks we take the best network generated by DENSER on the CIFAR-10 dataset and apply it to the classification of the CIFAR-100, MNIST, and Fashion-MNIST datasets. In order for the networks evolved on the CIFAR-10 datasets to work on the remaining ones we only change the input and output layers to cope with the new input image size and number of classes of the different problems.

Generalisation: CIFAR-100

The CIFAR-100 and CIFAR-10 datasets are very similar: the instances are the same, i.e., the same 32 x 32 RGB colour images, but the number of classes is fairly superior: instead of 10, the number of considered classes is 100. Therefore, we change the last layer of the model from Figure 2 to have 100 output neurons; the input is kept the same.

Using the same varying learning rate policy as before (starts at 0.01, and that on the 5th, 250th and 375th epochs changes to 0.1, 0.01, and 0.001, respectively) we obtain an average test accuracy of 73.32% (5 independent trains), with no data augmentation on the test set. If we augment each test instance 100 times, the average test accuracy slightly increases to 74.94%.

The training of each networks is stochastic; the initial conditions are different and they are trained using different instances of the dataset, because of the data augmentation process. Thus and in order to further improve the results, we investigate if the approach proposed by Graham to test the performance of the fractional max-pooling increases the performance reported by our network. In brief words, instead of a single network, an ensemble is used, where each network that is part of the ensemble is the result of an independent training of the network. Using this methods, an ensemble of the same network trained 5 times reports a test accuracy of 77.51%, an ensemble of 10 trains a test accuracy of 77.89%, and an ensemble of 12 trains a test accuracy of 78.14%. These results outperform those reported in the literature for the evolution of CNNs with a fairly standard data augmentation methodology.

Moreover, and instead of forming ensembles with the same network structure we also tested the performance of building an ensemble using the two best network topologies found by DENSER, similarly to what is done by Real et al. Following this methodology, we were able to increase the accuracy to 78.75%.

Figure 4 CIFAR-100 results.

Method Accuracy DENSER 78.75% Real et al. 77.00% Graham 73.61% Snoek et al. 72.60% Highway Network 67.76%

The table of Figure 4 shows a comparison of the results obtained by DENSER on CIFAR-100 with those of other approaches.

Generalisation: MNIST

The training data in the CIFAR-10 and CIFAR-100 datasets is the same, and thus it is expected that the networks evolved for the CIFAR-10 are capable of extracting features that then perform well on the CIFAR-100. That is why we have chosen to test the generalisation ability of the best models found for the CIFAR-10 dataset on a completely different problem.

The MNIST dataset is composed of handwritten digits, and each digit (0 to 9) is centered in a 28 x 28 grayscale image. Therefore we need to adapt the input shape of the best evolved networks to have 784 (28 x 28 x 1) input neurons instead of 3072 (32 x 32 x 3).

With the varying learning rate policy, the average accuracy of 5 independent trains is 99.65% without data augmentation on the test set. Performing data augmentation on the test set increases the average accuracy to 99.69%.

The ensembling method by Graham, where each different train of the network is used as a part of the ensemble, further increases the accuracy to 99.70%. Using the two best networks found during the search for topologies for the CIFAR-10 does not provide any gains, with the performance being the same: 99.70%.

Figure 5 MNIST results.

Method Accuracy DENSER 99.70% Graham 99.68% Highway Network 99.54%

The table of Figure 5 reports the best performance of the topology found by DENSER for the CIFAR-10 dataset, but trained on the MNIST dataset with the results of other topologies trained using different approaches.

Generalisation: Fashion-MNIST

However the performance of the best evolved networks on the CIFAR-10 dataset is high on the MNIST benchmark, it is well known that MNIST is an easy to solve problem, where a simple one hidden-layered network can achieve almost a close to perfect accuracy. For that reason we test the models on the Fashion-MNIST dataset, which is arguably more difficult than the standard MNIST.

The Fashion-MNIST dataset is similar to MNIST in the sense that it is composed by 28 x 28 grayscale images. However, instead of digits the images are of Zalando’s articles. There are also 10 independent classes: t-shirt/top; trouser; pullover; dress; coat; sandal; shirt; sneaker; bag; and ankle boat. The size of the networks input and output is the same as for the MNIST dataset.

As for the MNIST and CIFAR-100 problems we first check the average classification performance without data augmentation on the test set, which is of 94.23%. If augment the test instances 100 times, the average accuracy increases to 94.70% (on average).

The ensemble of the 5 trains of the best found model for the CIFAR-10 dataset reports an accuracy of 95.11%. This value further increases if we ensemble the trains of the two best models: 95.26%.



Figure 6 Fashion-MNIST results.

Method Accuracy DENSER 95.26% ResNet18 94.90% GoogleNet 93.70% AlexNet 89.90%

Figure 6 shows the comparative performance of the model found by DENSER and other well known CNNs.

References

Snoek, Jasper, et al. “Scalable Bayesian optimization using Deep Neural Networks.” International Conference on Machine Learning. 2015.

Miikkulainen, Risto, et al. “Evolving Deep Neural Networks.” arXiv preprint arXiv:1703.00548 (2017).

Masanori Suganuma, Shinichi Shirakawa, and Tomoharu Nagao. 2017. A genetic programming approach to designing convolutional neural network architectures. In Proceedings of the Genetic and Evolutionary Computation Conference (GECCO ’17). ACM, New York, NY, USA, 497-504.

Graham, Benjamin. “Fractional max-pooling.” arXiv preprint arXiv:1412.6071 (2014).

Real, Esteban, et al. “Large-scale evolution of image classifiers.” arXiv preprint arXiv:1703.01041 (2017).

Srivastava, Rupesh K., Klaus Greff, and Jürgen Schmidhuber. “Training very deep networks.” Advances in neural information processing systems. 2015.

In Press