Copyrights and Credit for: brokenmachine86

A lot of theoretical and practical problems can be reduced to an optimization problem. If you think of a function as a kind of program/machine that you feed it with an input and then it provides you with an output, then the optimization problem is how can you find the best input, as defined by some criteria applied to the corresponding output. For example, one of the simplest and most common optimization problems is finding the input that will produce a minimum (or a maximum) output, i.e no other value of the input can produce less (or more) output value.

Optimization is of great practical importance since a lot of critical real world problems are in their essence optimization problems. Think for example of routing packets on the internet. A packet emerging from your computer needs to reach its destination in the most efficient way, in terms of passing the least number of nodes, avoiding congestions and so on. In that sense you need to select the best router for your packet. If you define a function that takes a given packet route and calculate the “cost” associated with using this route, then the problem reduces to finding the input to that function (i.e the route) that gives you the minimum cost. So, it is an optimization minimization problem.

There are a lot of techniques and methods for doing optimization, each one has its advantages and disadvantages depending on the kind of problem, application and computational constraints, among other factors. Informally, a convex problem is an optimization problem in which the function you want to optimize (I will denote it by the target function) has the shape of a convex/concave surface. If the function is also continuous and differentiable, i.e contains no jumps or sharp changes, then you may use gradient information to find its maximum/minimum. Intuitively, this means just following the curvature of the function up or down, till you reach a flat point, which will be your minimum/maximum. Despite their relative optimization simplicity, unfortunately, a lot of critical world real problems don’t fall in this category. These problems are called non-convex. Also intuitively, you can think of a non-convex target function as having a complex landscape with a lot of hills and valleys.

In non-convex problems, you can have multiple minima/maxima. I will confine the discussion to minima, but all the concepts apply seamlessly to maxima. Actually, usually you can turn a minimization problem into a maximization one (or vice versa) by just flipping the sign of the target function. Any one of these minima can be either a local minimum, i.e it is lower than its neighborhood, but not necessarily the lowest point in the whole landscape, or a global minimum, i.e the lowest possible in the whole landscape. With these complications of the non-convex problems, the search for the desired minimum is a difficult complex problem, specially if you have a higher dimensional space. For example, think of a target function that takes the different behavior properties of a customer, say 10 different properties, e.g political orientation, nationality, gender, ..etc, and produce a probability of purchasing your product. What you may aim for is to pick the customer who is more likely to buy your product (i.e optimize to maximize purchasing probability) so you can target him. This is 10-dimensional space, since each property can change independently and this is a small dimensionality when compared with problems like optimizing neural networks with millions of dimensions.

Left: Convex function. Right: Non-convex function.

Evolutionary computation (EC) emerged as one of the very powerful techniques for generic optimization. Generic here means very little coupling to the problem details. Although you can integrate domain knowledge in EC, and despite the fact that for you to get the best of any algorithm you must be knowledgeable in the problem at hand, EC doesn’t assume any special properties for the target function, like continuity, differentiability or convexity. The main requirement is only that you can assign a measure of quality, or fitness in the evolutionary terminology, to any input/output mapping.

On the origin of EC

Multiple measures are used to quantify how good a theory is, like having the least set of axioms, having a good explanation of observations, making predictions and even elegance. However, I don’t think any theory is better than Darwin’s evolution by natural selection when it comes to the presupposed axioms. Basically, evolution by natural selection only requires two main components: random mutations and selection pressure. The idea is that at any given time you have a population of individuals that vary in their traits/properties due to random mutations and sexual crossover (crossover is the random blending between chromosomes that results in shuffling of the genes that you get from any one of your parents). That means that the natural selection pressure at any given time will favor some individuals over others, since their properties are more “fit” for the selection criteria. For example, an efficient variation of a gene responsible for metabolism would be favored in an environment with scarce food resources, hence its bearers will be in an advantage, they will survive and so reproduce and give offsprings that are more likely to inherit the gene and consequently the new generation will be more adapted to the local harsh environment and over the course of time, the gene will tend to increase in the population’s gene pool.

The idea of EC is just applying these concepts in silico. In order to optimize a target function, you initialize a population of candidate solutions/inputs to your target function, then you evaluate their fitness (sometimes like in the cost function for the packet route, the fitness is the same as the target function output). Then, you select the best subset of individuals based on the fitness. The individuals of this subset will be the parents of the next generation. You then mate these parents by doing crossover and mutations to produce the offspring, which are just the next generation. This is usually repeated till a satisfactorily good solution is reached.

No free lunch

As EC requires mainly only a fitness assigned to each individual, it is basically requiring very little domain knowledge to function. However, as mentioned, for any optimization to function well, domain knowledge is indispensable. Domain knowledge can be integrated into EC mainly in the fitness function or the evolutionary operators, i.e crossover and mutation. EC can be used to optimize nearly any kind of function regardless of whether it is differentiable, continuous, convex or not, provided that a reasonable fitness can be calculated. Also, some optimization techniques, e.g gradient-based, suffer from local minima entrapment. As mentioned, a local minimum is a point that is lower than its neighborhood, but not necessarily the lowest point. As gradient techniques follow the curvature of the function landscape, it can follow the path down to a local minimum, and as it detects the flatness of the bottom of the minimum, it will stop there since it isn’t able to tell if it is a local or global minimum. EC, on the other hand, doesn’t suffer from this problem since it is not making any assumptions about the target function. Since EC is essentially evaluating different individuals/solutions at each generation, parallelization can be done very naturally by evaluating each individual on a different computational unit (e.g CPU). This is very useful especially if the fitness evaluation is expensive.

As you may be thinking now, especially if you are experienced in algorithms, “Why not use EC all the time? Where is the trick?” Yes, there is no such thing as a free lunch. First of all, the main strength point in EC, which is the need for only the fitness, can be a drawback, like if the fitness is very expensive to calculate. For example, in Neuroevolution (i.e evolving neural networks), usually the fitness calculation requires training and testing the neural network, which is a very computationally expensive process. Another difficulty is the representation design. This means how you will encode your problem into a gene/attributes representation that will render the evolutionary search effective.

Contrary to techniques like gradient-based ones that can accelerate their performance by exploiting commercial GPUs, there is no commercial analogue for the EC. Usually, if you are doing an expensive EC experiment, you will need access to a cluster or an HPC with hundreds of nodes to parallelize your computations across. Moreover, for EC to function well, you will need to choose a set of hyperparameters. Hyperparameters are a set of values and choices that control your experiment’s behavior, e.g the mutation rate. This obviously can’t be determined by the EC algorithm itself and usually will need human experience and trial and error to figure out. Finally, although practically for most problems EC will perform reasonably well, theoretically, it has no guarantees whatsoever for convergence, i.e arriving at a solution. Other techniques, like the gradient-based, can be analysed theoretically (however, usually for specific simple problems) and can be proved to converge in a given time complexity. This, however, is shared by many other optimization techniques specially for complex high-dimensional problems, even the gradient-based ones (like in non-convex problems), and usually we rely on empirical efficiency in evaluating this.

A brief taxonomy

There are different variations to how evolutionary concepts are implemented in silico. The main steps in an EC algorithm are

Fitness evaluation

Selection

Reproduction (mating)

As we discussed, fitness evaluation is the problem specific part. This is totally delegated by the EC to you, the algorithm user. EC expects fitness from you and it will do the rest. The other two steps are EC algorithm specific and they are mainly where different flavors of EC vary. Two concepts that will become handy during discussing the EC flavors are genotype and phenotype. A genotype is the collection of genes contained in a given individual. The genotype (or the genome) is responsible, directly or indirectly, for every anatomical, physiological and psychological trait in your body and personality through a complex influence during the embryological development and throughout your life. The specific collection of these physical traits associated or caused by a given genotype is called a phenotype.

The main three variations that I will be discussing briefly are

Genetic Algorithms (GA)

Evolutionary Strategies (ES)

Genetic Programming (GP)

I will then finalize by briefly discussing an interesting extension to EC, which is Memetic Algorithms (MA).

Genetic Algorithms (GA)

GA are the most faithful to biological evolution. They require the problem, mainly the input to the target function, to be encoded into a genome. This is usually a string of bits, i.e string of 0s and 1s. This bit string is divided into parts called words, where each word (basically a sub-sequence of bits) can be decoded to give a real number that represents one of the input attributes. Selection is done using Roulette wheel selection. This is a random selection similar to turning a Roulette wheel multiple times to select the parents of the next generation. Each hole on the wheel corresponds to a given parent, however, holes differ in width according to the parent fitness, increasing the highly fit parents’ chance of getting selected. As you may expect, the same parent can be selected more than one time in different wheel turns.

Bit string. Source: Streichert, 2002.

Weighted Roulette wheel. Source: Streichert, 2002.

Crossover is done between the selected parents to produce the offsprings by pairing two parents and cutting at random location in the string (usually without respecting the word boundaries) and then swapping the strings halves between the two genomes. The offsprings produced are then mutated as a final step to get the next generation. This is done by random bit flipping, i.e turning a random 0 to 1 or vice versa.