Michael is a principal software engineer at Proficient Networks. He can be reached at [email protected]

Genetic Algorithms

A few months ago, the Texas legislature was in the midst of a slugfest over the state's congressional redistricting efforts. Republicans were in the driver's seat with 17 of 32 districts and Democrats were trying like anything to avoid losing additional seats. How bad was it? Well, Democrats were mad at Republicans for trying to pull a fast one; Republicans were mad at Democrats for leaving the state to avoid voting on the plan; Democrats were mad at Democrats for jumping ship and returning to Texas; and Republicans were mad at Republicans as they bickered over the spoils.

What a mess. You'd think that there would be an equitable and fair approach in creating these districts. Well, all this led my friend Noah Kennedy to suggest that perhaps the problem could be solved in a more equitable (and perhaps "technocratic") mannermaybe just a simple matter of some well thought-out and executed software.

"Fair and equitable congressional redistricting" can be defined as: 1. Districts of equivalent population size, while; 2. minimizing a district's perimeter. The subdivision of the population would obviously be performed with a blind eye to the voting preferences of a given region. Given these parameters, this problem started to feel as if it might yield to an optimization-solving algorithm.

While there are many approaches to solve for an optimal solution of this type, this looked similar to a genetic algorithm approach I've used to solve the traveling salesman problem. Perhaps, I could use a variation of the same approach to find an optimal redistricting solution for Texas? I was determined to help Texas.

A fair and equitable redistricting solution should provide:

Thirty-two equal population groups.

Compact district land representation to avoid gerrymandering.

Given these two points, how would you go about finding an optimal solution (the needle) among all the suboptimal solutions (the haystack)?

The first criteria (equal population size) appears easy enough to applyjust divide the districts into 32 equal population groups. The second criteria becomes a constraint to the first criteria and complicates matters. One tactic to satisfy the second condition would be to select 32 "centroids" within the state of Texas. A centroid would "attract" all the near or local population and, therefore, construct a congressional district. These 32 districts would then satisfy the second conditionensuring a district's compactness (that is, minimizing its perimeter length).

To test this approach, I used a year 2000 census dataset of 4390 data points with latitude, longitude, and population data. Population for each data point varied in size from 1000 to 12,000, with an average of about 5000 people. The job, therefore, was to find the set of 32 centroids out of the 4390 available points such that each district would have equal population sizes.

Genetic Algorithms

Genetic algorithms are a rather straightforward, flexible optimization-solving technique that are modeled after evolutionary forces in nature (see the text box "Genetic Algorithms" for more details).

A typical genetic-algorithms approach uses an array of Booleans to represent an organism (each Boolean is considered a gene). An example of an organism is:

| 1 | 0 | 1 | 1 | 1 | 0 | 1 |...| 0 | 0 | 1 | 1 2 3 4 5 6 7 n-2 n-1 n

The district representation cannot use Boolean values but needs to represent each gene as belonging to a specific district (therefore, a gene can have a value of 1 to 32):

| 1...32 | 1...32 | 1...32 |... | 1...32 | 1...32 | 1 2 3 4389 4390

There would be 4390 genes in this representation (one for each point in the dataset). The array of genes within an organism represents a solution to the problem being solved (in this case, Texas redistricting). All possible combinations of genes within the array are considered a valid organism (or solution). Unfortunately, this representation does not lend itself well to the problem at hand (most organisms would violate the compact district representation constraint). The internal representation of the organism should be further simplified for this problem.

You can modify the organism such that each gene represents a centroid and can have a value of 1 to 4390 (one value for each point in the dataset), then the organism contains 32 genes:

|1...4390|1...4390|1...4390|...|1...4390|1...4390| 1 2 3 31 32

Each gene now represents a centroid that will be used to assign all remaining local population data points. A limitation with this representation is that a gene's value must be unique within the organism (that is, no duplicate centroids within an organism). These changes force modifications to the standard genetic-algorithm operations (crossover and mutation) performed on the organism.

One final feature needs to be defined for this technique to workrating an organism's fitness. How will the genetic algorithm determine whether a particular organism represents a good or poor solution? The fitness criteria can be defined as a partial, standard deviation calculation performed on the population spread within an organism. This is sufficient to rate the fitness of a solution. The lower the standard deviation, the better the solution; that is, the more equal the population sizes are across the 32 congressional districts. An important aspect of a good fitness criterion is that it represents a continuous solution spacethere are no or few discontinuities. A discontinuity may prevent convergence to an optimal solution. The partial standard deviation equation as it applies to the fitness function is:

Fit = (PopDist - PopAvg) * (PopDist - PopAvg)

where PopDist is the population of the district and PopAvg is the average population size for all districts. Now, technocrats, flex your fingers and warm up your keyboards as we are ready to go to work and show the world what a little magic can do.

Implementation and Design

The design consists of two main object hierarchiesthe organism and the genetic population. The genetic population is a collection of organisms (see Figure 1). Both the organism and population behaviors have been generalized into base class implementations. Primitive data types have been templatized for these classes to allow for greater flexibility over the range of behaviors and characteristics for the organisms and populations.

GENOrganismBase is the base class for the organism (see Figure 1). It is templatized with the data type that represents each gene. The type is specified by the derived class DistrictOrganismin this case, the PopPoint data structure. The PopPoint structure represents a single centroid location, the size of the centroid data point itself, plus the size of the total population assigned to this centroid.

float m_fLat; //latitude float m_fLon; //longitude unsigned long m_ulPop; //gene's population only unsigned long m_ulTotalPop; //total district population unsigned short m_usID; //pop ID

With a typical genetic algorithm approach, the template argument would normally be a Boolean rather than PopPoint. Again, using the PopPoint representation does affect the mutation and crossover operations (and are implemented in the derived class: DistrictOrganism).

GENOrganismBase methods mateWith() and mutate() are pure virtual methods and are required to be overridden by the derived class (DistrictOrganism). The method mateWith() supports crossover operations, while mutate() supports (wild guess here) mutation. DistrictOrganism derives from GENOrganismBase and provides this specialization in behavior.

Within DistrictOrganism::mutate() (Listing One), mutation is no longer a random toggling of a Boolean cell value, but instead a random selection of one of 4390 values (provided that that value is not already represented within the gene sequence). Therefore, the 32 genes are passed in as a constraint to the mutate() method. Access to the full set of data is through the helper object DistrictBucket, which provides the method randomPick(). randomPick() lets the DistrictOrganism pass in a collection of data points to be excluded from the random- selection process, thereby preventing duplicate genes from being represented within the organism. The randomly selected PopPoint then replaces the selected gene. This condition operates at a frequency of fRate, which should be rare enough that convergence to an optimal solution is allowed, but not infrequent enough so that new paths to the optimal solution are not tested.

Crossover behaves in a similar fashion to mutation. In DistrictOrganism::mateWith() (Listing One), the crossing point of two organisms is a random location within the sequence of genes that is constrained to the unique set of genes between the two organisms (iLoc). This constraint is required so that duplicate cells don't appear as a result of a crossover operation. The organism is then spliced together with a partial set of the genes from the other organism at the crossover point.

Population of Organisms

The DistrictPopulation class derives from the GenPopulation base class. The GenPopulation base class is templatized on the data type representing the organism (which needs to derive from GENOrganismBase); in this case, the derived DistrictOrganism implementation. Within the DistrictPopulation class, the main method of interest is the evolve() method (Listing Two). This method contains the processing loop that evolves each organism within a generation. Reproduction, crossover, and mutation are performed to produce a new generation of organisms based on the previous generation. Each iteration of the population selects two adjacent organisms and determines if they should reproduce (at the reproduction rate). If these organisms reproduce, organisms possessing a higher fitness rating will be given preference. Crossover and mutations operations are applied on each organism. Finally, at the end of an evolution sequence, the method GENPopulation::evolutionComplete() is called, which performs a rating of the new generation of solutions.

The DistrictFitness class contains the fitness computation (derived from base class GENFitness) that performs the final and critically important step, rating the organisms according to their worthiness. Within this specific solution, the algorithm accesses the complete set of population data and assigns each population point to a gene until all population points have been assigned. Assignment is applied via a simple rule, where each population point is assigned to the nearest gene (centroid). Total populations are computed for each district, then a partial standard deviation is computed. The reason that this is a partial deviation (without applying the square root) is that the comparison of the fitness is the same with or without the square root; therefore, the computation can skip the CPU-expensive square root.

The fitness computation is followed by a scaling operation. Scaling lets the implementation adjust the dynamic range of the fitness values and therefore aid the search for the optimal solution. The scaling can be overridden by the implementer just as the fitness is allowed to be overridden.

DistrictScale is implemented and overrides the base class implementation ScaleFitness. This class provides an adjustment to the dynamic range of the fitness values. The ScaleFitness base class provides no modification (that is, it multiplies the passed-in value by one). To improve the chances of good organisms surviving into future generations, the fitness value is squared in the derived implementation. In initial runs of the genetic-algorithm system, I found that good organisms were often dying out before they were allowed to propagate to the next generation; therefore, I wanted to give good organisms an additional weight to survive the reproduction stage. By applying the square of the fitness, this skews the survival rate in favor of more fit organisms. In general, the scaling, if applied correctly, provides for a faster convergence to an optimal solution.

Optimal Solutions

Figure 2 shows the convergence of the best gene by generation. The convergence trend shows that this approach continues to search and find better solutionsnote that the x-axis is a log scale; therefore, to continue that same rate of improvement will take exponentially more time. The shape of the convergence plot can dramatically be affected by the reproduction rate, crossover rate, and mutation rate. For example, a higher mutation rate gives a less smooth convergence plot.

The best solution found is in Figure 3 (computed after 9176 generations). The district population points are color coded to show where the optimal districts would exist in the state of Texas. Centroid points that constitute the center of a district are shown as numbers (black). Table 1 shows the same solution but with the population for each district and the difference from the average population each district would optimally have. This is a good solution, but further processing would probably yield an even better solution.

Given the structure of the data (4390 population points) and the limitation that districts were centered around "centroids," it would be possible to further converge on population equivalence between the congressional districts. Two approaches not applied here could be used to further improve this solution:

Change the nature of the centroid picking such that a finer granularity of selection on the centroid would be found.

Change the fitness function such that the fitness would allow "trading" population between adjacent districts to further even out the population differential between districts.

Additional work should be done to define district boundaries. One approach to defining a boundary representation for each district would be to compute the convex hull around each district (the group of population points) and then generate a voronoi graph around these regions.

Conclusion

This work probably took less time than the effort expended in the halls of the Texas legislature. However, I suspect that the current political process in Texas would find little to embrace in this fair (and optimal) solutionso this becomes an exercise in possibilities. An infinitely more complex problem would be to model the political process and resulting redistricting solutionfor that, probably no amount of optimization magic can be applied.

DDJ

void DistrictOrganism::mateWith(GENOrganismBase<PopPoint> &otherOrg, float fWhere) { int iCt = 0, iLoc; GENOrganismBase<PopPoint>::GeneIter iter, iterMatch; //get number of unique cells between two genes iter = m_geneColl.begin(); while (iter != m_geneColl.end()) { if (otherOrg.has(*iter) == false) ++iCt; ++iter; } //scale cell to this number iLoc = int(fWhere * float(iCt)); iCt = 0; //splice at this location ensuring uniqueness iter = m_geneColl.begin(); while (iter != m_geneColl.end() && iCt < iLoc) { if (otherOrg.has(*iter) == false) { *iter = otherOrg.get(iCt); } ++iCt; ++iter; } } void DistrictOrganism::mutate(float fRate) { int iCell; PopPoint point; GENOrganismBase<PopPoint>::GeneIter iter; iter = m_geneColl.begin(); while (iter != m_geneColl.end()) { if (FLIP(fRate)) { *iter = m_pDistrictBucket->randomPick(m_geneColl); //passes in list of // exclusion points } ++iter; } }

Back to Article

void DistrictPopulation::evolve() { OrganismIter iterOne, iterTwo; DistrictOrganism geneOne, geneTwo; if (m_popPool->size() < 2) return; iterOne = m_popPool->begin(); iterTwo = ++(m_popPool->begin()); while (iterOne != m_popPool->end() && iterTwo != m_popPool->end()) { orgOne = *iterOne; orgTwo = *iterTwo; if (FLIP(m_fRepro)) { orgOne = doReproduction(orgOne); orgTwo = doReproduction(orgTwo); } doCrossover(orgOne, orgTwo); doMutation(geneOne); doMutation(geneTwo); addToNewGeneration(geneOne); addToNewGeneration(geneTwo); ++iterOne;++iterOne; ++iterTwo;++iterTwo; } evolutionComplete(); }

Back to Article