This Go project uses the genetic solver from my previous post to evolve a regular expression that matches all items in a set of wanted strings without matching any items in the set of unwanted strings. For example:

wanted := []string{"00", "01", "10"} unwanted := []string{"11"}

We desire a regular expression that finds strings from wanted without finding those from unwanted. We’ll start by determining the unique set of characters in the wanted strings:

func getUniqueCharacters(wanted []string) string { uniqueCharacters := make(map[string]bool) characters := "" for _, item := range wanted { for i := 0; i < len(item); i++ { token := item[i : i+1] if !uniqueCharacters[token] { characters += token uniqueCharacters[token] = true } } } return characters }

And we’ll combine those with a set of regex special characters to get our gene set:

const regexSpecials = "[]()|?*+" geneSet := getUniqueCharacters(wanted) + regexSpecials

Next, when we receive a candidate regex from the solver we’ll verify that it is a valid regex by trying to compile it.

func isValidRegex(candidate string) bool { if strings.Contains(candidate, "()") || strings.Contains(candidate, "??") { return false } _, err := regexp.Compile(candidate) return err == nil }

We also need a function to calculate the fitness of a given candidate based on the number of wanted and unwanted strings it matches.

func calculate(wanted, unwanted []string, geneSet string, candidate string) int { if !isValidRegex(candidate) { return math.MaxInt32 } regex := regexp.MustCompile("^(" + candidate + ")$") fitness := 0 for _, item := range wanted { if !regex.MatchString(item) { fitness++ } } for _, item := range unwanted { if regex.MatchString(item) { fitness += 10 } } return fitness }

Next we’ll need a way to visualize what’s happening:

disp := func(candidate string) { fmt.Println(candidate, "\t", calc(candidate), "\t", time.Since(start)) }

Finally we glue it all together as follows:

package main import ( "github.com/handcraftsman/GeneticGo" "fmt" "math" "regexp" "strings" "time" ) const regexSpecials = "[]()|?*+" func main() { wanted := []string{"AL", "AK", "AS", "AZ", "AR"} unwanted := []string{"AA"} geneSet := getUniqueCharacters(wanted) + regexSpecials calc := func(candidate string) int { return calculate(wanted, unwanted, geneSet, candidate) } start := time.Now() disp := func(candidate string) { fmt.Println(candidate, "\t", calc(candidate), "\t", time.Since(start)) } var solver = new(genetic.Solver) solver.MaxSecondsToRunWithoutImprovement = 3 solver.LowerFitnessesAreBetter = true solver.MaxRoundsWithoutImprovement = 10 var best = solver.GetBestUsingHillClimbing(calc, disp, geneSet, 10, 1, 0) fitness := calc(best) if fitness == 0 { println("

solved with: " + best) } else { println("

failed to find a solution") } fmt.Print("Total time: ") fmt.Println(time.Since(start)) }

Note: depending on the speed of your processor you may be able to reduce MaxSecondsToRunWithoutImprovement. For this problem I’ve set it to 1/10 second.

solver.MaxSecondsToRunWithoutImprovement = .1

Output:

$ go run samples/regex.go | 3 0 10 2 1.0001ms 1*0+ 1 131.0075ms 0*1?0* 0 335.0192ms solved with: 0*1?0* Total time: 335.0192ms

Cool!

Now, here’s a somewhat more interesting problem – repetition:

wanted := []string{"01", "0101", "010101"} unwanted := []string{"0011"} solver.MaxSecondsToRunWithoutImprovement = .1

Output:

$ go run samples/regex.go 0 3 1ms 01 2 1ms (01)+ 0 215.0123ms solved with: (01)+ Total time: 216.0123ms

Nailed it! And one more based on U.S. state mail codes:

wanted := []string{"AL","AK","AS","AZ","AR"} unwanted := []string{"AA"} solver.MaxSecondsToRunWithoutImprovement = .5

Output:

$ go run samples/regex.go R 5 0 AS 4 1.0001ms AK?L? 3 1.0560604s AS*L?K* 2 2.1301218s A[SKLZ] 1 2.62215s A[LRSZK] 0 3.1391795s solved with: A[LRSZK] Total time: 3.6392081s

The source code for this project is available from my GitHub repository.

Go on to part 5.