AUG95: ALGORITHM ALLEY

ALGORITHM ALLEY

Biochemical Techniques Take On Combinatorial Problems

Peter Pearson

Peter is a cryptologist at Uptronics Inc., a cryptography and data-security company in San Jose, California. He can be reached at [email protected]

Readers of Dr. Dobb's Journal are accustomed to solving mathematical problems using "computers"--that is, boxes full of semiconductors, buses, RAM, and related gizmos. Consequently, it's hard to believe that a large class of difficult and intensely mathematical problems might be best solved not by pushing electrons through wires in a computer laboratory, but by mixing solutions in test tubes in a molecular-biology laboratory. Yet that is exactly the prospect suggested by Leonard Adleman, who applied the laboratory tools of modern molecular biology to the bogeyman problems of computer science (see "Molecular Computation of Solutions to Combinatorial Problems," by Leonard M. Adleman, Science, November 11, 1994).

There is a class of computationally intractable problems known as "NP." Problems in this set include the well-known Traveling Salesman problem and problems such as the Hamiltonian Circuit, Bin Packing, Graph-3-colorability, Knapsack, and Generalized Instant Insanity (remember the Parker Brothers puzzle?) problems. Computer scientists and mathematicians have discovered many such instances of problems whose solution requires taking a (possibly large) number of objects and finding an arrangement that has a particular property or satisfies a requirement. Given the solution, you can quickly verify that it solves the problem, but there is no known "fast" way to find the solution.

A simple example is the Knapsack Problem: From a given finite set of integers, find a subset whose sum is a given x. Exhaustive search is a practical solution for small sets, but as the size of the set increases, the time required to find a solution increases faster than any power of the size. However, testing a candidate subset is easy: It requires no more additions than there are integers in the starting set.

To be admitted to the NP club, a problem must be proven equivalent to a problem already in NP. "Equivalent" means, casually speaking, that any instance of the new problem can be easily transformed into an instance of some NP problem and vice versa. This admission criterion guarantees that all problems in NP are about equally hard: If a quick way were found to solve Traveling Salesman problems, for example, then someone with a tough instance of the Knapsack problem could transform it into an instance of the Traveling Salesman Problem, solve it, and transform the solution back into the answer to the Knapsack problem. Thus, all members of NP stand or fall together.

Traditional estimates of these problems' difficulty assume that the problem is attacked on a conventional computer. The analysis of, say, a Knapsack problem might go something like this:

There are 100 integers in the whole set, so there are 2100 (=1030) possible subsets. If I have to examine 10 percent of these subsets, using a million computers, each of which can examine a million subsets per second, then it will take 1017 seconds, or three billion years.

In a dramatic departure from conventional thinking, Adleman attacked a problem in NP using techniques found in molecular biology laboratories. (See the accompanying text box entitled "DNA Basics" for more background information.) He synthesized DNA molecules that represent randomly guessed answers, then searched through a huge number of them to pick out any correct answer. The number of guesses that can be tested with this approach is limited not by time and computing power, but by the number of DNA molecules you can handle. Since a gram of DNA might contain 1018 smallish molecules, the millions of computers testing millions of subsets start looking puny in comparison.

Adleman solved the Directed Hamiltonian Path Problem: Given a map showing many cities and many one-way roads connecting cities, find the shortest itinerary that starts at City A, ends at City Z, and passes through every other city exactly once.

In Adleman's approach, cities are represented by random "20-mers." That is, each city is assigned a sequence of 20 bases selected at random from the set {A,T,G,C}; see Figure 1. Roads are represented by 20-mers derived from the sequences of the cities they connect. For example, a road from City J to City D would be represented by a 20-mer whose first 10 bases are complementary to the first 10 bases of City J and whose last 10 bases are complementary to the last 10 bases of City D; see Figure 2. An exception is made for roads starting at the starting city or ending at the ending city (cities A and Z, in this example). These roads are extended by an extra 10 bases, so as to contain the full 20-base sequence complementary to the starting or ending city.

Using DNA-manipulation techniques developed by biologists, Adleman manufactured a bunch of "city" 20-mers and a bunch of "road" 20-mers, and mixed them all together in a pot. Because complementary DNA strands tend to stick together, a typical Road JD 20-mer will have its beginning half stuck to the beginning half of a City J 20-mer, and its ending half stuck to the ending half of a City D 20-mer; see Figure 3. The other half of the City D 20-mer will probably be stuck to some road that begins at City D, and so forth.

Next, Adleman added to this soup an enzyme that repairs "nicks" in DNA. This enzyme finds the places where the ends of two road 20-mers touch (in the middle of a city 20-mer) and welds the two ends together. (It also welds together the touching ends of city 20-mers.) The resulting DNA strands represent lists of roads that you can legally travel, called "itineraries."

Most itineraries look nothing like the answer to the problem: Some contain only a couple of roads, and some traverse one part of the map many times over without ever visiting some other part. Still, there was a chance that one of these DNA strands might represent the solution to the Directed Hamiltonian Graph problem, and Adleman had to find it.

He knew the length of the desired molecule: The number of roads taken must be one less than the number of cities on the map. Molecular biologists routinely use electrophoresis to separate DNA molecules by length: When an electric field pushes DNA molecules through a gel, longer molecules move more slowly. Adleman cut out the part of the gel containing strands of the desired length, extracted the DNA from the gel, and threw away everything else.

It was also obvious that all the molecules that don't pass through a given city could be discarded. Adleman did this using "City J" 20-mers attached to magnetic beads. These beads were mixed with the DNA from the gel, and time was allowed for itineraries that pass through City J to stick to the complementary sequence on the beads. He fished them out with a magnet and discarded everything else. By warming and changing the salinity of the solution, he unstuck the itinerary strands from the beads, giving a solution of right-sized itineraries that pass through City J.

Repeating this process for every city on the map leaves you with (if anything) an itinerary with exactly the required properties: It passes through every city; it can't pass through any city twice, because it's not long enough to hold the extra road; and the special handling of roads starting and ending at cities A and Z guarantees that City A must be first and City Z, last in the itinerary. Thus, if there's a molecule there, it's the answer.

A technique called the "Polymerase Chain Reaction" (PCR) can be used to duplicate many million-fold a single, special DNA sequence hidden in a soup of other DNA. PCR requires only that you know the sequence of the first several and last several bases of the sequence to be duplicated. Since you know the first 20 bases of the desired itinerary (because it starts at City A) and the last 20 bases (complementary to City Z), you can use PCR to make an abundance of the desired strand. To find where City J appears in the itinerary, PCR duplicates just the sequence from City A to City J, and the size of the resulting strands is measured by electrophoresis.

Computer science is full of NP problems; usually, they relate to optimization. Typically, these problems need to be solved in seconds, and approximate solutions are usually acceptable. It's hard to imagine that very many of these problems would warrant interfacing a computer to an "NP coprocessor" with pumps, reagents, glassware, heaters, and gels. The small problem (a map with seven cities) on which Adleman demonstrated this technique took a week of lab work, and even though a substantially larger problem probably wouldn't take appreciably longer and the procedure could be automated to speed it up by a couple orders of magnitude, this technique seems destined for use on large, nonurgent problems with very valuable answers.

Where are such problems found? Cryptology is one place. (It's no coincidence that Adleman is the "A" in the RSA public-key encryption algorithm.) For decades, cryptologists have been mining NP for problems around which ciphers might be built. The best pedigree for a cryptographic protocol is proof that breaking it is equivalent to solving some general problem in NP. But Adleman has knocked off the rails the traditional calculus of security that applies to these systems, and arguments that once postulated processors and microseconds may soon revolve around gallons of vat capacity.

Adleman, Leonard M. "Molecular Computation of Solutions to Combinatorial Problems." Science (November 11, 1994).

Garey, Michael R. and David S. Johnson. Computers and Intractability: A Guide to the Theory of NP-Completeness. San Francisco, CA: W.H. Freeman, 1979.

Schneier, Bruce. "NP-Completeness," Dr. Dobb's Journal (September 1994).

DNA Basics

Deoxyribonucleic acid (DNA) is a long, thin, chain-like molecule made by connecting smaller molecules called "bases" (see Figure 4). Four different bases--typically called A, T, C, and G--occur. The number of bases in a DNA molecule can range from a mere handful to tens of millions. A DNA molecule can be specified by giving its sequence of bases in the order in which they appear, such as "ATCCATTAG...." Any sequence of bases is possible. There is directionality in a DNA strand, so the molecule AAAAATTTTT is not just TTTTTAAAAA viewed upside-down.

The A bases have a gentle attraction for the T bases, and the C bases for the G bases, such that two DNA molecules whose sequences "fit together" (see Figure 5) will tend to stick together. (This is the famous "double helix" configuration, though I ignore the helicity in my diagrams.) Two DNA sequences are called "complementary" if each equals the other in reverse order with As, Ts, Cs, and Gs replaced by Ts, As, Gs, and Cs, respectively.

Over the past few decades, molecular biologists have discovered enzymes (large protein molecules that occur in cells) that perform such functions as cutting DNA molecules where certain sequences occur, assembling DNA molecules complementary to existing DNA molecules, linking separate DNA molecules into longer molecules, and more. These enzymes are now routinely harvested from bacteria and used in molecular-biology laboratories.

The typical cell in your body contains around 7 billion base pairs of DNA, with a total length of about 2 meters--one meter from each parent. This DNA occurs in 46 pieces called "chromosomes," and constitutes a vast library of recipes used in conducting the business of a cell. The DNA you got from each of your parents is thought to contain recipes for around 100,000 different proteins. A protein is a string of amino acids selected from a suite of 20, and the recipe for a protein is simply an ordered list of amino acids. Three consecutive DNA bases specify one amino acid in the protein, and the translation from DNA triplets to amino acids is done very tidily using a lookup table that also contains "end-of-protein" triplets. The replication of DNA (for dividing cells) and its translation into proteins are performed by complex proteins built (naturally) from recipes encoded in the DNA. A large part of the business of your body is carried out by proteins, and much of the rest is carried out by molecules built by proteins.

Multiplying the estimated 100,000 protein recipes (genes) by the typical length of a gene (a few thousand base pairs) gives a few hundred million--a small fraction of the total amount of DNA in the cell. It is presently unknown what function, if any, is served by all that extra DNA, but much of it consists of sequences that are stylistically different from protein recipes.

--P.P.

Figure 1: Adleman's DNArepresentation of a particular city.

Figure 2: Roads are represented by DNA sequences complementary to the cities they connect (sequences in the top row are read from right to left).

Figure 3: A mixture of roads and cities tends to self-assemble into possible itineraries.

Figure 4: (a) A schematic representation of the four bases from which DNA molecules are assembled. (b) Short DNA molecules are sometimes classified by the number of bases; for example, the illustrated 6-mer.

Figure 5: A double-stranded DNA molecule made from complementary 6-mers. Note that the backbones run in opposite directions.

Copyright © 1995, Dr. Dobb's Journal