Quick introduction: Generalizing the NK-model of fitness landscapes

February 23, 2019 by Artem Kaznatcheev

As regular readers of TheEGG know, I’ve been interested in fitness landscapes for many years. At their most basic, a fitness landscape is an almost unworkably vague idea: it is just a mapping from some description of organisms (usually a string corresponding to a genotype or phenotype) to fitness, alongside some notion of locality — i.e. some descriptions being closer to each other than to some other descriptions. Usually, fitness landscapes are studied over combinatorially large genotypic spaces on many loci, with locality coming form something like point mutations at each locus. These spaces are exponentially large in the number of loci. As such, no matter how rapidly next-generation sequencing and fitness assays expand, we will not be able to treat a fitness landscape as simply an array of numbers and measure each fitness. At least for any moderate or larger number of genes.

The space is just too big.

As such, we can’t consider an arbitrary mapping from genotypes to fitness. Instead, we need to consider compact representations.

Ever since Julian Z. Xue first introduced me to it, my favorite compact representation has probably been the NK-model of fitness landscapes. In this post, I will rehearse the definition of what I’d call the classic NK-model. But I’ll then consider how the model would have been defined if it was originally proposed by a mathematician or computer scientists. I’ll call this the generalized NK-model and argue that it isn’t only mathematically more natural but also biologically more sensible.



For simplicity of presentation, let’s focus on biallelic systems.

The classic NK-model is a fitness landscape on . The n loci are arranged in a gene-interaction network where each locus is linked to K other loci and has an associated fitness component function . Given a genotype , we define fitness as .

By varying K we can control the amount of epistasis in the landscape. With K = 0 we have a smooth landscape, and for higher K we can get various kinds of epistasis. The model also provides an upper bound of on the number of gene pairs that have epistatic interactions. This is the awkward part for me.

Consider the simplest non-trivial case: K = 1. Here, the fitness function will only have at most n fitness components, with each involving two loci. That means a connected gene-interaction graph would be either a tree or a tree with an extra edge. And a disconnected graph would have each component as a tree or tree with an extra edge. This seems incredibly restrictive given the possible pairs of loci.

I think that a computer scientists or mathematician would never have given such a definition an awkward definition. So let me propose another one (again, restricting to biallelic systems for simplicity of presentation).

The generalized NK-model is a fitness landscape on . The n loci are arranged in a gene-interaction network which forms a hyper-graphs where each interaction links a set of S nodes together (such that with an associated fitness component function . If we let be the set of interactions and use to mean the genotype x on the loci in S then given a genotype , we define fitness as .

Now, if we turn back to the K = 1 case, we can have all possible binary interactions, plus upto n more unary interactions. This allows for a richer space of possible gene-interaction networks. In particular, if we want to understand what kind of gene-interaction networks produce easy vs hard fitness landscapes.

More importantly, this allows our definition of the NK-model to more clearly align with well studied theoretical computer science topics like valued-constraint satisfaction problems. For the details of the correspondence, take a look Alexandru Strimbu’s post. This correspondence is probably the biggest mathematical advantage of the generalized definition.

But the advantage isn’t limited to mathematics. I think there is also an important aspect of biological interpretation to consider.

In particular, the classic NK-model seems to enshrine — or at least heavily favour — the one-gene one-function view of molecular biology.

The easiest way to interpret the fitness components in the NK-model is as ‘basic functions’ that together add up to the total fitness of the organism. In this way, the fitness components serve as a rudimentary decomposition of the genotype->phenotype map. But in this interpretation, if each gene is linked to a single fitness component then we are linking one gene to one function. Sure, that function is mediated by K other genes, but there are still no more of them than there are genes.

This seems like a strange dogma to just build into our model.

With the generalized model, we can avoid this. Instead we can just talk of each component as a function in which the loci in S participate, without singling out any particular gene as the central cause of that function. To me, this feels to be more in keeping with how we should be thinking about molecular biology.

What do you think, dear reader?