For years I have struggled with a piece of advice given to me by my master-thesis advisor, Prof. David Wool. According to him, following the completion of the analyses of the data but prior to composing the discussion, one has to “take another good look at the data.” “Maybe the data is no good,” he said. “Maybe the data is good, but it is telling you a different story.”

Why have I struggled with this piece of advice? Because, soon after leaving Tel Aviv University to pursue a doctoral degree at the University of Texas Medical Center, I ceased working with manageable quantities of data and started working with “big data” or what passed for big data in the early 1980s. At the beginning, the big data was manually tamable, say, eleven orthologous genes from seven taxa. I could easily look at the alignments and notice peculiarities that did not make sense. Soon, however the data in the public databases that I used became too huge for “taking another good look.” What was I supposed to see? Is it even possible to notice a meaningful pattern by merely visually examining the data?

In time, I learned that taking a good look at the data may mean something different than literally staring at the DNA sequences. For example, taking a good look at the data may entail performing weird statistical analyses that are unrelated to the research questions at hand. For example, I would divide the data according to the laboratories that generated it, and compare nucleotide frequencies, codon usages, and autocorrelations among the different laboratories. Sometimes, the data from one laboratory looked exceptional in too many respects, and I knew that I needed to treat this data with extreme care or discard it altogether. My contemporaries at the Graduate School of Biomedical Sciences in 1984 may still recall my screaming in the corridors that I shall never again use DNA sequences generated at one particular laboratory in Naples.

I have used this method of taking a second good look at the data not only in my own research, but also when reading papers by other scientists. I use this method especially when the data show an intriguing pattern that is difficult to reconcile with existing theory. The reason for my being paranoically suspicious is that there are infinitely more wrong hypotheses out there than correct ones, and the wrong hypotheses have a tendency to be infinitely more interesting, intriguing, and publishable in Nature than the correct ones.

Let us now illustrate my “good-look” approach with an example. In 2008, Eugene Koonin and Yuri Wolf published a very long article entitled “Genomics of bacteria and archaea: the emerging dynamic view of the prokaryotic world.”

In the article one can find the following passages:

“The sequenced bacterial genomes span two orders of magnitude in size, from ∼180 kb in the intracellular symbiont Carsonella rudii to ∼13 Mb in the soil bacterium Sorangium cellulosum. Remarkably, bacteria show a clear-cut bimodal distribution of genome sizes, with the highest peak at ∼2 Mb and the second, smaller one at ∼5 Mb… Although there are many genomes of intermediate size, this distribution suggests the existence of two, more or less distinct classes of bacteria, those with ‘small’ and those with ‘large’ genomes… The possibility remains that the bimodality of the bacterial genome size distribution is due to the bias of the genome sequencing efforts toward smaller genomes (such as those of symbionts and parasites) but with the growth of the genome collection, this explanation is becoming increasingly less plausible.”

In translation: the genome-size distribution in eubacteria has two peaks, and this bimodality is unlikely due to sampling. There are distinct classes of small genomes and large genomes.

What can explain this distribution? The explanation provided by Koonin and Wolf involves processes of genome contraction and genome expansion:

“It seems likely that the balance between the opposing trends of genome contraction caused by streamlining and degradation, and expansion via various routes shape are directly reflected in the size distribution of bacterial genomes, with the dominant peak shaped, primarily, by contraction and the second peak by expansion…”

Nick Lane in his “Energetics and genetics across the prokaryote-eukaryote divide” has a different explanation for the bimodality in eubacterial genome size.

“Incidentally, a bimodal distribution of genome size in bacteria is predicted on energetic grounds. Very small cells will be favoured in terms of replication speed, but must support a minimum number of genes for a free-living lifestyle. The smaller the cell, the smaller the absolute plasma membrane surface area for chemiosmotic coupling, which constrains energy per gene for very small cell sizes because the genome size is necessarily quite large in relation to surface area. As cells become larger, they presumably reach some kind of energetic optima in the region of 2,500-5000 genes. Much larger genomes, up to 10,000 genes, are not generally favoured, except in cells that have complex internal membranes, such as cyanobacteria and nitrifying bacteria. These complex bacteria are under less heavy selection pressure for replication speed (I assume because they draw on resources unavailable to other bacteria), but are still far less complex than quite mundane unicellular algae like Euglena. Why the gap? Many cyanobacteria with complex internal membranes are also polyploid, which permits a larger surface area of bioenenergetic membrane, as in giant bacteria like Epulopiscium and Thiomargarita, but with all the costs and limitations discussed for them. I am not aware of systematic studies of metabolic rate in large cyanobacteria, but such studies could give invaluable insights into the energetic limitations of large bacterial genomes.“

Many other theories have been suggested in the literature to account for the bimodality found by Koonin and Wolf. Of course many of the 216 articles that cite Koonin and Wolf (2008) deal with other subjects, but more than a few put forward ingenious theories to explain the weird genome-size distribution in eubacteria.

Let us now take a good look at the data. First, we repeated the 2008 analysis with data from 2014 (almost double the size of the 2008 data).

Let us now look at some particular columns that indicate bimodality. Let us look at the numbered columns in the figure below.

My student, Yichen Zheng found that columns 1&2 represent 537 genome entries from 221 eubacterial species. That is, 59% of the data in these columns consists of duplicates. The second peak, columns 5&6, represent 394 entries but only 163 species. Again, 59% of the entries consist of duplicates. In contrast, the control columns 3&4 represent 264 entries and 199 species. Only 25% of the entries are duplicates.

To put it another way, the two peaks result from sequencing multiple genomes belonging to some popular bacteria, such as Escherichia coli.

If one removes the duplicates, the genome-size distribution is reduced to a mundane, uninteresting, unimodal, blah distribution, which requires neither fancy talk nor ingenious formulations to explain its innards. There remains a small second peak, but I am sure that an additional close look at the data will cause it to disappear.

If one is to use the proper statistical jargon, the problem that we have identified is one of ascertainment bias or sampling bias. That is, our sample was collected in such a way that some members of the intended population were less likely to be included than others. This resulted in a biased sample, in which the various bacterial taxa were not equally likely to have been selected and some were selected an inordinate number of times. If this bias is not accounted for, results can erroneously be attributed to a real cause rather than to the method of sampling.

I still remember taking a memorable class in Biostatistics with Prof. David Wool in 1978. I particularly remember the session on sampling. He asked us to estimate the number of male and female students at Tel Aviv University. Where should we—the collectors of data—position ourselves around the campus? Near the School of Mathematics? Near the School of Elementary School Education? Near the Salad Bar at the Cafeteria? Near the Grill? Near the Men’s Room?