Consider garlic, a key ingredient of the Mediterranean diet: the USDA quantifies 67 nutritional components in raw garlic, indicating that this bulbous plant is particularly rich in manganese, vitamin B 6 and selenium4. However, a clove of garlic contains more than 2,306 distinct chemical components5,6—from allicin, an organosulfur compound responsible for the distinct aroma of the freshly crushed herb, to luteolin, a flavone with reported protective effects in cardiovascular disease7—which are listed in FooDB, a database representing the most comprehensive effort to integrate food composition data from specialized databases and experimental data. As of August 2019, FooDB records the presence of 26,625 distinct biochemicals in food8,9, a number that is expected to increase in the near future (see Supplementary Discussion 2). This exceptional chemical diversity could be viewed as the ‘dark matter’ of nutrition, as most of these chemicals remain largely invisible to both epidemiological studies, as well as to the public at large.

Where does this remarkable chemical diversity come from? Living organisms require a large number of biochemicals to grow and survive in their limited environments, well beyond the nutritional components that we humans need in our diet. From an evolutionary perspective, plants are characterized by a particularly rich chemical composition, mainly because they are unable to outrun their predators; their defence is occasionally mechanical (for example, through the development of spikes) but is predominantly chemical, exercised through smell, taste and appearance. These chemical defences require an extensive secondary metabolism that produces a wide range of flavonoids, terpenoids and alkaloids. Polyphenols—a highly studied group of chemicals believed to be responsible for the health effects of tea and other plants—are the product of that secondary metabolism. The number of secondary metabolites is estimated to exceed 49,000 compounds, indicating that the 26,000 chemicals currently assigned to food represent an incomplete assessment of the true complexity of the ingredients we consume10. Multiple environmental factors, from light to soil moisture, fertility and salinity, can influence the biosynthesis and accumulation of such secondary metabolites11. Humans and other animals who can hunt for the necessary food sources do not have the ability to synthesize many molecules our metabolism requires, like ascorbic acid or alpha-linolenic acid, necessitating a source for these essential nutrients.

Overall, an analysis of USDA and FooDB data confirms that plants as a group have the highest chemical diversity, with approximately 2,000 chemicals detected in most examples. Yet, 85% of these chemicals remain unquantified, meaning that while their presence has been detected or inferred, their concentration in specific food ingredients remains unknown (see Supplementary Discussion 2). With garlic, for example, FooDB reports the chemical concentration for just 146 chemical components; the remaining 2,160 chemicals listed in FooDB are not quantified5,6. We, therefore, raised the question as to whether the scientific literature contains valuable information on food composition beyond that currently compiled by food databases. Indeed, experimental and analytical projects focused on specific foods and foodborne chemicals are published on a daily basis, and only a small fraction of them inform databases. To unveil this potentially hidden knowledge, we developed a pilot project, FoodMine, that uses natural language processing to mine the full scientific literature for the purpose of comprehensively expanding all available scientific data on the biochemical composition of foods12.

FoodMine identified 5,676 papers from PubMed that potentially report on chemicals pertaining to the detailed chemical composition of garlic. After filtering this list using machine learning, we manually evaluated 299 papers, of which 77 reported 1,426 individual chemical measurements pertaining to garlic’s chemical composition. Our pilot project recovered more unique quantified compounds than are catalogued by the USDA and FooDB together (see Supplementary Discussion 3 and Supplementary Table 1). For example, diallyl disulfide is known to contribute to garlic’s smell and taste, and is implicated in the reported health benefits of garlic, as well as in garlic allergy13,14. Although FoodMine found multiple publications reporting on its concentration in garlic, the current databases do not offer quantified information for the compound. Furthermore, FoodMine identified information for 170 compounds that were not previously linked to garlic, either in the USDA or FooDB database (see Supplementary Discussion 3).

Taken together, we find that there is a wealth of exceptionally detailed information about food composition scattered across multiple literature sources. The current incompleteness in coverage within existing food composition databases is not due to a lack of interest in these chemicals or lack of efforts to map these chemical building blocks of food. Rather, it reflects the absence of systematic in-depth efforts to identify and catalogue the data scattered across multiple scientific communities and literature sources. As we discuss below, high-throughput tools required to scan the scientific literature and to overcome these limitations have emerged in the past several years. Mobilizing them could set the stage for an in-depth and systematic understanding of the ways by which our food affects health.