Genome sequencing has fundamentally changed how plant biologists think about genes. All or nearly all genes can ultimately be associated with a gene model. However, many gene models appear to play little or no role in the traits of an organism. A range of structural, molecular, population and evolutionary features all show a separation between genes with known phenotypes and the overall set of annotated gene models. These different features could be combined to develop models to distinguish the genes that determine the traits of plants from the subset gene other annotated gene models which are unlikely to play a role in doing so. Efforts to identify the subset of annotated gene models likely involved in specifying the characteristics of plants would help aid a wide range of researchers.

I. Introduction As a plant geneticist, I spend a lot of time thinking about genes. Which genes are shared between species. Which genes are unique. Which genes stay in the same place in the genome. Which genes move around over evolutionary time. I am embarrassed to admit how long it took me to realize that there are two very different concepts that are each represented by the word ‘gene.’ The first concept is the definition of a gene that comes from Wilhelm Ludvig Johannsen, who first coined the word gene in 1905. A few years later he explained the origin of the word gene like this: ‘the word Gene is completely free from any hypothesis, it only expresses the assured fact that, at any rate, many characteristics of the organism are conditioned by special, separable, and therefore self‐existent fundamentals that occur in the gametes [translated from German].’ (Johansen, 1909). In this original definition, a gene is defined by its impact on the characteristics of an organism. The second concept is the ‘gene model’ definition of a gene. A gene model is a region of the genome which is thought to be transcribed into RNA, and that RNA is, in turn, believed to be either translated into protein, or belong to one of a number of defined classes of noncoding RNA genes (Gerstein et al., 2007). The evidence for a gene model can come from one of several sources (Yandell & Ence, 2012; Mudge & Harrow, 2016). There may be direct evidence of transcription and/or translation in the target species. The sequence may show homology to a previously annotated gene either in the same species or a different one. The sequence may be identified by machine learning algorithms trained on other annotated genes. Small differences in which evidence is used to identify genes or how different evidence is weighted can produce big differences in how many genes are identified in a genome (Schnable et al., 2009; Kawahara et al., 2013; Jiao et al., 2017). Nowhere in the gene model definition of a gene is there a need that the gene play a role in specifying the characteristics of an organism. I am certainly not the first to point out the differences between these two definitions (Bennetzen et al., 2004; Graur et al., 2013; Doolittle et al., 2014).

II. Distinguishing features of gene models that correspond to genes Several recently published pieces of work provide evidence that there is a range of features which distinguish gene models corresponding to genes from those which do not. The existence of these differences also lends support to the assertion that, whereas all genes likely have corresponding gene models or eventually will, many gene models which appear to be genes based upon all sequence and molecular evidence may ultimately fail to meet the classical definition of a gene. To understand these differences, it is helpful to have a high confidence set of genes known to be free of false positive gene calls. In this Tansley insight, I will employ gene models with reported loss‐of‐function mutant phenotypes as a population of known true‐positive genes. The use of this set comes with an important caveat: not every Johannsen gene will produce a phenotype when mutated. Work in yeast, where characterization of the fitness consequences of both single and combinatorial mutations of all gene models is practical indicate that, whereas only 18% of yeast gene models are lethal under ideal conditions, significant numbers of additional genes produce lethal phenotypes when combined (as reviewed in Ooi et al., 2006). Intriguingly, pairs of individually nonlethal knockouts which combine to produce lethal phenotypes in yeast are often nonhomologous, indicating that these interactions are the result of something more complex than simple functional redundancy (Ooi et al., 2006). Even among the cases where a mutation in a single gene produces an observable phenotype, that phenotype may be observable only under certain growth conditions (Lloyd & Meinke, 2012). It is not presently feasible for plant biologists to phenotype mutants across the full range of growth conditions a plant may encounter under field conditions. Thus, although it is possible to confidently say that every gene model with a validated loss‐of‐function phenotype is indeed a gene by the Johannsen definition, the inverse – a gene model without a validation loss‐of‐function phenotype is not a gene – is not consistently true. Here I will focus on populations of known true positive genes in two model species for plant genetics, arabidopsis and maize (Schnable & Freeling, 2011; Lloyd & Meinke, 2012). What features distinguish these known true positive genes from the overall population of gene models? The average observed mRNA level for gene models across many tissues in maize shows a bimodal distribution (Walley et al., 2016) and the maize true positive gene set is near universally found within the higher distribution (Liang et al., 2019). In maize, extensive presence/absence variation has been described for both annotated gene models in the reference genome (Wang & Dooner, 2006; Springer et al., 2009; Swanson‐Wagner et al., 2010), as well as for gene model sequences present in the population but not in the reference genome (Hirsch et al., 2014; Springer et al., 2018; Sun et al., 2018). The maize true positive gene set shows almost no observable presence/absence variation (Liang et al., 2019) and data from arabidopsis are consistent with this finding (Bush et al., 2013). Genes exhibiting presence/absence variation in arabidopsis (Bush et al., 2013) are significantly less likely to overlap with the arabidopsis true positive gene set described by Lloyd & Meinke (2012) (P = 7.6e‐5; χ2 test). Depending on the set of maize gene model annotations used, 40–60% of all maize gene models are new ones not present in the most recent common ancestor of maize and closely related species (Schnable et al., 2009, 2012; Murat et al., 2010; Springer et al., 2018). These new gene models exhibit presence/absence variation at more than twice the rate of syntenically conserved genes (Schnable et al., 2011) and the lower of the two peaks in the bimodal distribution of average expression consistents almost entirely of these genes (Walley et al., 2016). Nonsyntenic genes with evidence of transcription are less likely to exhibit evidence of translation than syntenic genes (Walley et al., 2016). The maize known true positive gene set includes very few of these new genes (Schnable & Freeling, 2011). Similar results can be obtained in arabidopsis: syntenically conserved gene models account for 66% of annotated arabidopsis gene models, but these conserved gene models are 7.5 times more likely to be present in the arabidopsis known true positive gene set than the remaining 34% of gene models (P < 2.2e‐16; χ2 test) (Lyons et al., 2008; Lloyd & Meinke, 2012). The remaining gene models consist of older genes which are conserved at syntenic orthologous regions of the genome in multiple species. These genes exhibit lower rates of presence/absence variation in maize (Schnable et al., 2011) and arabidopsis (new analysis based on data from Lyons et al., 2008; Bush et al., 2013): syntenically conserved gene models in arabidopsis are 6× less likely to exhibit presence/absence variation than nonsyntenic gene models (P < 2.2e‐16; χ2 test). These syntenically conserved genes near‐universally belong to the higher of the two modes of average mRNA expression level and show greater evidence of translation into protein (Walley et al., 2016). Maize genes with validated roles in specifying the characteristics of the organism therefore appear to experience stronger purifying selection (Liang et al., 2019). Syntenically conserved genes are more likely to show links to variation in quantitative or qualitative traits than nonsyntenic genes (Wallace et al., 2014; Schnable, 2015).

III. Potential origins of gene models that do not correspond to genes A picture begins to emerge of two populations of gene models. One set shows average higher mRNA expression, evidence of stronger purifying selection, conservation at syntenic orthologous positions in the genome across multiple species, and is consistently present in the genomes of all individuals of a given species assayed. This population contains almost all of the genes linked to the characteristics of the organism (Fig. 1). The second set includes many more nonsyntenic gene models. They are less likely to be transcribed, those which are transcribed are less likely to be translated. In maize this second population shows substantial presence/absence variation among diverse lines (Schnable et al., 2011; Springer et al., 2018), and as a result, the size of this second population expands substantially if gene models not represented in reference genome but present within some individuals of a species are included (Hirsch et al., 2014). Figure 1 Open in figure viewer PowerPoint et al., 2010 et al., 2016 et al., 2017 2011 2012 Integrated analysis of distinctions between known true positive genes and annotated gene models without annotated functions. (a) First two principal components (PCs) of variation for 39 656 maize genes included in the B73 RefGenV2 high confidence gene set. Data included in the original PC analysis were up copy number variation (CNV) – an increase in the number of copies of the sequence for a given gene model present in some individuals within species relative to the reference genome – and frequency of CNV, down copy number variation/down presence/absence variation – a decrease in the number of copies of the sequence of a given gene model present in some individuals within a species relative to the reference genome, where presence/absence variation (PAV) is the subset of down copy number variation where the number of copies decreases from 1 to 0 – and frequency of PAV (from Swanson‐Wagner.,), average RNA abundance across many tissues, number of tissues a gene model was transcribed to at least 1 fpkm, number of tissues a gene model was transcribed to at least 10 fpkm, presence of detectable translation, average protein abundance and number of tissues where translation was detected (from Walley.,), syntenic conservation in sorghum (from Zhang.,), total gene length, exon number, total exon length and total intron length. Relative PC loadings for a representative subset of these factors are shown in black. Density distribution for all high confidence annotated maize gene models across these first two PCs are indicated in red, whereas the density distribution for a set 109 maize genes identified through forward genetics as listed in Schnable & Freeling () is indicated in blue. (b) Distribution of PC1 scores from (a) for both genes with validated loss‐of‐function phenotypes and gene models without. PC1 values inverted for consistence with panel (c). (c) Distribution of PC1 scores from Supporting Information Fig. S1 for arabidopsis genes with known loss‐of‐function phenotypes divided into four categories following the assignments of Lloyd & Meinke () as well as annotated arabidopsis gene models without reported loss‐of‐function phenotypes. Where does this second population of gene models – which, individually, appear less likely to play significant functional roles in determining the characteristics of plants – come from? Why are they present in the genome at all? Some are doubtless the product of transposon activity. Both helitrons and pack‐mules can capture whole genes or individual exons (Hanada et al., 2009; Barbaglia et al., 2012). Many of these captured exons are transcribed into mRNA, and some show evidence of translation (Hanada et al., 2009; Barbaglia et al., 2012). Non‐transposon‐mediated mechanisms also can create positionally novel gene models (Freeling et al., 2008; Woodhouse et al., 2010). New transcribed regions also can arise de novo from intergenic DNA, including both transcribed long noncoding RNAs and even transcribed sequences containing novel open reading frames (ORFs) (McLysaght & Guerzoni, 2015). The unifying factor is that these mechanisms provide a constantly renewing population of gene‐like sequences. Many of these new sequences will have homology to genes with known phenotypes. They may contain ORFs, be transcribed or even be translated into proteins. Yet these positionally new genomic sequences arise from neutral mechanisms of genome evolution. It is quite possible for a gene, duplicated in a new chromosomal context by a neutral mechanism, to take on a new function. However, it seems likely that, like most forms of mutation, most are neutral mutations; of those which are not neutral, most of deleterious, and some small number provide a fitness benefit (Ohta, 1992). There is, thus, no reason to expect a given gene model, particularly one exhibiting the various markers that distinguish this second population of genes from those with known and characterized roles, will play a role in determining the ultimate characteristics of a particular plant.

IV. Future prospects for gene model classification Should gene models which lack evidence of translation into protein, or do show presence/absence variation be removed from genome annotations? No. The patterns described above are trends, not hard and fast rules. What these patterns do teach us is that there is a great deal of information captured in the molecular, population genetic, and comparative genomic characteristics of any given gene model. These data can be used to classify which gene models are likely to produce phenotypes (Lloyd et al., 2015). The annotations of many plant genomes already make a first pass attempt at this by providing high and low confidence gene sets (Schnable et al., 2009; Kawahara et al., 2013; Jiao et al., 2017). However, a great deal of more recent information about how known true positive differ from the overall population gene model is currently not considered when these classifications are made. A reanalysis of the most recent version of B73 genome identified thousands of gene models which were located within transposons, exhibited presence/absence variation relative to W22, and expressed only at low levels (Springer et al., 2018; Anderson et al., 2019). All of these data types could be integrated into more accurate and more quantitative machine‐learning based assessments of the probability a given gene model will ultimately meet the Johannsen criteria to be defined as a gene (Fig. 1). The feasibility of using machine learning to make predictions by integrating multiple molecular and evolutionary features has been repeatedly demonstrated (Lloyd et al., 2018a,b; Washburn et al., 2019). One concern common to many applications of machine learning is that bias in the data used to train a model may produce misleading outcomes (Calders & Žliobaité, 2013). Validated true positive genes identified based on the phenotypic consequences of loss‐of‐function alleles are unlikely to contain false positives, but also unlikely to be a random sample of Johannsen genes. A host of factors encourage geneticists to work with genes that produce constitutive but non‐lethal phenotypes rather than either genes with lethal phenotypes, and those with conditional phenotypes and genes falling into the first category are likely over representative in the maize classical gene list (Schnable & Freeling, 2011). Replicating the analyses from Fig. 1(a) with arabidopsis (Supporting Information Fig. S1), demonstrated that the same principal component that predicted likelihood that an annotated gene model was associated with a validated gene operated approximately equally well for genes with lethal, constitutive non‐lethal and conditional loss‐of‐function phenotypes (Fig. 1c) (Lloyd & Meinke, 2012). Bias also can be introduced by the use of molecular or population or evolutionary features which are available for some genes but not others. Estimates of purifying selection (K a /K s ratios) are a useful example. Evidence of purifying selection from K a /K s ratios can be strong supporting evidence for function (Hanada et al., 2009; Graur et al., 2013). However, informative scans for different types of selection rely on accurate knowledge of the evolutionary history of gene sequences. For genes conserved at syntenic orthologous locations, these assumptions are generally well supported. For nonsyntenic gene models, including those derived from helitron‐captured exons, particular care is necessary to confidently identify orthologous sequences (Yang & Bennetzen, 2009). Many nonsyntenic genes which are single copy across pairs of maize inbreds are nonallelic homeologues rather than alleles (Liu et al., 2012; Brohammer et al., 2018; Anderson et al., 2019). It also should be noted that these analyses focus on protein coding genes, as the majority of gene models with known loss‐of‐function phenotypes fall into this category. Trends are likely to be different for genes which influence phenotype through other mechanisms such as microRNA genes and lncRNAs. A number of factors including gene length, evidence of translation, syntenic conservation in related species and average mRNA abundance have substantial predictive power for odds of being a validated gene in both maize and arabidopsis (Figs 1a, S1). With the exception of syntenic conservation, which was not evaluated, these same factors also were found to predict the likelihood that a gene would be a focus of significant research efforts in humans (Stoeger et al., 2018). Repeated identification of some of the same predictors across different plant species and different eurkaroyotic lineages suggests it may be possible to employ transfer learning from well‐characterized genetic model species to also improve the identification of Johannsen genes in non‐model species (Lloyd et al., 2015). Whether trained in a single species directly or utilizing transfer learning, prediction models will necessarily be imperfect. Biology is a field of study notorious for the existence of exceptions to nearly every rule. Yet the existence of exceptions is not a compelling argument to avoid the study of what features make a gene model more or less likely to act as a classically defined gene. Just as the existence of synonymous mutations which affect protein function (Kimchi‐Sarfaty et al., 2007), is not a compelling argument to abandon the distinction between nonsynonymous and synonymous SNPs within protein‐coding genes or to abandon more detailed efforts to predict the likely functional consequences of individual mutations based on either mechanistic (Cingolani et al., 2012) or observational models (Cooper et al., 2005).

V. Conclusion If nothing else, I hope that readers will remember that a gene model is a hypothesis about the presence of a gene. No one would consider all hypotheses equally likely to be true. Yet, we are all, and I include myself, sometimes guilty of the assumption that any gene model must provide some functional role. Some hypotheses, and some gene models, are strongly supported by multiple rounds of testing and evidence. Many hypotheses, and many gene models, are interesting ideas that have been barely tested, if at all. One of the key distinguishing factors of the best scientists I have had the opportunity to work with during my career is that they uniformly did an outstanding job of sorting through countless hypotheses to identify that subset where an experiment was likely to produce the most interesting outcomes, regardless of whether the hypothesis in question was found to be false.

Acknowledgements I would like to thank Zhikai Liang for informative discussions throughout the course of drafting this manuscript. I also owe thanks to Mitchell Feldmann, John Fowler, Justin Walley, and Damon Lisch who all participated in a discussion on twitter which sparked many of the ideas regarding the challenges of accurately determining purifying selection for gene models which both lack syntenic orthologues in other species and exhibit presence/absence variation among individuals of a single species. I also must thank an anonymous reviewer who proposed the analogy between predicting which gene models are more or less likely to have function, and prior efforts to predict which mutations are more or less likely to have fitness consequences.

Supporting Information Please note: Wiley Blackwell are not responsible for the content or functionality of any Supporting Information supplied by the authors. Any queries (other than missing material) should be directed to the New Phytologist Central Office. Filename Description nph16011-sup-0001-SupInfo.pdfPDF document, 313.9 KB Fig. S1 First two principal components of variation for 27 416 arabidopsis genes. Please note: The publisher is not responsible for the content or functionality of any supporting information supplied by the authors. Any queries (other than missing content) should be directed to the corresponding author for the article.