The completion of several genome sequencing projects in angiosperms has resulted in improved knowledge of the content and organisation of the flowering plant genomes. In gymnosperms, in the absence of a completely sequenced and ordered genome, recent efforts have been put toward improving knowledge of the gene space through several EST sequencing projects [33]; but the structural organisation of this gene space on the genome remains largely undetermined [34]. The spruce genetic map and analyses presented herein allow better comprehension of the genome macro-structure for a gymnosperm. These results combined with phylogenies reveal the relative proportion of gene duplications shared between angiosperms and gymnosperms or unique to gymnosperms, and how the seed plant genome has been reshuffled over time from a conifer perspective.

Gene distribution and density

To localise the GRRs, we implemented a statistical approach based on the kernel density function. This represents a technical improvement compared with existing methodologies given that we used an adaptive kernel approach to avoid the use of an arbitrarily fixed bandwidth. This approach allowed us to take into account the density observed locally to compute the bandwidth size. Because the number of genes currently positioned on the spruce genome represents around 6% of the estimated total number of genes [26], we applied stringent parameters in these analyses to reduce the rate of false positives. Thus, we may have underestimated the extent of GRRs. Besides these significant peaks, a few other peaks of kernel density that do not currently reach significance (Figure 3) may do so with an increased number of mapped genes. Indeed, Kolmogorov-Smirnov tests of homogeneity of gene distribution indicated that nine chromosomes had a significantly non-uniform distribution. Even so, there does not seem to be a widespread occurrence of GRRs on the spruce genome. In addition, the seven significant GRRs were distributed among seven chromosomes. This peculiar distribution suggests that GRRs may correspond to centromeric regions where, on genetic maps, markers tend to cluster due to more limited recombination.

In angiosperms, species with small genomes tend to be made of GRRs alternating with gene-poor regions. For example, the genic space of Arabidopsis thaliana represents 45% of the genome while the remaining 55% is 'gene-empty' and interspersed among genes as blocks ranging in size from a few hundred base pairs to 50 kb [35]. By contrast, plant species with larger genomes do not show such a contrasted gene distribution, in line with the pattern found here for the large spruce genome. Rather, they harbour a gradient of gene density along chromosomes, such as in maize [36], soybean [37] and wheat [38, 39]. In the soybean genome, a majority of the predicted genes (78%) are found in chromosome ends, whereas repeat-rich sequences are found in centromeric regions [40]. In conifers, retroelements have been reported as a large component of the genome, with some families well dispersed while others occur in centromeric or peri-centromeric regions (for example, see [41–45]). Thus, they might have participated in shaping the distribution of genes along chromosomes by reducing the occurrence of GRRs.

The type of gene distribution along the genome bears consequences for the planning of genome sequencing strategies. For instance, a gene distribution of 'island' type implies that a deeper sequencing effort is necessary to reach a majority of the genes [38]. Though genetic distance does not equate physical distance, the pattern seen here in spruce indicates that genetic maps alone that would include most of the gene complement will be insufficient to anchor a significant portion of physical scaffolds, especially if these are small. In conifers, little is known about physical gene density in genomic sequences. In spruce, two partially sequenced BAC clones had a single gene per 172 kbp and 94 kbp, respectively, which represents a density at least 10-fold lower than the average gene density of the sequenced genomes of Arabidopsis, rice, poplar or grapevine [46]. In addition, the sequencing of four other randomly selected BAC clones in spruce failed to report any gene [45].

Tandemly arrayed genes and functional clusters

In the present analysis, we identified two types of gene clusters: arrays of gene duplicated in tandem and arrays of unrelated sequences sharing functional annotations. There were 51 arrays (TAGs) encompassing genes from the same family that were duplicated within 5 cM. They incorporated 6.9% (125) of the mapped genes and they could be indicative of small segmental duplications. Such TAGs were also reported in genomic sequences of model angiosperms: they involve 11.7% of the Arabidopsis genes and 6.7% of the rice genes [47]. Most of the spruce arrays (78.0%) included only two genes. Similar proportions were found in genome sequences of model angiosperms [47]. The largest spruce array found consisted of eight myb-r2r3 genes on chromosome 7 (Figure 1). Interestingly, seven of these myb-r2r3 belong to the same subgroup Sg4C [48]. The other genes belonging to Sg4C were not positioned on this linkage map. Spruce TAGs were significantly enriched in functions related to DNA binding, secondary metabolism and structural proteins (Figure 1). In Arabidopsis and rice, TAGs are under-represented among transcription factors and over-represented in enzymes [47]. GO analyses and expression data showed a strong correlation between tandem duplicates and biotic stress genes in Arabidopsis [49], leading the authors to suggest that 'tandem duplicates are likely important for adaptive evolution to rapidly changing environments'. In the myb-r2r3 gene array, the three genes named PgMyb5, PgMyb10 and PgMyb13 exhibited very different expression patterns [50]. The lack of co-expression of the genes mapped in arrays did not support a gene arrangement oriented by co-regulation. In these arrays, a majority of the genes were derived from duplications occurring after the GA split but were shared between spruce and pine, indicating from the perspective of geological time that expression divergence may occur quite rapidly after gene duplication [51]. Such a pattern is in accordance with the observation that if a new function is not acquired rapidly through neo-functionalisation, one duplicate tends to evolve towards a pseudogene and disappear [7]. A high frequency of pseudogenes has indeed been reported in conifer genomes [26, 42, 52]. However, there are exceptions to this neo-functionalisation trend among surviving duplicates, such as in the conifer Knox-I family. In this family, the closely located kn1 and kn2 arose from a duplication postdating the divergence between gymnosperms and angiosperms; nevertheless, neo-functionalisation has not happened yet between these duplicates in spite of the duplication occurring before the divergence between the spruce and pine lineages, more than 100 Mya [24]. Sub-functionalisation of these duplicates has been noted [24], conferring partial functional redundancy that might enhance survival and adaptation in these long-lived perennials. Several other cases of conifer-specific duplications might exist that imply partial redundancy of function instead of neo-functionalisation.

We found three clusters made of co-expressed gene sequences that were similar to operon-like structures: two cases made of non-homologous sequences and a third one made of tandemly duplicated chalcone synthases. In angiosperms, only five such structures have been described and were associated with secondary metabolism and defence mechanisms [53]. Such metabolic clusters have emerged as a new and growing theme in plant biology [54]. The three clusters found in our study were similar to these functional clusters, except that their roles were not restricted to secondary metabolism. Among our data, two other cases for which we could not obtain expression data were also good candidates for functional clusters. On chromosome 11, there were also two co-localising pectin methylesterases. Moreover, a single group of two non-homologous sequences was clearly involved in the secondary metabolism. This group encompassed one flavonol synthase and one glutathione synthase on chromosome 6 of Picea. Glutathione plays several important roles in the defence of plants against environmental threats. It is a substrate for glutathione S-transferases, enabling neutralisation of potentially toxic xenobiotics [55]. Thus, these flavonol and glutathione synthases may belong to a cluster of functionally related but non-homologous genes. Similarly, 80 co-expression clusters sharing the same GO term were described along the 3B wheat chromosome, suggesting the existence of adaptive complexes of co-functional genes [39]. In spruce, exhaustive transcriptomic resources have recently been developed [30]. Their analysis combined with the positioning on the genome of additional genes should allow us to pinpoint whether adaptation at the metabolic level has contributed to shaping the organisation of the gene space.

Highly conserved organisation of the gene space between spruce and pine

Before conifer gene catalogues were available, the number of available orthologous markers to enable comparative studies of genome macro-structure between conifer species was highly limited [34]. A substantial conservation of the Pinaceae genome macro-structure was nevertheless suspected [34, 56, 57]. Estimating the extent of conservation in genome macro-structure was more exhaustive in our study, because we identified a much enlarged set of orthologous mapped gene loci (over 200) between spruce and pine. Synteny and collinearity between the spruce and loblolly pine genomes were very high. Lower collinearity was noted with the maritime pine genome, which resulted from likely lower accuracy of the gene order based on the use of a smaller mapping population for this species [32]. Therefore, it is safe to assume that the organisation of the conifer gene space has been largely maintained over a period dating back 120 to 140 Mya, since the early diversification of Pinaceae in its main lineages in the Early Cretaceous [13, 14, 27]. Such a high level of conservation of the genome macro-structure has also been reported among angiosperm genomes from the Rosids and Asterids clades [58, 59], which diverged about 115 Mya [58–60], a time period similar to that of the pine-spruce divergence. By contrast, since the monocot-eudicot split 140 to 150 Mya [61], which slightly preceded the pine-spruce divergence, synteny has been largely disrupted between model monocots and dicots [62]. Such large discrepancies in apparent rate of evolution of genome macro-structure are largely conspicuous among angiosperm lineages, where it has been shown that the genomes of perennial species such as grape and poplar evolved slower than those of annual species such as Arabidopsis and rice [63]. These differences in evolutionary rates are also reminiscent of those in substitution rates between annual and perennial or woody seed plants, where various hypotheses related to mutation rate, generation time, population size and fixation rate have been proposed [64–68].

Age and organisation of gene duplicates

The phylogenetic analysis of 157 gene families indicated a large imbalance in favour of ancient duplications predating the GA split versus more recent duplications postdating the GA split. Since the genes sampled in the present study were identified after sequencing ESTs, one could argue that the sample might be biased towards expression patterns that are possibly related to high sequence conservation (for example, [69]), hence artificially increasing the ratio of ancient versus recent gene duplications detected in the present study. First, the ratio was highly asymmetric (eight to one) and we showed that the genes and families involved in our study were representative of a large array of molecular functions and biological processes seen in the most recent spruce gene catalogue, and implicating conserved and less conserved gene families [26]. Gene annotations in conifers [26, 70, 71] also do not favour this hypothesis. Indeed, the most complete catalogue of expressed genes for a conifer, which was based on a large effort involving the sequencing of 23,589 full-length cDNA inserts, has recently enabled the reporting of the most exhaustive comparison of homologous genes sequenced both in angiosperms and gymnosperms [26]. The results indicated that the spruce protein families were largely overlapping with those of angiosperm model plants completely sequenced [26]. Comparing the occurrence of the Pfam domains in the spruce gene catalogue with genomes completely sequenced from model plant species showed that only 28 protein domains were statistically over-represented in spruce and most of them were involved in metabolism, stress response and retrotransposition. Moreover, the gene coding portion of the spruce genome was evaluated at around thirty thousand transcribed genes, a number in the same range as that observed for model angiosperm genomes [26]. The in-depth study of a few transcription factor families also showed that conifers lack some members in specific subfamilies while containing more genes in closely related subfamilies that were derived from duplication events postdating the GA split [24, 48, 72]. These various observations suggest that the conifer genes that are highly divergent from their angiosperm's homologues are rare in the sequence resources developed so far, in spite of the fact that these resources relied on the investigation of a diversity of tissues and conditions [26]. In the future, the availability of the genome sequence may allow the discovery of more conifer-specific genes that could be highly duplicated; but we would (do?) not expect to find them in abundance, as suggested by the present phylogenetic analysis.

A large majority of spruce gene pairs were translocated and most of these translocations occurred before the GA split, affecting a large majority of the 157 gene families analysed. By contrast, genes duplicated after the GA split were located overwhelmingly in close proximity on the same chromosome and often organised in tandem. These trends were consistent with the observation that the physical distances between duplicates on the Caenorhabditis elegans genome increase with time, due to chromosomal rearrangements and other mutational events [73]. Nevertheless, this pattern is not always clear, for instance in Arabidopsis gene families where the occurrence of tandem duplications and segmental duplications are negatively correlated [47, 74]. In this model plant, unequal cross-over and gene loss were proposed as possible mechanisms leading to the counter-selection of tandem duplications [74].

The observed large excess of ancient duplications predating the GA split over more recent duplications postdating the GA split is consistent with the hypothesis of relative stasis in the gymnosperm lineage leading to conifers and the little evidence for a recent large expansion of the gene space. While a single whole genome duplication has been hypothesised to have affected the common ancestor of seed plants around 350 Mya [20], evidence for more recent widespread polyploidy in the gymnosperm lineage after its divergence from the angiosperm lineage was not found in the present study, in agreement with results from cytological studies reporting a rare occurrence of this phenomenon in gymnosperms [18, 75]. Variation in basic chromosome number in the diploid gymnosperms would rather be the result of chromosome fusion or fission [75]. For instance, such fission would have led to the additional chromosome seen in Douglas fir, relative to other Pinaceae [56]. If so, some of the translocations hypothesised in the present study after the GA split could also be the result of ancient chromosomal fissions increasing the basic chromosome number in the lineage leading to spruce and pine.

While more recent duplications specific to the gymnosperm lineage leading to extant conifers were detected, the stasis of genome macro-structure noted in this lineage is in concordance with that observed between the spruce and pine genomes, and corresponding to a period exceeding 100 My since the last common ancestor of Pinaceae [13, 14, 27]. Such slow rates of genome evolution parallel the slow rates of speciation and patterns of reticulate evolution noted in Pinaceae taxa [76, 77] and their archaic morphological features and life history [14, 78]. Such multiple coincidences reinforce the idea that perennial nature and large historical population sizes are key factors to the slow evolution of conifers [65]. The large excess of ancient duplications detected in the present study also indicates that much gene expansion has occurred in the plant lineage before the divergence between gymnosperms and angiosperms in the Late Carboniferous. Part of this expansion in the primitive land plants might be related to the major burst of duplications noted in transcription factors and coinciding with the water-to-land transition of plants [79]. Further studies implicating other divisions of green plants are needed to better comprehend the temporal dynamic of gene family expansions and reshuffling of the plant genome before the emergence of modern seed plants.