Our approach consists of three conceptual steps: first synteny is used to narrow down candidate co-orthologs to local tRNA clusters. In the second step each of the initial candidate sets is partitioned further based on sequence similarity on preserved relative order, resulting in an estimated co-orthology relation. These are further refined and corrected using the fact that orthology relations must have cograph structures. In the remainder of this section we describe the individual steps formally and discuss possible future improvements.

From synteny to candidate orthologs

We consider a set Σ of species or genomes. Each genome a∈Σ comprises a discrete set of loci. Genomic coordinates establish an order relation ≺ among loci. Since genetic elements have an intrinsic reading direction the order ≺ is either the same or the inverse of the coordinate system. We write \(\bar {u}^{a}\) for the reverse complement of locus u a on genome a. Note that u a≺v a is equivalent to \(\bar v^{a}\prec \bar {u}^{a}\). Since the reverse complement of a locus is also a valid locus we arbitrarily choose the orientation.

For a subset of loci we assume that they evolve independently by vertical inheritance and are not subject to duplication in the set of species under consideration. We say that two tRNAs t a and t b in genomes a and b, respectively, are 1:1 orthologs, if t a is the only ortholog in genome a of t b in genome b, and vice versa. Therefore we know (or can compute) the 1:1 orthologs of p a in a set of species Σ p ⊆Σ. We will refer to such a set of orthologous loci p={p a|a∈Σ p } as an anchor. An anchor p may connect all or only a subset Σ p ⊆Σ. The orthologs within an anchor are defined to be oriented in the same reading direction. Therefore, if p and q are anchors with p a≺q a then p b≺q b for all a,b∈Σ p ∩Σ q . That is, we assume that anchors preserve synteny including relative reading direction in the set of genomes of interest. We can therefore write p≺q.

Now we consider a set T of loci of interest; in our case tRNAs. None of the t a∈T gives rise to an anchor, i.e., we assume that the multiple, nearly identical sequences are present in the genome. We make two basic, simplifying assumptions:

There are anchors p and q such that p a ≺ t a ≺ q a .

A pair of anchors can be chosen such that the relative order such that the order of homologous loci in preserved between p and q.

Both assumptions are approximations to reality. Condition (S1) stipulates that the locus t a of interest is not too close to the end of contig, scaffold, or chromosome. It will be violated essentially by incomplete data and flaws in genome assemblies. Condition (S2) is a more severe restriction. It allows only unduplicated vertical inheritance and tandem duplications of individual loci. It explicitly rules out genome arrangement between anchors sufficiently close to the locus of interest and also neglects tandem duplications affecting more than a single gene. In essence it forces us to treat a multi-locus tandem duplication as if it was a combination of unduplicated vertical inheritance combined with the insertion of the second copy of tandem pair. It could be replaced in future work by the weaker condition that genomic evolution by tandem duplication, and regarrangements is confined to regions delimited by the anchors.

The exact nature of the anchors is irrelevant for the workflow. In a very conservative approach, known sets of orthologous protein-coding mRNAs are used. If a more fine-grained resolution is desired, one can use e.g. blocks of genome-wide multiple sequence alignment (see Methods).

Tight anchors

Condition (S1) allows us to obtain initial candidates for orthology assignments. We assume that they have a set of homologous elements T a for each genome a∈Σ. If p and q are anchors with p a≺t a≺q a then any t ′∈T b with t ′≺p b or q b ≺t ′ cannot be co-ortholog of t a in genome b.

The practical difficulty is that in general we might not have anchors that cover all species of interest but only a subset of them. For any “query” locus t a and any species b∈Σ, b≠a we therefore define a pair of tight anchor for t a into b as a pair of anchors p b (t a):={p a,p b} and q b (t a)={q a,q b} such that (i) p a≺t a≺q a and (ii) the pair (p b (t a),q b (t a)) is minimal in the sense that there is no further anchor u=(u a,u b) with p a≺u a≺t a or t a≺u a≺q a, Fig. 1.

Fig. 1 Tight anchors for t into species b. Possible anchors are indicated by the grey boxes. The tight anchors are the anchors closest to t (marked in lighter grey) that connect species a and b. By synteny, the only possible orthologs of t are the three loci indicated by white circles Full size image

Under our assumption (S1), there is a unique pair of tight anchors of t a into b for every b∈Σ. In practice, however, there may be exceptions: in the case of genome arrangement or a fragmented genome assembly the anchor points p b (t a) and q b (t a) may be located on very far apart or even on different chromosomes, contigs, or scaffold. We refer to the Methods section for the handling of such exceptions.

Condition (S2), or even a much weaker locality assumption, implies that only the homologs t ′∈T b enclosed by the pair of tight anchors for t a are possible co-orthologs of t a in genome b.

Candidate graph

From the sets of homologous loci T a and a collection of anchors on Σ we construct the candidate graph Γ c as follows. The vertices Γ c are the annotated homologs, i.e., \(T=\bigcup _{a\in \Sigma } T_{a}\). An edge between t a∈T a and t b∈T b is inserted if p b (t a)≺t b≺q b (t a), i.e., if t b is located between the pair of tight anchors from t a into b. In order to accomodate some local inversions and/or assembly errors one might want to relax this definition and to draw an edge between t a and every locus t∈T b so that p b (t a)≺t≺q b (t a) or \(p_{b}(t^{a})\prec \bar t\prec q_{b}(t^{a})\). By construction, the true orthology relation is a sub-graph of Γ c , see Fig. 2 a. Its nodes are the tRNAs and there is an edge between two tRNAs if they are possibly orthologuous, thus if they are flanked by the same tight anchors and belong to distinct species.

Fig. 2 Stepwise refinement of the candidate graph Γ c . a The graph Γ c represents the possible orthology assignments among tRNA loci derived from the synteny anchors. Only genes from different species can be orthologs, hence no edges connect loci in the same species. b Based on sequence similarity edges are removed between tRNAs from different isoacceptor families. c A modified Needleman-Wunsch alignment algorithm is used to identify order-preserving subgroups. This step admits local tandem duplications but not duplications of larger subclusters Full size image

The graph Γ c is not sufficient to completely solve the orthology problem because in general two tRNA loci \({t^{a}_{i}}\) and \({t^{a}_{j}}\) will not be separated by anchors. The available anchors in fact may enclose entire tRNA clusters, see Fig. 1. For tRNAs, however, we can clearly distinguish subgroups by sequence similarity. In particular, tRNAs of different isoacceptor families (i.e., those that are loaded with different aminoacids) and within these, most subgroups with distinct anticodons, exhibit clearly separate sequences. We therefore can prune the edge set of Γ c by removing all edges that connect tRNA loci with clearly distinct sequences. We therefore require that the genetic distance satisfies \(d_{G}({t^{a}_{i}},{t^{b}_{j}})<\varepsilon \) for all edges of the pruned candidate graph, which we denote by Γ a , see Fig. 2 b. The threshold ε is chosen as an upper bound on divergence of genes in phylogenetic range of interest (see Methods for details).

Order preservation within clusters

Assumption (S2), stipulates that co-orthologous loci preserve relative order. In the context of tRNA clusters, this amounts to the assumption that tRNAs within a gene cluster proliferate by means of single gene tandem duplications or by retroposition-like insertions.

The relationship between clustered tRNAs in two species corresponds to a generalized version of an alignment problem. In order to see this, we consider each tRNA cluster as an ordered list of tRNAs and tRNA pseudogenes \({t^{a}_{i}}\) and \({t^{b}_{j}}\) in the two genomes a and b. For the sake of the argument let us first neglect gene duplications and consider insertion, deletion, and remolding only. In this case the correspondences between orthologous loci form an order-preserving matching in the induced subgraph of Γ a restricted to every pair of species. This amounts to an alignment of the tRNA loci in T a with those in T b with alignment edges allowed only between loci that are connected by an edge in Γ a .

Modified Needleman-Wunsch alignment

In order to account for local, i.e., order-preserving duplications we can extend the alignment model in a simple manner. In the usual setting of matchings, one locus \({t^{a}_{i}}\) can match at most a single locus \({t^{b}_{j}}\). Otherwise one of \({t^{a}_{i}}\) and \({t^{b}_{j}}\) is deleted. This is called a 1:1 alignment. In the simplest extension also 1:2 and 2:1 matches are allowed, i.e, two positions \(({t^{a}_{i}}, t^{a}_{i+1})\) may collectively match a single position \({t^{b}_{j}}\), or vice versa. More generally, p:q matches may be considered. Such extensions to one-to-many or many-to-many matches lead to quite simple modification of the Needleman-Wunsch [37] algorithm (see Methods for details). Such extensions have been proposed and applied to natural language data, see e.g. [38, 39] and a means of extracting co-linear clusters of phylogenetic footprints [40].

As stated above, condition (S2) is a restrictive approximation that rules out tandem duplications of subclusters larger than a single locus as well as any local genome rearrangements. More inclusive assumption could be made instead. The full duplication-loss alignment problem that allows copying of subclusters of arbitrary size is APX hard [41], but a practicable dynamic programming heuristic is available [42]. Recently, it was extended further in OrthoAlign to include also genome rearrangements [43]. In principle these approaches could be substituted into our workflow. We are content here with the simpler list alignment method, Fig. 2 c, despite its shortcomings because it allows us to avoid the problem of estimating weight parameters for complex duplications and rearrangments operations. As an alternative to alignment-like approaches for disentangling the history of individual loci it may also be fruitful to consider generalizations of gene order methods, see e.g. MLGO [44–47], albeit at least in the data we considered here duplications and losses are by far the dominating events.

Linear coordinate interpolation

An even stricter alternative to the list alignments is to assume that not only the relative order but also the relative distances of gene are preserved. Given two anchors p and q and an annotated tRNA position t a, one can then estimate the position of its putative ortholog t ∗ by linear interpolation:

$$ t^{*} = p^{b} + \frac{q^{b}-p^{b}+1}{q^{a}-p^{a}+1} t^{a} $$ (1)

The fraction in Eq.(1) is simply the ratio of the sequence lengths between the corresponding anchor points in the two genomes b and a, resp., see Fig. 3. This simple estimate was used e.g. in [48, 49] in the context of phylogenetic footprinting. The tRNA in b closest to the predicted position t ∗ is the best estimate for a 1:1 ortholog. It is important to note that the linear interpolation method can detect only 1:1 orthologs and will fail for co-orthologs arising from gene duplications.

Fig. 3 Determining orthologs by linear coordinate transformations Full size image

Estimated orthology graph

The alignment edges predicted by the pairwise generalized alignment algorithm serve our best estimates for the orthology relation. For 1:2 duplications an edge is inserted from the “original” to both “copies”; in the more general case of p:q duplications, we accept all edges of the complete bipartite graph corresponding to the p:q duplication. Superimposing all pairwise alignments yields the estimated orthology graph Γ o , conceptually shown in the bottom row of Fig. 4. It contains only edges between tRNAs that can be orthologs according to their sequence similarity, and all connected components of Γ o are order preserving since their edges result from the order-preserving alignment step, see Fig. 4. By construction Γ o is a spanning subgraph of Γ a , which in turn is a spanning subgraph of the initial candidate graph Γ c . In general, Γ o will consist of many small connected components, each comprising members of a single tRNA family that locally has expanded and contracted by duplication and loss events.

Fig. 4 Scheme of step-wise orthology identification. Top genomic organization of tRNAs (colored symbols). “Secure anchors” such as known orthologous proteins are shown as gray ovals. Sequence-unique alignment blocks are indicated as thick dark-gray lines. These anchors subdivide the genome into syntenic clusters forming the connected components of the graph of candidates Γ c , here shown for blocks of a genome-wide alignment as delimiters. Each cluster forms a connected component of Γ c . Pairwise generalized list alignments leads to an estimate of the co-orthology relation for each group of homologous tRNAs. Each of these estimated graphs is then corrected to the nearest cograph Full size image

Cographs and orthology

Recent results in phylogenetic combinatorics [35, 36, 50–53] show that orthology relations are cographs. There are many equivalent characterizations for this well-studied class. In particular, G is a cograph if it does not contain a P 4 , a path on 4 vertices, as an induced subgraph [54]. In particular, complete graphs are cographs. A cograph is associated with a unique cotree, which corresponds to the (not necessarily fully resolved) gene tree with labels at the interior vertices that identify speciation and duplication events, respectively [35, 36].

We expect that Γ o is already a very good approximation to tRNA orthology. Various sources of noise, however, will introduce violations of the cograph structure. Therefore, the orthology estimates can be improved further by editing Γ o to the nearest cograph. This amounts to inserting and deleting the minimal number of edges so that all P 4 s are destroyed. Although the cograph editing problem is NP hard [55], this is not a practical problem here. It is not difficult to see that the connected components C i of Γ o can be edited independently of each other [36]. Empirically, we observe that most connected components of Γ o are complete graphs and this already correct cographs.

From the final, corrected orthology estimates \(\hat {C}_{i}\) it is now straightforward to infer the evolutionary events. The cographs \(\hat {C}_{i}\) themselves provide direct information on the tandem duplication events. To this end it suffices to convert the \(\hat {C}_{i}\) into its equivalent cotree [54], from which the duplication events can be directly read off. Deletion events as well as gain events in which a particular locus was settled are obtained by mapping each of the \(\hat {C}_{i}\) to the species tree. We use Dollo parsimony [56] approach to derive the numbers of gain, losses, and duplications from the co-ortholog groups. Duplication events identified from the cographs \(\hat {C}_{i}\) are counted separately from gains.