Before introducing the genus trace, we recall what the genus is and how it can be used in the analysis of biopolymers. Note that the genus of RNA structures was considered before, e.g. in1,2,3,4,5,6,7,8, or for proteins in9. However in those works the genus was computed only for the entire chain length, and taking into account only canonical Watson-Crick base pairs in the RNA case. Here we show that much more detailed information is revealed once genus is computed for various types of bonds in a given structure, e.g. also for non-canonical base pairs, including those involved in helix backbone packing interactions in RNA. Moreover, the genus trace that we introduce in what follows captures much more information than solely the genus of the whole chain.

What is genus and how to compute it?

Consider a polymer-like chain consisting of a number of residues, with bonds connecting various pairs of these residues, as in the example in Fig. 1(a). The structure of such a chain can be presented in the form of a chord diagram. A chord diagram consists of b horizontal intervals (called backbones) that represent one or more polymer-like chains, and n arcs (chords) representing bonds, which connect pairs of residues, and are drawn as half-circles in the upper-half plane. In this work we consider configurations with only one backbone, \(b=1\). A chord diagram corresponding to the structure in Fig. 1(a) is shown in Fig. 1(b). Such diagrams are commonly used to present the structure of RNA chains3,4. A stack of parallel chords contributes in the same way as a single chord to the genus, so each set of parallel chords can be replaced by one chord, as in Fig. 1(c). Furthermore, to compute the genus it is of advantage to replace all backbones and chords by ribbons of finite width, also as in Fig. 1(c). In this way we obtain a two-dimensional surface with r boundaries, which – after shrinking a backbone to a small circle – can be drawn in a smooth way on an auxiliary surface of genus g (i.e. having g “holes”), as in Fig. 1(d). The genus of a chord diagram is defined as the genus of this auxiliary surface. This genus can be determined from the Euler formula

$$b-n=2-2g-r.$$ (1)

Figure 1 How to compute the genus. (a) A chain with several bonds (in blue and orange) connecting various pairs of residues (black dots). (b) Chord diagram representing the same structure. (c) Parallel chords replaced by a single chord, and then – together with the backbone – replaced by ribbons, whose single boundary is shown in red. (d) After shrinking the backbone to a small circle, the ribbon diagram can be smoothly drawn on a surface of a torus, whose genus is g = 1. Full size image

For example, in Fig. 1(c) there is \(b=1\) backbone, \(n=2\) chords, and \(r=1\) boundary (drawn in red). Therefore it follows from the Euler formula that the genus \(g=1\), so that the auxiliary surface is a torus, see Fig. 1(d).

Note that if no chords intersect in a given chord diagram then \(g=0\); in this case the chord diagram is called planar. In particular, a large complicated RNA with a secondary structure having all nested basepairs has genus \(g=0\), so it is quite simple from the point of view of this paper. Furthermore, for a fixed number of chords and backbones the genus cannot exceed some maximal value. We also recall that chord diagrams are used by mathematicians to characterize moduli spaces of Riemann surfaces, while physicists reinterpret them as a particular class of Feynman diagrams arising in certain quantum field theories or matrix models4,7. Certain properties of chord diagrams have been also discussed in10.

Types of bonds and bifurcations

To determine the genus, for example using the formula (1), one simply considers all bonds in a given chain. However in various contexts, in particular for biomolecules, one can distinguish between various types of bonds. In this work we propose to consider such a distinction; as we will see, this provides some new information about those different types of bonds. For RNA, an important classification of base pairs have been introduced by Leontis and Westhof11,12. They noticed that RNA bases can be regarded as triangles with three different edges, referred to as: Hoogsteen edge (denoted HG or H), Watson-Crick edge (denoted WC or W), and Sugar or Shallow Groove edge (denoted S or SG), see Fig. 2(a). Base pairs are formed by any of these three edges of a nucleotide with an edge of another nucleotide. Depending on the orientation, a given base pair may have a configuration trans (t) or cis (c). A given base pair is denoted by specifying its orientation and edges it involves; for example, cWW denotes base pairs formed by two Watson-Crick edges in the cis configuration. Altogether there are 12 types of base pairs, which we list here in the order corresponding to the frequency of their occurrence13, and which will be important in what follows:

$$\begin{array}{l}{\rm{cWW}},{\rm{tHS}},{\rm{tWH}},{\rm{tSS}},{\rm{cWS}},{\rm{tWS}},\\ {\rm{cHS}},{\rm{tWW}},{\rm{cWH}},{\rm{tHH}},{\rm{cSS}},{\rm{cHH}}.\end{array}$$ (2)

Figure 2 Resolving bifurcations. (a) Each nucleotide has three edges: Hoogsteen (denoted HG or H), Watson-Crick (WC or W), and Shallow Groove or Sugar (SG or S). (b) An example of a bifurcation in RNA structure. (c) Resolving a bifurcation: overlapping chord endpoints are replaced by separate residues, which are then sorted so that no additional crossings are introduced. Full size image

For proteins one can also consider various types of contacts, e.g. between various secondary structures: between two helices, or two β-strands, or between a helix and a β-strand.

In the computation of the genus for biomolecules we have to take into account an important subtlety, that we refer to as bifurcations – namely, a given residue may form bonds with more than one other residue. For example, in RNA a given nucleotide may form base pairs with more than one other nucleotide, as in Fig. 2(b), and in proteins a given amino acid may be in contact with more than one other amino acid. In the language of chord diagrams this means that more than one chord is attached in the same place of the backbone, which is not an allowed configuration, and in such case the genus cannot be computed. To deal with this subtlety we split each residue into as many residues as the number of bonds it forms, and sort endpoints of chords in such a way, that no intersections are introduced in the corresponding chord diagram by chords ending in those residues. This ensures that the genus can be computed and it is not artificially increased (by “artificially” introduced crossings), and it has a well defined minimal possible value. An example of such sorting, in the case of the bifurcation shown in Fig. 2(b), is shown in Fig. 2(c).

Genus classification of RNA structures

Once we know how the genus is defined, it is of interest to compute it for all known RNA structures. Such a computation, however taking into account only canonical (cWW) base pairs, was first conducted in3. Here we show that taking into account other types of base pairs reveals much more interesting information. Moreover, since the work of 3 many more RNA structures have been identified, so it is also of advantage to compute genus for all of them.

We computed genus for RNA structures with better than 3.0 Å resolution in the PDB database (also deposited in the BGSU RNA Site database14). Out of the total of 1240 structures, 565 did not contain errors and were of appropriate form for the genus computation (from the first backbone in the structure). In genus computations we considered three classes of base pairs. First, we considered only canonical (cWW) base pairs. Second, we considered all possible base pairs excluding the sugar-sugar (cSS and tSS) base pairs, as SS interaction are largely responsible for helix packing interactions and therefore necessarily introduce crossing chords that increase the genus. Third, we took into account all possible base pairs listed in Eq. (2). Results for these three classes are shown in Fig. 3, respectively by red triangles (cWW interactions), green dots (all but sugar-sugar interactions), and blue crosses (all base pairs). Each triangle, a dot, or a cross corresponds to one RNA chain, and its coordinates correspond respectively to the length (the number of nucleotides) of this chain and the genus (of the corresponding auxiliary surface).

Figure 3 Genus computed for all known RNA structures. Each point represents one structure, and its coordinates denote respectively its length and genus. Red triangles correspond to genus computed taking into account only canonical (cWW) base pairs. Green dots represent genus computed for all base pairs apart from sugar-sugar (cSS and tSS) interactions. Blue ×’s correspond to genus computed for all base pairs. Full size image

From the data in Fig. 3 several conclusions can be drawn. First, sugar-sugar interactions indeed contribute significantly to the genus, as expected. Second, for each three classes of base pairs, the dependence of the genus g on the length of the chain d (the number of nucleotides) is linear to a good approximation – as shown in Fig. 3 – and takes the form:

$$\begin{array}{ll}{\rm{Only}}\,\mathrm{cWW}: & g=0.005d,\\ {\rm{All}}\,{\rm{but}}\,\mathrm{SS}: & g=0.040d,\\ {\rm{All}}\,{\rm{base}}\,\mathrm{pairs}: & g=0.076d.\end{array}$$ (3)

The slopes 0.040 and 0.076 are 8 and 15 times higher than the slope with cWW only, so the genus indeed depends significantly on non-canonical base pairs. Moreover, note that the result for canonical base pairs (red triangles) is of the same order as the slope \(\simeq \)0.003 found in3. However, our computation involves many more RNA structures known as of 2018, and the linear character of the resulting plot in Fig. 3 is much more evident than of the plot in3.

Moreover, in Fig. 3 it is clearly seen that RNA structures are divided into 3 main groups: those of length shorter than 1000 nucleotides (450 structures), those of length between 1000–2500 (72 structures), and those of length above 2500 nucleotides (43 structures). The second and the third groups correspond respectively to small and large subunits of ribosome structures, whose genus is very large. It is in the range \(50\, < \,g < \,200\) for the following 71 structures:

1FJG, 1IBL, 1N32, 1N33, 1XNQ, 1XNR, 2F4V, 2UUA, 2UXB, 2UXC, 2UXD, 2VQE, 3J7Y, 3J9M, 3J9W, 3JAM, 3JBU, 3JBV, 3JCS, 3JCT, 3T1Y, 4B3T, 4BTS, 4DR6, 4DR7, 4GKJ, 4GKK, 4JV5, 4JYA, 4K0K, 4KHP, 4TUE, 4U26, 4V19, 4V4Q, 4V50, 4V5G, 4V5K, 4V6E, 4V7M, 4V83, 4V84, 4V85, 4V8N, 4V8U, 4V92, 4V9I, 4V9L, 4V9R, 4W29, 4WSM, 4XEJ, 5A2Q, 5AJ3, 5AN9, 5E7K, 5E81, 5EL4, 5IB8, 5IBB, 5IT7, 5IT9, 5J7L, 5JU8, 5JUP, 5MC6, 5O5J, 5T2A, 5T5H, 5T7V, 5TCU

and in the range g ≥ 200 for 40 structures:

1NJP, 1QVF, 1S72, 1VQ6, 1VQL, 1VQM, 1VQN, 1VQO, 3J6B, 3J79, 3J7P, 3J7Q, 3J7R, 4IOA, 4U4U, 4UG0, 4V88, 4V8C, 4V8E, 4V8P, 4V8Q, 4V91, 4V9F, 4V9Q, 4WFA, 5AJ0, 5DM6, 5FDU, 5HL7, 5L3P, 5LZD, 5MGP, 5MMI, 5MMM, 5MRC, 5ND8, 5O61, 5T2C, 5UMD, 5X8P

Most of the remaining structures with at most 1000 chords have much lower genus (computed for all base pairs): for 229 structures the genus is simply \(g=0\), for 225 structures it is lower than 50.

Note that when all base pairs (with or without sugar-sugar interactions) are taken into account, the value of genus grows significantly and for small or large subunits of the ribosome it is of order of several hundreds. These values are much larger than for genus computed only for canonical base pairs, which is of the order 20–30, even for long large ribosome units. As we will see in what follows, the properties of the genus trace are most interesting when the genus of the whole chain takes large values. For this reason long ribosomal subunits with all non-canonical base pairs will be of our particular interest in the following analysis.

We also computed the genus for designed artificial RNA structures, such as the “Peano curve” or the “smiley face”15,16. However, even though these structures are quite long (and considering all base pairs), their genus does not exceed 20. This is so, because these artificial structures do not take advantage of as many tertiary motifs to achieve a 3D compaction compared to natural RNA structures. Since the artificial structures consist of only cWW interactions, they end up having fairly low genus compared to natural folded RNAs of similar length.