The circular chromosome of R. prowazekii strain Madrid E has 1,111,523 bp and an average G+C content of 29.1% (Figs 1 , 2 (Note: For Figure 2, please refer to PDF File: 217k)). The genome contains 834 complete open reading frames with an average length of 1,005 bp. Protein-coding genes represent 75.4% of the genome and 0.6% of the genome encodes stable RNA. We have assigned biological roles to 62.7% of the identified genes and pseudogenes; 12.5% of the identified genes match hypothetical coding sequences of unknown function and the remaining 24.8% represent unusual genes with no similarities to genes in other organisms (Table 1 (Note: For Table 1, please refer to PDF File: 469k)). Multivariate statistical analysis has shown that there is no major variation in codon-usage patterns among genes that are expressed in different amounts, indicating that codon-usage patterns in R. prowazekii may be dominated mainly by mutational forces14. G+C-content values at the three codon positions average 40.4, 31.2 and 18.6%, and these values are similar at different positions in the genome. We classified the open reading frames with significant sequence-similarity scores to gene sequences in the public databases into functional categories (Table 1 (PDF File: 469k)) that allow comparisons with the metabolic profiles of other bacterial genomes15,16,17,18,19,20,21,22,23.

Figure 1: Overall structure of the R. prowazekii genome. The putative origin of replication is at 0 kb. The outer scale indicates the coordinates (in base pairs). The positions of pseudogenes are highlighted with death's heads. The distribution of genes is shown on the first two rings within the scale. The location and direction of transcription of rRNA are shown by pink arrows and of tRNA genes by black arrows. The next circle in shows GC-skew values measured over all bases in the genome. Red and purple colours denote positive and negative signs, respectively. The window size was 10,000 nucleotides and the step size was 1,000 nucleotides. The central circles shows GC-skew values calculated for third positions in the codon only. GC-skew values were calculated separately for genes located on the outer strand (green) and on the inner strand (blue). To allow easier visual inspection, the signs of the values calculated for genes located on the inner strand have been reversed. Full size image

Non-coding DNA. The coding content of previously sequenced bacterial genomes is, on average, 91%, ranging from 87% in Haemophilus influenzae to 94% in Aquifex aeolicum. In comparison, a large fraction of the R. prowazekii genome, 24%, represents non-coding DNA ( Fig. 3). A small fraction of this corresponds to pseudogenes (0.9% of the genome) and less than 0.2% of the genome is accounted for by non-coding repeats. The remaining 22.9% contains no open reading frames of significant length and it has the low G+C content (mean 23.7%) that is characteristic of spacer sequences in the R. prowazekii genome14. A region of 30 kilobases (kb) located at position 886–916 kb contains as much as 41.6% non-coding DNA and 11.5% pseudogenes. The non-coding DNA in this region has a small, but significantly higher, G+C content (mean 27.3%) than non-coding DNA in other areas of the genome (mean 23.7%) ( P < 0.001), indicating that it may correspond to inactivated genes that are being degraded by mutation (Fig. 3).

Figure 3: G+C content in intergenic regions longer than 20 bp in the R. prowazekii genome. The empty circles correspond to spacer sequences located at 886 to 916 kb, a region with an unusually large fraction of non-coding DNA and pseudogenes. Full size image

Origin of replication. The origin of replication has not been experimentally identified in the R. prowazekii genome, but we identified dnaA at ∼750 kb. However, the genes flanking the dnaA gene differ from the conserved motifs found in Escherichia coli and Bacillus subtilis (rnpA–rpmH–dnaA–dnaN–recF–gyrB ). In R. prowazekii, the genes rnpA and rpmH are located in the vicinity of dnaA, but in the reverse orientation compared to the consensus motif, and dnaN, recF and gyrB are located elsewhere.

The origin and end replication in microbial genomes are often associated with transitions in GC skew (G − C/G + C) values24. In R. prowazekii we observe transitions in the GC skew values at around 0 and 500–600 kb (Fig. 1). There is a weak asymmetry in the distribution of genes in the two strands, such that the first half of the genome has a 1.6-fold higher gene density on one strand and the second half of the genome has a 1.6-fold higher gene density on the other strand. The shift in coding-strand bias correlates with the shift in GC-skew values. As most genes are transcribed in the direction of replication in microbial genomes, the origin of replication may correspond to the shift in GC-skew values at the position that we have chosen as the start point for numbering. Indeed, several short sequence stretches that are characteristic of dnaA -binding motifs are found in the intergenic region of genes RP001 and RP885 at 0 kb, supporting this interpretation.

Stable RNA sequences and repeat elements. We identified 33 genes encoding transfer RNA, corresponding to 32 different isoacceptor-tRNA species. There is a single copy of each of the rRNA genes, with rrs located more than 500 kb away from the rrl–rrf gene cluster (Fig.1). Comparison of the sequences from ten different Rickettsia species indicates that the disruption of the rRNA gene operon preceded the divergence of the typhus group and spotted fever group Rickettisia (S.G.E.A. et al., unpublished observations). In addition, the genome contains a short sequence with similarity to a 213-nucleotide RNA molecule in Bradyrhizobium japonicum that may regulate transcription25.

There are unusually few repeat sequences in this genome. We identified four different types of repeat sequence: all of these are located in intergenic regions. There is a sequence of 80 bp that is repeated seven times downstream of rpmH and rnpA in the dnaA region. A repetitive sequence of 325 bp is found at two intergenic regions that are more than 80 kb apart, downstream of the genes ksgA and rnh, respectively. A 440-bp-long repetitive sequence has been identified at two intergenic sites, 140 kb apart; one of these sites is downstream of rrf and the others downstream of pdhA and pdhB. Finally, two similar sequences of 730 bp are located immediately next to each other at 850 kb.

Paralogous families. We have identified 54 paralogous gene families comprising 147 gene products. Of these, 125 have an assigned function. Most paralogues encode proteins with transport functions, such as the ABC transporters, the proline/betaine transporters and the ATP/ADP transporters. Five paralogous genes located next to each other at 115 kb encode putative integral membrane proteins with unknown functions.