Proteogenomics is the application of mass spectrometry-derived proteomic data for testing and refining predicted genetic models. Cyanobacteria, the only prokaryotes capable of oxygenic photosynthesis, are the ancestor of chloroplasts in plants and play crucial roles in global carbon and nitrogen cycles. An integrated proteogenomic workflow was developed, and we tested this system on a model cyanobacterium, Synechococcus 7002, grown under various conditions. We obtained a nearly complete genome translational profile of this model organism. In addition, a holistic view of posttranslational modification (PTM) events is provided using the same dataset, and the results provide insights into photosynthesis. The entire proteogenomics pipeline is applicable to any sequenced prokaryotes and could be applied as a standard part of genome annotation projects.

We describe an integrated workflow for proteogenomic analysis and global profiling of posttranslational modifications (PTMs) in prokaryotes and use the model cyanobacterium Synechococcus sp. PCC 7002 (hereafter Synechococcus 7002) as a test case. We found more than 20 different kinds of PTMs, and a holistic view of PTM events in this organism grown under different conditions was obtained without specific enrichment strategies. Among 3,186 predicted protein-coding genes, 2,938 gene products (>92%) were identified. We also identified 118 previously unidentified proteins and corrected 38 predicted gene-coding regions in the Synechococcus 7002 genome. This systematic analysis not only provides comprehensive information on protein profiles and the diversity of PTMs in Synechococcus 7002 but also provides some insights into photosynthetic pathways in cyanobacteria. The entire proteogenomics pipeline is applicable to any sequenced prokaryotic organism, and we suggest that it should become a standard part of genome annotation projects.

Proteogenomics refers to the correlation of mass spectrometry-derived proteomic data to refine genome annotation (1) and has been applied to the identification of previously unidentified genes and the correction and validation of predicted genes in various organisms (2⇓–4). It is an important tool for integrating protein-level information into the genome annotation process and can greatly improve genome annotation quality. The same experimental proteomic datasets are also useful in identifying posttranslational modifications (PTMs) on a proteome-wide level (5, 6). Many cellular proteins undergo appreciable amounts of PTM in response to certain stimuli, and this dynamic process occurs in various cell compartments to dictate the fate and activity of the modified proteins (7). Identification and mapping of PTMs in proteins have been improved dramatically, mainly due to increases in the sensitivity, speed, accuracy, and resolution of mass spectrometry (MS). However, system-wide identification of multiple PTMs remains a highly challenging task, especially in situations where some reversible PTMs are induced by a particular stimulus and are present for only a short period (8). To the best of our knowledge, very few reports of proteogenomic datasets have presently been used to analyze PTM events comprehensively in a genome sequence (9, 10).

In this study, we developed a proteogenomic approach to carry out genome annotation and whole-proteome analysis of PTMs in prokaryotes by using high resolution and high accuracy MS data and the cyanobacterium Synechococcus sp. PCC 7002 (hereafter Synechococcus 7002) as a test case. Cyanobacteria are a morphologically diverse group of Gram-negative bacteria and are the only prokaryotes capable of oxygenic photosynthesis (11). It is estimated that more than half of the photosynthetic activity on Earth is contributed by cyanobacteria (12). Cyanobacteria make substantial contributions to global CO 2 assimilation, O 2 production, and N 2 fixation and are the progenitors of chloroplasts in higher plants (13). Cyanobacterial habitats are highly diverse, and cyanobacterial cells adjust their cellular activities in response to a wide range of environmental cues and stimuli. Recently, cyanobacteria have attracted great interest due to their crucial roles in global carbon and nitrogen cycles and their ability to produce clean and renewable biofuels such as hydrogen (14⇓–16). Synechococcus 7002 is a unicellular, marine cyanobacterium and a model organism for studying photosynthetic carbon fixation and the development of biofuels (17, 18). However, whereas the genome of Synechococcus 7002 is fully sequenced, it is annotated only by in silico methods (www.ncbi.nlm.nih.gov/), with a large portion (1,210 out of 3,186) of protein-coding genes annotated as hypothetical proteins (17). Therefore, a comprehensive analysis is needed to provide experimental support for the genome annotation so as to facilitate systems-level analysis. Using our method, we performed the validation of the predicted protein-coding genes, identified previously unidentified genes, and corrected gene initiation and stop-codon positions in Synechococcus 7002, and directional RNA-Seq was used to determine the existence of a number of previously unidentified genes identified in this study. More importantly, we characterized PTM features on a proteome-wide level using the same experimental proteomic datasets and identified many different PTM types that may play important roles in cellular functions. Our proteogenomic data provided significant information for revising the genome annotation of Synechococcus 7002 and offered insights into the physiology of this model organism. The method and approach can also be used to study genome annotation and cellular protein PTMs in other organisms.

Results

Proteogenomic Strategy for the Analysis of Synechococcus 7002. The aim of this study was to provide an experimental catalog of the genome-wide gene expression and PTMs in Synechococcus 7002 and to use this information to refine genome annotations. To enhance coverage of the expressed genome, cultures from eight different growth conditions were combined, and total protein extracts were isolated. This treatment mimicked native conditions experienced by Synechococcus 7002 and allowed greater PTM representation in the isolated proteins. A total of 52 samples were generated and subjected to nanoscale liquid chromatography coupled to tandem mass spectrometry (nano LC-MS/MS) analysis on a high-resolution LTQ-Orbitrap Elite mass spectrometer. Using five different algorithms, MS-derived data were searched against (i) a protein database comprising the protein sequences of Synechococcus 7002 from CyanoBase and (ii) a database of a six-frame translated genome of Synechococcus 7002. The complete workflow is summarized in Fig. 1. Peptide spectrum matches (PSMs) were filtered for first-rank assignments that passed a 1% false discovery rate (FDR) threshold. The complete list of peptides identified in our study, along with PSM scores, charge, m/z value, and mass error, is provided in Table S1A [Tables S1–S9 are available at www.peptideatlas.org (accession no. PASS00285)]. We achieved a mean absolute mass deviation of 0.005006 Da (Fig. 2A). The average absolute peptide mass accuracy was 1.96 parts per million (ppm) (SD, 1.78 ppm), and more than 94% of the identified PSMs had less than 5 ppm mass error, confirming the high accuracy of peptide data obtained from the mass spectrometer (Fig. S1A). Fig. 2B shows the distribution of peptides and proteins identified from the two fractionation methods applied for analyzing the cell-lysate proteins. Sequest, Mascot, MaxQuant, pFind, and X!Tandem searches identified 238,918; 239,779; 252,696; 241,601; and 268,763 peptides, respectively, resulting in the identification of 55,862 unique peptides. Fig. 1. Experimental and bioinformatic workflow of the proteogenomic analysis. Protein extracts were prepared from Synechococcus 7002 cultures grown to exponential phase (flask 1) and stationary phase (flask 2) and exposed to stresses, including iron deficiency (flask 3), phosphate deficiency (flask 4), nitrogen deficiency (flask 5), A5 deficiency (flask 6), high light (flask 7), and high salt concentration (flask 8) (2.5 M). After protein extraction, proteins are subjected to Glu-C and Trypsin digestion, producing peptide mixture. The mixture was analyzed by means of nano-LC-MS/MS with an LTQ-Orbitrap Elite mass spectrometer. MS/MS peptide spectra were searched against specific organism genome sequences, validating and correcting genomic annotations, as well as identifying previously unidentified protein-coding genes and diverse PTMs. Fig. 2. Overview of the proteogenomic results. (A) Scatter plot showing the distribution of the mass errors of all of the identified peptides. (B) The Venn diagram illustrates the relative contribution of the different fractionation methods used to the total number of peptides and proteins identified. (C) Venn diagram illustrating the coverage of several levels (whole, protein-coding, expressed, detected) of the Synechococcus 7002 genome. (D) Proteome landscape of the Synechococcus 7002. An overview of peptide data against the genome was generated using the Circos software (45). The concentric circles from the periphery to the center represent (i) Synechococcus 7002 chromosomes and six plasmids, (ii) proteins encoded by the genome, (iii) GC content of the Synechococcus 7002 genome, (iv) peptides identified in this study, genome search-specific peptides (GSSPs), (v) previously unidentified gene models (1, intergenic genes; 2, different frame with annotated genes; and 3, opposite strand to existing genes discovered in this study), and (vi) revised gene models (N-terminal changes in annotated genes).

Protein Expression Analysis of Synechococcus. The CyanoBase database (genome.microbedb.jp/cyanobase/) lists 3,186 protein-coding genes in the Synechococcus 7002 genome (3.41 Mb, released 2012). Unique peptides were mapped onto the genome-translated database, and proteins identified by at least two unique peptides or by a single peptide with manual validation were reported. In total, we identified 2,938 Synechococcus 7002 proteins with FDR 1% (2,699 proteins with at least two unique peptides and 239 proteins with single-peptide identification), which are listed in Table S1A. Proteins identified based on shared peptide evidence were listed separately (Table S1B). This comprehensive dataset enabled us to address the general features of our proteomic experiments, especially with respect to coverage of the genome sequence by the detected peptides. We first defined the protein-coding part of the genome by mapping the 3,186 annotated proteins onto the chromosome and plasmids (3.41 Mb) (Fig. 2C); this result revealed that at least 87.1% (2.97 Mb) of the genome is annotated in the protein database and is therefore protein coding. We next used the sequences of all 2,938 proteins identified in our dataset to estimate the size of the expressed genome, which corresponded to 98.3% (2.92 Mb) of the annotated protein-coding regions. Finally, mapping the detected peptide sequences onto the chromosome and plasmids captured 2.30 Mb of the raw genome sequence or 77.4% of the protein-coding genome (Fig. 2C). The average sequence coverage per identified protein was 44.4%, and 1,211 proteins had peptide evidence for more than 50% of their sequences (Fig. S1B). Each protein was represented by 1–357 distinct peptides, with an average coverage of 19 peptides per identified protein (Fig. S1C). As an illustration of the high coverage of many of the proteins, Fig. S1D depicts the identification of peptides mapping to 100% of phycocyanin, an alpha subunit (SYNPCC7002_A2210) gene product, where we identified 194 unique peptides based on 27,145 PSMs. It is clear from the data that our approach using more than one method each of protein isolation, protein/peptide fractionation, and database search resulted in an increased number of peptide and protein identifications from the same sample. The high quality of our protein identifications was demonstrated by the following: the FDR at the peptide level was lower than 1.0%; the average absolute peptide-mass accuracy was 1.96 ppm; 92% of 2,938 identified proteins were mapped by at least two unique peptides; and all modified peptides and peptides singly assigned to proteins were manually verified. Among all of the identified proteins, a large number of hypothetical proteins were identified in Synechococcus 7002 (Table S1C). The Synechococcus 7002 genome annotation contains 1,210 (38%) hypothetical proteins, which are functionally uncharacterized due to lack of sequence similarity to any known genes from model prokaryotes: i.e., proteins predicted on the basis of the nucleic acid sequences only and protein sequences with unknown function. In this study, we identified 918 proteins by MS that are annotated as hypothetical proteins, among which 311 were assigned by at least 100 PSMs, and only 59 were single-peptide identifications (Fig. S2A). Blast was performed against the National Center for Biotechnology Information (NCBI) database to obtain functional descriptions of the hypothetical proteins, and 99 MS-detected hypothetical proteins had assigned Gene Ontology (GO) terms (Fig. S2B). The distribution of identified proteins among different biological processes, molecular functions, and cellular localizations is illustrated in Fig. S2C. Thus, our results provide important clues for future functional studies of these hypothetical proteins.

Integration and Visualization of Proteomics and RNA-Seq Transcriptomic Data. Increasing numbers of investigators are now incorporating RNA-Seq information with proteomic data to gain a more complete understanding of cellular systems and improve genome annotation (19⇓–21). The comparative analysis of RNA and protein levels can be used as a validation tool to generate a protein atlas with higher reliability. In this study, we performed directional RNA sequencing, and the results are presented in Table S1D. We further integrated the proteomic data described above with the transcriptome determined by RNA-Seq to facilitate the validation of the proteomic results. In-depth transcriptome analyses of RNA-Seq data from Synechococcus cultures under eight different conditions have shown that >97% of the annotated genome is expressed in these cellular states, and our proteome data set covers >93% of these RNA-Seq data (Fig. 3A). We have thus generated a comprehensive protein database that covers nearly the entire expressed Synechococcus proteome. To compare peptide versus RNA abundance, we computed a scatter-plot of mRNA expression [reads per kilobase per million mapped reads (RPKM), 2,160,264 reads] mapped to a known gene (x axis) versus the protein expression (PSMs, 1,212,428 spectra) (y axis) falling within the region. The correlation value (r = 0.318) calculated using Spearman’s rank correlation coefficient test suggested little or no correlation between protein expression (PSMs) and mRNA expression (RPKM) (Fig. 3B). This result is in line with previous studies, which have shown that protein expression is influenced by an array of posttranscriptional regulatory mechanisms and that correlation between protein and mRNA levels is generally modest (22, 23); this result was highly similar to those of the previous analyses on Medicago truncatula (24), human, and mouse (25). These data were combined with the genome data on Synechococcus 7002 and made publicly accessible online (lag.ihb.ac.cn:8080/), where they can be browsed by gene, protein, peptide, and PTM. Fig. 3C shows a screenshot of ABrowse displaying genomic, RNA-Seq, and proteomic data from Synechococcus 7002. This platform helps visualize all of the potential reading frames in the Synechococcus 7002 genome and has the capability to help browse and query the data to identify regions of interest quickly with respect to structural annotation (e.g., previously unidentified genes or identified peptides and their PTMs). The annotation entries shown in the main browsing canvas of ABrowse are all clickable, and their corresponding detailed information can be displayed in the “Entry Detail” tab of the detailed-information/user-space panel. In addition, all of the modified peptides listed in the main browsing canvas were highlighted in light blue, and the previously unidentified genes identified in this work were highlighted in light yellow in the protein-coding gene models region. The interactive visual analytics tool provides a user-friendly web interface to browse, search, retrieve, and update information on the Synechococcus 7002 genome, and it is a step toward integration of proteomic (peptide-centric) and transcriptomic (RNA-Seq) data with current genome-location coordinates, allowing for in-depth studies of individual genes and their protein counterparts as well as more global studies using systems-biology approaches. Fig. 3. Comparison and integration of proteomics and transcriptomics data. (A) Comparison showing the overlap of protein identifications with the transcriptome determined using RNA-Seq in Synechococcus 7002. (B) Correlation between mRNA expression (RPKM) and protein expression (PSMs) in genes detected at both the mRNA and protein levels. (C) Screenshot of the main ABrowse genome browser interface. The genome browser interface consists of the navigation bar, the browsing canvas, and the detailed-information/user-space panel, which was used to covisualize experimental peptides and RNA-Seq data for Synechococcus 7002. The sequences of the protein and the peptides can be seen by zooming into the protein sequence track. This type of covisualization can be done on a large scale to comprehensively integrate the proteome with the genome and the transcriptome.

Identification of Previously Unidentified Genes. We also compared peptide sequences searched against a six-frame translated genome database with those present in the protein database. We identified a set of unique orphan peptides that were not represented among the predicted proteins of Synechococcus. These peptides, designated as genome search-specific peptides (GSSPs), mapped to unique locations on the Synechococcus 7002 genome. Out of the 2,778 GSSPs identified in this study, 486 peptides either mapped to regions of the genome where no gene was annotated or did not match the gene model they were mapping to. Two gene-prediction programs, FgenesB and GeneMark, were used to identify ORFs in the region to which these GSSPs were mapped. Our in-house program was also used to identify ORFs after incorporating a wide range of information, including the peptide score, an initiation codon, the number of peptide hits, and the length of the previously unidentified protein. Using these programs, we identified 118 previously unidentified protein-coding regions and modified the annotation of 38 existing gene models, all of which were assigned by at least two unique GSSPs. A graphical representation of all previously unidentified and corrected proteins identified in this study is shown in Fig. 2D. Further, we used comparative genomic strategies to investigate conservation of previously unidentified gene models across related species. The presence of orthologous genes in other species provides further support for the previously unidentified gene structures; conversely, absence of conserved homologous protein-coding regions in other genomes may indicate that these genes or gene regions may be unique to Synechococcus 7002. Twenty-two of the previously unidentified ORFs had been previously annotated in other species, 80 previously unidentified genes were supported by FgenesB/GeneMark ORF predictions, and the remaining genes were predicted using our in-house program. Table S2 A–C lists the previously unidentified genes found in this study along with their genomic coordinates and the supporting peptide evidence. Among the previously unidentified genes identified, 13 were intergenic, 41 were frame-shifted, and 64 were on the strand opposite to an existing gene annotation. Fig. 4 depicts an example of a previously unidentified gene that is located in the intergenic region between two predicted genes, SynPCC7002_A0518c and SynPCC7002_A0519. Another previously unidentified protein of 100 amino acids (SynPCC7002_DZ003) was discovered in the region spanning 11,658–11,960 on plasmid pAQ4 (Fig. S3). Nine GSSPs revealed a previously unidentified gene (SynPCC7002_Z0076) encoded by the opposite strand of the SynPCC7002_A0539 gene and were identified as part of the gene model encoding a 91-amino acid protein predicted by FgenesB and GeneMark and supported by directional RNA-Seq data (Fig. S4). Fig. 4. Identification of a previously unidentified gene, Z0010, based on peptides mapping to an intergenic region and RNA sequencing evidence. (A) Sixteen peptides mapped to an intergenic region between genes SynPCC7002_A0518c and SynPCC7002_A0519. Gene prediction algorithms FgenesB and GeneMark supported the presence of this additional gene. In addition, a protein corresponding to this previously unidentified gene has been annotated in the Leptolyngbya sp. PCC 7376 genome. (B) Protein sequence of a previously unidentified gene. Identified region is indicated by red text. (C) RNA sequencing evidence also supports the expression of the previously unidentified gene Z0010.

N Terminus Validation and Correction. Apart from identification of previously unidentified ORFs, we corrected some gene coordinates using GSSP data. Using this strategy, we corrected 38 gene models of N-terminal extensions with 105 unique GSSPs. Table S2D contains the list of genes for which modification in the structure was suggested by our analysis, and this table contains information about previously assigned coordinates, our modifications, and the corresponding peptide evidence. Fig. S5A depicts an example of correction of a gene model by extension of the N terminus. The current coordinates of the gene SynPCC7002_A0031, which codes for the Dps family DNA-binding stress response protein, are 28,318–28,854. We found seven peptides mapping upstream of the gene, and elongated PCC7942 by comparative genomic analysis to detect another homologous protein in Synechococcus. In addition, both FgenesB and GeneMark gene-prediction algorithms predicted gene models that extended the gene in the 5′ direction, in agreement with the newly annotated start codon. Thus, a combination of GSSPs, homology, and RNA-Seq–based analysis allowed us to extend the gene model 48 nucleotides upstream from the previous start codon. We further analyzed protein translation start sites (TSSs) by probing N-terminal–specific modifications. Formylation occurs on the initiator methionine, and N-terminal acetylation occurs on the second amino acid after the initiator methionine is cleaved. These modifications directly mark the TSS for a protein-coding gene. Based on semiproteolytic peptides identified at <1% FDR and their upstream codons in the genome, we confirmed the annotated TSS for 52 genes (Table S2E). Fig. S5B depicts two N-terminally formylated unique peptides that facilitate confirmation of the TSS for SynPCC7002_A1803c, a protein involved in the carbon dioxide-concentrating mechanism. We also identified N-terminal peptides that confirmed the TSS of three revised gene models (SynPCC7002_A0315, SynPCC7002_A1929, and SynPCC7002_A0646) (Table S2E).

Identification of Proteins with Noncanonical Translation Initiators. In prokaryotes, translation initiation is typically mediated by the start codon ATG. In addition, GTG and TTG are used as alternative start codons in agreement with the preference: ATG > GTG > TTG (26, 27). Studies on Escherichia coli have estimated the frequency of the start codon use as ATG 83%, GTG 14%, and TTG 3% (28). In Synechococcus 7002, the frequencies of initiation codons were ATG 83.11%, GTG 9.51%, TTG 7.31%, and CTG 0.06%. Additionally, it has been reported that ATT, ATA, and ATC are rare noncanonical translation initiators (28, 29). A set of 19 putative proteins with noncanonical start codons containing 43 unique GSSPs (ATT or ATA) was identified and is reported in Table S2F. Fig. S6A illustrates an example of the use of peptides to predict a nontraditional start codon. We identified two peptides that map to the opposite strand of SynPCC7002_A0826c; however, none of the common ATG, GTG, or TTG start codons were found upstream of the region spanned by these two peptides. A survey of available literature on translation initiation revealed that ATA is known to function as a rare translation initiator, suggesting that this previously unidentified gene Z0045 may use ATA as the start codon. A representative MS/MS spectrum of the peptide INDAALRKE is shown in Fig. S6A, and the nearly complete b and y ion series provide additional confidence to the annotation. The previously unidentified protein is supported by two gene-prediction algorithms, FgenesB and GeneMark, and we validated the expression of this gene at the level of the transcript with seven mRNA reads and a 2.55 RPKM value. The use of this noncanonical translation initiation codon may imply a specific regulatory mechanism, and further experiments are needed to characterize the fundamental regulatory mechanisms for cyanobacterial cells.

Identification of Peptides with Stop Codon Read-Through. Aside from noncanonical translation initiation, protein translation can sometimes continue through a stop codon, a mechanism known as “stop codon read-through” (30⇓–32). Two examples of this mechanism are the TGA and TAG stop codons that sometimes code for the rare amino acids selenocysteine and pyrrolysine, respectively (33, 34). In this study, we translated all of the TGA and TAG codons into selenocysteine (U) and pyrrolysine (O), respectively, while keeping the third stop codon, TAA, as the true stop codon. We found 23 read-through candidates with 43 unique GSSPs, and Table S2G provides a complete list of the peptides identified with their genomic coordinates and RPKM values. Annotated peptide spectrum matches of the identified peptides with stop codon read-through have been deposited to the PeptideAtlas database and can be accessed using the accession number PASS00285. Fig. S6B depicts a previously unidentified gene model in which two genes are merged into one based on RNA-Seq and MS experimental evidence of translation of the TAG stop codon as pyrrolysine (O). Read-through has been observed across the animal kingdom, and its widespread use suggests an additional level of regulatory complexity (31). Therefore, the identified read-through in cyanobacteria offers a rich opportunity to enhance our understanding of the underlying molecular mechanisms and regulation of the read-through process more generally.

Identification of Signal Peptides. A signal peptide is a short N-terminal sequence that targets a protein for export or transport to a desired cellular location (35). Signal peptides are essential for proper cellular function in both eukaryotes and prokaryotes (35). The average length of a signal peptide in Gram-negative bacteria is estimated to be 25 amino acids, with most signal peptides between 20 and 30 amino acids (36). Although knowledge of signal peptides is important for understanding protein function, they are difficult to confirm experimentally, and computational predictions are used to fill the gap. SignalP (37) and PrediSi (38) are two popular signal peptide prediction tools. It is clear that proteomic evidence can greatly increase the number of experimentally confirmed signal peptides and the confidence of signal peptide predictions. We analyzed our peptide annotations to confirm or refute signal peptide predictions. The lists of predicted signal peptides by SignalP, PrediSi, and MS/MS analysis are provided in Table S3. A clear sequence motif, which closely matches motifs used by SignalIP and PrediSi, emerges when we examine the sequence immediately upstream of the 107 putative signal peptides predicted by MS/MS analysis (Fig. S7A). SignalP and PrediSi predict 130 and 332 proteins with signal peptides, respectively. However, there is a substantial discrepancy between these tools: only 81 signals are predicted by both. LC-MS/MS evidence provides a strategy to resolve the discrepancies between SignalP and PrediSi and identify signal peptides missed by both tools. Fig. S7B compares our predicted signal peptides with the predictions made by SignalP and PrediSi. Our results confirm 12 signal peptide predictions. For 119 confirmed proteins, the MS/MS results include peptides upstream of the cleavage site predicted by SignalP/PrediSi and thus represent evidence against SignalP/PrediSi predictions (Table S3D). Therefore, we refute 25 sites predicted by SignalP and 110 sites predicted by PrediSi (with 8 refuted sites predicted by both tools) (Fig. S7C).

Global Posttranslational Modification Discovery in Synechococcus 7002. As mentioned above, proteogenomic data may be used to identify PTMs on the proteome-wide level. Much less is known about the type, frequency, and function of PTMs in prokaryotes, even for intensively studied model organisms such as E. coli, than in eukaryotes. PTM information gained from MS/MS data of cells grown under different conditions will contribute to an understanding of PTMs in prokaryotic organisms. Recently, we described a phosphoproteomic analysis of Synechococcus 7002 by using protein/peptide prefractionation, TiO 2 enrichment, and LC-MS/MS analysis (39). In this study, we first analyzed the mass spectra using the MODa (MODification via alignment) algorithm (40), which allows for unrestrictive PTM searches with no limitation on the number of modifications per peptide and incorporates several unique features to improve the sensitivity and accuracy of peptide identification and eliminate the increases in false positives and false negatives. In all, 278,859 spectra of 70,042 unique peptides and 2,469 proteins were identified by MODa. Among these results, 76,103 spectra of 28,905 unique peptides and 1,666 proteins contained PTMs with sizes accepted by the MODa search, regardless of the modification classification in Unimod. Within these PTMs, 7,026 spectra of 3,059 unique peptides and 690 proteins contained in vivo PTMs. These findings, along with the parameters for search and output column descriptions, are summarized in Table S4A. Table S4B presents 25 common modification types and the frequency of each. Because the false positive rate is low, it is extremely unlikely that any of these modification types represent a computational artifact. Moreover, all of the selected modification types in Table S4B are supported by studies on other species, further reinforcing the conclusion that they are not artifacts. MODa was used to (i) discover and identify unexpected modifications and (ii) assign posttranslationally modified peptide sequences to MS/MS spectra, but MODa cannot be used to assign the localization of modifications to specific amino acids. To address this problem, the MaxQuant computational proteomics platform was used to search for the specific PTM and determine the localization probability of modifications in peptides (41). The identified peptides with different PTMs were further validated by manual inspection of MS/MS data (see criteria in SI Materials and Methods). MaxQuant takes advantage of high-resolution data such as those obtained by the LTQ-Orbitrap instruments and employs algorithms that determine the mass precision and accuracy of individual peptides (41). This method leads to greatly enhanced peptide mass accuracy that can be used as a filter in database searching. Our approach identified 23 different PTMs on 6,704 unique peptides from 2,230 Synechococcus 7002 proteins with high confidence (FDR <1%). Fig. 5A shows the numbers and types of PTMs identified by MaxQuant in Synechococcus 7002. We further confirmed the localization accuracy of these modification sites using probability-based PTM scores. These scores were used to rank peptides in MaxQuant searches from the beginning, and they determined the localization probability of modifications in the peptides. For the modified peptides detected, 11,839 modification sites were determined with a localization probability higher than 0.75 (Table S5A). In the other 1,977 peptides, the modification sites could not be unambiguously determined from the mass spectra but are limited to the short amino acid sequences in the peptides. Details of the identified peptides with each PTM, including their protein IDs, sequences, search algorithm scores, PTM scores, and localization P values, are provided in Table S5A. To evaluate the identification performance of PTM events by different algorithms, we compared the modified peptides and proteins identified by both PTM search algorithms, and Fig. S7D presents the overlap between MODa and MaxQuant search results. It shows that 649 unique modified peptides and 1,518 modified proteins were identified by both PTM search algorithms. Table S4B presents the common modification sites identified by two PTM search algorithms. We also compiled a list of all proteins identified in this study along with the number of unique peptides, coverage, UniProt accessions, NCBI RefSeq accessions, protein product description, domains, Gene Ontology information, signal peptide and transmembrane domain information in Table S5B. Fig. 5. Summary of the identified proteins with PTMs in Synechococcus 7002. (A) Distribution of the number of proteins and sites with different PTMs in Synechococcus 7002. (B) Distribution of the predicted, identified, and modified proteins of Synechococcus 7002 according to their molecular functions. Different PTMs are represented with different colors. GO category classification of Synechococcus 7002 proteins as predicted from their genome annotations. *, Denotes overrepresentation.

Biological Relevance of Modified Proteins. Previous conservation analysis indicated that posttranslationally modified proteins are more conserved than unmodified proteins from prokaryotes to eukaryotes (5). Owing to the high coverage and large number of modified proteins identified, we compared the conservation levels between MS-detected modified and unmodified proteins. In this study, we searched for orthologs of Synechococcus proteins against 652 bacterial species across the phylogenetic tree, as well as against 18 Archaea and 61 eukaryotes, by performing 2D BLASTP. As shown in Fig. S7E, the MS-detected proteins were, on average, more conserved than the unidentified proteins in the Synechococcus 7002 database throughout the three superkingdoms (P < 0.01). In addition, our analysis also indicated that most MS-detected modified proteins are more highly conserved than the MS-detected total proteins (P < 0.01), suggesting that the modified proteins could be involved in the conserved function class and translation machinery. Table S6 compares some important PTMs between Synechococcus 7002 and other species. This set of modified proteins supports the emerging view that PTM events are general and fundamental regulatory processes that occur in both prokaryotes and eukaryotes and opens the way for their functional and evolutionary analysis in cyanobacteria. The entire set of predicted proteins, MS-detected proteins, and modified proteins was searched against the NCBI COG (Clusters of Orthologous Genes) database and classified into a wide range of functional classes. The distribution of protein proportions in different functional classes is illustrated in Fig. 5B and Tables S7 and S8. In particular, certain sets of modified proteins were statistically overrepresented (P < 0.05) in certain functional classes (Fig. 5B). For example, the identified proteins with methylation, persulfide modification, farnesylation, and hydroxymethylation PTMs were overrepresented in photosynthesis and respiration processes, suggesting that these PTM events are extensively involved in photosynthesis in Synechococcus 7002.

PTM Events Involved in Photosynthesis. It has been reported that protein PTM events are involved in the regulation of photosynthesis in cyanobacteria (39). According to the proteogenomic data set contributed by this study, there exist many proteins with diverse PTMs in the photosynthetic apparatus (Table S9), which are illustrated in Fig. 6A. Mapping PTM events to the major components of the photosynthetic apparatus will facilitate the integration of proteogenomic data with biological function and may thereby provide insight into the potential functional relevance of the identified PTMs. To confirm the existence of identified PTMs in Synechococcus 7002 further, we carried out immunoblotting analyses of Synechococcus 7002 total cell lysates using pan antiacetylation (lysine, abbreviated as K), trimethylation (K), dimethylation (K), butyrylation (K), crotonylation (K), phosphorylation (tyrosine, abbreviated as Y), and succinylation (K) antibodies. The specificity of these antibodies was confirmed as in previous reports (42, 43). Strong signals were detected for all seven antibodies tested (Fig. 6B). Interestingly, the succinylation (K) and phosphorylation (Y) signals from Synechococcus 7002 cells showed changes after the cells were treated with high light for 2 h, suggesting that these two PTM events may play a role in the regulation of photosynthesis in Synechococcus 7002. Apart from the high light treatment, we also performed an immunoblot analysis of global protein and PTM levels upon different stress inductions, and the results were shown in Fig. S7F. Although we cannot validate all of the predicted modifications at this time due to lack of available experimental data about PTMs for Synechococcus 7002, future research may confirm many of these identified modifications and begin to uncover their biological functions. Fig. 6. Overview of the PTM events involved in the photosynthesis process. (A) Working scheme to delineate the PTM events in photosynthesis pathways in Synechococcus 7002. Modified proteins identified in Synechococcus 7002 are shown using a black arrow. Different PTMs are shown as squares with different colors. (B) Relative proteome-wide PTM levels in Synechococcus 7002 after a 2-h exposure to high light compared with standard conditions. Immunoblots were probed using antibodies for acetyllysine, succinyllysine, butyryllysine, crotonyllysine, dimethyllysine, trimethyllysine, and phosphotyrosine. Coomassie blue staining shows equal loading amounts (Left). HL, high light treatment; C, untreated control.