Single Amino Acid Variant Encyclopedia for Proteogenomic Analysis

Image

In a multi-omics setting, proteogenomics merges the fields of proteomics and genomics by combining both mass spectrometry and high-throughput sequencing technologies. The primary objectives of the area right now are to aid in genome annotation or to decipher the complexity of the proteome. Gene models can be improved further using mass spectrometry-based identifications of similar or homologous peptides. Additionally, by examining high-throughput sequencing data from RNAseq or ribosome profiling experiments, it is possible to identify novel proteoforms based on the discovery of novel translation initiation sites cognate or near-cognate, novel transcript isoforms, sequence variation, or novel open reading frames in intergenic or untranslated genic regions.

Other proteogenomics investigations that combine proteomics and genomics methods concentrate on antibody sequencing, immunogenic peptide discovery, or venom peptides. To facilitate these cross-omics research, an increasing number of bioinformatics tools and resources have been available over time. Some of these methods are useful exclusively for particular phases of proteogenomics research, such as creating unique sequence databases for mass spectrometry fragmentation spectrum matching based on the results of next-generation sequencing. A few integrative tools that can carry out comprehensive proteogenomics analyses have also become available over the past few years. While some of these are implemented in a web-based framework like Galaxy, others are given as stand-alone solutions.

As NGS methods become more accessible and affordable, they are frequently used in conjunction with corresponding MS-based proteomics investigations. A more thorough search space for MS/MS identification is produced by adding additional data acquired from genomics, transcriptomics, or translatomics data, based on genome sequencing, RNAseq, and ribosome profiling, respectively. Transcript abundance, translation efficiency, translation initiation site location, somatic versus germ line mutations, splice variation, and delineation of novel coding regions are just a few examples of the various types of information that can be gleaned from these various NGS-based technologies to support and improve the MS-based peptide and protein identification process.