Common Data Analysis Pipeline (CDAP)
The CPTAC program supports analyses of the mass spectrometry raw data (mapping of spectra to peptide sequences and protein identification) for the public using a Common Data Analysis Pipeline (CDAP). The data types available on the public portal are described below. A general overview of this pipeline can be downloaded here.
Mass Spectrometry Data
RAW (Vendor) Format
Mass-spectrometry data is uploaded by the Proteome Characterization Centers (PCCs) as RAW or vendor format files corresponding to the mass spectrometers used to acquire the spectra. These files are usually very large and can only be read using the mass spectrometer vendor’s libraries on (typically) Windows based operating systems, or these files can be read using a number of open-source projects that integrate these vendor libraries, such as the ProteoWizard project. The spectral data in RAW files are considered unprocessed, although in some cases, the acquisition software of the mass-spectrometer may process it, in real-time, before recording it.
The RAW format spectra are converted to HUPO Proteome Standards Initiative (PSI) compliant mzML format at the Data Coordinating Center (DCC). This standardized XML format for mass-spectrometry data is generated using the MSConvert tool from the ProteoWizard project. In this process, each spectrum is transformed to a peak-list using the vendor’s peak-picking algorithms. These spectral datafiles are smaller than the RAW format spectral datafiles and are completely operating-system and programming language agnostic. These files can be viewed using the ProteoWizard SeeMS tool and converted to other peak list formats suitable for analysis by tandem-mass-spectrometry search engines using the MSConvert tool. A list of commercial and open-source tools supporting the mzML format can be found at the PSI site.
Peptide-Spectrum Match Data
The primarily, or first-level, analysis of the spectra uploaded by the Proteome Characterization Centers (PCCs) is the matching of tandem-mass spectra to peptide sequences. Tandem-mass spectrometry search engines match the spectra to peptide sequences from protein sequence databases, score the matches, and output the best peptide-spectrum matches (PSMs) for each spectrum. PSMs are then filtered by score and statistical significance to ensure that only the most reliable PSMs are retained. Typically, in this process, only the best PSM per spectrum is retained. Each PSM links an identifier for the spectrum, the peptide sequence, any post-translational modifications on the peptide, and a list of identifiers for the protein sequences found to contain the peptide sequence. In addition, depending on the analysis pipeline, PSMs may be annotated with additional information, such as iTRAQ reporter ion intensities and post-translational modification localization scores.
RAW PSM Format
The Common Data Analysis Pipeline (CDAP) implemented for CPTAC by NIST produces tab-separated-value format files containing PSMs generated by MS-GF+ for each CPTAC study. The current reference protein database used for human in mouse xenograft tumor pooled samples is concatenated RefSeq H. sapiens (build 37), M. musculus (build 37), and the sequence for S. scrofa (porcine) trypsinogen. The FASTA file used for analysis of human TCGA samples and ovarian cancer tumors includes RefSeq H. sapiens (build 37) and the sequence for S. scrofa (porcine) trypsinogen.
Download Common Data Analysis Pipeline Bioinformatic Methods
Reference mass spectral peptide libraries may be downloaded freely from peptide.NIST.gov.
PCCs may also analyze the spectral data and provide PSMs in other formats, including IDPicker3 database and MS-GF+ mzIdentML. Separate documents will describe the details of these analysis pipelines and document PSM formats.
mzIdentML PSM Format
Raw PSMs from the CDAP or the PCCs are converted to
HUPO Proteome Standards Initiative
format at the Data Coordinating Center (DCC). This standardized XML format for PSMs is generated using a tool written at the DCC with support from the
project. In this process, the PSMs are standardized and normalized for consumption by third-party data-processing pipelines. PSM normalization includes realignment of peptide sequences to current RefSeq/UniProt protein sequence databases to get peptide start and end positions, consistent accession format, and human readable descriptions; normalization of all post-translational modifications with UNIMOD accessions and PSI conventions for N-terminal modifications; recomputation of all theoretical masses from elemental composition; extraction of precursor m/z and retention time data from spectral datafiles; and verifying and populating mzML nativeIDs as spectral identifiers.
controlled vocabulary terms are used wherever possible. A list of commercial and open-source tools supporting the
format can be found at the
Download mzIdentML Format Bioinformatic Methods
The protein reports are based on the peptide-spectrum-matches from the CDAP and provide protein identification and quantitation for both ‘label-free’ and 4plex iTRAQTM workflows with a common reference sample. These results are based on a conservative gene-based generalized parsimony analysis developed by the Edwards lab. Peptides are associated with genes, rather than protein identifiers, and genes with at least two unshared peptide identifications are inferred. The resulting gene list is estimated to have a false-discovery rate of at most 0.01%. A summary of the gene-based generalized parsimony analysis is provided in the protein identification summary report.
CDAP Protein Report Description