Frequently Asked Questions

Print This Page

Overview
How to Download Data

Overview

What data is hosted by the CPTAC (Clinical Proteomic Tumor Analysis Consortium) Data Portal?
The portal host the mass spectrometry data from the CPTAC program. A key component is the proteogenomic profiling of the tumors form the breast, colorectal, and ovarian cancer programs in The Cancer Genome Atlas (TCGA). The portal also host data from the Clinical Proteomic Technologies for Cancer Initiative from 2006 to 2011 and external programs.

What research groups generate these data?
The CPTAC consists of five teams that create a network of Proteome Characterization Centers (PCCs)

What are the data use policies for files downloaded from the CPTAC Data Portal?
The CPTAC program abides by the Amsterdam principles established at the 2008 International Summit on Proteomics Data Release and Sharing Policy and has established the following policy to clarify freedom of CPTAC and non-CPTAC users to publish findings using CPTAC data (Responsible Use of CPTAC Data).

There are no limitations on submitting manuscripts to a journal and subsequent publications containing analyses using any CPTAC data set if the data set meets one of the following three freedom-to-publish criteria:

A global analysis publication paper has been published on that tumor type or sample set; or
15 months after the final raw data file of a given tumor type has been made public to the CPTAC Data Portal; or
The author or presenter receives specific approval from the CPTAC Steering Committee.

The specific status of each tumor dataset is displayed on the study page.

How do I cite this work in publications?
The CPTAC program requests that publications using data from this program, include the following statement:
“Data used in this publication were generated by the Clinical Proteomic Tumor Analysis Consortium (NCI/NIH).”

The following manuscripts may also be cited:
CPTAC program overview
Ellis, M.J., Gillette, M., Carr, S.A., Paulovich, A.G., Smith, R.D., Rodland, K.K., Townsend, R.R., Kinsinger, C., Mesri, M., Rodriquez, H., Liebler, D.C., on behalf of the Clinical Proteomic Tumor Analysis Consortium (CPTAC), 2013. Connecting genomic alterations to cancer biology with proteomics: The NCI Clinical Proteomic Tumor Analysis Consortium. Cancer Discovery 3:1108-1112.

The CPTAC Data Portal
Edwards, N.J., Oberti, M., Thangudu, R.R., Cai, S., McGarvey, P.B., Jacob, S., Madhavan, S., and Ketchum K.A. The CPTAC Data Portal: A Resource for Cancer Proteomics Research. A Resource for Cancer Proteomics Research. J Proteome Res. 2015 Apr 15.

How to Download Data

Do I need the Aspera connect client plug-in for file transfer?
Yes, the Aspera Connect Client Plug-in enables the high speed file transfer. Without it you will not be able to "Download" files from the links on the study page. You can download without Aspera, using the HTTP protocol from here, but this is significantly slower than data transfer with Aspera.

Where can I get the Aspera Connect Client Plug-in?
The client plug-in can be downloaded from http://downloads.asperasoft.com/connect2. The Aspera download site automatically recognizes your operating system and will recommend the correct client plug-in for your machine.

Where can I get documentation for the Aspera connect client that I installed on my computer?
Information on the Aspera Connect Web Browser Plug-in is found at:
http://asperasoft.com/software/transfer-clients/connect-web-browser-plug-in

I received an error message that the Aspera Client Plug-in was unable to authenticate using Port 33001. What does this mean?
The Aspera Connect Server at the CPTAC DCC uses nonstandard ports for security, UDP 33001 for file transfer and TCP 33001 for User Authentication (via SSH). If a user is working at a University or Research Institute and within their own security firewall, they need to contact their IT security staff to open these ports, UDP 33001 and TCP 33001.

My internet connection was interrupted, is there a way I can set my transfers in the Aspera Connect Client to resume automatically?
Go to Aspera Connect "Preferences" on your machine and in the Transfers tab enable the auto-retry function by checking the Automatically retry failed transfers box and entering a numerical value for the number of time to retry that suits your situation. You can also manually click the retry icon to restart the download.

Can I use the Aspera command line to download data?
Yes, there are two ways to use the Aspera Command Line:

1) Direct Access from a Linux system

Install the Aspera connect client on your linux system (http://asperasoft.com/software/transfer-clients/connect-web-browser-plug-in)
The default install location will be the user home directory. Modify the path in the below command line example if the Aspera connect client is installed in a different location.
Run the following command to test
~/.aspera/connect/bin/ascp -v -i ~/.aspera/connect/etc/asperaweb_id_dsa.putty -P 33001 -O 33001 -l 50M -T -Q --user public --host cptc-xfer.uis.georgetown.edu --mode recv /Phase_II_Data/CompRef/CompRef_Proteome_BI/CompRef_Proteome_BI_mzML.cksum.
If (c) is successful, simply replace the '/Phase_II_Data/CompRef/CompRef_Proteome_BI/CompRef_Proteome_BI_mzML.cksum' with the desired file or folder name.
Example folder names
/Phase_II_Data/CompRef
/Phase_II_Data/TCGA_Colorectal_Cancer
Additional folder names and individual dataset names can be obtanied by browsing the web portal.

2) A Python executable script allows direct file transfer from the CPTAC DCC public portal and is subjected to the CPTAC DCC public portal data use agreement. The script can be run from the command prompt to perform whole directory or single file transfers. The script can be obtained
here.

Can I download the data without using Aspera?
The DCC offers access to the CPTAC data using the HTTP protocol. Look for the “Http Data Access” link on each study page, or access the URL https://cptc-xfer.uis.georgetown.edu/publicData directly.

How do I access the data in compressed files with a .tar.gz file-extension?
On Linux and OSX systems, the system tar and gzip command-line tools should be used. On Windows, the 7z suite of file-compression tools have been tested to successfully uncompress even the very large compressed files.

Data Integrity

What are the .cksum files for
The checksum (.cksum) files provide sha1 and md5 hashes and file size, in bytes, of each file to make it possible to verify that the contents of files after download from the CPTAC data portal match the content on the portal.

How can I verify the checksums
On Linux and OSX systems, the traditional ls, md5sum, and sha1sum programs compute the same file-sizes and hashes and file-sizes as those contained in the .cksum files. In addition, the DCC offers a command-line program, cksum, for generating and checking .cksum files. See Checksums, under the Help tab.

How can the Aspera infrastructure help ensure file-integrity?
The DCC has configured the Aspera Connect Server to use integrity verification for each transmitted data block. Furthermore, the Aspera client will only download files that are missing or different than the files on the server, using file-size and sparse checksums to determine if files on the local filesystem are different from those on the server. The command-line program, cptacpublic, (see above) for headless execution of Aspera downloads can also be configured to require the Aspera client compute full-file checksums. Finally, checksum files (see above) can be used to provide an orthogonal check of downloaded file-integrity.

Experimental Design and Data Formats

Where can I find protocols for the preparation of tumor samples and methods for mass spectrometry?
Each laboratory reports details of their experimental protocol in their publications. Links to the CPTAC publications can be found on the Available Studies tab, in the third column. Prior to publication, metadata files are provided with details of sample file naming, instruments and instrumental parameters. These files are available for download from each study page under the data set column "meta".

Where can I find the assignment of biospecimens to iTRAQ labels?
In studies using iTRAQ labels there is a file for iTRAQ Sample Mapping available for download from each study page under the data set column "meta". In the TCGA Ovarian and Breast Cancer Studies this file is also provided under the section "Biospecimens and Metatdata Files."

What data formats are available?
Raw (Vendor) format
RAW or vendor format files corresponding to the mass spectrometers used to acquire the spectra.

mzML
The RAW format spectra are converted to HUPO Proteome Standards Initiative (PSI) compliant mzML format at the Data Coordinating Center (DCC).

Raw PSM format
The Common Data Analysis Pipeline (CDAP) implemented for CPTAC by NIST produces tab-separated-value format files containing peptide spectrium matches (PSMs) generated by MS-GF+ for each CPTAC study.

mzIdentML PSM Format
Raw PSMs from the CDAP or the PCCs are converted to HUPO Proteome Standards Initiative (PSI) compliant mzIdentML format at the Data Coordinating Center (DCC).

Detailed descriptions are here

Is original instrument data retrievable from the CPTAC Data Portal?
Yes, on the data download pages, specify ‘raw’ as the data type desired.

Where can I find spectral data format information?
Spectral data is available in vendor RAW format and in HUPO PSI format mzML files from the study pages. Select datatypes “raw” or “mzML”.

Where can I find details of the PSM data formats? For example, what do iTRAQ flags signify?
Data format details begin on Page 8 in Software Programs and Output Files of the Common Data Analysis Pipeline.There are three flags defined on p. 10 (I, M, and D) that signify iTRAQ signal purity and abundance.

How are lists of peptides and their intensities generated by the CDAP at NIST?
See details provided in CDAP Description and CDAP Results Overview.

Where can I find details of the XML format PSMs?
The XML format PSMs are in HUPO PSI format mzIdentML files. The document mzIdentML Format Peptide-Spectrum-Matches describes the transformation of CDAP format PSM data to mzIdentML.

Where is there detailed description of the Protein reports?
See document CDAP Protein Report Description

Common Data Analysis Pipeline

What data is from the CPTAC Common Data Analysis Pipeline (CDAP)?
The CPTAC program supports analyses of the mass spectrometry raw data (mapping of spectra to peptide sequences and protein identification) for the public using a Common Data Analysis Pipeline (CDAP).

Why is a Common Data Analysis Pipeline (CDAP) used?
While each laboratory thoroughly analyzes and publishes on its own data, there is considerable interest in cross-study analyses. To facilitate cross-study comparisons, all spectral data is processed by the CDAP to ensure uniformly formatted results with consistent identification acceptance thresholds. See CDAP Results Overview for more information.

How and why would published protein reports differ from the CDAP results?
Each Proteome Characterization Center selects search engines, reference databases, other data analysis programs, and parameters to generate the most informative and comprehensive analysis for each study. While a committee of Proteome Characterization Center members agreed on the publicly accessible and well documented tools and methods for the common pipeline, the same scientists are free to select different software and sequence databases for their own analyses. A description of the different strategies for peptide assignment is summarized in CDAP Results Overview.

What types of analyses were performed on each tumor type in the CDAP? Are they directly comparable?
All data were processed using a Common Data Analysis Pipeline described in the CDAP Results Overview document. In addition, each contributing laboratory (Proteome Characterization Center, PCC) analyzed their own data. The specific methods they used are described in the publications posted on the CPTAC Overview page.

Were any normal samples analyzed in the Colorectal cancer study?
Normal colon tissue was analyzed using identical protocols as for the TCGA samples, and is found in Normal Colon Epithelium Samples. Note that the normal colon samples are not matched normals from the TCGA, CPTAC tumor sample donors.

Were any normal samples analyzed in the Breast or Ovarian cancer studies?
No. A pooled reference sample was used in the iTRAQ control channel.

How can I get relative protein abundance for my genes from the Breast cancer study?
Download the TCGA_Breast_BI_Proteome_CDAP_Protein_Report.r1 dataset using the “Prot” datatype selector. The tab-separated-values format protein report TCGA_Breast_BI_Proteome_CDAP.r1.itraq.tsv provides relative protein abundance by sample. Rows correspond to proteins, while columns correspond to TCGA samples. The “XXXX Log Ratio" columns contain the relative abundance of sample XXXX, with respect to the pooled reference sample, as log ratios (base 2). The “XXXX Unshared Log Ratio” columns contain the relative abundance of sample XXXX computed using only those peptide ions whose peptide sequences are associated with a single inferred protein.

How can I get relative protein abundance for my genes from the Colorectal cancer study?
Download the TCGA_Colon_VU_Proteome_CDAP_Protein_Report.r1 dataset using the “Prot” datatype selector. The tab-separated-values format protein report TCGA_Colon_Proteome_CDAP.r1.spectral_counts.tsv provides spectral count protein abundance by sample. Rows correspond to proteins, while columns correspond to TCGA samples. The “XXXX Spectral Count" columns contain the spectral count values for sample XXXX. The “XXXX Unshared Spectral Count" columns contain the spectral count values for sample XXXX computed using only those peptide ions whose peptide sequences are associated with a single inferred protein. Similarly, protein abundance based on integration of precursor peaks is available in protein report TCGA_Colon_Proteome_CDAP.r1.precursor_area.tsv.

How is the consistency and reproducibility of CPTAC spectral data assessed?
NIST performed quality assessment using parameters derived from each of the output files from quantitation and isotope analysis. Each participating laboratory pre-tested their experimental protocol in the system suitability studies using human-in-mouse xenograft breast cancer tumor reference material (CompRef) distributed to all groups for lab-to-lab and within-laboratory performance checks. The same CompRef materials are run between TCGA samples for quality control and the resulting ‘interstitial’ CompRef analyses made available for download. See CDAP Results Overview for additional description.

Will mass spectral library spectra result from these data?
Yes, this process has begun. Mass spectral files accumulated by the CPTAC project currently represent more than 100 million mass spectra. The mass spectrum of each unique peptide sequence exhibits a characteristic reproducible pattern of mass/charge vs. intensity, much like an individual’s fingerprint. Consequently, mass spectral libraries of previously characterized components permit very rapid compound identification. The NIST Mass Spectrometry Data Center established repositories of compound specific mass spectral data useful for rapid recognition of simple chemical structures like drugs, pesticides, steroids, amino acids, etc., beginning in the 1970s. These libraries and associated software enabling spectral matching have been widely accepted in analytical laboratories worldwide. More recently, libraries of tandem mass spectra of peptides recorded using liquid chromatographic separation, electrospray ionization using ion trap - type instrumentation have been distributed to the public by NIST after several steps of curation. Some of the CPTAC data has already been incorporated in the NIST Human peptide libraries (ion trap and collision cell). However, iTRAQ, phospho- and glyco-peptides will require separate data compilations. Reference mass spectral peptide libraries resulting from these studies may be downloaded freely from peptide.NIST.gov.

How should I cite the Common Data Analysis Pipeline (CDAP)?
A publication is being prepared describing the CDAP. A link will appear here once the publication has been accepted. CDAP is supported by NIST, Steve Stein(stephen.stein@nist.gov), Sandy Markey (sanford.markey@nist.gov), Jeri Roth (jeri.roth@nist.gov)
and Paul Rudnick (paul.rudnick@spectragen-informatics.com) from Spectragen Informatics.

Additional Help

Who should I contact if I need assistance?
If you are having problems with the CPTAC Public Data Portal please contact cptac.dcc.help@esacinc.com

How can I request new features for the CPTAC Public Data Portal?
Feature requests, suggestions, and comments are always welcome, select the Feedback blue button at the top of the page, or you can send an e-mail to: cptac.dcc.help@esacinc.com