DbGaP downloads

This documentation will assume access has already been approved for a given dbGaP study with a particular dbGaP accession. dbGaP accessions are in the form of phs000... . The processes for navigating the dbGap system in order to assess genotype and alignment availability, retrieving the available data, and loading the data for AnnoDB/ATAV is outlined below.

All raw downloaded dbGaP data is located in /nfs/seqscratch10/tx_temp/dbGaP-6013/

Assessing data availability

There are four general categories of data available on dbGap:

  • meta data about the study and included samples
  • phenotype data ( gender, affected status, ethnicity, quantitative traits, etc)
  • genotype data either from sequencing (VCFs) or from chip/array-based assays
  • aligned read data from the Short Read Archive (SRA)

Meta-data, phenotype data, and genotype data are found within the dbGap Authorized Access web-site ( ). The credentials needed here are the ERA commons credentials for David Goldstein or another authorized individual.

In order to view the approved studies for this account, log in and select the 'My Requests' tab.
Click the 'Request Files' link for a particular study.
On the following page, a collapsed tree of links is shown. Navigate into the links by clicking the '+' . Alternatively, this process has been made much easier by using the dbGaP File Selector link, also on this 'Request Files' page.

The most useful Meta-data file available can be found on the SRA site, accessible by clicking the 'SRA data (reads and reference alignments)' link, then following the SRA Run selector link. Clinking the 'RunInfo Table' button downloads a tab-delimitted text file (named 'SraRunTable.txt') that can be used as a manifest for all individual samples within the study.

In addition, back on the 'Access Request' page, the 'Phenotype and Genotype' tab may contain files with 'Sample_Attributes' in the title, ending with a '.txt' or '.gz' extension. These files often contain information necessary to link Genotype/Phenotype data for individual samples to the SRA alignment data.

Notable Phenotype files include those with 'Subject_Phenotypes' in the title, ending with a '.txt' or '.gz' extension. It has not been our experience that the '.xml' files contain useful data.

Sequencing Genotype files include 'vcf' in the title and are usually multi-sample VCFs. Usually the 'Sample Attributes' files are required to link sample names within the VCF to the identifiers found in the 'SraRunTable.txt' file.

Note that many VCF files for different sub-studies or from different sequencing centers may exist within the 'Access Request' page for a particular dbGap accession, so care must be taken to identify which sub-studies from which centers are being accessed and processed.

As mentioned above, Alignment data is available from the SRA website. Sample alignment data can be retrieved from the SRA Run Selector individually, or in bulk using the SRA command-line download tool with appropriate credentials.

Retrieving dbGaP data

Meta-data, Phenotype data, and Genotype data are downloaded directly from the dbGaP File Selector by using the left-hand panes to filter the file list displayed on the right, checking the files of interest, and clicking the 'Get Data Files' button. Aspera ( ascp ) is needed to retrieve the files. Within our system, the command is:

/home/sally1/.aspera/connect/bin/ascp -QTr -l 300M -k 1 -i "/home/sally1/.aspera/connect/etc/asperaweb_id_dsa.openssh" -W [paste data string here]

The text displayed on the 'Data-request' page in the Run ascp manually... text box following the '-W' can be copied an pasted into the above command.

Each individual run of the aspera command will create a folder in your working directory with a name matching the Request #. This directory will have the same structure as was visualized on the 'Access Request' page under the 'Phenotype and Genotype' tab, containing the files you selected in the dbGaP File Selector.

For organizational purposes, it is better to limit the number of individual downloads done, but rather to make the attempt to select all files of interest and incorporate into a single run of aspera.
In addition, the person who owns the account (David Goldstein) will get an email notifying them that the download request has been fulfilled. The number of these emails should be limited since they can easily be confused with new dbGap study access approvals.

Alignment data is best downloaded in bulk using the SRA toolkit. From SRA files, one can extract BAM or FastQ files, depending on the need (for example whether one plans on re-aligning the data locally, or is happy to base further analysis on pre-aligned data).

Ingest into AnnoDB/ATAV analyses

The Ingest process for dbGaP data depends on the data which was downloaded. By far the simplest and most reliable method for incorporating dbGaP data into AnnoDB is to simply retrieve the FastQ files from the SRA, and insert the samples through the standard alignment/genotyping pipeline. No additional processing is required for this method. The only reason to NOT choose this method is if cluster resources are limited such that running the cohort of samples downloaded through the pipeline is unfeasible.

A method that is less computationally expensive, but potentially prone to introduce artifacts into downstream analysis, is to load the genotypes found the in the downloaded VCFs directly into AnnoDB. This becomes challenging given that the VCF format can be quite flexible from study to study and between originating sequencing centers (i.e. Broad, Baylor, WashU, etc.). In addition the sample identifiers used in the VCF may not match the sample identifiers used in the SraRunTable or in the Phenotype files. Linking this data at a large scale can be time consuming. Ultimately what is required to proceed with this method is a single text file with one sample per line with the following columns (comma- or tab-delimitted):

  1. dbGaP study accession
  2. dbGaP subject ID (found within the SraRunTable, Sample_Attributes, or Phenotype file)
  3. sample ethnicity (found within the SraRunTable, Sample_Attributes, or Phenotype file)
  4. sample gender (found within the SraRunTable, Sample_Attributes, or Phenotype file)
  5. linking identifier (if one is used)
  6. Sample ID within VCF file
  7. full path to VCF file

Optionally it may be best to make a version of this file specific to each VCF file being processed.

A perl script was written to process such a file, construct a CHGVID for each sample, and parse the multi-sample VCF file, loading all alternative genotype data into the AnnoDB tables. This script is goldsteinlab/Bioinformatics/scripts/ .

**Note that any individual VCF may carry its own idiosyncrasies surrounding which variant call quality datas are presented and how genotypes are encoded. Some key differences to be aware of include:

  • Whether or not indels are included
  • Whether or not multi-site SNVs are included
  • If GATK VQSR was run or what filtration methods (or Tranche values) were used to set the FILTER column
  • Which QC scores are available in the INFO field and what are their ranges
  • Do genotypes correspond to the typical encodings ( 0/0, 0/1, 1/2, etc.) for homozygous ref, het, etc
  • Which FORMAT data is available ( GQ, DP, AD, etc.) and how are they encoded
  • Has the VCF already been annotated, if so with what database/program; These annotations typically simply need to be stripped.

After parsing the mutli-sample VCF, the task remains to replicate the produced text files to the slave AnnoDB servers using LOAD DATA INFILE... or mysqlimport.

See the EVS individual data download wiki for more information.