Individual EVS VCFs were downloaded containing individual genotypes for ~5,500 samples of the 6503 that are available in the aggregate dataset. The other 1000 samples are not available for individual control use and cannot be downloaded.
Approval was granted for the following 15 dbGaP studies/accessions:
- JHS Heart Cohorts phs000402
- MESA Heart Cohorts phs000403
- Bronchiectasis phs000518
- Familial A-Fib phs000362
- ARIC Heart Cohorts phs000398
- CARDIA phs000399
- WHI phs000281
- PAH Lung Cohort phs000290
- Asthma Lung Cohort phs000422
- Aortic Disease phs000347
- FHS Heart Cohorts phs000401
- COPDGene ESP phs000296
- CHS Heart Cohorts phs000400
- LungGO LHS COPD phs000291
- Cystic Fibrosis Lung Cohort phs000254
In AnnoDB, the CHGVID for each sample from these studies takes the form:
evs_[aa or ea]_[study accession]_[subject ID]
The 'v' at the end of the third & fourth examples indicate that the string following the third underscore ('_') in the CHGVID ('5862' and 'bi_12785') is the name of the sample in the VCF, not the dbGaP subject ID. This occurred when no subject ID could be determined due to missing phenotype or meta-data.
The original dbGaP download location for all files across these studies is dispersed within the
/nfs/seqscratch10/tx_temp/dbGaP-6013/ directory. The sub-directories containing data for each sub-directory are given in the attached spreadsheet, as well as the number of samples, VCF files, and samples-per-VCF file within each study.
The working directory used to load all of the VCF genotype data into AnnoDB is located at
/nfs/seqscratch10/ANNOTATION/GROUP_DATA/ . There is a sub-directory for each study accession. Within each study accession directory there is a sub-directory for each VCF file for that study.
The files used to link the phenotype data & and SraRunTable entries to the VCF files were produced by a combined team effort in the bioinformatics group. They are located within the directory
/nfs/seqscratch10/ANNOTATION/GROUP_DATA/evs_sample_lists/ as *.csv files.