We downloaded 5503 EVS samples of the 6503 that are available in the aggregate dataset. The other 1000 samples are not available for individual control use and cannot be downloaded.
Notes on the EVS individual data download process from dbGaP are available on a separate wiki page.
We performed sex checks using chr X het:hom ratios, ethnicity checks using eigenstrat axes, and relatedness checks using king (highly contaminated samples were also identified using the relatedness checks). 156 samples were excluded for failing one of these checks (mostly relatedness checks for being second-degree relatives).
The remaining 5347 samples will be added as controls to your analysis if you use --include-evs-sample all (1951 if you use aa, 3396 if you use ea).
Investigations comparing against our samples showed p-value inflation when performing collapsing analyses including all of these samples. The most useful remedy for the inflation was to exclude a new list of EVS artifacts (described here http://redmine2.chgv.lsrc.duke.edu/projects/bioinfo_tools/wiki/Artifacts). We recommend removing these artifacts from your analyses using the --exclude-variant command. They are not yet part of the --exclude-artifacts command as we need to finalize the list now that certain ATAV features are working correctly with the dataset.
We also identified 469 whites who were contributing to the inflation (and 437 blacks who would likely contribute to inflation in a black collapsing). We recommend excluding them from your collapsing analyses. To do this, add the samples in the attached file to your usual --sample list instead of usign the --include-evs-sample command.
This leaves you 2927 whites and 1514 blacks that can be added to analyses. We are continuing to investigate ways to use these samples and may change the list in the future. In particular, we have not yet tried performing a collapsing analysis on blacks. If you have anything to add based on your own experiences, please please please contact Liz (firstname.lastname@example.org)!
The EVS samples come from numerous studies and were downloaded in ~90 different vcfs. There are different levels of QC measures available for the samples, and many samples have certain QC fields missing. By default, if you set a cutoff like --gq 20, and a variant has no gq in a certain sample, then that sample's genotype will be set to missing. If you want to include such genotypes, use the new function --include-qc-missing. Also, many EVS samples do not have indels called.
The coverage for the EVS samples is calculated using the aggregate data already available. We correct for the # of evs samples actually used in the analysis (thus, if you only want to use 100 white evs samples, atav will use the data available for the aggregate 4300 white evs samples and will scale it down to 100 white evs samples when calculating MAFs). Indels were not called in all EVS samples, so indel MAFs are corrected for that smaller # of samples as well. Also, all EVS coverage is based on % of samples covered at least 8x, whereas most work in the CHGV uses 10x. There is nothing to be done about this except to be aware.
Comparisons between the aggregate MAFs and ATAV-calculated MAFs based on the individual-level data show that >99% of SNVs and >97% of indels have values within 1% of expectation. Most of the discrepancies seem to be due to differences between us and EVS in how data are filtered. Because different quality data are available from the ~90 evs vcfs we downloaded, it is not possible to perfectly replicate the values given in the aggregate EVS dataset.
More detailed phenotype data is available for nearly all of the samples loaded into AnnoDB. These are currently stored in tab-delimited text files that can be opened with Excel at :
/nfs/seqscratch10/tx_temp/dbGaP-6013/PHENOTYPES/ . These files were pulled directly from dbGaP and have meaningful headers. The data is broken up by consent group, for each individual study accession (phs000.. identifier).
Note that there was one study accession (encompassing 191 samples) which had no available phenotype data on the dbGap site: phs000422.
In order to find phenotype data given a particular AnnoDB EVS sample name (CHGVID), follow this simple procedure:
- Extract the study accession and the dbGap subject ID from the sample name / CHGVID of interest. The names are formatted like: evs_[ ea/aa ]_[ study accession ID ]_[ subject ID ] .
- Locate the files with the study accession ID contained in the file name within the directory
/nfs/seqscratch10/tx_temp/dbGaP-6013/PHENOTYPES/. Download/copy to open with Excel if necessary.
- Search all matching files for the subject ID , usually within the 1st column of the file.
Data later became available linking samples from phs000254 to phenotype data, so the CHGVIDs were updated to make this connection easier. See the attached 'phs000254_name_update' spreadsheet to map to the original CHGVIDs.