Reference SNV Data Sets¶
Here we document sets of SNVs that have been generated for various purposes, mostly for ATAV-related analysis. For ATAV usage, please refer to Variant Level Filter Options for details on the specific parameters to use, and <PATH> will be used as a shorthand here to refer to the path to the variant file.
Variants for Cryptic Relatedness/Ancestry Pruning¶
This is a set of SNPS on the 1MDuo Illumina chip that are exonic, with intermediate MAF, and in linkage equilibrium on a test set of samples. This set of variants is used in our production pipeline for cryptic relatedness inference/confirming documented genetic relationships. These are rs IDs, so use
2783 internally sequenced samples were combined with 2504 from the 1000 Genomes phase 3 project, which is composed of the following ancestries: 2911 Caucasian, 184 Middle East, 368 Hispanic, 539 East Asian, 529 South Asian, 756 African. SNVs from this set of samples were filtered for >= 5% MAF, >= 95% genotyping rate (i.e. >= 10x coverage), and LD-pruned. The suffix "Roche" is appended to indicate this can be an appropriate set of SNPs if using kits similar to the Roche one which has been used for most of our samples. Use the
--variant <PATH> parameter.
This SNP set was generated in the same manner as the "Roche" version, but has been restricted to the targeted regions on the Nextera 37MB capture kit, as we observed poor performance upon inclusion of samples captured using this kit when not restricting to this (smaller) subsetted portion of the genome. This is our most recently created (as of 2/2017) set of informative SNPs to use, and as such suggest its use unless an analyst chooses to construct a custom one. For ATAV usage, use
Variants Identified as Artifactual¶
De Novo Artifacts¶
These are variants which were reported as putative de novo in > 5 unrelated probands from a set of 604 Duke sequenced trios. This is included in
Aggregate EVS Artifacts¶
These are protein-coding variants that had a 2-tailed Fisher's exact test p < 2e-8 (to take into consideration multiple testing) for an allelic imbalance between 654 European American non-disease ascertained samples sequenced at Duke and ESP's 4300 European American samples. This is included in
Individual EVS Artifacts¶
This list was created by comparing 3400 EVS European American samples to 1501 European American non-disease ascertained samples sequenced at Duke (/nfs/svaprojects2/liz/alsgrctrl_vswhitehealthy.txt), and performing a Fisher's exact test on all variants using dominant, recessive, genotypic, and allelic models, after which any that had < 1e-8 (in consideration of multiple testing) were classified as artifacts. There is also a larger list, , that contains all variants with p < 1e-5. This is not included in
Duke Sequencing Artifacts, Roche kit vs TruSeq 65 MB kit¶
Variants from 715 samples processed with the 65 MB kit from non-disease ascertained European Americans were tested with four models (dominant, recessive, genotyping, and allelic) against variants from 734 samples processed with the Roche kit from non-disease ascertained European Americans, and all variants with p < 2e-8 (to account for multiple testing) are listed here. There is a more lenient version at , which contains all variants with p < 1e-5. This is not included in
Variants from 108 schizophrenia controls from phs000473 (from dbGaP) were tested with four models (dominant, recessive, genotypic, and allelic) against variants from 801 samples sequenced at Duke utilizing the 65 MB kit, and all variants with p < 1e-5 from any model are included here. This is not included in
Variants from 53 individuals of European ancestry (from Guy Rouleau, using 50 MB v4 kit) were tested with four models (dominant, recessive, genotyping, and allelic) against variants from 1503 European ancestry controls from Duke, and variants that were coding, in the UTR, or intronic and with p < 1e-3 (N.B. not strict here due to low sample size) are listed here. These are not included in
--exclude-artifacts and their general use is not encouraged given the specifics of the controls used.
Variants from 94 Hispanics without high psychopathy were tested with three models (dominant, recessive, and allelic) against variants from 2521 non-Hispanics, and any non-synonymous variant with p < 0.05 is listed here. These are not included in
--exclude-artifacts and are unlikely to be used for other studies.