Project

General

Profile

Collapsing

Please read and follow the guidelines from the collapsing workflow which is based on previously published IGM collapsing analyses performed for idiopathic pulmonary fibrosis (Petrovski et al. 2017) and common epilepsies (Epi4K Consortium 2017).

Reference papers:

Special notes:

  • If you are using gnomad filters, please make sure not to include attached IGM samples.
  • For most collapsing analyses, you will want to run the coverage comparison function first and then use the pruned exon file and coverage information from that output in your subsequent collapsing analyses.
  • By providing the --ccds-only option and also a gene-boundaries file (which can be generated by running a coverage comparison analysis) through --gene-boundaries option , the function will allow user to perform collapsing analysis on filtered genes and exons.
  • --collapsing-comp-het function does not work well for XY genes. People will usually have the same variant called on both chromosomes but with a different name/position, so they will look like they are compound het when they are not.
  • More details on this method and the way to run it outside of ATAV can be found here: http://redmine.igm.cumc.columbia.edu/projects/bioinfo_tools/wiki/LogisticLinear_collapsing_analysis
  • This function support to do gene domains collapsing, please check below --gene-boundaries option.
  • When calling this function:
    1. using --covariate option alone will trigger to calculate logistic p
    2. using --quantitative option alone will trigger to calculate linear p
    3. using both --covariate option and --quantitative option will trigger to calculate linear p with sample quantitative values and covariate values included

Command examples:

atav.sh --collapsing-dom --sample PATH_TO_SAMPLE_FILE --out PATH_TO_OUTPUT_DIR
atav.sh --collapsing-dom --sample PATH_TO_SAMPLE_FILE --ccds-only --gene-boundaries PATH_TO_GENE_BOUNDARY_FILE --out PATH_TO_OUTPUT_DIR

Command to generate permutation based empirical Q-Q plots:

/nfs/goldstein/software/python2.7.7/bin/python \
/nfs/goldstein/software/atav_home/lib/generate_qq.py \
PATH_TO_COLLAPSING_SUMMARY_FILE  \
PATH_TO_COLLAPSING_MATRIX_FILE \
OUTPUT_FILE_PATH

This script generates a Q-Q plot based on 1000 random permutations of case/control labels. In other words, it generates the distribution of expected p-values in the absence of association between genotypes and disease status. Within each permutation, the p-values are sorted. When all permutations are finished, the mean of each p-value rank is calculated (i.e. the mean of the lowest p-value from each permutation, the mean of the second lowest p-value from each permutation, etc.). This is then plotted against the observed p-values. The plot also includes the 2.5th percentile and 97.5th percentile of expected p-values at each rank, which are displayed as green and yellow lines, respectively. The lambda factor displayed on the plot is the slope of the regression line of the observed vs. expected points, after removing p-values of 1 (or -log10(p-value) of 0) and p-values of genome-wide significance after Bonferroni correction.

Command options:

--collapsing-dom: trigger collapsing function for dominant model.

This function “collapses” all qualified variants in a gene into one measurement, and it calls samples carriers if they have any qualified variant in the gene.

--collapsing-rec: trigger collapsing function for recessive model.

This function “collapses” all qualified recessive variants in a gene into one measurement, and it calls samples carriers if they have any qualified variant in the gene.

--collapsing-comp-het: trigger collapsing compound het function.

This function calls samples carriers if they are either hom for a qualified variant in a gene or have two qualified variants in the same gene.

--loo-af: the allele frequency is calculated based on all samples in sample file (ignoring the one where the variant was observed)

--min-covered-sample-binomial-p: exclude all variants with binomial coverage comparison test p-value below specified threshold.

--gene-boundary: specify a gene-boundary file to indicate which exonic regions you want to include in your analysis. This file is defined by a gene name followed by its region(exon) information. When the gene-boundary option is specified, a variant not only has to be in the regions from the gene-boundary file but also has to match the gene name or gene domain name to be output.

Ex:

gene name: AVPR1B 1 (206224439..206225382,206230806..206231144) 1283
gene domain name: CFHR5_-_0 1 (196946795..196946852,196952015..196952209,196953091..196953095) 258

Note: There are 4 columns in this format,separated by space. Column 1 is the gene name or gene domain name; column 2 is the chromosome (1,2,...X,Y); column 3 is a list of regions(exons) that one wants to use to define the gene, separated by comma, enclosed by parenthesis, with each region in the format of region_start..region_end; column 4 is the total count of sites from all regions in column 3. The start/stop positions in gene-boundaries file is one based.
CCDS gene boundaries file directory: /nfs/goldstein/software/atav_home/data/ccds (currently recommend using /nfs/goldstein/software/atav_home/data/ccds/addjusted.CCDS.genes.index.r14.txt; this file has 2bp added to the ends of each exon to catch splice sites)
Gene domain example file: /nfs/goldstein/software/atav_home/data/gene/dRVIS_domain_index_withoutUTR.txt

--region-boundary: specify a region-boundaries file to indicate which exonic regions you want to include in your analysis.

Format: region_name chr:start-end,chr:start-end,chr:start-end
Ex. hsa-mir-6859-1 1:17369-17436,1:18369-18436 (comma delimited regions)

--read-coverage-summary: specify the output file from coverage summary function then ATAV will append gene relevant avgCase, avgCtrl, absDiff values to collapsing summary output.

--quantitative: specify a quantitative file to include all your interested samples and relevant quantitative traits.

File format: sample name and quantitative trait (tab delimiter)

--covariate: specify a covariate file to include all your interested samples and relevant covariates.

File Header is required: Familty ID, Individual ID, Covar1, Covar2, Covar3_cat... (tab delimited)
The sample list in covariate file should match sample file.
The Firth logistic regression will be automatically triggered once you use --covariate option.

--mann-whitney-test: trigger to run variant Mann Whitney Test script downstream.

All the Command Options are available to use in this function.

Output:

gene.sample.matrix.txt: a matrix where each cell represents qualifying/not status for a sample at a gene.

genotypes.csv: output qualified variants. The output format is the same as --list-var-geno function with three additional columns.
  • Covered Case: Total # of case samples with coverage at this variant site (at specified coverage depth)
  • Covered Ctrl: Total # of control samples with coverage at this variant site (at specified coverage depth)
  • Covered Sample Binomial P (two sided): P-value result of two-sided binomial coverage comparison test

summary.csv: summarize qualified variants & samples within one gene & output ordered by fisher p values. (output linear p value and logistic p value for each genes)

  • Total Variant: Total # of qualified variants in this gene in this sample of cases and controls
  • Total SNV: Total # that are SNVs
  • Total Indel: Total # that are indels
  • Qualified Case: Number of cases with a qualifying variant in this gene
  • Unqualified Case: Number of cases without a qualifying variant in this gene
  • Qualified Case Freq: proportion of cases with qualifying variants
  • Qualified Ctrl: Number of ctrls with a qualifying variant in this gene
  • Unqualified Ctrl: Number of ctrls without a qualifying variant in this gene
  • Qualified Ctrl Freq: proportion of ctrls with qualifying variants
  • Enriched Direction
  • Fet P: p-value for an imbalance between proportion of cases and ctrls with a rare variant in this gene
  • Linear P
  • Logistic P

comphet.csv: output qualified combination variants. (includes people as qualified if 1) the person has 2 het variants or 2) the person has at least 1 hom variant) (only output for --collapsing-comp-het function)

sample.variant.count.csv

  • Sample Name
  • Total Sample Variant
  • Total Sample SNV
  • Total Ref SNV
  • Total Het SNV
  • Total Hom SNV
  • Total Sample INDEL
  • Total Ref INDEL
  • Total Het INDEL
  • Total Hom INDEL

Please check Output Columns for common columns.

Difference:

While the new collapsing doesnt do a MAF-all like function, what it does is a leave-one-out like MAF, where maf is calculated based on all samples in sample file (ignoring the one where the variant was observed). This feature allows us to run collapsing on singletons, but also removes the inherent bias of filtering out based on same controls that you then run your stats one.

Note:

We recommend leveraging the available QC metrics to permit analyses on increasingly cleaner sets of variant calls. For options, see: Genotype Level Filter Options