Collapsing Compound Het

Special Note:
  • For most collapsing analyses, you will want to run the coverage comparison function first and then use the pruned exon file and coverage information from that output in your subsequent collapsing analyses.
  • This function does not work well for XY genes. People will usually have the same variant called on both chromosomes but with a different name/position, so they will look like they are compound het when they are not.

Command examples: --collapsing-comp-het --sample $SAMPLE_FILE --out $OUTPUT_PATH --collapsing-comp-het --sample $SAMPLE_FILE --ccds-only --gene-boundaries $GENE_BOUNDARY_FILE --out $OUTPUT_PATH

Command options:

--collapsing-comp-het: trigger collapsing compound het function.

Note: This function calls samples carriers if they are either hom for a qualified variant in a gene or have two qualified variants in the same gene.

--gene-boundaries: specify a gene-boundaries file to indicate which exonic regions you want to include in your analysis. This file is defined by a gene name followed by its region(exon) information.

example: AVPR1B 1 (206224439..206225382,206230806..206231144) 1283
Note: There are 4 columns in this format,separated by space. Column 1 is the gene name; column 2 is the chromosome (1,2,...X,Y); column 3 is a list of regions(exons) that one wants to use to define the gene, separated by comma, enclosed by parenthesis, with each region in the format of region_start..region_end; column 4 is the total count of sites from all regions in column 3. The start/stop positions in gene-boundaries file is one based.
CCDS gene boundaries file directory: /nfs/goldstein/goldsteinlab/software/atav_home/data/ccds

--covariate: specify a covariate file to include all your interested samples and relevant covariates.

File format: Familty ID, Individual ID, Covar1, Covar2, Covar3_cat...
The sample list in covariate file should match sample file.

--loo-maf: the maf is calculated based on all samples in sample file (ignoring the one where the variant was observed)

--loo-maf-rec: loo minor allele frequency for filtering recessive variants.

--loo-mhgf-rec: loo minor homozygous genotype frequency for filtering recessive variants.

--loo-comb-freq: apply a frequency filter for the co-occurrence in controls.

All the Common Command Options are available to use in this function.


gene.sample.matrix.txt: a matrix where each cell represents qualifying/not status for a sample at a gene.
comphet.csv: output qualified combination variants. (includes people as qualified if 1) the person has 2 het variants or 2) the person has at least 1 hom variant)
summary.csv: summarize qualified variants & samples within one gene & output ordered by fisher p values.


While the new collapsing doesnt do a MAF-all like function, what it does is a leave-one-out like MAF, where maf is calculated based on all samples in sample file (ignoring the one where the variant was observed). This feature allows us to run collapsing on singletons, but also removes the inherent bias of filtering out based on same controls that you then run your stats one.

--loo-maf, --loo-maf-recessive, --loo-mghf-recessive, in this function, are all followed leave-one-out way to calculate.