Project

General

Profile

Kinship Pruning

This tool performs kinship pruning (removing related individuals) from a cohort of interest, as described below. The protocol we use is to create a PED and MAP pair on the samples and variants of interest, run KING to estimate pairwise kinship coefficients, and then provide this script with the PED file (or alternately, our custom sample file format), along with the file output from KING. This will produce a new pared down PED file, in which we have removed at least one individual in every pair of relatives.

This method is fast, allows an analyst to not be concerned with manually removing relatives when building an initial sample set, and ensures consistency in that we will not mistakenly include a related pair if two samples are not annotated as being related. Specifying -v/--verbose generates a log of the actions performed in order to arrive at the final sample set.

N.B. the analyst must choose a set of SNVs to use to generate the PED, but we have built a set of SNPs that should be appropriate for most purposes, which is located at /nfs/goldstein/software/atav_home/data/variant/informative_snps.ld_pruned.37MB.txt. Briefly, this was generated by taking intermediate MAF SNPs from a large set of exome samples of varied ancestry, restricted to the targeted regions of the Nextera 37 MB kit (as it is the smallest subset of targeted regions for our samples), and finally LD-pruned.
Additionally, the parameter -r RELATEDNESS_THRESHOLD is by default 0.0884. This is the recommended value from the authors of KING to remove second-degree or greater relatives.

/nfs/goldstein/software/atav_home/lib/run_kinship.py --help
usage: run_kinship.py [-h] [-r RELATEDNESS_THRESHOLD]
                      [--sample_coverage_summary SAMPLE_COVERAGE_SUMMARY]
                      [--seed SEED] [-v] [-o OUTPUT]
                      PED_FILE KINSHIP_FILE

Take a KING kinship file, PED/FAM/sample file (for getting phenotype),
and optional coverage summary file in order to generate a list of samples
to remove iteratively as follows:
    1. remove affected that is related to the most other affecteds
        a. break ties by removing the affected distantly related to
            the most other affecteds
        b. break ties by removing the affected related to the most unaffecteds
        c. break ties by removing the affected distantly related
            to the most unaffecteds
        d. (optional) break ties by removing the affected with the least
            coverage
    2. remove unaffected that is related to the most other unaffecteds
        a. break ties by removing the unaffected that is related to the most
            affecteds
        b. break ties by removing the unaffected that is distantly related to
            the most affecteds
        c. break ties by removing the unaffected that is distantly related to
            the most unaffecteds
        d. (optional) break ties by removing the unaffected with the least
            coverage
    3. remove unaffected that is related the most affecteds
        a. break ties by removing the unaffected that is distantly related to
            the most affecteds
        b. break ties by removing the unaffected that is distantly related to
            the most unaffecteds
        c. (optional) break ties by removing the unaffected with the least
            coverage

KING should be run like: king -b <bed_infile> --kinship --related --degree 3

Written by Brett Copeland <bc2675@cumc.columbia.edu>

positional arguments:
  PED_FILE              a PED/MAP/ATAV sample file to get phenotype
  KINSHIP_FILE          the KING kinship output file to read

optional arguments:
  -h, --help            show this help message and exit
  -r RELATEDNESS_THRESHOLD, --relatedness_threshold RELATEDNESS_THRESHOLD
                        consider kinship coefficients above this value to be
                        related (default: 0.0884)
  --sample_coverage_summary SAMPLE_COVERAGE_SUMMARY
                        break ties by removing the sample with the lowest
                        coverage as indicated in this file (default: None)
  --seed SEED           set a random seed to guarantee the same results each
                        time (default: None)
  -v, --verbose         verbose mode (default: False)
  -o OUTPUT, --output OUTPUT
                        the output file (default: <open file '<stdout>', mode
                        'w' at 0x7f20474bc150>)