Cohort Selection

Tool(s) for Selecting Cases & Controls that Meet User Specified Criteria
NB. This tool is intended to build an initial cohort. It is the analyst's responsibility to take the resulting sample set and pare it down to one appropriate for analysis (for example, we do NOT perform any relatedness checks here). In addition, the default QC criteria can be changed to meet your specific needs.

Run /nfs/goldstein/software/cohort_selection/ --help to see options:
./ --help
usage: [-h]
                       [--sequencing_type {Exome,Genome,Custom_Capture} [{Exome,Genome,Custom_Capture} ...]]
                       [--sequencing_preference {Exome,Genome,Custom_Capture}]
                       [--bases_covered_10x BASES_COVERED_10X]
                       [--dbsnp_overlap_snv DBSNP_OVERLAP_SNV]
                       [--max_contamination MAX_CONTAMINATION]
                       [--ethnicity {African,Caucasian,EastAsian,Hispanic,MiddleEastern,SouthAsian} [{African,Caucasian,EastAsian,Hispanic,MiddleEastern,SouthAsian} ...]]
                       [--min_genotyping_rate MIN_GENOTYPING_RATE]
                       [--log_file_name LOG_FILE_NAME]
                       [--qc_file_name QC_FILE_NAME]

Assist user in selecting a set of cases and controls that meet certain user
specified criteria for downstream analysis

positional arguments:
  sample_file_name      Provide name of output sample list

optional arguments:
  -h, --help            show this help message and exit
  --sequencing_type {Exome,Genome,Custom_Capture} [{Exome,Genome,Custom_Capture} ...]
                        Specify one or more types of sequencing to include
                        (default: ['Exome'])
  --sequencing_preference {Exome,Genome,Custom_Capture}
                        Specify a preferred sequencing type in case a sample
                        has multiple database entries (default: Exome)
                        Include samples where self-declared sex does not match
                        that inferred from sequencing (default: False)
  --bases_covered_10x BASES_COVERED_10X
                        Require all samples to have at least this fraction of
                        bases in the CCDS regions to have 10x coverage
                        (default: 0.9)
  --dbsnp_overlap_snv DBSNP_OVERLAP_SNV
                        Require this fraction of overlap with dbSNP (SNVs)
                        (default: 0.9)
  --max_contamination MAX_CONTAMINATION
                        Require all samples to have <= this amount of
                        contamination (default: 0.03)
  --ethnicity {African,Caucasian,EastAsian,Hispanic,MiddleEastern,SouthAsian} [{African,Caucasian,EastAsian,Hispanic,MiddleEastern,SouthAsian} ...]
                        Specify one or more permitted ethnicities (default:
  --min_genotyping_rate MIN_GENOTYPING_RATE
                        Require all samples to have at least this genotyping
                        rate for ethnicity prediction (default: 0.9)
  --log_file_name LOG_FILE_NAME
                        Provide name of log file name, if desired (default:
  --qc_file_name QC_FILE_NAME
                        Provide name of output sample QC file (default: None)

If a parameter accepts one or more arguments, use a space to separate them, e.g. if you want to include exome and genome samples, you would use
--sequencing_type Exome Genome
. This tool is only intended to filter based on the following criteria:
  • sex typing mismatch (OPTIONAL): default is to not include these
  • sequencing type
  • CCDS regions covered at least at 10x
  • dbSNP overlap
  • contamination
  • ethnicity probability (if more than is specified, we accept any sample which has >= 80% probability of any one of the ethnicities; N.B. this does not handle admixed samples)

The selection tool will display the all samples that meet your specified QC criteria, grouped by broad phenotype. After you select CASE and CONTROL phenotypes, the tool will display the capture kits (if applicable) that apply to the selected sample. You can select any number of capture kits to include in your cohort, and the tool will proceed to generate a sample file in ATAV-ready format.

If you have provided --qc_file_name, the tool will provide all available QC information for the final list of selected samples in the specified. This can take a few moments.

If you have provided --log_file_name, the tool will keep a record of the usage details (including QC criteria, selected phenotypes, selected capture kits, etc.) in the specified file. Further runs will append to this file rather than overwriting.