Cohort Selection¶
Tool(s) for Selecting Cases & Controls that Meet User Specified Criteria
NB. This tool is intended to build an initial cohort. It is the analyst's responsibility to take the resulting sample set and pare it down to one appropriate for analysis (for example, we do NOT perform any relatedness checks here). In addition, the default QC criteria can be changed to meet your specific needs.
Run /nfs/goldstein/software/cohort_selection/build_cohort.py --help to see options:
./build_cohort.py --help usage: build_cohort.py [-h] [--sequencing_type {Exome,Genome,Custom_Capture} [{Exome,Genome,Custom_Capture} ...]] [--sequencing_preference {Exome,Genome,Custom_Capture}] [--include_sex_typing_mismatches] [--bases_covered_10x BASES_COVERED_10X] [--dbsnp_overlap_snv DBSNP_OVERLAP_SNV] [--max_contamination MAX_CONTAMINATION] [--ethnicity {African,Caucasian,EastAsian,Hispanic,MiddleEastern,SouthAsian} [{African,Caucasian,EastAsian,Hispanic,MiddleEastern,SouthAsian} ...]] [--min_genotyping_rate MIN_GENOTYPING_RATE] [--log_file_name LOG_FILE_NAME] [--qc_file_name QC_FILE_NAME] sample_file_name Assist user in selecting a set of cases and controls that meet certain user specified criteria for downstream analysis positional arguments: sample_file_name Provide name of output sample list optional arguments: -h, --help show this help message and exit --sequencing_type {Exome,Genome,Custom_Capture} [{Exome,Genome,Custom_Capture} ...] Specify one or more types of sequencing to include (default: ['Exome']) --sequencing_preference {Exome,Genome,Custom_Capture} Specify a preferred sequencing type in case a sample has multiple database entries (default: Exome) --include_sex_typing_mismatches Include samples where self-declared sex does not match that inferred from sequencing (default: False) --bases_covered_10x BASES_COVERED_10X Require all samples to have at least this fraction of bases in the CCDS regions to have 10x coverage (default: 0.9) --dbsnp_overlap_snv DBSNP_OVERLAP_SNV Require this fraction of overlap with dbSNP (SNVs) (default: 0.9) --max_contamination MAX_CONTAMINATION Require all samples to have <= this amount of contamination (default: 0.03) --ethnicity {African,Caucasian,EastAsian,Hispanic,MiddleEastern,SouthAsian} [{African,Caucasian,EastAsian,Hispanic,MiddleEastern,SouthAsian} ...] Specify one or more permitted ethnicities (default: None) --min_genotyping_rate MIN_GENOTYPING_RATE Require all samples to have at least this genotyping rate for ethnicity prediction (default: 0.9) --log_file_name LOG_FILE_NAME Provide name of log file name, if desired (default: None) --qc_file_name QC_FILE_NAME Provide name of output sample QC file (default: None)
If a parameter accepts one or more arguments, use a space to separate them, e.g. if you want to include exome and genome samples, you would use
--sequencing_type Exome Genome. This tool is only intended to filter based on the following criteria:
- sex typing mismatch (OPTIONAL): default is to not include these
- sequencing type
- CCDS regions covered at least at 10x
- dbSNP overlap
- contamination
- ethnicity probability (if more than is specified, we accept any sample which has >= 80% probability of any one of the ethnicities; N.B. this does not handle admixed samples)
The selection tool will display the all samples that meet your specified QC criteria, grouped by broad phenotype. After you select CASE and CONTROL phenotypes, the tool will display the capture kits (if applicable) that apply to the selected sample. You can select any number of capture kits to include in your cohort, and the tool will proceed to generate a sample file in ATAV-ready format.
If you have provided --qc_file_name, the tool will provide all available QC information for the final list of selected samples in the specified. This can take a few moments.
If you have provided --log_file_name, the tool will keep a record of the usage details (including QC criteria, selected phenotypes, selected capture kits, etc.) in the specified file. Further runs will append to this file rather than overwriting.