sample file: /nfs/goldstein/software/atav_home/data/sample/swgr_GS00_case.txt
using server sva12
--region 1,2,3...X (no Y or MT)
include wellderly qc missing
it support in all ATAV functions - please use --min-coverage instead of --wellderly-min-coverage in --coverage-comparison or --coverage-summary function
Wellderly samples, also labeled as SWGR, are a set of genomes produced by Complete Genomics sequencing, provided by collaborators at Scripps to Liz Cirulli . In total there are 534 genome samples in this data set with genotype and coverage data available. Due to the size of this data set, it is not stored along with other data within AnnoDB, but rather in a separate structure only on AnnoDB Slave SVA12.
In AnnoDB the sample names range from 'swgr_GS000008112' to 'swgr_GS000029076' . The list of sample names is attached at the bottom.
Since the SWGR samples were produced by Complete Genomics sequencing rather than by Illumina HiSeqs, they have gone through a completely different alignment and genotyping pipeline than all other samples within AnnoDB. As such there are many idiosyncrasies with this data set that should be explained.
The original data is in the form of masterVarBeta files located at
/nfs/goldsteindata/liz/SWGR/masterVarBeta/ and multi-sample VCF files produced by the collaborators, one per chromosome, located at
/nfs/goldsteindata/liz/SWGR/ . The masterVarBeta files are used to create Coverage Profile data, while variation call data is extracted from the VCFs. In addition to the 534 loaded genome samples, there are 40 samples with masterVarBeta files that do not have genotypes within the VCF files ( GS000034172 through GS000035681 ). A list of the 40 associated masterVarBeta files is also attached to this post.
- No exact coverage was available for these samples, but rather lists of regions labeled as 'no-call' within the masterVarBeta text files. Outside of these regions, a minimum of 10x coverage has been assumed and was used to generate standard AnnoDB Coverage Profile data.
- The variant call QC scores for these samples are limited to filtered_coverage, reads_ref, reads_alt, a Genotype Quality, and a pass/fail status
- Variant calls with a 'fail' status were annotated by the Complete Genomics pipeline as 'VQLOW'
- The variant call set includes genotypes reported with one allele missing as in '1/.' , '0/.' or './2' . In these cases the pass/fail status for non-homozyous ref genotypes is set to 'intermediate.'
- Indels from this data set were ignored while reading the VCF, so are not loaded to AnnoDB
- The VCF includes multiple, phased SNVs at a single locus (ex. TCA/AGT ). These were reduced to separate single-base SNVs across multiple loci (ex. T/A , C/G, A/T) in order to be correctly processed by the AnnoDB system.
- The DP and AD columns are missing for 0/. calls, so they have been set to 10 just for consistency with how hom-refs are treated otherwise. The GQ is also missing for 0/. calls, and they have been set to 20.
- If you use –var-status pass,fail,intermediate, then 0/. calls will be included as hom ref, and 1/. will be included as het. If you specify –var-status pass, then the 0/. and 1/. will be set to missing.
The masterVarBeta-to-Coverage Profile transformation is implemented in the perl script
The mutli-position SNV reduction is implemented in the perl script
The SWGR Mutli-sample VCF parsing is implemented in the perl script
Additional for the called_snv output there are numerous cases where a line was truncated with the rest of that line on the next line. Josh put together a quick script to read through the data and clean it up
/nfs/goldstein/goldsteinlab/jb371/ATAV_testing/issue1045/fixCalledSNVLines.py . It takes the called_snv output as standard in.
python /nfs/goldstein/goldsteinlab/jb371/ATAV_testing/issue1045/fixCalledSNVLines.py < /nfs/seqscratch10/ANNOTATION/GROUP_DATA/SWGR/chr21.simplified_snvs.ANNOTATED/called_snv_chr21 > called_snv_chr21_fixed