Project

General

Profile

DbGaP uploads

It has been the responsibility of the Bioinformatics team to prepare BAM and VCF files for transfer to dbGaP. The IT team has managed the physical data transfer, but they must first be provided a list of paths to the properly formatted data.

BAM file uploads

The BAM files produced by the alignment/genotyping pipeline need no modification prior to dbGaP transfer. The paths to the files can most easily be generated by querying the AlignSeqFileLoc column from tine SequenceDB.seqdbClone database table and appending the following string to the path:

/combined/[sample_name]_final.bam

VCF file uploads

VCF uploading requires a modification of the header section and a removal of the snpEff annotations, if they exist. Two scripts exist to automate this process:

goldsteinlab/Bioinformatics/scripts/createReplaceVcfHeaderScript.pl and goldsteinlab/Bioinformatics/scripts/remove_vcf_snpeff_annotations.pl

createReplaceVcfHeaderScript.pl creates a bash script called replace_header.sh, which can be run locally without any parameters or submitted to an SGE cluster using SGE. remove_vcf_snpeff_annotations.pl is executed within the replace_header.sh bash script.

Usage:
perl goldsteinlab/Bioinformatics/scripts/createReplaceVcfHeaderScript.pl -s [sample_name] -t [sequence_type] -d [scratch directory]

Example:
perl goldsteinlab/Bioinformatics/scripts/createReplaceVcfHeaderScript.pl -s otepi7775y1 -t genome -d /nfs/seqscratch10/ALIGNMENT/samples/otepi7775y1/

Output:
Running the example code above creates the directory structure similar to the alignment/genotyping pipeline, rooted at the specified scratch directory. The created sub-directories are: combined/ , Logs/ , and Scripts . The dbGaP-compatible VCF file along with intermediate output files will be created in the combined/ directory, standard out and standard error logs will be written to the Logs/ directory, and the script which actually does all of the work is created in the Scripts/ directory, named replace_header.sh.

The final files produced will be called [sample_name].analysisReady.vcf.gz (compressed by bgzip).

The final VCF will have had any SnpEff annotations removed, if they existed in the original VCF. This particular task is done by the utility script goldsteinlab/Bioinformatics/scripts/remove_vcf_snpeff_annotations.pl. remove_vcf_snpeff_annotations.pl reads a VCF from standard-in and writes a VCF to standard out, with annotations removed.