SAGA Sample Ingest¶
The SAGA Sample Ingest pipeline is responsible for copying sample data into HDFS and staging the data into the SAGA data structure expected by the Spark processes which run the analysis. The pipeline consists of three steps, which will be described in further detail below.
- Transfer preparation
- Novel Variant/Annotation Ingest
- Sample Coverage & Genotype Ingest
Each step is implemented within its own perl script, all of which can be found at
Each step must be executed one at a time, sequentially. Step 3 can not be carried out for a given set of samples prior to Step 2 completion, and Step 1 can not run while Step 3 is in progress. Flag files are written to
/nfs/seqscratch10/ANNOTATION/HDFS_TRANSPORT/IN_PROGRESS/ to enforce this order.
As the AnnoDB Master script runs, it copies data for each sample into a special directory structure on seqscratch10, rooted at
Within this root directory, there are 5 key sub-directories:
Within the first four directories ( NOVEL_VARIANTS to CVG_PROFILE ) there is one sub-directory per date that the AnnoDB Master Pipeline script was executed. Within each date sub-directory are text files for each sample loaded on that day. These filenames follow this convention:
[ prep_id ]_[ sample_type ].[ data_type ].txt
Here are a few examples:
HDFS_TRANSPORT/IN_PROGRESS directory is used exclusively as a temporary location for data files as the ingest process is running. Individual files are moved from their original directories into this
IN_PROGRESS directory near the beginning of the ingest process, after they have been successfully copied into the HDFS system. After the ingest process is completed, these files are deleted from the
Corresponding AnnoDB Pipeline Steps¶
A series of have been defined for the SAGA ingest process, which are updated by the perl scripts throughout the procedure. These pipeline steps can be used to track precisely which samples are at which stage in the SAGA ingest pipeline, and can potentially be used for debugging purposes. These pipeline steps are:
- HDFS_AssignGroupID: sample has been assigned a hdfs_group_id in the
sample AnnoDB table
- HDFS_NovelVariantsLoaded: all novel variants from this sample have been loaded to the
saga.variants table in HDFS
- HDFS_NovelVariantAnnotationsLoaded: all novel variant annotations from this sample have been loaded to the
saga.variant_annotations table in HDFS
- HDFS_HomoRefNovelVariants: Homozygous-Reference genotypes have been added to the
saga.genotype_groups table in HDFS for old samples
- HDFS_CvgProfile: The coverage profile data for this sample has been added to the
saga.sample_read_coverage_groups table in HDFS
- HDFS_SampleLoad: A record for this sample has been added to the
saga.clean_sample table in HDFS
- HDFS_SNV_genotype_join: SNV Genotype records for this sample have been added to the
saga.genotype_groups table in HDFS
- HDFS_INDEL_genotype_join: Indel Genotype records for this sample have been added to the
saga.genotype_groups table in HDFS
- HDFS_HOMREF_genotype_join: All homozygous-reference genotype records for this sample have been added to
- HDFS_GenotypeComplete: All genotype data for this sample has been loaded; sample is fully available for SAGA analysis
- HDFS_LoadCleanup: Scratch/temporary data for this sample has been expunged from HDFS as well as the
Step 1: Transfer Preparation¶
This step ( implemented as
goldsteinlab/software/Annotation_Pipeline/perl/SAGA/saga_ingest_prepare_hdfs_transfer.pl ) requires a single command-line argument which specifies the date containing samples to load:
perl goldsteinlab/software/Annotation_Pipeline/perl/SAGA/saga_ingest_prepare_hdfs_transfer.pl -d 2014-11-01 2>&1 > transport_log
This step has 5 main objectives, which are individually applied on each sample found in the source-data directory indicated by the date parameter:
- Verify the sample can be found in AnnoDB & retrieve data from the sample table
- Assign an HDFS Group ID to the sample (used to partition the data)
- Create the sample's clean_sample record, write to
- Transfer clean_sample, novel_variants, novel_annotations, cvg_profile, and called_variation data into HDFS using
hadoop fs -put
- Move above data files into the
The HDFS directories the data is moved to is rooted at
/annodb/incoming_samples/ , and follows a similar structure to that of the
HDFS_TRANSPORT directory, with 1 directory for each data type:
This step can be carried out for multiple dates before moving on to Step #2. Steps #2 & #3 operate more efficiently in terms of # of samples processed per unit of time if there are many samples to load. Ideally the minimal number of samples to load with Steps 2 & 3 should be above 50 exomes or 75 custom_captures.
Step 2: Novel Variant/Annotation Ingest¶
The objective of this step in the SAGA ingest pipeline is to incorporate novel variants and annotations into the SAGA data structure, including adding homozygous-reference genotypes for the novel variants for each currently-existing sample in the saga.genotype_groups table. As mentioned above, optimally this step will be run only after >50 samples have be processed with Step 1: Transfer Preparation.
This step is implemented in the script
This step requires no command-line arguments, but rather will simply process all samples with data located in the HDFS directory
/annodb/incoming_samples/novel_variants/. The following sanity checks are performed to help reduce the chance that a misplaced sample gets loaded twice or is skipped:
- Ensure the samples with entries in the novel_variant also have entries in the novel_annotations HDFS directories, and that the number of samples in each is the same
- Confirm that the pipeline status is 'submitted' or 'running' for each sample for AnnoDB Pipeline Step ID = 21 & 22 ( HDFS_NovelVariantsLoaded & HDFS_NovelVariantAnnotationsLoaded )
- Confirm that a data file was found for each sample with 'submitted' or 'running' status for these steps.
- Ensure there are no duplicate variants across all samples ( by using chr_pos_ref_alt as a key) both within the novel set and in comparison to the full set in saga.variant.
If any of the above sanity checks fail, the script will die immediately, reporting the failure observed.
The remaining objectives that are performed within this stage include:
- Build external HIVE tables surrounding the data in
- Build a temporary, partitioned, compressed 'novel_variant_noncarrier_genotype_groups' table by joining saga.sample_read_coverage_groups with the table built around
- Insert data from 'novel_variant_noncarrier_genotype_groups' into the saga.genotype_groups table, then delete the temporary table
- Insert data from
/annodb/incoming_samples/novel_variants/into saga.variants , and
/annodb/incoming_samples/novel_variant_annotations/into saga.variant_annotations .
- drop the external HIVE tables built in step 1, and remove the data from
These 5 steps complete 21 through 23 as listed above.
Step 3: Sample Coverage & Genotype Ingest¶
This stage incorporates the coverage and called variation data into the SAGA data structure. The coverage data is simply copied, whereas the genotype data requires some transformation described in detail below.
This step should typically be performed immediately after Step 2, but this is not required. Mutliple rounds of Step 2 can in theory be performed before running Step 3.
This step is implemented in the perl script