SAGA Sample Ingest

The SAGA Sample Ingest pipeline is responsible for copying sample data into HDFS and staging the data into the SAGA data structure expected by the Spark processes which run the analysis. The pipeline consists of three steps, which will be described in further detail below.

  1. Transfer preparation
  2. Novel Variant/Annotation Ingest
  3. Sample Coverage & Genotype Ingest

Each step is implemented within its own perl script, all of which can be found at goldsteinlab/software/Annotation_Pipeline/perl/SAGA/ .

Each step must be executed one at a time, sequentially. Step 3 can not be carried out for a given set of samples prior to Step 2 completion, and Step 1 can not run while Step 3 is in progress. Flag files are written to /nfs/seqscratch10/ANNOTATION/HDFS_TRANSPORT/IN_PROGRESS/ to enforce this order.

Source Data

As the AnnoDB Master script runs, it copies data for each sample into a special directory structure on seqscratch10, rooted at /nfs/seqscratch10/ANNOTATION/HDFS_TRANSPORT/.

Within this root directory, there are 5 key sub-directories:


Within the first four directories ( NOVEL_VARIANTS to CVG_PROFILE ) there is one sub-directory per date that the AnnoDB Master Pipeline script was executed. Within each date sub-directory are text files for each sample loaded on that day. These filenames follow this convention:

[ prep_id ]_[ sample_type ].[ data_type ].txt

Here are a few examples:


The HDFS_TRANSPORT/IN_PROGRESS directory is used exclusively as a temporary location for data files as the ingest process is running. Individual files are moved from their original directories into this IN_PROGRESS directory near the beginning of the ingest process, after they have been successfully copied into the HDFS system. After the ingest process is completed, these files are deleted from the IN_PROGRESS directory.

Corresponding AnnoDB Pipeline Steps

A series of AnnoDB Pipeline steps have been defined for the SAGA ingest process, which are updated by the perl scripts throughout the procedure. These pipeline steps can be used to track precisely which samples are at which stage in the SAGA ingest pipeline, and can potentially be used for debugging purposes. These pipeline steps are:

Pipeline Step ID = 20 - HDFS_AssignGroupID: sample has been assigned a hdfs_group_id in the sample AnnoDB table
Pipeline Step ID = 21 - HDFS_NovelVariantsLoaded: all novel variants from this sample have been loaded to the saga.variants table in HDFS
Pipeline Step ID = 22 - HDFS_NovelVariantAnnotationsLoaded: all novel variant annotations from this sample have been loaded to the saga.variant_annotations table in HDFS
Pipeline_Step ID = 23 - HDFS_HomoRefNovelVariants: Homozygous-Reference genotypes have been added to the saga.genotype_groups table in HDFS for old samples
Pipeline_Step ID = 24 - HDFS_CvgProfile: The coverage profile data for this sample has been added to the saga.sample_read_coverage_groups table in HDFS
Pipeline_Step ID = 25 - HDFS_SampleLoad: A record for this sample has been added to the saga.clean_sample table in HDFS
Pipeline_Step ID = 26 - HDFS_SNV_genotype_join: SNV Genotype records for this sample have been added to the saga.genotype_groups table in HDFS
Pipeline_Step ID = 27 - HDFS_INDEL_genotype_join: Indel Genotype records for this sample have been added to the saga.genotype_groups table in HDFS
Pipeline_Step ID = 28 - HDFS_HOMREF_genotype_join: All homozygous-reference genotype records for this sample have been added to saga.genotype_groups
Pipeline_Step ID = 30 - HDFS_GenotypeComplete: All genotype data for this sample has been loaded; sample is fully available for SAGA analysis
Pipeline_Step ID = 31 - HDFS_LoadCleanup: Scratch/temporary data for this sample has been expunged from HDFS as well as the IN_PROGRESS directory

Step 1: Transfer Preparation

This step ( implemented as goldsteinlab/software/Annotation_Pipeline/perl/SAGA/ ) requires a single command-line argument which specifies the date containing samples to load: -d [YYYY]-[MM]-[DD].

For example:
perl goldsteinlab/software/Annotation_Pipeline/perl/SAGA/ -d 2014-11-01 2>&1 > transport_log

This step has 5 main objectives, which are individually applied on each sample found in the source-data directory indicated by the date parameter:

  1. Verify the sample can be found in AnnoDB & retrieve data from the sample table
  2. Assign an HDFS Group ID to the sample (used to partition the data)
  3. Create the sample's clean_sample record, write to IN_PROGRESS
  4. Transfer clean_sample, novel_variants, novel_annotations, cvg_profile, and called_variation data into HDFS using hadoop fs -put
  5. Move above data files into the IN_PROGRESS directory

The HDFS directories the data is moved to is rooted at /annodb/incoming_samples/ , and follows a similar structure to that of the HDFS_TRANSPORT directory, with 1 directory for each data type: sample_data, novel_variants, novel_annotations, cvg_profile, called_snvs, and called_indels.

This step can be carried out for multiple dates before moving on to Step #2. Steps #2 & #3 operate more efficiently in terms of # of samples processed per unit of time if there are many samples to load. Ideally the minimal number of samples to load with Steps 2 & 3 should be above 50 exomes or 75 custom_captures.

Step 2: Novel Variant/Annotation Ingest

The objective of this step in the SAGA ingest pipeline is to incorporate novel variants and annotations into the SAGA data structure, including adding homozygous-reference genotypes for the novel variants for each currently-existing sample in the saga.genotype_groups table. As mentioned above, optimally this step will be run only after >50 samples have be processed with Step 1: Transfer Preparation.

This step is implemented in the script goldsteinlab/software/Annotation_Pipeline/perl/SAGA/

This step requires no command-line arguments, but rather will simply process all samples with data located in the HDFS directory /annodb/incoming_samples/novel_variants/. The following sanity checks are performed to help reduce the chance that a misplaced sample gets loaded twice or is skipped:

  • Ensure the samples with entries in the novel_variant also have entries in the novel_annotations HDFS directories, and that the number of samples in each is the same
  • Confirm that the pipeline status is 'submitted' or 'running' for each sample for AnnoDB Pipeline Step ID = 21 & 22 ( HDFS_NovelVariantsLoaded & HDFS_NovelVariantAnnotationsLoaded )
  • Confirm that a data file was found for each sample with 'submitted' or 'running' status for these steps.
  • Ensure there are no duplicate variants across all samples ( by using chr_pos_ref_alt as a key) both within the novel set and in comparison to the full set in saga.variant.

If any of the above sanity checks fail, the script will die immediately, reporting the failure observed.

The remaining objectives that are performed within this stage include:

  1. Build external HIVE tables surrounding the data in /annodb/incoming_samples/novel_variants/ and /annodb/incoming_sample/novel_annotations/
  2. Build a temporary, partitioned, compressed 'novel_variant_noncarrier_genotype_groups' table by joining saga.sample_read_coverage_groups with the table built around /annodb/incoming_sample/novel_variants/
  3. Insert data from 'novel_variant_noncarrier_genotype_groups' into the saga.genotype_groups table, then delete the temporary table
  4. Insert data from /annodb/incoming_samples/novel_variants/ into saga.variants , and /annodb/incoming_samples/novel_variant_annotations/ into saga.variant_annotations .
  5. drop the external HIVE tables built in step 1, and remove the data from /annodb/incoming_samples/novel_variants/ and /annodb/incoming_sample/novel_annotations/

These 5 steps complete AnnoDB Pipeline Step IDs 21 through 23 as listed above.

Step 3: Sample Coverage & Genotype Ingest

This stage incorporates the coverage and called variation data into the SAGA data structure. The coverage data is simply copied, whereas the genotype data requires some transformation described in detail below.

This step should typically be performed immediately after Step 2, but this is not required. Mutliple rounds of Step 2 can in theory be performed before running Step 3.

This step is implemented in the perl script goldsteinlab/software/Annotation_Pipeline/perl/SAGA/