Project

General

Profile

FlashPCA

Uses Flashpca 2 ( https://github.com/gabraham/flashpca) to obtain principal compoents and plink nearest neighbors to perform outlier detection. Flashpca 2 can accommodate a much larger number of samples compared to around 20k for eigenstrat (smartpca)

Procedure:

  1. Run plink ped -> bed from atav ped file with --mind 0.99
    This excludes samples with more than 99% of their ped file genotype information missing (i.e. 0's in more than 99% of ped file variant positions_
  2. Run flashpca; verify root mean sq error from flashpca
  3. Plot eigenvalues, percent variance explained
  4. Plot eigenvectors for first 3 dimensions (1 vs 2, 2vs 3, 1 vs 3)
  5. Perform outlier detection ( where applicable)
    1. use plink --neighbour 1 <numNeighbor>
    2. Filter plink's output.nearest file based on z-score-thresh for outliers as described above
    3. Color outliers from previous eigenvector plots
    4. Repeat steps 1-4 minus outliers
  6. Generate new sample file by removing samples removed by plink's --mind 0.99 and outliers

Output:

  1. flashpca runs either twice or once depending on whether or not outlier removal takes place - the flashpca log is concatenated to *flashpca.log
  2. plink's bed, bim, fam and log files from method step 1.
  3. eigenvectors, principal components, percent variance explained and eigenvalues files from method step 2
  4. Plot pdfs : eigenvalues_flashpca.pdf, pve_flashpca.pdf from method step 3
  5. Plots pdf: eigenvectors_flashpca.pdf from method step 4
    1. plink_outlier.nearest, plink_outlier.log from method step 5a
    2. outlier_file.txt from method step 5b
    3. color outliers plot pdf: plot_eigenvectors_flashpca_color_outliers.pdf from method step 5c
    4. plink_outlier_removed.bed, bim,fam,log, eigenvalues_flashpca_outliers_removed, eigenvectors_flashpca_outliers_removed, pcs_flashpca_outliers_removed, pve_flashpca_outliers_removed, eigenvalues_flashpca_outliers_removed.pdf, pve_flashpca_outliers_removed.pdf, plot_eigenvectors_flashpca_outliers_removed.pdf from method step 5d
  6. pruned_sample_file.txt from method step 6
    The names of the outliers, flashpca / plink commands as well counts of number of cases and controls at each step are also printed to the stdout file.