# FlashPCA¶

Uses Flashpca 2 ( https://github.com/gabraham/flashpca) to obtain principal compoents and plink nearest neighbors to perform outlier detection. Flashpca 2 can accommodate a much larger number of samples compared to around 20k for eigenstrat (smartpca)

### Procedure:¶

- Run plink ped -> bed from atav ped file with --mind 0.99

This excludes samples with more than 99% of their ped file genotype information missing (i.e. 0's in more than 99% of ped file variant positions_ - Run flashpca; verify root mean sq error from flashpca
- Plot eigenvalues, percent variance explained
- Plot eigenvectors for first 3 dimensions (1 vs 2, 2vs 3, 1 vs 3)
- Perform outlier detection ( where applicable)
- use plink --neighbour 1 <numNeighbor>
- Filter plink's output.nearest file based on z-score-thresh for outliers as described above
- Color outliers from previous eigenvector plots
- Repeat steps 1-4 minus outliers

- Generate new sample file by removing samples removed by plink's --mind 0.99 and outliers

### Output:¶

- flashpca runs either twice or once depending on whether or not outlier removal takes place - the flashpca log is concatenated to *flashpca.log
- plink's bed, bim, fam and log files from method step 1.
- eigenvectors, principal components, percent variance explained and eigenvalues files from method step 2
- Plot pdfs : eigenvalues_flashpca.pdf, pve_flashpca.pdf from method step 3
- Plots pdf: eigenvectors_flashpca.pdf from method step 4
- plink_outlier.nearest, plink_outlier.log from method step 5a
- outlier_file.txt from method step 5b
- color outliers plot pdf: plot_eigenvectors_flashpca_color_outliers.pdf from method step 5c
- plink_outlier_removed.bed, bim,fam,log, eigenvalues_flashpca_outliers_removed, eigenvectors_flashpca_outliers_removed, pcs_flashpca_outliers_removed, pve_flashpca_outliers_removed, eigenvalues_flashpca_outliers_removed.pdf, pve_flashpca_outliers_removed.pdf, plot_eigenvectors_flashpca_outliers_removed.pdf from method step 5d

- pruned_sample_file.txt from method step 6

The names of the outliers, flashpca / plink commands as well counts of number of cases and controls at each step are also printed to the stdout file.