The proposed method, a localized intolerance model using Bayesian regression (LIMBR), is intended to stabilize estimates of genic intolerance, leading to lower variability and improved predictive utility. This is validated by identify regions enriched for ClinVar pathogenic variants with AUC up to 0.86 for de novo missense variants in Online Mendelian Inheritance in Man (OMIM) genes. The scores are obtained by fitting a Bayesian multilevel model where domains are clustered within genes using the canonical transcript found in the Conserved Domain Database (http://www.ncbi.nlm.nih.gov/cdd/).
The method regresses the missense variants versus total number of variants using 123,136 exome sequence samples in the Genome Aggregation Database (gnomAD http://gnomad.broadinstitute.org/). This is an extension of existing RVIS and sub-RVIS methods. The sub-RVIS method does not consider the fact that each of the sub-regions lie within genes and therefore have natural relationship to other sub-regions in the same gene. This model leverages that shared information to increase the power to detect pathogenic regions.
Please feel free to email Tristan J. Hayeck firstname.lastname@example.org for more information. The manuscript is being revised before submission, updates to this wiki are pending. If you'd like to be contacting when the manuscript is out please let Tristan know.
"Improved Pathogenic Variant Localization using a Hierarchical Model of Sub-regional Intolerance" (In preparation)
Abstract: Different parts of a gene can be of differential importance to development and health. This regional heterogeneity is also apparent in the distribution of disease mutations which often cluster in particular regions of disease genes. The ability to precisely estimate functionally important sub-regions of genes will be key in correctly deciphering relationships between genetic variation and disease. Previous methods have had some success using standing human variation to characterize this variability in importance by measuring sub-regional intolerance, i.e., the depletion in functional variation from expectation within a given region of a gene. However, the ability to precisely estimate local intolerance was restricted by the fact that only information within a given sub-region is used, leading to instability in local estimates, especially for small regions. Here, we overcome that limitation by borrowing information across genes using a Bayesian hierarchical model. We fit the model using standing human variation comprised of 123,136 individuals found in the genome Aggregation Database (gnomAD). We show that our approach stabilizes estimates, leading to lower variability and improved predictive utility. Specifically, using pathogenic mutations from ClinVar we show that our approach more effectively identifies regions enriched for pathogenic variants and are able to show significant correlations between sub-region intolerance and the distribution of pathogenic variation in disease genes, with AUC up to 0.86 for de novo missense variants in Online Mendelian Inheritance in Man (OMIM) genes. This later result immediately suggests that considering the intolerance of regions in which variants are found may improve diagnostic interpretation within exons. We also illustrate the utility of integrating regional intolerance into gene-level disease association tests with a study of known disease genes for epileptic encephalopathy.