Cory McLean
Cory is a staff software engineer in Google Health who leads the Genomics research team. His research interests broadly include applying machine learning to the analysis and interpretation of genomic data and publishing tools and methods as open-source software. Prior to Google, Cory was at 23andMe where he developed algorithms and tools to improve identity-by-descent detection, haplotype phasing, and genotype imputation, and the application of genetic association study results to drug development. Cory received a PhD in computer science from Stanford, where he developed computational methods to understand vertebrate gene regulation, and a BS in computer science from MIT.
Research Areas
Authored Publications
Google Publications
Other Publications
Sort By
Longitudinal fundus imaging and its genome-wide association analysis provides evidence for a human retinal aging clock
Sara Ahadi
Kenneth A Wilson Jr,
Orion Pritchard
Ajay Kumar
Enrique M Carrera
Ricardo Lamy
Jay M Stewart
Avinash Varadarajan
Pankaj Kapahi
Ali Bashir
eLife (2023)
Preview abstract
Background
Biological age, distinct from an individual’s chronological age, has been studied extensively through predictive aging clocks. However, these clocks have limited accuracy in short time-scales. Deep learning approaches on imaging datasets of the eye have proven powerful for a variety of quantitative phenotype inference and provide an opportunity to explore organismal aging and tissue health.
Methods
Here we trained deep learning models on fundus images from the EyePacs dataset to predict individuals’ chronological age. These predictions lead to the concept of a retinal aging clock which we then employed for a series of downstream longitudinal analyses. The retinal aging clock was used to assess the predictive power of aging inference, termed eyeAge, on short time-scales using longitudinal fundus imaging data from a subset of patients. Additionally, the model was applied to a separate cohort from the UK Biobank to validate the model and perform a GWAS. The top candidate gene was then tested in a fly model of eye aging.
Findings
EyeAge was able to predict the age with a mean absolute error of 3.26 years, which is much less than other aging clocks. Additionally, eyeAge was highly independent of blood marker-based measures of biological age (e.g. “phenotypic age”), maintaining a hazard ratio of 1.026 even in the presence of phenotypic age. Longitudinal studies showed that the resulting models were able to predict individuals’ aging, in time-scales less than a year with 71% accuracy. Notably, we observed a significant individual-specific component to the prediction. This observation was confirmed with the identification of multiple GWAS hits in the independent UK Biobank cohort. The knockdown of the top hit, ALKAL2, which was previously shown to extend lifespan in flies, also slowed age-related decline in vision in flies.
Interpretation
In conclusion, predicted age from retinal images can be used as a biomarker of biological aging in a given individual independently from phenotypic age. This study demonstrates the utility of retinal aging clock for studying aging and age-related diseases and quantitatively measuring aging on very short time-scales, potentially opening avenues for quick and actionable evaluation of gero-protective therapeutics.
View details
Multimodal LLMs for health grounded in individual-specific data
Anastasiya Belyaeva
Shravya Shetty
Andrew Carroll
Nick Furlotte
ICML Workshop on Machine Learning for Multimodal Healthcare Data (2023)
Preview abstract
Large language models (LLMs) have shown an impressive ability to solve tasks in a wide range of fields including health. Within the health domain, there are many data modalities that are relevant to an individual’s health status. To effectively solve tasks related to individual health, LLMs will need the ability to use a diverse set of features as context. However, the best way to encode and inject complex high-dimensional features into the input stream of an LLM remains an active area of research. Here, we explore the ability of a foundation LLM to estimate disease risk given health-related input features. First, we evaluate serialization of structured individual-level health data into text along with in context learning and prompt tuning approaches. We find that the LLM performs better than random in the zero-shot and few-shot cases, and has comparable and often equivalent performance to baseline after prompt tuning. Next, we propose a way to encode complex non-text data modalities into the token embedding space and then use this encoding to construct multimodal sentences. We show that this multimodal LLM achieves better or equivalent performance compared to baseline models. Overall, our results show the potential for using multi-modal LLMs grounded in individual health data to solve complex tasks such as risk prediction.
View details
Inference of chronic obstructive pulmonary disease with deep learning on raw spirograms identifies new genetic loci and improves risk models
Babak Behsaz
Babak Alipanahi
Zachary Ryan Mccaw
Davin Hill
Tae-Hwi Schwantes-An
Dongbing Lai
Andrew Carroll
Brian Hobbs
Michael Cho
Nature Genetics (2023)
Preview abstract
Chronic obstructive pulmonary disease (COPD), the third leading cause of death worldwide, is highly heritable. While COPD is clinically defined by applying thresholds to summary measures of lung function, a quantitative liability score has more power to identify genetic signals. Here we train a deep convolutional neural network on noisy self-reported and International Classification of Diseases labels to predict COPD case-control status from high-dimensional raw spirograms and use the model's predictions as a liability score. The machine-learning-based (ML-based) liability score accurately discriminates COPD cases and controls, and predicts COPD-related hospitalization without any domain-specific knowledge. Moreover, the ML-based liability score is associated with overall survival and exacerbation events. A genome-wide association study on the ML-based liability score replicates existing COPD and lung function loci and also identifies 67 new loci. Lastly, our method provides a general framework to use ML methods and medical-record-based labels that does not require domain knowledge or expert curation to improve disease prediction and genomic discovery for drug design.
View details
Unsupervised representation learning improves genomic discovery for lung function and respiratory disease prediction
Babak Behsaz
Zachary Ryan Mccaw
Davin Hill
Robert Luben
Dongbing Lai
John Bates
Howard Yang
Tae-Hwi Schwantes-An
Yuchen Zhou
Anthony Khawaja
Andrew Carroll
Brian Hobbs
Michael Cho
medRxiv (2023)
Preview abstract
High-dimensional clinical data are becoming more accessible in biobank-scale datasets. However, effectively utilizing high-dimensional clinical data for genetic discovery remains challenging. Here we introduce a general deep learning-based framework, REpresentation learning for Genetic discovery on Low-dimensional Embeddings (REGLE), for discovering associations between genetic variants and high-dimensional clinical data. REGLE uses convolutional variational autoencoders to compute a non-linear, low-dimensional, disentangled embedding of the data with highly heritable individual components. REGLE can incorporate expert-defined or clinical features and provides a framework to create accurate disease-specific polygenic risk scores (PRS) in datasets which have minimal expert phenotyping. We apply REGLE to both respiratory and circulatory systems: spirograms which measure lung function and photoplethysmograms (PPG) which measure blood volume changes. Genome-wide association studies on REGLE embeddings identify more genome-wide significant loci than existing methods and replicate known loci for both spirograms and PPG, demonstrating the generality of the framework. Furthermore, these embeddings are associated with overall survival. Finally, we construct a set of PRSs that improve predictive performance of asthma, chronic obstructive pulmonary disease, hypertension, and systolic blood pressure in multiple biobanks. Thus, REGLE embeddings can quantify clinically relevant features that are not currently captured in a standardized or automated way.
View details
Accurate human genome analysis with Element Avidity sequencing
Andrew Carroll
Bryan Lajoie
Daniel Cook
Kelly N. Blease
Kishwar Shafin
Lucas Brambrink
Maria Nattestad
Semyon Kruglyak
bioRxiv (2023)
Preview abstract
We investigate the new sequencing technology Avidity from Element Biosciences. We show that Avidity whole genome sequencing matches mapping and variant calling accuracy with Illumina at high coverages (30x-50x) and is noticeably more accurate at lower coverages (20x-30x). We quantify base error rates of Element reads, finding lower error rates, especially in homopolymer and tandem repeat regions. We use Element’s ability to generate paired end sequencing with longer insert sizes than typical short–read sequencing. We show that longer insert sizes result in even higher accuracy, with long insert Element sequencing giving noticeably more accurate genome analyses at all coverages.
View details
DeepNull models non-linear covariate effects to improve phenotypic prediction and association power
Zachary R. Mccaw
Nicholas A. Furlotte
Andrew Carroll
Babak Alipanahi
Nature Communications (2022)
Preview abstract
Genome-wide association studies (GWASs) examine the association between genotype and phenotype while adjusting for a set of covariates. Although the covariates may have non-linear or interactive effects, due to the challenge of specifying the model, GWAS often neglect such terms. Here we introduce DeepNull, a method that identifies and adjusts for non-linear and interactive covariate effects using a deep neural network. In analyses of simulated and real data, we demonstrate that DeepNull maintains tight control of the type I error while increasing statistical power by up to 20% in the presence of non-linear and interactive effects. Moreover, in the absence of such effects, DeepNull incurs no loss of power. When applied to 10 phenotypes from the UK Biobank (n = 370K), DeepNull discovered more hits (+6%) and loci (+7%), on average, than conventional association analyses, many of which are biologically plausible or have previously been reported. Finally, DeepNull improves upon linear modeling for phenotypic prediction (+23% on average).
View details
DeepConsensus improves the accuracy of sequences with a gap-aware sequence transformer
Aaron Wenger
Andrew Walker Carroll
Armin Töpfer
Ashish Teku Vaswani
Daniel Cook
Felipe Llinares
Gunjan Baid
Howard Cheng-Hao Yang
Jean-Philippe Vert
Kishwar Shafin
Maria Nattestad
Waleed Ammar
William J. Rowell
Nature Biotechnology (2022)
Preview abstract
Genomic analysis requires accurate sequencing in sufficient coverage and over difficult genome regions. Through repeated sampling of a circular template, Pacific Biosciences developed long (10-25kb) reads with high overall accuracy, but lower homopolymer accuracy. Here, we introduce DeepConsensus, a transformer-based approach which leverages a unique alignment loss to correct sequencing errors. DeepConsensus reduces errors in PacBio HiFi reads by 42%, compared to the current approach. We show this increases the yield of PacBio HiFi reads at Q20 by 9%, at Q30 by 27%, and at Q40 by 90%. With two SMRT cells of HG003, reads from DeepConsensus improve hifiasm assembly contiguity (NG50 4.9Mb to 17.2Mb), increase gene completeness (94% to 97%), reduce false gene duplication rate (1.1% to 0.5%), and improve assembly base accuracy (QV43 to QV45), and also reduce variant calling errors by 24%.
View details
Preview abstract
Genome-wide association studies (GWAS) are used to identify genetic variants significantly correlated with a target disease or phenotype as a first step to detect potentially causal genes. The availability of high-dimensional biomedical data in population-scale biobanks has enabled novel machine-learning-based phenotyping approaches in which machine learning (ML) algorithms rapidly and accurately phenotype large cohorts with both genomic and clinical data, increasing the statistical power to detect variants associated with a given phenotype. While recent work has demonstrated that these methods can be extended to diseases for which only low quality medical-record-based labels are available, it is not possible to quantify changes in statistical power since the underlying ground-truth liability scores for the complex, polygenic diseases represented by these medical-record-based phenotypes is unknown. In this work, we aim to empirically study the robustness of ML-based phenotyping procedures to label noise by applying varying levels of random noise to vertical cup-to-disc ratio (VCDR), a quantitative feature of the optic nerve that is predictable from color fundus imagery and strongly influences glaucoma referral risk. We show that the ML-based phenotyping procedure recovers the underlying liability score across noise levels, significantly improving genetic discovery and PRS predictive power relative to noisy equivalents. Furthermore, initial denoising experiments show promising preliminary results, suggesting that improving such methods will yield additional gains.
View details
How DeepConsensus Works
Aaron Wenger
Anastasiya Belyaeva
Andrew Carroll
Armin Töpfer
Ashish Teku Vaswani
Daniel Cook
Felipe Llinares
Gunjan Baid
Howard Yang
Jean-Philippe Vert
Kishwar Shafin
Maria Nattestad
Waleed Ammar
William J. Rowell
(2022)
Preview abstract
N/A
These are slides for a public video about DeepConsensus
View details
Knowledge distillation for fast and accurate DNA sequence correction
Anastasiya Belyaeva
Daniel Cook
Kishwar Shafin
Daniel Liu
Armin Töpfer
Aaron Wenger
William J. Rowell
Howard Yang
Andrew Carroll
Maria Nattestad
Learning Meaningful Representations of Life (LMRL) Workshop NeurIPS 2022
Preview abstract
Accurate genome sequencing can improve our understanding of biology and the genetic basis of disease. The standard approach for generating DNA sequences from PacBio instruments relies on HMM-based models. Here, we introduce Distilled DeepConsensus - a distilled transformer–encoder model for sequence correction, which improves upon the HMM-based methods with runtime constraints in mind. Distilled DeepConsensus is 1.3x faster and 1.5x smaller than its larger counterpart while improving the yield of high quality reads (Q30) over the HMM-based method by 1.69x (vs. 1.73x for larger model). With improved accuracy of genomic sequences, Distilled DeepConsensus improves downstream applications of genomic sequence analysis such as reducing variant calling errors by 39% (34% for larger model) and improving genome assembly quality by 3.8% (4.2% for larger model). We show that the representations learned by Distilled DeepConsensus are similar between faster and slower models.
View details
DeepNull models non-linear covariate effects to improve phenotypic prediction and association power
Andrew Carroll
Babak Alipanahi
Zachary Ryan Mccaw
Nick Furlotte
Nature Communications (2022)
Preview abstract
Genome-wide association studies (GWAS) examine the association between genotype and phenotype while adjusting for a set of covariates. Although the covariates may have non-linear or interactive effects, due to the challenge of specifying the model, GWAS often neglect such terms. Here we introduce DeepNull, a method that identifies and adjusts for non-linear and interactive covariate effects using a deep neural network. In analyses of simulated and real data, we demonstrate that DeepNull maintains tight control of the type I error while increasing statistical power by up to 20% in the presence of non-linear and interactive effects. Moreover, in the absence of such effects, DeepNull incurs no loss of power. When applied to 10 phenotypes from the UK Biobank (n=370K), DeepNull discovered more hits (+6%) and loci (+7%), on average, than conventional association analyses, many of which are biologically plausible or have previously been reported. Finally, DeepNull improves upon linear modeling for phenotypic prediction (+23% on average).
View details
DeepTrio: Variant Calling in Families Using Deep Learning
Gunjan Baid
Howard Yang
Maria Nattestad
Sidharth Goel
bioRxiv (2021)
Preview abstract
Every human inherits one copy of the genome from their mother and another from their father. Parental inheritance helps us understand the transmission of traits and genetic diseases, which often involve de novo variants and rare recessive alleles. Here we present DeepTrio, which learns to analyze child-mother-father trio from the joint sequence information, without explicit encoding of inheritance priors. DeepTrio to learn how to weigh sequencing error, mapping error, and de novo rates and genome context directly from the sequence data. DeepTrio has higher accuracy on both Illumina and PacBio HiFi data when compared to DeepVariant. Improvements are especially pronounced at lower coverages (with 20x DeepTrio roughly equivalent to 30x DeepVariant). As DeepTrio learns directly from data, we also demonstrate extensions to exome calling and calling with duos (child and one parent) solely by changing the training data. DeepTrio includes pre-trained models for Illumina WGS, Illumina exome, and PacBio HiFi.
View details
Machine learning guided aptamer discovery
Ali Bashir
Geoff Davis
Michelle Therese Dimon
Qin Yang
Scott Ferguson
Zan Armstrong
Nature Communications (2021)
Preview abstract
Aptamers are discovered by searching a large library for sequences with desirable binding properties. These libraries, however, are physically constrained to a fraction of the theoretical sequence space and limited to sampling strategies that are easy to scale. Integrating machine learning could enable identification of high-performing aptamers across this unexplored fitness landscape. We employed particle display (PD) to partition aptamers by affinity and trained neural network models to improve physically-derived aptamers and predict affinity in silico. These predictions were used to locally improve physically derived aptamers as well as identify completely novel, high-affinity aptamers de novo. We experimentally validated the predictions, improving aptamer candidate designs at a rate 10-fold higher than random perturbation, and generating novel aptamers at a rate 448-fold higher than PD alone. We characterized the explanatory power of the models globally and locally and showed successful sequence truncation while maintaining affinity. This work combines machine learning and physical discovery, uses principles that are widely applicable to other display technologies, and provides a path forward for better diagnostic and therapeutic agents.
View details
Large-scale machine learning-based phenotyping significantly improves genomic discovery for optic nerve head morphology
Babak Alipanahi
Babak Behsaz
Zachary Ryan Mccaw
Emanuel Schorsch
Lizzie Dorfman
Sonia Phene
Andrew Walker Carroll
Anthony Khawaja
American Journal of Human Genetics (2021)
Preview abstract
Genome-wide association studies (GWAS) require accurate cohort phenotyping, but expert labeling can be costly, time-intensive, and variable. Here we develop a machine learning (ML) model to predict glaucomatous features from color fundus photographs. We used the model to predict vertical cup-to-disc ratio (VCDR), a diagnostic parameter and cardinal endophenotype for glaucoma, in 65,680 Europeans in the UK Biobank (UKB). A GWAS of ML-based VCDR identified 299 independent genome-wide significant (GWS; P≤5×10-8) hits in 156 loci. The ML-based GWAS replicated 62 of 65 GWS loci from a recent VCDR GWAS in the UKB for which two ophthalmologists manually labeled images for 67,040 Europeans. The ML-based GWAS also identified 93 novel loci, significantly expanding our understanding of the genetic etiologies of glaucoma and VCDR. Pathway analyses support the biological significance of the novel hits to VCDR, with select loci near genes involved in neuronal and synaptic biology or known to cause severe Mendelian ophthalmic disease. Finally, the ML-based GWAS results significantly improve polygenic prediction of VCDR in independent datasets.
View details
Accurate, scalable cohort variant calls using DeepVariant and GLnexus
Helen Li
Michael F. Lin
Andrew Walker Carroll
Bioinformatics (2021)
Preview abstract
Motivation
Population-scale sequenced cohorts are foundational resources for genetic analyses, but processing raw reads into analysis-ready cohort-level variants remains challenging.
Results
We introduce an open-source cohort-calling method that uses the highly-accurate caller DeepVariant and scalable merging tool GLnexus. Using callset quality metrics based on variant recall and precision in benchmark samples and Mendelian consistency in father-mother-child trios, we optimized the method across a range of cohort sizes, sequencing methods, and sequencing depths. The resulting callsets show consistent quality improvements over those generated using existing best practices with reduced cost. We further evaluate our pipeline in the deeply sequenced 1000 Genomes Project (1KGP) samples and show superior callset quality metrics and imputation reference panel performance compared to an independently-generated GATK Best Practices pipeline.
Availability and Implementation
We publicly release the 1KGP individual-level variant calls and cohort callset (https://console.cloud.google.com/storage/browser/brain-genomics-public/research/cohort/1KGP) to foster additional development and evaluation of cohort merging methods as well as broad studies of genetic variation. Both DeepVariant (https://github.com/google/deepvariant) and GLnexus (https://github.com/dnanexus-rnd/GLnexus) are open-sourced, and the optimized GLnexus setup discovered in this study is also integrated into GLnexus public releases v1.2.2 and later.
Supplementary information
Supplementary data are available at Bioinformatics online.
View details
SLOE: A Faster Method for Statistical Inference in High-Dimensional Logistic Regression
Advances in Neural Information Processing Systems (2021)
Preview abstract
Logistic regression remains one of the most widely used tools in applied statistics, machine learning and data science. Practical datasets often have a substantial number of features $d$ relative to the sample size $n$. In these cases, the logistic regression maximum likelihood estimator (MLE) is biased, and its standard large-sample approximation is poor. In this paper, we develop an improved method for debiasing predictions and estimating frequentist uncertainty for such datasets. We build on recent work characterizing the asymptotic statistical behavior of the MLE in the regime where the aspect ratio $d / n$, instead of the number of features $d$, remains fixed as $n$ grows. In principle, this approximation facilitates bias and uncertainty corrections, but in practice, these corrections require an estimate of the signal strength of the predictors. Our main contribution is SLOE, an estimator of the signal strength with convergence guarantees that reduces the computation time of estimation and inference by orders of magnitude. The bias correction that this facilitates also reduces the variance of the predictions, yielding narrower confidence intervals with higher (valid) coverage of the true underlying probabilities and parameters.
View details
A population-specific reference panel for improved genotype imputation in African Americans
Jared O’Connell
Meghan Moreno
Helen Li
Nadia Litterman
Elizabeth Noblin
Anjali Shastri
Elizabeth H. Dorfman
Suyash Shringarpure
23andMe Research Team
Adam Auton
Andrew Carroll
Communications Biology (2021)
Preview abstract
There is currently a dearth of accessible whole genome sequencing (WGS) data for individuals residing in the Americas with Sub-Saharan African ancestry. We generated whole genome sequencing data at intermediate (15×) coverage for 2,294 individuals with large amounts of Sub-Saharan African ancestry, predominantly Atlantic African admixed with varying amounts of European and American ancestry. We performed extensive comparisons of variant callers, phasing algorithms, and variant filtration on these data to construct a high quality imputation panel containing data from 2,269 unrelated individuals. With the exception of the TOPMed imputation server (which notably cannot be downloaded), our panel substantially outperformed other available panels when imputing African American individuals. The raw sequencing data, variant calls and imputation panel for this cohort are all freely available via dbGaP and should prove an invaluable resource for further study of admixed African genetics.
View details
Underspecification Presents Challenges for Credibility in Modern Machine Learning
Dan Moldovan
Babak Alipanahi
Alex Beutel
Christina Chen
Jon Deaton
Shaobo Hou
Ghassen Jerfel
Yian Ma
Akinori Mitani
Andrea Montanari
Christopher Nielsen
Thomas Osborne
Rajiv Raman
Kim Ramasamy
Martin Gamunu Seneviratne
Shannon Sequeira
Harini Suresh
Victor Veitch
Journal of Machine Learning Research (2020)
Preview abstract
ML models often exhibit unexpectedly poor behavior when they are deployed in real-world domains. We identify underspecification as a key reason for these failures. An ML pipeline is underspecified when it can return many predictors with equivalently strong held-out performance in the training domain. Underspecification is common in modern ML pipelines, such as those based on deep learning. Predictors returned by underspecified pipelines are often treated as equivalent based on their training domain performance, but we show here that such predictors can behave very differently in deployment domains. This ambiguity can lead to instability and poor model behavior in practice, and is a distinct failure mode from previously identified issues arising from structural mismatch between training and deployment domains. We show that this problem appears in a wide variety of practical ML pipelines, using examples from computer vision, medical imaging, natural language processing, clinical risk prediction based on electronic health records, and medical genomics. Our results show the need to explicitly account for underspecification in modeling pipelines that are intended for real-world deployment in any domain.
View details
Preview abstract
In this work, we investigate variant calling across a pedigree of mosquito (Anopheles gambiae) genomes. Using rates of Mendelian violation, we assess pipelines developed to call variation in humans when applied to mosquito samples. We demonstrate the ability to rapidly retrain DeepVariant without the need for a gold standard set by using sites that are consistent versus inconsistent with Mendelian inheritance. We show that this substantially improves calling accuracy by tuning DeepVariant for the new genome context. Finally, we generate a model for accurate variant calling on low-coverage mosquito genomes and a corresponding variant callset.
View details
An open resource for accurately benchmarking small variant and reference calls
Justin M. Zook
Jennifer McDaniel
Nathan D. Olson
Justin M. Wagner
Hemang Parikh
Haynes Heaton
Sean A. Irvine
Len Trigg
Rebecca Truty
Francisco M. De La Vega
Chunlin Xiao
Stephen Sherry
Marc Salit
Nature Biotechnology, vol. 37 (2019), 561–566
Preview abstract
Benchmark small variant calls are required for developing, optimizing and assessing the performance of sequencing and bioinformatics methods. Here, as part of the Genome in a Bottle (GIAB) Consortium, we apply a reproducible, cloud-based pipeline to integrate multiple short- and linked-read sequencing datasets and provide benchmark calls for human genomes. We generate benchmark calls for one previously analyzed GIAB sample, as well as six genomes from the Personal Genome Project. These new genomes have broad, open consent, making this a ‘first of its kind’ resource that is available to the community for multiple downstream applications. We produce 17% more benchmark single nucleotide variations, 176% more indels and 12% larger benchmark regions than previously published GIAB benchmarks. We demonstrate that this benchmark reliably identifies errors in existing callsets and highlight challenges in interpreting performance metrics when using benchmarks that are not perfect or comprehensive. Finally, we identify strengths and weaknesses of callsets by stratifying performance according to variant type and genome context.
View details
Preview abstract
Summary
Reference genomes are refined to reflect error corrections and other improvements. While this process improves novel data generation and analysis, incorporating data analyzed on an older reference genome assembly requires transforming the coordinates and representations of the data to the new assembly. Multiple tools exist to perform this transformation for coordinate-only data types, but none supports accurate transformation of genome-wide short variation. Here we present GenomeWarp, a tool for efficiently transforming variants between genome assemblies. GenomeWarp transforms regions and short variants in a conservative manner to minimize false positive and negative variants in the target genome, and converts over 99% of regions and short variants from a representative human genome.
Availability and implementation
GenomeWarp is written in Java. All source code and the user manual are freely available at https://github.com/verilylifesciences/genomewarp.
View details
Sequential regulatory activity prediction across chromosomes with convolutional neural networks
David Kelley
Yakir Reshef
Genome Research (2018)
Preview abstract
Functional genomics approaches to better model genotype-phenotype relationships have important
applications toward understanding genomic function and improving human health. In particular,
thousands of noncoding loci associated with diseases and physical traits lack mechanistic
explanation. Here, we develop the first machine-learning system to predict cell type-specific
epigenetic and transcriptional profiles in large mammalian genomes from DNA sequence alone.
Using convolutional neural networks, this system identifies promoters and distal regulatory elements
and synthesizes their content to make effective gene expression predictions. We show that model
predictions for the influence of genomic variants on gene expression align well to causal variants
underlying eQTLs in human populations and can be useful for generating mechanistic hypotheses
to enable GWAS loci fine mapping.
View details
A universal SNP and small-indel variant caller using deep neural networks
Scott Schwartz
Dan Newburger
Jojo Dijamco
Nam Nguyen
Pegah T. Afshar
Sam S. Gross
Lizzie Dorfman
Mark A. DePristo
Nature Biotechnology (2018)
Preview abstract
Despite rapid advances in sequencing technologies, accurately calling genetic variants present in an individual genome from billions of short, errorful sequence reads remains challenging. Here we show that a deep convolutional neural network can call genetic variation in aligned next-generation sequencing read data by learning statistical relationships between images of read pileups around putative variant and true genotype calls. The approach, called DeepVariant, outperforms existing state-of-the-art tools. The learned model generalizes across genome builds and mammalian species, allowing nonhuman sequencing projects to benefit from the wealth of human ground-truth data. We further show that DeepVariant can learn to call variants in a variety of sequencing technologies and experimental designs, including deep whole genomes from 10X Genomics and Ion Ampliseq exomes, highlighting the benefits of using more automated and generalizable techniques for variant calling.
View details
A deep learning approach to pattern recognition for short DNA sequences
Akosua Busia
Clara Fannjiang
Lizzie Dorfman
Mark DePristo
bioArxiv (2018)
Preview abstract
Sequence-to-sequence alignment is a widely-used analysis method in bioinformatics. One common use of sequence alignment is to infer information about an unknown query sequence from the annotations of similar sequences in a database, such as predicting the function of a novel protein sequence by aligning to a database of protein families or predicting the presence/absence of species in a metagenomics sample by aligning reads to a database of reference genomes. In this work we describe a deep learning approach to solve such problems in a single step by training a deep neural network (DNN) to predict the database-derived labels directly from the query sequence. We demonstrate the value of this DNN approach on a hard problem of practical importance: determining the species of origin of next-generation sequencing reads from 16s ribosomal DNA. In particular, we show that when trained on 16s sequences from more than 13,000 distinct species, our DNN can predict the species of origin of individual reads more accurately than existing machine learning baselines and alignment-based methods like BWA or BLAST, achieving absolute performance within 2.0% of perfect memorization of the training inputs. Moreover, the DNN remains accurate and outperforms read alignment approaches when the query sequences are especially noisy or ambiguous. Finally, these DNN models can be used to assess metagenomic community composition on a variety of experimental 16s read datasets. Our results are a first step towards our long-term goal of developing a general-purpose deep learning model that can learn to predict any type of label from short biological sequences.
View details
Preview abstract
Image-based screening is a powerful technique to reveal how chemical, genetic, and environmental perturbations affect cellular state. Its potential is restricted by the current analysis algorithms that target a small number of cellular phenotypes and rely on expert-engineered image features. Newer algorithms that learn how to represent an image are limited by the small amount of labeled data for ground-truth, a common problem for scientific projects. We demonstrate a sensitive and robust method for distinguishing cellular phenotypes that requires no additional ground-truth data or training. It achieves state-of-the-art performance classifying drugs by similar molecular mechanism, using a Deep Metric Network that has been pre-trained on consumer images and a transformation that improves sensitivity to biological variation. However, our method is not limited to classification into predefined categories. It provides a continuous measure of the similarity between cellular phenotypes that can also detect subtle differences such as from increasing dose. The rich, biologically-meaningful image representation that our method provides can help therapy development by supporting high-throughput investigations, even exploratory ones, with more sophisticated and disease-relevant models.
View details
No Results Found