Jump to Content
Lucy Colwell

Lucy Colwell

Lucy is a research scientist at Google Research who works closely with colleagues from GAS and Brain to better understand the relationship between the sequence and function of biological macromolecules. Her broader research interests involve understanding how Google's strengths in experimental design and machine learning can be applied to the discovery and production of proteins for use in a diverse range of applications.
Authored Publications
Google Publications
Other Publications
Sort By
  • Title
  • Title, descending
  • Year
  • Year, descending
    Deep diversification of an AAV capsid protein by machine learning
    Ali Bashir
    Sam Sinai
    Nina K. Jain
    Pierce J. Ogden
    Patrick F. Riley
    George M. Church
    Eric D. Kelsic
    Nature Biotechnology (2021)
    Preview abstract Modern experimental technologies can assay large numbers of biological sequences, but engineered protein libraries rarely exceed the sequence diversity of natural protein families. Machine learning (ML) models trained directly on experimental data without biophysical modeling provide one route to accessing the full potential diversity of engineered proteins. Here we apply deep learning to design highly diverse adeno-associated virus 2 (AAV2) capsid protein variants that remain viable for packaging of a DNA payload. Focusing on a 28-amino acid segment, we generated 201,426 variants of the AAV2 wild-type (WT) sequence yielding 110,689 viable engineered capsids, 57,348 of which surpass the average diversity of natural AAV serotype sequences, with 12–29 mutations across this region. Even when trained on limited data, deep neural network models accurately predict capsid viability across diverse variants. This approach unlocks vast areas of functional but previously unreachable sequence space, with many potential applications for the generation of improved viral vectors and protein therapeutics. View details
    Rethinking Attention with Performers
    Valerii Likhosherstov
    David Martin Dohan
    Peter Hawkins
    Jared Quincy Davis
    Lukasz Kaiser
    Adrian Weller
    accepted to ICLR 2021 (oral presentation) (to appear)
    Preview abstract We introduce Performers, Transformer architectures which can estimate regular (softmax) full-rank-attention Transformers with provable accuracy, but using only linear (as opposed to quadratic) space and time complexity, without relying on any priors such as sparsity or low-rankness. To approximate softmax attention-kernels, Performers use a novel Fast Attention Via positive Orthogonal Random features approach (FAVOR+), which may be of independent interest for scalable kernel methods. FAVOR+ can be also used to efficiently model kernelizable attention mechanisms beyond softmax. This representational power is crucial to accurately compare softmax with other kernels for the first time on large-scale tasks, beyond the reach of regular Transformers, and investigate optimal attention-kernels. Performers are linear architectures fully compatible with regular Transformers and with strong theoretical guarantees: unbiased or nearly-unbiased estimation of the attention matrix, uniform convergence and low estimation variance. We tested Performers on a rich set of tasks stretching from pixel-prediction through text models to protein sequence modeling. We demonstrate competitive results with other examined efficient sparse and dense attention methods, showcasing effectiveness of the novel attention-learning paradigm leveraged by Performers. View details
    Preview abstract Machine learning-guided protein design is rapidly emerging as a strategy to find high fitness multi-mutant variants. In this issue of Cell Systems, Wittman et al. analyze the impact of design decisions for machine learning-assisted directed evolution (MLDE) on its ability to navigate a fitness landscape and reliably find global optima. View details
    Preview abstract Being able to design biological sequences like DNA or proteins to have desired properties would have considerable impact in medical and industrial applications. However, doing so presents a challenging black-box optimization problem that requires multiple rounds of expensive, time-consuming experiments. In response, we propose using reinforcement learning (RL) for biological sequence design. RL is a flexible framework that allows us to optimize generative sequence policies to achieve a variety of criteria, including diversity among high-quality sequences discovered. We use model-based RL to improve sample efficiency, where at each round the policy is trained offline using a simulator fit on functional measurements from prior rounds. To accommodate the growing number of observations across rounds, the simulator model is automatically selected at each round from a pool of diverse models of varying capacity. On the tasks of designing DNA transcription factor binding sites, designing antimicrobial proteins, and optimizing the energy of Ising models based on protein structures, we find that model-based RL is an attractive alternative to existing methods. View details
    Evaluating Attribution for Graph Neural Networks
    Alexander B Wiltschko
    Brian Lee
    Jennifer Wei
    Wesley Qian
    Yiliu Wang
    Advances in Neural Information Processing Systems 33 (2020)
    Preview abstract Interpretability of machine learning models is critical to scientific understanding, AI safety, and debugging. Attribution is one approach to interpretability, which highlights input dimensions that are influential to a neural network’s prediction. Evaluation of these methods is largely qualitative for image and text models, because acquiring ground truth attributions requires expensive and unreliable human judgment. Attribution has been comparatively understudied for graph neural networks (GNNs), a model class of growing importance that makes predictions on arbitrarily-sized graphs. Graph-valued data offer an opportunity to quantitatively benchmark attribution methods, because challenging synthetic graph problems have computable ground-truth attributions. In this work we adapt commonly-used attribution methods for GNNs and quantitatively evaluate them using the axes of attribution accuracy, stability, faithfulness and consistency. We make concrete recommendations for which attribution methods to use, and provide the data and code for our benchmarking suite. Rigorous and open source benchmarking of attribution methods in graphs could enable new methods development and broader use of attribution in real-world ML tasks. View details
    Preview abstract The use of black-box optimization for the design of new biological sequences is an emerging research area with potentially revolutionary impact. The cost and latency of wet-lab experiments requires methods that find good sequences in few experimental rounds of large batches of sequences --- a setting that off-the-shelf black-box optimization methods are ill-equipped to handle. We find that the performance of existing methods varies drastically across optimization tasks, posing a significant obstacle to real-world applications. To improve robustness, we propose population-based optimization (PBO), which generates batches of sequences by sampling from an ensemble of methods. The number of sequences sampled from any method is proportional to the quality of sequences it previously proposed, allowing PBO to combine the strengths of individual methods while hedging against their innate brittleness. Adapting the population of methods online using evolutionary optimization further improves performance. Through extensive experiments on in-silico optimization tasks, we show that PBO outperforms any single method in its population, proposing both higher quality single sequences as well as more diverse batches. By its robustness and ability to design diverse, high-quality sequences, PBO is shown to be a new state-of-the art approach to the batched black-box optimization of biological sequences. View details
    Critiquing Protein Family Classification Models Using Sufficient Input Subsets
    Brandon Michael Carter
    Jamie Alexander Smith
    Theo Sanderson
    ACM SIGKDD Conference on Knowledge Discovery and Data Mining (2019) (to appear)
    Preview abstract In many application domains, neural networks are highly accurate and have been deployed at large scale. However, users often do not have good tools for understanding how these models arrive at their predictions. This has hindered adoption in fields such as the life and medical sciences, where researchers require that models base their decisions on underlying biological phenomena rather than peculiarities of the dataset introduced, e.g., as a function of when and how the data were collected. In response, we propose a set of methods for critiquing deep learning models, and demonstrate their application for protein family classification, a task for which high- accuracy models have considerable potential impact. Our methods extend the recently-introduced sufficient input subsets technique (SIS), which we use to identify the subset of locations (SIS) in each protein sequence that is sufficient for classification. Our suite of tools analyzes these SIS to shed light on the decision making criteria employed by models trained on this task. These tools expose that while these deep models may perform classification for biologically-relevant reasons, their behavior varies considerably across choice of network architecture and parameter initialization. While the techniques that we develop are specific to the protein sequence classification task, the approach taken generalizes to a broad set of scientific contexts in which model interpretability is essential. We encourage further application of our techniques for interrogating machine learning models trained on other scientifically relevant tasks. View details
    Using attribution to decode binding mechanism in neural network models for chemistry
    Ankur Taly
    Federico Monti
    Proceedings of the National Academy of Sciences (2019), pp. 201820657
    Preview abstract Deep neural networks have achieved state of the art accuracy at classifying molecules with respect to whether they bind to specific protein targets. A key breakthrough would occur if these models could reveal the fragment pharmacophores that are causally involved in binding. Extracting chemical details of binding from the networks could potentially lead to scientific discoveries about the mechanisms of drug actions. But doing so requires shining light into the black box that is the trained neural network model, a task that has proved difficult across many domains. Here we show how the binding mechanism learned by deep neural network models can be interrogated, using a recently described attribution method. We first work with carefully constructed synthetic datasets, in which the 'fragment logic' of binding is fully known. We find that networks that achieve perfect accuracy on held out test datasets still learn spurious correlations due to biases in the datasets, and we are able to exploit this non-robustness to construct adversarial examples that fool the model. The dataset bias makes these models unreliable for accurately revealing information about the mechanisms of protein-ligand binding. In light of our findings, we prescribe a test that checks for dataset bias given a hypothesis. If the test fails, it indicates that either the model must be simplified or regularized and/or that the training dataset requires augmentation. View details
    A Comparison of Generative Models for Sequence Design
    David Dohan
    Ramya Deshpande
    Olivier Chapelle
    Babak Alipanahi
    Machine Learning in Computational Biology Workshop (2019)
    Preview abstract In this paper, we compare generative models of different complexity for designing DNA and protein sequences using the Cross Entropy Method. View details
    Preview abstract Understanding the relationship between amino acid sequence and protein function is a long-standing problem in molecular biology with far-reaching scientific implications. Despite six decades of progress, state-of-the-art techniques cannot annotate $\sim1/3$ of microbial protein sequences, hampering our ability to exploit sequences collected from diverse organisms. To address this, we report a deep learning model that learns the relationship between unaligned amino acid sequences and their functional classification across all 17929 families of the PFam database. Using the Pfam seed sequences we establish a rigorous benchmark assessment and find that a dilated convolutional model reduces the error of state of the art BLASTp and pHMM models by a factor of nine. With 80\% of the full Pfam database we train a protein family predictor that is more accurate and over 200 times faster than BLASTp, while learning sequence features such as structural disorder and transmembrane helices. Our model co-locates sequences from unseen families in embedding space far from existing families, allowing sequences from novel families to be classified. We anticipate that deep learning models will be a core component of future general-purpose protein function prediction tools. View details
    Preview abstract Machine learning (ML) models trained to predict ligand binding to single proteins have achieved remarkable success, but cannot make predictions about protein targets other than the one they are trained on. Models that make predictions for multiple proteins and multiple ligands, known as drug-target interaction (DTI) models, aim to solve this problem but generally have lower performance. In this work, we improve the performance of DTI models by taking advantage of the accuracy of single protein/ligand binding models. Specifically, we first construct individual protein/ligand binding models for all train proteins with some experimental data, then use each individual model to make predictions for all remaining ligands, against the corresponding protein target. Finally, we use the known and predicted ligand binding data for all targets in a DTI model to make predictions for the unseen test proteins. This approach significantly improves performance; most importantly, some of our models are able to achieve Areas Under the Receiver Operator Characteristic curve (AUCs) exceeding $0.9$ on test datasets that contain only unseen proteins and unseen ligands. View details
    Biological Sequences Design using Batched Bayesian Optimization
    Zelda Mariet
    Ramya Deshpande
    David Dohan
    Olivier Chapelle
    NeurIPS workshop on Bayesian Deep Learning (2019)
    Preview abstract Being able to effectively design biological sequences like DNA and proteins would have transformative impact on medicine. Currently, the most popular method in the life sciences for performing design is directed evolution,which explores sequence space by making small mutations to existing sequences.Alternatively, Bayesian optimization (BO) provides an attractive framework for model-based black-box optimization, and has achieved many recent successes in life sciences applications. However, within the ML community, most large-scale BO efforts have focused on hyper-parameter tuning. These methods often do not translate to biological sequence design, where the search space is over a discrete alphabet, wet-lab experiments are run with considerable parallelism (1K-100K sequences per batch), and experiments are sufficiently slow and expensive that only few rounds of experiments are feasible. This paper discusses the particularities of batched BO on a large discrete space, and investigates the design choices that must be made in order to obtain robust, scalable, and experimentally successful models within this unique context. View details
    Glycation changes molecular organization and charge distribution in type I collagen fibrils
    Sneha Bansode,
    Uliana Bashtanova,
    Rui Li,
    Jonathan Clark,
    Karin H. Müller,
    Anna Puszkarska,
    Ieva Goldberga,
    Holly H. Chetwood,
    David G. Reid,
    Jeremy N. Skepper,
    Catherine M. Shanahan,
    Georg Schitter,
    Patrick Mesquida
    Melinda J. Duer
    Scientific Reports, vol. 10 (2020), pp. 3397
    Rapid discovery and evolution of orthogonal aminoacyl-tRNA synthetase–tRNA pairs
    Daniele Cervettini
    Shan Tang
    Stephen D. Fried
    Julian C. W. Willis
    Louise F. H. Funke
    Jason W. Chin
    Nature Biotechnology, vol. 38 (2020), 989–999
    The Effect of Debiasing Protein–Ligand Binding Data on Generalization
    Vikram Sundar
    J. Chem. Inf. Model., vol. 60 (2019), 56–62
    A polymer physics framework for the entropy of arbitrary pseudoknots
    Ofer Kimchi
    Tristan Cragnolini
    Biophysical Journal, vol. 117 (2019), pp. 520-532
    Computational approaches to therapeutic antibody design: established methods and emerging trends
    Richard A. Norman
    Francesco Ambrosetti
    Alexandre M.J.J. Bonvin
    Sebastian Kelm
    Sandeep Kumar
    Konrad Krawczyk
    Briefings in Bioinformatics, vol. 21 (2019), 1549=1567
    Collagen-inspired self-assembly of twisted filaments
    MJ Falk,
    A Duwel,
    Phys. Rev. Lett., vol. 123 (2019), pp. 238102
    Statistical and machine learning approaches to predicting protein–ligand interactions
    Current opinion in structural biology, vol. 49 (2018), pp. 123-128
    Power law tails in phylogenetic systems
    Chongli Qin
    PNAS, vol. 115 (2018), pp. 690-695
    Proline provides site-specific flexibility for in vivo collagen
    Wing Ying Chow,
    Chris J Forman,
    Dominique Bihan,
    Anna M Puszkarska,
    Rakesh Rajan,
    David G Reid,
    David A Slatter,
    David J Wales,
    Richard W Farndale,
    Melinda J Duer
    Scientific Reports, vol. 9 (2018), pp. 13809
    Analysis of nanobody paratopes reveals greater diversity than classical antibodies
    Laura S Mitchell,
    Protein Engineering, Design and Selection, vol. 31 (2018), 267–275
    Comparative analysis of nanobody sequence and structure data
    Laura S. Mitchell
    Proteins: Structure, Function, and Bioinformatics, vol. 86 (2018), 697–706