Exploiting small and complex datasets for drug discovery
Statistics Seminar
5th May 2023, 1:00 pm – 2:00 pm
Fry Building, 2.41
Machine learning has been used in drug discovery for many years, and the field presents interesting challenges to statisticians. Graph neural networks (GNNs) have garnered particular attention recently, as they achieve state of the art performance in certain supervised learning and sequential optimisation tasks. However, classical kernel methods and Gaussian process regression remain competitive in smaller datasets. It is widely agreed that benchmarks used to assess predictive methods fail to represent the intricacies of real chemical datasets, which are typically multivariate and sparse, explore a small subset of molecular space, and have a low signal-to-noise ratio.
In this talk, I will discuss two projects motivated by these characteristics of real data. The first project deals with the popular Tanimoto similarity on molecular fingerprints. We introduce a generalisation which yields differentiable, scalable and interpretable features for kernel methods. Scalability is achieved through an oblivious subspace embedding which is continuous in the input and enjoys good approximation properties. The second project compares two approaches to probabilistic regression and meta-learning: Gaussian Processes and Neural Processes. In each case, we employ realistic and challenging benchmarks from the Dockstring package.
This talk is joint work with Austin Tripp and Miguel García-Ortegón.
Comments are closed.