## Overview

Research in the Statistics Institute covers a wide range of topics, broadly unified under the theme of computational statistics and data science, but also including foundations and core statistics methods, and a range of applications in science and commerce. All of the topics below include high-dimensional and highly-structured datasets.

## Bayesian Modelling and Analysis

In the Bayesian approach to statistics, observables, predictands, and model parameters are all treated as random variables, which allows the observations to be incorporated by conditioning, and probabilistic predictions to be made by integrating out the model parameters. This powerful unifying framework has become far more accessible in the last twenty years, owing to improvements in computer power, and in stochastic algorithms for conditioning and integrating, particularly Markov Chain Monte Carlo (MCMC) algorithms. This framework also allows us to develop more complex statistical models, suitable for modern high-dimensional and highly-structured data. It is not uncommon for a Bayesian model to have hundreds or even thousands of parameters, but the effective number of parameters, which is data-determined, can be far fewer.

Bayesian methods are now mainstream in Machine Learning, and they are also widely used in more complex applications such as signal processing, target-tracking, protein folding, and genetic epidemiology, and in many applications involving latent processes, including spatial statistics. Bayesian decision theory is crucial in the development of transparent early warning systems, for example for extreme weather, or for volcanic eruptions.

Current Bayesian applications in the School include global-scale spatial statistics for measuring and predicting sea-level rise, and spatio-temporal modelling of long term daily air-pollution data; analysis of ancestry in population genetics. Theoretical work on Bayesian analysis includes high-dimensional sparse computation, visualisation for model-checking, Bayesian asymptotics, asymptotic approximation in inverse problems.

**Staff in this area: Ganesh Ayalvadi, Mathieu Gerber, Peter Green, Dan Lawson, Anthony Lee, Jonathan Rougier, Simon Wood, Feng Yu**

## High-Dimensional and Highly-Structured Data

High-dimensional statistics studies data whose dimension is larger than those treated in classical statistics theory. There has been a dramatic surge of interest and activity in high-dimensional statistics over the past two decades, due to new applications and datasets, and theoretical advances. With high-dimensional data, statistical activities such as variable selection, estimation, and hypothesis testing must scale conservatively with the number of cases, and this rules out many traditional statistical approaches. Theoretical work in high-dimensional data studies sequential, iterative, and approximation approaches, to establish whether they scale conservatively, and what their statistical properties are.

Many high-dimensional datasets are also highly-structured: relational data, for example, which are ubiquitous across a wide range of application areas including public health, life science, social science, and finance, and of course social media. The mathematical representation of a relational network is a graph, comprising vertices (sometimes called ‘nodes’) and edges between vertices. For example, representing brain voxels and their connectivity, to understand brain structure. Or, in epidemiology, representing individuals and their contacts, to help policy-makers to mitigate harm and also to manage outbreaks. Or, in cyber-security, representing a computer network in order to monitor it for suspicious changes in behaviour. In some applications, such as cyber-security, the graph is pre-specified and interest lies in how graph-structured data evolves in time. In other applications, such as neuroscience, the graph itself must be inferred from data such as brain scans. Inferring and exploring graphs involves a ‘combinatorial explosion’ of edges and paths, and many key operations on graphs are known to be NP hard (computationally intractable), and must be approximated.

Current research in the School focuses on developing methods for uncovering the sparse/low-rank structure in high-dimensional data, such as factor analysis of high-dimensional panel collecting financial indicators and price processes, and Bayesian methods for structural learning, inference about relationships from DNA mixtures in forensic statistics.

**Staff in this area: Haeran Cho, Peter Green, Oliver Johnson, Tobias Kley, Dan Lawson, Anthony Lee, Patrick Rubin-Delanchy, Feng Yu, Yi Yu**

## Modern Regression Methods

The aim of regression modelling is to determine how the distribution of a noisy response variable depends on one or more independent variables, or ‘covariates’. Despite having been around for more than two hundred years, regression modelling is a very active and fast-paced research area. Modern regression methods are not limited to a continuous response, or to modelling the conditional mean of the response (Generalized Linear Models, quantile regression). They can also include a wide variety of non-linear covariate effects, which are constructed using basis expansions or stochastic processes (Generalized Additive Models, GAMs). This makes these regression models much more adaptable to the empirical relationship between the covariates and the response, although it also introduces the danger of over-fitting, which tends to undermine predictive performance: controlling for overfitting is a major topic in regression modelling.

From a practical point of view, the most pressing challenge in regression modelling is developing estimation methods that can handle large datasets, including techniques such as sequential learning and parallel computing. Machine Learning (ML) is a major user of modern regression methods, and ML research is a productive source of new algorithms for modern regression: modern regression is in the intersection of ML and computational statistics. Current research in the School includes developing theory and more efficient computational methods for GAMs; well founded methods for smooth additive quantile regression; scalable computation for smooth regression models in general; efficient INLA methods for non-sparse models; big model/data visualization; controlling spatial confounding in complex regression models.

**Staff in this area: Matteo Fasiolo, Tobias Kley, Arne Kovac, Guy Nason, Simon Wood**

## Monte Carlo Computation

Monte Carlo methods are simulation algorithms designed to compute answers to deterministic questions using random numbers. They are used in statistics principally to compute probabilities and expectations in complex stochastic models, and are at the origin of the ever-increasing popularity of Bayesian methods. Monte Carlo methods were first imported into Statistics from Physics: they originated in Los Alamos and the atomic bomb project, although there is reference to them as far back as the ancient Babylonians of Biblical times. Now they are applied across many scientific fields, including engineering, aerospace, image and speech recognition and robot navigation.

Modern data structures, like streaming data and high dimensional datasets, are challenging for traditional Monte Carlo methods such as the well-known Metropolis-Hastings algorithm and the Gibbs sampler (both based on reversibility), because they cannot efficiently take advantage of the growing parallel computing power of modern computers. Current research in the School includes non-reversible and continuous time Markov chain Monte Carlo methods, distributed particles filters and stochastic optimisation algorithms.

**Staff in this area: Christophe Andrieu, Mark Beaumont, Mathieu Gerber, Anthony Lee, Vlad Tadic, Nick Whiteley**

## Multiscale Methods

In recent years multiscale methods have revolutionised the modelling and analysis of phenomena in a number of different disciplines. The ’multiscale’ paradigm typically involves a multiscale representation, and then manipulation of that representation to achieve a desired goal. Practical applications include: modelling communications network traffic – such as queues on routers – and image compression. The JPEG 2000 image standard is based on wavelet compression, as is the FBI fingerprint database. In addition, wavelets have the ability to sparsify systems and transformed previously challenging problems into ones which admit elegant solutions, using methods of high-dimensional mathematics and statistics.

Multiscale method are often required in large-scale spatial statistics, where it is necessary to merge observational datasets having very different spatial footprints. For example, GPS measurements made at a single point, LIDAR measurements made along a transect, and GRACE satellite measurements which average over an area of hundreds of square kilometres. These observational datasets are crucial in determining current and future sea levels, and understanding the impact of climate change.

Current research in the School includes ‘lifting’: using second-generation wavelets to tackle more realistic problems, where data are not uniformly spaced or arise on some complex manifold. Lifting methods provide computationally efficient methods of producing approximate wavelet coefficients, which share most of the attractive properties of first-generation wavelets, including sparsity, efficiency and the ability to manipulate objects and systems at multiple scales. Also hierarchical multi-resolution methods (e.g. in the context of electricity demand forecasting using smart meter data).

**Staff in this area: Haeran Cho, Matteo Fasiolo, Arne Kovac, Guy Nason, Jonathan Rougier**

## Optimisation under uncertainty

Classical optimisation deals with problems in which the objective function is precisely known, and the challenge is to develop efficient algorithms for problems with a large number of variables and constraints. But in many practical applications, there might be uncertainty either about the parameters in the objective function, or the function itself may be unknown. Such problems require a combination of inferring the objective function by choosing actions and observing rewards (or costs), and optimising over the imperfectly inferred objective function.

A common approach to such problems involves the use of probabilistic models to describe the objective function, and to provide a framework for jointly dealing with inference and optimisation. Secondly, many practical problems of this type are too large and complex to admit exact solutions. Therefore, a common approach is to derive bounds on the achievable performance of *any*algorithm, and to develop heuristics and show that these achieve performance close to the proven bounds.

Current research in the School touches upon multi-armed bandits, Markov decision processes and reinforcement learning, optimisation on random graphs, and applications to communications and computer science.

**Staff in this area: Ganesh Ayalvadi, Jonathan Rougier, Vlad Tadic**

## Statistical genetics

Our evolutionary history lies embedded within our genome. Statistical genetics addresses questions such as: Have there been population bottlenecks or population explosions in the past? Has another group contributed to the genome of the current population? Is a certain genetic mutation under natural selection? The recent study showing roughly 2% of human DNA has Neanderthal origin, and strong genetic evidence that supports the out-of-Africa origin of modern humans, are both outstanding examples how we can infer historical events using only a sample of DNA extracted at present.

Statistical genetics is highly interdisciplinary and draws on many fields, including genetics, molecular biology, computer science, statistics and probability theory. Huge datasets such as the UK Biobank and 100,000 genomes project are becoming available as a result of low-cost genotyping and next-generation sequencing, and so this is yet another high dimensional & highly-structured data challenge, where the structure comes from the complexity of how genetic patterns are shared within and across generations.

Current research in the School focuses on how to construct computationally-efficient models that extract information from sequence data or whole-genome datasets to infer population parameters. Examples include the selection coefficients of mutations, fine-scale population structure, historical migration rates of one subgroup into another group, looking for evidence of evolution and response to environment change in the past, using whole-genome sequence data sampled from various sites and at various historical times. There are strong links with the Integrative Epidemiology Unit at Bristol, allowing methodology to be translated into application.

**Staff in this area: Mark Beaumont, Dan Lawson, Patrick Rubin-Delanchy, Feng Yu**

## Time Series Analysis

Time series are observations on variables indexed by time, or some other meaningful ordering. They are frequently collected in many areas such as finance, medicine, engineering, natural and social sciences. One example from the world of finance is daily quotes of share indices, such as FTSE 100. Time series analysis aims at: (i) finding a model that provides a good description of the main features of the data, and (ii) given the model and the data, forecasting and/or controlling the future evolution of the process. These two stages of analysis often require the development of novel procedures and algorithms which depend on the particular problem at hand. One important and wide-ranging example isstatistical signal processing, where modern statistical methods are applied across a variety of information engineering activities, such as telecommunications, target tracking, sensor data fusion, and signal and image processing.

Among many branches of time series analysis, change-point analysis allowssome stochastic properties of the data to be time-varying. Change point detection problems have a relatively long history dating back at least to World War 2. This area is now going through a renaissance due to the emerging of complex data types, for instance high-dimensional vectors, high-dimensional matrices and networks. Lately there has been a renaissance in research for computationally fast and statistically efficient methods for change-point problems, in response to the emergence of large data sets observed in highly non-stationary environments. Current change-point research in the School covers theoretical properties of algorithms for a variety of models, and a range of applications including: the detection of DNA copy number aberrations in cancer research, structural break analysis in large financial datasets, and anomaly detection in computer networks which may indicate a cyber-attack.

Time series research in the School also includes: (i) work on non-stationary time series in terms of local autocovariance, partial autocovariance and spectral estimation and applications and improvements to forecasting in these situations. (ii) Questions related to the time series sample rate.; for example, is it possible to tell whether a series should be sampled at a faster rate, given a series at a particular rate, and whether this is cost-effective?

**Staff in this area: Haeran Cho, Tobias Kley, Guy Nason, Patrick Rubin-Delanchy, Vlad Tadic, Yi Yu, Nick Whiteley**