Bayesian inference on high-dimensional Seemingly Unrelated Regression, applied to metabolomics data
23rd March 2018, 3:30 pm – 4:30 pm
Main Maths Building, SM3
Increasingly, epidemiologists are collecting multiple high-dimensional molecular data sets on large cohorts of people. The interest is in finding associations between these data sets and with genetic variants. In order to do this effectively these multi-variate data sets should be modeled jointly, taking into account correlations in the data.
Sparse solutions are usually required, and performing variable selection in this setting is critical.
We present a Bayesian Seemingly Unrelated Regressions (SUR) model for associating metabolomics outcomes with genetic variants,
allowing for both sparse variable selection and sparse covariance between the outcomes.
This model can be fit using a Gibbs sampler, but this quickly becomes computationally unfeasible as
the dimensions of the problem grow. Previously people have made use of two alternate simplifying assumptions,
either assuming independence between the outcomes (Bottolo et al. 2011, Lewin et al. 2015)
or selecting predictors jointly for all the outcomes (Bhadra and Mallik 2013, Bottolo et al. 2013).
In both this simplified cases conjugate priors can be used on the regression coefficients and variance/covariances.
In order to overcome some of the computational difficulty with the general SUR model,
we use a reparametrisation of the model in which the likelihood factorises completely into a product of
conditional distributions, and build a MCMC sampler capable of handling real molecular biology data involving
100's or 1000's of responses.
We extend previous work in this direction by allowing for a more general prior distribution, that allows for sparse covariance estimation
through the use of the reparametrised Hyper-Inverse-Wishart distribution, and we show that it is possible to improve considerably
the computational aspects of the method. Structural inference alone, on both variables and covariances selection, can be performed via pseudo-marginal MCMC.
The proposed method is applied to simulated data to illustrate the computational gains and further demonstrated on a metabolomics
(highly structured, with strong correlations) v. genetic variants data set from the North Finnish Birth Cohort.