Data mining epidemiological relationships: integration of causal analysis with published evidence
Population health research is being transformed by the increasing wealth of complex data. New high-dimensional epidemiological datasets provide novel opportunities for systematic approaches to understanding the relationships between risk factors and disease outcomes. This programme will build on our successes in collating data and implementing software to automate causal inference using Mendelian randomization (MR-Base, www.mrbase.org), and in literature mining (MELODI, www.melodi.biocompute.org.uk). We will implement a new graph database (EpiGraphDB) that integrates causal estimates with comprehensive data on relationships between traits, risk factors, biomarkers, intervention targets and diseases. Using EpiGraphDB we will develop new methods to explore the relationships between risk factors and disease, enabling new causal hypotheses to be generated and explored.
Aims and Objectives
We aim to: (a) systematically integrate biological contextual information with causal estimates generated using Mendelian randomization (b) develop novel approaches to identifying, validating and prioritising potential causal estimates in the context of a wide array of other information (c) utilise our database to inform the development of new Mendelian randomization methods that address pleiotropy (d) apply the data and approaches from (a) to (c) to investigate the causal risk factors in cardiovascular disease and cancer.
MR-Base (www.mrbase.org) is an openly accessible R (R statistical language) package, a web application and a comprehensive database of GWAS studies (the MRC-IEU GWAS database) including 3.4 billion genotype/phenotype association results from 974 GWAS studies published by 36 consortia. MR-Base enables automation of two-sample MR using 11 MR methods (including MRC-IEU methods addressing pleiotropy).
LD Hub (ldsc.broadinstitute.org) - In collaboration with Ben Neale (Broad Institute) and Dave Evans (University of Queensland) we developed LD Hub, a web application that performs linkage disequilibrium (LD) score regression. This enables us to automatically estimate the genetic correlations between a wide range of traits and diseases.
MELODI (www.melodi.biocompute.org.uk) mines the biomedical literature for mechanistic relationships between epidemiological concepts using a graph database (Neo4J, neo4j.com) that incorporates concepts (phenotypes, diseases, etc) from MeSH (Medical Subject Headings) and the SemMedDB Database extracted using the SemRep text-mining software.