View all news

New data mining tool MELODI can point researchers in the most promising direction

23 January 2018

A new data mining tool known as MELODI (Mining Enriched Literature Objects to Derive Intermediates) has been developed by researchers at the MRC IEU and published in the International Journal of Epidemiology. MELODI can search through two lists of articles to find known and unknown mechanisms linking two biomedical concepts. This can aid researchers in forming hypotheses and carrying out further research in the most promising direction for disease treatment and prevention.

Scientific research creates vast amounts of literature on potential mechanisms linking risk factors to disease. Traditional methods of sorting through this information include manual filtering/selection of articles or selection by impact factor and citation number as well as reading media reports and word of mouth. These methods can introduce bias and may mean that resources and research are not targeted towards the most promising mechanisms. MELODI provides a new method for selecting information based on occurrences of biomedical terms within sets of articles.

MELODI operates by comparing two sets of articles. These sets can be generated by searching PubMed for a defined term (e.g. ‘body mass index’) – which can be carried out in MELODI – or by manually curating two lists of articles relating to defined biomedical topics.

MELODI also enriches the analysed texts to increase search efficacy. This is done by comparing the number of times a term occurs within the two sets of articles to the background rate in the entire database. This means MELODI promotes terms that appear more frequently than would be expected by chance, so commonly used terms such as ‘patients’ or ‘cells’ will not compete with potential intermediates.

Epidemiological studies in particular may benefit from data mining as MELODI can identify possible mechanisms connecting exposures and outputs (e.g. alcohol intake and liver cancer) which can then be followed up by Mendelian Randomisation (MR). MR can then determine if these mechanisms are causal. MELODI can also be used to investigate how known causal mechanisms operate (e.g. how increased alcohol intake increases the risk of liver cancer).

The researchers used two case studies to demonstrate the benefits of MELODI. Firstly, MELODI was able to identify the gene SP1 as a potential intermediate between the gene ERG and prostate cancer showing how MELODI can identify novel intermediates. Secondly, they revealed potential mechanisms underlying the causal relationship (identified using MR) between carnitine and pancreatic cancer including the potential role of insulin and/or fatty acid oxidation. These findings can pave the way for future research.

However, there are still some limitations MELODI is unable to overcome. Firstly, the way articles are structured and the order in which terms are used can sometimes limit the effectiveness of data mining. Future implementation of machine learning and natural language processing may be able to overcome this. Secondly, the published literature which MELODI analyses may also be subject to bias, due partly to negative results often going unpublished. This means MELODI may present a biased picture of what is known about a topic.

Despite these limitations MELODI remains a powerful tool for pointing researchers in the right direction and further investigating the mechanisms of disease.

Dr Benjamin Elsworth the study’s lead author and senior research associate at the MRC IEU said: "The volume of biomedical literature is growing rapidly with over 1 million articles published every year. The task of manually reading and summarising this data for a single concept is unthinkable, and identifying overlapping terms between concepts, impossible. However, within such a vast dataset there may be hidden nuggets of information, perhaps published in disparate journals, even in different decades, that can be used to derive novel mechanisms linking two concepts. MELODI has proved this is indeed possible, and is being used by researchers all over the world to explore more of the data within the scientific literature, uncovering connections never seen before."

The full study can be found here:


Edit this page