Disease Detective Part 2: Exploring mechanistically-related diseases through shared phenotypic profiles
28th February 2017

Author: SciBite Team

In Part 1 of our series on Rare Disease Day, we explored how SciBite’s technology can help to identify experts across the globe – an essential part of furthering treatment and therapies for rare diseases.

In Part 2, we’ll look at a fresh way of enabling scientific researchers, either in pharmaceutical R&D or in medical institutes to deepen their investigations and consider new links.

What’s the problem?

One of the biggest headaches a researcher faces is the huge volumes of published literature out there that they’d want to mine.  The conundrum is how to quickly identify the most important and relevant points.  Fast and accurate distillation is key.

Now, text mining is already out there, so you may be wondering what it is that SciBite can bring to the semantics analytics arena.

We offer a two-pronged resolution with our high quality VOCabs - hand curated ontologies, tailored to the scientific domain – and our super-fast TERMite engine to liberate more data that might have otherwise remained buried.  The advantages of this over deep NLP approaches are:

a) the simplicity with which we can identify concepts among multiple sources and formats

b) the ease with which you can plug this into your existing systems

And the results?

  1. It enables you to find direct links in literature more readily
  2. You’re able to find new links which may have never been previously (or explicitly) stated
  3. You gain a better understanding of the mechanisms behind disease - unravelling how and why someone gets it, its behaviour, development, what it looks like and its weak spot.

Then, you have the start of a journey that could lead on to applying gene therapy and eventually a potential therapy or treatment.

Understanding the disease is key

So let’s demonstrate this technology on a real-life rare disease and its related conditions.

Friedreich’s Ataxia is described on the Rare Disease Day website as:

“…a genetic, progressive, neurodegerative movement disorder, with a mean age of onset between 10 and 15 years. Initial symptoms may include unsteady posture, frequent falling, and progressive difficulty walking due to impaired ability to coordinate voluntary movements (ataxia).”

What we’re aiming for here is a better characterisation of this rare disease based on its similarities to more widely understood conditions.

Step 1: 

We ran TERMite across 25 million medline abstracts and extracted co-occurring pairs of conditions and clinical signs.

TERMite results from Medline abstracts

TERMite results from Medline abstracts

Step 2:

We performed a statistical analysis of the results to score the most scientifically interesting relationships.  This enables the user to filter results and avoid those unwanted network hairballs a bit like this:


Step 3:

We then loaded the results into a Neo4j graph database, providing us with scalable and flexible retrieval. 

Step 4:

Here you can see an initial visualisation of the graph database using Linkurious.  The image below shows the major phenotypes associated with Friedreich’s Ataxia.


Step 5:

Now, let’s interrogate this knowledgebase.

How Friedreich’s Ataxia shares multiple phenotypes with Huntington’s Disease

How Friedreich’s Ataxia shares multiple phenotypes with Huntington’s Disease

Now that we can calculate the major phenotypes associated with thousands of conditions, we can compare their phenotype profiles and apply similarity scoring algorithms.

The next image shows the conditions that have the most similar phenotype profiles to Friedreich’s Ataxia:

Indications related by similar phenotype profiles. The numbers on the grey lines represent the relative similarity score for each pair of conditions

Indications related by similar phenotype profiles. The numbers on the grey lines represent the relative similarity score for each pair of conditions

We can also export the data as a list of the related indications and their major shared phenotypes (from the Neo4J interface into Excel).


If you’re an expert in the field, you may be thinking that many of these indications are well known, but keep scanning down the list - less well known information may become apparent.

Let me make this clear - this was all worked out by the computer with no prior knowledge of the condition: a computer which can now also characterise thousands of other conditions in the same way.

Exploiting the power of this analysis

So now it’s time to explore the associated genes for these phenotypically related conditions. By doing this, we’ll get an idea of where there are knowledge gaps for how these conditions might be mechanistically related. We can also show potential areas where these gaps might be filled. 

By overlaying gene association data from DisGeNET, we can see some conditions with many known gene associations. However, for Friedreich’s Ataxia, there is only one - frataxin (FXN).


Are there any conditions with lots of gene associations? Yes - you can see Peripheral Neuropathies has a huge number of associated genes – these are linked because of the sheer amount of research done in this area.

By contrast, take a look at our Friedreich’s Ataxia.  There are clearly huge gaps in mechanistic understanding and we can see that there’s not a great deal of investigation.

Going back to FXN, and to help get an idea of where it might fit in with the other gene/protein entities displayed on the graph, we added in protein-protein interaction data from iRefIndex. This fills in some of the gaps from the above image and we now see FXN interacting with several ­­­­­­genes that are known to be associated with phenotypically related conditions. In doing so, we’re building up a picture of related conditions and their underlying genetic mechanisms.


The incredibly useful thing about this method is that we’ve brought together three sets of data:

  1. Text-mined data from Medline (courtesy of TERMite) – seen here in yellow lines
  2. Gene disease associations from DIsGenet – pink lines
  3. Protein-protein interaction data from Irefindex – orange lines

Once some interesting and plausible hypotheses have been derived from the graphs, an individual can help to drive research in new directions.

For example, the gene entity PASK (PAS domain containing serine/threonine kinase) seen on the image above, interacts with FXN and is also known to be associated with Peripheral Neuropathies. From the analysis, this was one of the most phenotypically similar conditions to Friedreich’s Ataxia, as well as SDHA (succinate dehydrogenase complex, subunit A – you can see why it’s shortened!) being linked to a number or related conditions.

Could this be a new area of research?

What we love at SciBite about using our software in conjunction with Linkurious in this way is exactly that – opening up new possibilities.  And opening them up quickly, in an easily readable, presentable and accessible format, leaving researchers more time to, well, research.

We’ve written a White Paper on how we used a combination of network analysis and Machine Learning techniques to liberate data buried in millions of Medline abstracts.  To find out more about our work and how we could best help you, please contact us with your name, contact details and your organisation.  We’d love to hear from you.

Sign up for our newsletter