To achieve the best results in exploring target-disease biology, researchers regularly apply multiple approaches. Reviewing direct protein-disease relationships is a simple but powerful technique for gaining an initial view of the current situation.
Often a project team will move into a new disease area and want a summary of all of the targets linked to a disease. There also are often gene family initiatives where one wishes to learn of all the associations between diseases and a particular family of proteins. Both of these areas are of course, key questions in drug repurposing studies.
The links between these entities may be direct [PROTEIN] causes [DISEASE], or more distant, a relationship to something in a pathway that may be modulated using a drug focused at another step. Direct gene/protein-disease associations are very powerful, but only take you so far.
There are many ideas about how integrated data can take you forward but ultimately these still rely on direct protein/disease linkage
The question of how to build on these approaches is one we often discuss with our customers, is it possible to delve deeper into the phenotypes associated with a disease.
Phenotypes are phenomenon such as ‘reduced T-cell function’, ‘increased fat absorption’ and so on. There are a number of use-cases exploring phenotypic data which all center upon the principle that a disease can be described in terms of its phenotypes and those phenotypes are a spring board into the deeper biological mechanism of the condition.
How do you search for and identify phenotypes? There is the well-established mammalian and human phenotype (HPO) ontologies. These provide a very popular and powerful standard for tagging phenotypic data. The key question here is, whether they can be utilised by text analytics tools to automatically identify phenotypes in text.
If we take a look at the HPO itself, we can see it is a pretty comprehensive list (over 15,000 terms right now) with synonyms and definitions:
We can see that the HPO team have done a great job in defining these phenotypes, yet blindly applying HPO in a naive text-matching tool will yield pretty low recall. Let’s take HP_00006479 – Abnormality of the dental pulp. Using a raw HPO export as a text-mining dictionary will not match phrases such as,
There are probably 10s if not hundreds of variations of this phrase in real-world text.
At SciBite, our named entity recognition engine – TERMite is designed to manage this variation and specifically understands phenotype-style language and can take the HPO and MPO libraries as they are but automatically identify many different variations of these key phrases. Couple that with some considerable human curation and you have a pretty great tool to now recognise HPO and MPO terms within text.
Going Beyond The Ontologies
Is that it, are we done? Not quite! While the HPO and MPO are fantastic resources (they really are very impressive community efforts), we need to think more about what a phenotype actually is,
The set of observable characteristics of an individual resulting from the interaction of its genotype with the environment.
A phenotype is an observable characteristic, some kind of phenomenon or perturbation to a biological sub-system. If you agree with this, then actually the set of potential phenotypes is as close to infinite as it makes no difference. Is ‘reduced lymphocyte activity’ different from ‘impaired lymphocyte activity’ and from ‘impaired lymphocyte function’?
If we consider all the possible biological entities, systems, processes and concepts that could be involved its a staggering number of potential phenotypes. Too many to expect high-quality hand-curated ontologies to represent them all, many of which will never be found in real life, but of course we don’t know up front which ones will and will not be reported.
The fact is that no matter how good your input vocabulary is (HPO, MPO or an internal list), it never likely to cover all potential phenotypes. We’ve spent a lot of time understanding just what a phenotype is, and how to detect them.
Finding Phenotypes with SciBite
We have tried to address these issues discussed above by combining a very deep HPO/MPO search with a collection of algorithms specifically designed to identify potential phenotypes de novo, without the need for a dictionary. This means we can find words or phrases that look like phenotypes in the correct context and return these to the user. As a result, we have evolved from scanning the 12,000 or so catalogued phenotypes to the order of many millions. Our existing users can see both the ontology and algorithmically-derived results at the same time and send the portions of interest to further downstream analysis.
In this way, we’re hoping we really start to open up the Disease –> Phenotype –> Rich Biology that’s currently locked away in scientific text and help generate new insight into many debilitating conditions.
If you’d like to talk more about our work on disease phenotypes or other similar challenges, please contact us on firstname.lastname@example.org, we look forward to seeing if we can help!
Announcing the latest version of our flagship text analytics software for life sciences, TERMite 5.9.Read
Scientific knowledge can be represented as relationships between things. Thousands or millions of such relationships make a knowledge graph or network analysis. SciBite technology enables extraction of these relationships, and in doing so, can uncover knowledge that might otherwise have remained hiddenRead
Get in touch with us to find out how we can transform your data.
© SciBite Limited / Registered in England & Wales No. 07778456