High Performance Ontology Engineering
11th July 2016
Author: Lee Harland
One of the key aims of SciBite is to help our customers work with public ontologies in text mining applications. While these ontologies are very valuable resources, they are often built for the purpose of data organisation, not text mining. The reliance on vanilla public ontologies in text-mining will often lead to very poor results.
How to improve the outputs? One of SciBite's key offerings is our collection of high-performance ontologies fine-tuned to achieve the best possible results in text mining applications. We have developed a comprehensive workflow for expert and algorithm-based dictionary enrichment resulting in a collection of over 50 biomedical topics with over 20 million entries. This enormous reference library of terms is complemented by the advanced linguistic processing within our TERMite engine.
The NCBI Taxonomy classification of organisms is a typical example. In analysis of the biomedical literature, many use cases require organism identification and as the most comprehensive resource in this area, it’s often seen as a perfect fit. However, as noted above, the aim of this resource is for classification not for text mining. This leads to some significant hurdles in its application for text mining, such as those below.
Challenge 1 - “Simple” Synonym Coverage
Helicobacter cinaedi (a bacterial formerly known as Campylobacter cinaedi) can cause bacteremia in immunocompromised people. If we look this up in NCBI Taxon we get a pretty good result, providing both its new and legacy name:
The abbreviated forms of organism names (such as H. cinaedi) are used extensively in the literature but will not be found using the terms above. OK, that’s not a big problem, can’t we just look for [First initial] [second word] to cover this?
It’s here that the ambiguity of science steps in to prove it’s never quite as easy as that. Let’s take science’s favourite insect, the fruit fly, Drosophila Melanogaster (TAXON7227). If we simply take the first word and abbreviate it to D. Melanogaster, we find that it is commonly used in the literature. But what about the damsel fly, TAXON1219116?
Using this approach, we’d also create D. Melanogaster for this organism too, leading to finding all of the same papers mapping to this organism, which would of course be incorrect. Thus, our simple example became a lot more complex.
Fortunately, our TERMite text-analysis engine is designed for just this scenario, its advanced ambiguity processing capabilities will ensure that the right melanogaster is found.
Challenge 2 – The Usual Suspect
Like any resource not designed for text mining, we’ve got our fair share of ‘difficult’ synonyms.
- TAXON274808, Bathymaster signatus. Common Name: “searcher”
- TAXON71765 Lavinia exilicauda. Common Name “hitch”
- TAXON143334 Poromitra oscitans. Common Name “yawning”
We’ve also got a lot of names of (mostly female) people and places in there too, particularly with genus level entries such as:
- TAXON381266 Alexandra
- TAXON34586 Katharina
- TAXON63672 Turbo and in fact, TAXON538939 Turbo and even TAXON538943 unclassified Turbo
There are literally 1000’s of these in the data that will produce a lot of noise if used without prior curation. In a database of 750,000 entries one cannot simply load into excel and spend the night on the sofa checking each synonym.
SciBite has advanced dictionary production tools that rapidly screen these issues and allocate the correct processing instruction to TERMite thus removing millions of potential false positives from the results. When you run our version of the Taxonomy ontology, you can be confident you’ll be seeing a lot less of these ladies in your results!
Challenge 3 – Fitting Your Purpose
Finally, there’s a lot of entries in there that make sense from a taxonomy perspective, but (arguably) aren’t really suited to text mining, some examples include:
- TAXON32644 unidentified
- TAXON87828 environmental samples
- TAXON12440 Non-A, non-B hepatitis virus
- TAXON2387 transposons
There are also large blocks of organisms derived from sequence records which are highly unlikely to be of interest apart from some very specialised applications. For example, there are thousands of entries such as this.
Clearly, we won’t be finding these in text and at best, they bulk up the dictionary reducing processing speed, and at worse will introduce the dreaded false positive results. Again, our dictionary processing experience comes into play helping to identify and remove these entries and the potential issues they could introduce.
Starting in Pole Position
Whilst public available ontologies are often a popular place to start in text mining activities, there are many inherent factors that reduce the overall accuracy and confidence of the results. To be clear, this is not a criticism of the resources themselves, which are incredibly valuable contributions to our scientific ecosystem. Rather, using such resources that were never intended for text analytics will always yield poor results unless time is taken to fine-tune them for this purpose.
SciBite’s VOCabs are specifically designed for their application in text mining and can address the issues discussed above and many more associated with non purpose-specific sources. Whether it’s a public or private ontology you want to rapidly scan over millions of documents – why not drop us a line today and see for yourself the quality of the results we can help you achieve.