Fuzzy matching now in TERMite
22nd November 2016
Author: Neal Dunkinson
TERMite, our high-performance text recognition engine is designed to identify key concepts in biomedical text regardless of the synonyms used. However, a drawback of this approach is that generally terms must be spelt correctly to be recognised.
In general, highly proof-read sites such as Medline or Clinicaltrials.gov will contain very few errors and this issue has negligible effect on recall. Other document types such as patents and many internal documents are not subject to such a high degree of proofing and one can note a larger proportion of such errors within their content.
As a number of customers had expressed an interest in expanding TERMite’s capabilities in identifying matches to misspelt words, we have developed a new fuzzy matching feature in the latest release of TERMite.
Managing non-exact representation in text
Fuzzy matching requires the alignment of mis-spelt words to known dictionary terms. Once activated within TERMite, this feature invokes a set of algorithms designed to identify incorrectly spelt words.
For instance, this pubmed article concerning the gene Galectin-3, uses the technically incorrect term “Galactin” (‘a’ instead of ‘e’) in the first sentence. This is actually a common mistake in medline with over 20 articles using the incorrect ‘a’ form. A similar issue occurs with the misspelling of Vimentin as Vimintin in other articles.
A different example based around spacing rather than spelling comes from this cancer article which notes a measurement of carcinoembryonic antigen protein levels. The standard name for this gene does not separate the “carcino” and “embryonic” but very occasionally this form is used, such as in the aforementioned article.
Patents are well known to be a source of significant numbers of spelling and transposition errors. A good example is this patent from that lists a number of diseases, including Creutzfeldt Jakob disease as "Creutzfeld- Jacob Disease”, a mis-spelling in both elements of the name.
All of these forms are identified using TERMites fuzzy matching feature.
By default, the algorithm assumes that somewhere in the text, the entity will be spelt correctly at least once. This can be changed to show all fuzzy matches in the text though given the huge number of similar words there may be a large number of hits!
From now on wether it's down as a hitaminc receptor, histamine recep-tor or histaminereceptor - we'll help you find the true meaning of the term and improve your search results across the corpus.
Supporting dictionary curation
When building dictionaries, TERMite offers a suite of tools to help users gain the best levels of precision and recall for their input terms. Fuzzy matching can also help in this process to identify closely related terms that may not be in the dictionary. For example, if a customer were interested in Signaling lymphocytic activation molecule they may not have thought to use the 'lymphocyte' variant of the name often used by authors. To aid with identification of such variants, TERMite offers a simplified fuzzy term identification workflow which dictionary developers can use to rapidly identify such terms.
Of course, some spelling changes lie in a more grey area. For instance a search using our protein type dictionary for 'oxytocin receptor's identified a number of papers (such as this) through the fuzzy match of 'oxytocics' that describe a class of drugs operating on the oxtytocin receptors. It is of course, very debatable whether oxytocics is a true synonym for oxytocin receptors - it all depends on the research the customer is performing.
With fuzzy matching enabled, the user now can now choose to include or exclude this type of term on a case by case basis.
If you'd like see how our fuzzy matching feature works on your data or have any questions, please get in touch.