As many of our regular visitors will know, the focus of our work here at SciBite is unlocking the knowledge held in the vast amount of biomedical text researchers have access to. We use many different techniques to do this, all embedded in our award-winning text analytics software. As you might expect, machine learning and artificial intelligence techniques are at the heart of what we do, and are particularly important for our enrichment activities where we look to expand the coverage of concepts and terms of many of the world’s major biomedical ontologies. It was in the midst of such an activity that we noticed an interesting output from the system, suggesting the English word ‘bum’ as an adverse medical event alongside words such as ‘lightning’ and ‘frostbite’!
It may help to take a step back and explain. We use a technique called ‘word embedding’ to embed words into an n-dimensional vector space – or, in English, to convert words into sets of numbers. Each number in this set of numbers contains some semantic information about the word, captured by passing masses of text data through a system known as a neural network. We can feed these into simple mathematical operations; for example, ‘uncle – male + female’ returns the result ‘aunt’, and ‘Paris – France + England’ returns London. Similarly, ‘citalopram – depression + schizophrenia’ returns ‘risperidone’, a drug commonly used to treat schizophrenia.
We can also calculate ‘distances’ between words to determine how similar they are, and this is a functionality that is particularly useful when constructing vocabularies. The most important aspect of a vocabulary entry is the set of synonyms, and when words occupy a similar semantic space in our word embeddings, there is a good chance that they are synonymous. In tandem with careful human curation, this helps to make the coverage of our vocabs as thorough as possible. And not only does it help experts to recall related words more quickly, it also, occasionally, comes up with an association that no human would have thought to check.
And now we get to the bottom of the matter.
Among the most similar words to ‘bum’ are ‘battlefield’, ‘polytrauma’ and ‘firecracker’. Either there’s a side to life science literature that we are unfamiliar with, or there is something else going on here. You may have figured it out already. How about if I tell you that the second most similar word to ‘bum’ is ‘scald’? The first is ‘burn’. The word ‘bum’, as it most commonly occurs in life science literature, is, in fact, a misreading of the word ‘burn’, where the ‘r’ and ‘n’ have blurred together to become an ‘m’; probably an artefact of transcription errors, particularly when PDF files are converted into raw text.
Here are some examples:
As we can see, the algorithms have correctly identified these articles as concerning events that could be considered traumatic to a human. They’ve subsequently gone on to identify the words and phrases they think the article pertains to … and have ended up at what is technically, if not scientifically, the right conclusion!
This, of course, is an edge case – but it does go some way to showing how thoroughly we construct our vocabularies and ontologies, and the value they can bring to handling even very noisy data. And, as you can see from the pictures above, misreading of text is an issue that affects articles in the real world – even relatively recent ones!
Furthermore, where our word embeddings feed forward into other aspects of our broader AI strategy, such as classifying documents or identifying sentences with certain relations, it is reassuring to know that systematic errors, like common misreading of words, are counteracted by our methods for representing data. It also illustrates the point that while machine learning techniques are very powerful, they lack the deep scientific understanding of a human. Fortunately, that understanding is exactly what is captured in ontologies, and highlights that combination of strong deep ontologies and cutting-edge machine learning techniques is by far the best approach for text analytics.
Get in touch with us to find out how we can transform your data
© SciBite Limited / Registered in England & Wales No. 07778456