The Powerful Combination of Semantics-Based Machine Learning and Domain Expertise

In this blog hear about our CTO James Malone's recent talk at Pistoia Alliance’s Spring Virtual Conference on semantics-based machine learning and domain expertise on a day dedicated to emerging science and technologies.

Machine learning

SciBite is one of several members of the Pistoia Alliance working to address challenges in making better use of big data in Global Life Sciences R&D. In April, our CTO James Malone took part in the Alliance’s Spring Virtual Conference on a day dedicated to emerging science and technologies. In his talk “Combining Deep Learning, Semantics and Domain Expertise to Detect Patterns in Biomedical Text,” James took viewers on a journey through examples of semantic-based machine learning (ML) that overcame hurdles in gaining information from varied sources with biomedically relevant content. In the process, James made a compelling argument for the power of SciBite’s approach of combining ML with subject matter expertise. “Know your data, understand the problem, and select your solution accordingly,” James said. “Semantics matter and are a powerful tool to capture, enrich, and even sub-select data to help know more about content before training, but language models require ‘fine-tuning’ for different language types and that’s where subject matter expertise is essential.”

Seeding Machine Learning Models with Rapidly Generated Micro-Vocabularies

James described a ML strategy based on “seeding” named entity recognition (NER). “Seed” terms are passed to an ML model that extracts similar terms from text based on the contexts in which those terms occur. For example, both lung and heart occur alongside words such as ‘surgery’, ‘function’ and ‘medication’. The model therefore clusters them together. Thus, each seed generates a cluster of terms of varied relevance. Pass those terms as seeds in a second iteration and each will generate its own word cluster. The resulting pattern of word clusters is the beginning of semantic categories which, though noisy at first, can be pruned by subject matter experts to shape meaningful classifications through an iterative seeding and pruning process.

Pistoia Alliance 2021

The advantage of this strategy is that by tag-teaming ML and domain expertise, models built on this approach can be rapidly developed and flexibly tackle numerous applications.

Building De Novo Ontologies and Learning New Languages

James and his team of ML experts have, for example, trained a transformer model to scale up the seeding process and constructed a 6000-term vocabulary for genetic variation. Beginning with a handful of seed terms, the model generated increasingly rich term groupings which were validated by subject experts as the relational backbone of the vocabulary. This approach was even successfully applied alongside translation to enable English speakers to develop a Japanese vocabulary with thousands of terms, despite not speaking a word of Japanese. The final result was then validated by Japanese speaking staff. And finally, the flexibility of the strategy addresses some of the language variation that has stymied efforts to capture relevant information from patient-side accounts in Facebook, Reddit, Twitter, forums and other types of real-world data sources. Models trained to understand how specific term categories, like medications or symptoms, appear within a sentence, can infer phrases that look like they should be medications or symptoms because of the language used. So, for example, “could not sleep”, although not in an ontology, can be recognised and annotated as a symptom, which is not possible with a model trained to look for specific terms, like “insomnia.”

Processing Vast Context Spaces

Going beyond term extraction, the strategy also presents an opportunity to support Bidirectional Encoder Representations from Transformers (BERT) models that deliver answers to natural language questions within a given chunk of text. Identifying the chunks that might contain the right answer can pose a challenge when the information space is very extensive or ambiguous. Flexible NER like James described can filter out the most salient paragraphs from a vast information space to deliver a narrower search field where answers are most likely to be found. This semantics-based sub-selection of data can streamline real-time processing and is something we are building into SciBiteSearch.

Our goal at SciBite is to create applied AI that is flexible and ready to be used. It should be scalable, it should be seamless, and it should allow the user to focus on doing high-quality science with data rather than puzzling over the “nuts and bolts” of how to do it. James’ presentation at the Pistoia Alliance is one example of our continuous quest to create new ways of generating more insights from our data.

Watch the full recorded presentation from the Pistoia Alliance virtual conference, or learn more about our state-of-the-art AI platform SciBiteAI, combining deep learning with powerful semantic algorithms.

Watch the full presentation

Related articles

  1. SciBite CTO James Malone joins ISB Panel to discuss AI and the Future of Biocuration

    Our very own SciBite CTO James Malone was invited to take part in a panel discussion as part of this year's virtual Biocuration Conference, where he shared his thoughts in a thought-provoking discussion on “The Future of Biocuration.”

    Read
  2. A Semantic approach to creating Machine Learning training data using Ontologies, Wikipedia and not Sherlock Holmes

    SciBite's CTO James Malone explains how the semantic approach to using ontologies is essential in successfully training machine learning data sets. In this blog he discusses how Sherlock Holmes (amongst others) made an appearance when we looked to exploit the efforts of Wikipedia to identify articles relevant to the life science domain for a language model project.

    Read

How could the SciBite semantic platform help you?

Get in touch with us to find out how we can transform your data

Contact us