A De Novo Vocabulary Approach for Producing AI Models

In this blog hear how our SciBiteAI team demonstrated a de novo vocabulary approach for generating a machine learning model, allowing researchers to identify and annotate text containing mutant descriptors.

A De Novo Vocabulary Approach For Producing AI Models

Optimised vocabularies with comprehensive collections of domain-specific synonyms are key for researchers analysing or mapping their data. However, creating vocabularies is typically time and resource intensive. A broad syntax for example, presents curators with huge challenges when compiling new libraries. We have been able to demonstrate de novo vocabulary generation using a combination of machine learning techniques (Word2Vec and BioBERT), our rules-based VOCabs and subject matter expertise, to create a genetic variation (GenVar) artificial intelligence model.

SciBiteAI Genetic Variation model

Genetic variants are mutations or permanent changes to DNA sequences. Mutations may be implicated in gene regulatory networks and disease pathways, making them interesting entities for biomedical studies, for example target druggability.

Within textual data however, mutations can be described within a myriad of ways: either to the DNA code; as changes to the amino acid sequence of a protein produced from a mutated gene; or using IDs assigned from the Single Nucleotide Polymorphism (SNP) database.

Thanks to our SciBiteAI team, including Spyroula Masiala who lead the development of this approach, we have developed a GenVar AI model using this AI modelling approach, with TERMite, our named entity recognition platform, allowing researchers to identify and annotate text containing these mutant descriptors. Learn more about SciBiteAI.

Demonstrating De Novo Vocabulary Generation

In the following Molecular Genetics and Genomic Medicine excerpt, see Figure 1, an association between Lynch disease and a mutated gene is described. The GenVar AI model is able to identify the mutant MLH1 gene, with nomenclature that describes the specific change (within coding DNA as a deletion at position 518-519).

The GenVar NER model also annotates changes observed in proteins. In this same example, a frame-shift mutation is identified within specific amino acids (tyr173trpfs*18) of the mutant protein.

Use of the GenVar model to identify a DNA and Gene-Protein mutation

Figure 1. Use of the GenVar model to identify a DNA and Gene-Protein mutation

Mutations are frequently described in the literature using Single Nucleotide Polymorphism (SNP) database IDs. The GenVar model has been designed to index SNP IDs, as shown in the following examples for insulin-like growth factor I, taken from a study of colon cancer reported in the journal La Tunisie Medicale, see Figure 2.

Use of the GenVar model to identify SNP IDs

Figure 2. Use of the GenVar model to identify SNP IDs

The process for de novo vocabulary generation of the GenVar NER AI model was as follows:

1. Genetic variation terms, for example SNP ids, DNA and protein mutations were extracted from PubMed Central papers

2. A Word2Vec model, trained in scientific genetic variation literature, was then used to generate related terms for these “seed” entities, see Figure 3. Whilst the use of Word2Vec to generate related terms isn’t novel, we are able to leverage its rules-based vocabularies and curators to validate these suggestions.

A Word2Vec model was used to generate related terms from “seed” entities

Figure 3. A Word2Vec model was used to generate related terms from “seed” entities

3. In this proof-of-concept project, we were able to validate the dictionary of mutant terms produced by the Word2Vec model against the publicly available Clinical Variant and SNP databases. As described in Figure 4, an automated “keep and toss” method was employed to cross check these Word2Vec suggestions against these resources, with validated terms being added to a genetic variation vocabulary.

The process for de novo vocabulary generation

Figure 4. The process for de novo vocabulary generation

4. This process was repeated in rounds to build up a genetic variation vocabulary of around 6k terms.

5. A training set of PubMed articles, that included genetic variation references were then annotated with this genetic variation vocabulary.

6. The GenVar model was generated from this annotated content using the BioBERT algorithm.

In evaluation studies, the GenVar AI model exhibited impressive accuracy, precision and recall scores against “unseen” PubMed articles that included genetic variations.

We have demonstrated a machine learning approach for rapidly generating an NER AI model that accurately describes genetic variations from a few seed terms. To quote Aeschylus, “From a small seed, a mighty trunk may grow”.

As well as being able to retrain this genetic variation model with other datasets, this methodology can be used to produce a whole host of NER AI, where the “conventional” process is time consuming or unwieldy, for example biomedical models such as drug sentiment analysis, to industry-specific equipment in the manufacturing sector.

Learn more about SciBiteAI or get in touch with the team to find out more about our GenVar AI model.

You can also download our SciBiteAI use case on leveraging Machine Learning models.

SciBiteAI use case

Related articles

  1. SciBite launches SciBiteAI Relationship Extraction models

    SciBite announces the release of SciBiteAI Relationship Extraction models, which provide the enhanced ability to identify complex relationships within text to further unlock insights from Life Sciences data.

    Read
  2. SciBite launches state-of-the-art AI platform for Life Sciences organisations

    SciBite announces the launch of SciBiteAI, a state-of-the-art Artificial Intelligence software platform for leveraging machine learning models alongside semantic technologies to unlock insights into Life Sciences data.

    Read

How could the SciBite semantic platform help you?

Get in touch with us to find out how we can transform your data

Contact us