A Helping Hand: Annotation of the COVID-19 Open Research Dataset for the Scientific Research Community

In this blog find out how the SciBite team has responded to the tech community call to arms from The White House after they released an Open Research Dataset (CORD-19), with the hope to help uncover insights and answer high-priority scientific questions related to COVID-19.

COVID-19

In March 2020, in response to the global Coronavirus disease pandemic, a consortium including The White House released an Open Research Dataset (CORD-19) along with a call to arms to the tech community to help uncover insights and answer high-priority scientific questions related to COVID-19. This call asked for researchers around the world to apply or develop techniques to analyse and study a large collection of COVID-19 publication data which has been made open access to help the scientific community. The original datasets can be accessed on Kaggle.

Since then, SciBite has been continually working using our semantic data analytics software to produce biomedical ontology annotated versions of this data which we have been releasing to the public domain under a GPL license to offer a helping hand. Facilitated by our CTO James Malone, one of our Machine Learning experts Oliver Giles along with help from our ontologies team including Jane Lomax, Rachael Huntley and Anneli Karlsson, have used our software to produce 8 million sentence level annotations on this dataset to help those trying to identify biomedical entities within the text.

Applying data science approaches, such as machine learning, can benefit from the normalisation of biomedical concepts we have identified. Similarly, knowledge graph building can also exploit links between identified concepts in text, and extract them as relationships. The open annotation and publication of this data is our early contribution to this call to arms. A publication on this work is now available via bioRxiv.

Summary of Method

To identify relevant concepts, a focused set of ontologies was required with enrichments specific to concepts pertinent to COVID-19 research, as well as use our existing ontologies in broad areas of interest – drugs, genes, indications, phenotypes. These new COVID-19 specific ontologies developed were also released as vocabularies on our SciBiteLabs website for use.

To create the annotations we deployed TERMite, our named entity recognition engine (NER) to the text in the CORD-19 data (titles, abstracts and body text where applicable). By injecting these results back into the original JSON as an additional set of objects, we aimed to prevent any compatibility issues that would have arisen for individuals and groups who had already begun work with the initial release of this data. In addition, to enable a more focused analysis, we also tabulated and transposed the annotated data to create a set of sentence cooccurrences to expedite sentence-focused relationship extraction.

Summary of Results

TERMite was able to identify and annotate over 45 million entities consisting of 62,746 unique ontology concepts. As can be observed in the table below, many of the drugs found both in March and more recently have been heavily investigated by researchers around the world for their use as either a treatment (e.g. Hydroxychloroquine, Ribavirin) or considering if they may play a role in poorer clinical outcome (e.g. Angiotensin).

Ontology vocabulary

Drug most frequent hits

The use of this data has already helped in developing knowledge graphs and data integration portals and it is our hope that it will be put to further use in the data driven approaches used in 2020 and beyond.

To learn more take a look at the publication on this work which is now available via bioRxiv, or get in touch with the team if you’d like to find out more information.

Get in touch

Related articles

  1. SciBite launches state-of-the-art AI platform for Life Sciences organisations

    SciBite announces the launch of SciBiteAI, a state-of-the-art Artificial Intelligence software platform for leveraging machine learning models alongside semantic technologies to unlock insights into Life Sciences data.

    Read
  2. Unlocking Patents as a Data Source in the Life Sciences – Overview of Challenges and using Semantic Technology to overcome them

    Throughout this blog we highlight some complexities that exist in extracting meaningful information from patents and shown various solutions, making use of SciBite technology alone or, augmented by or delivered by our partners.

    Read

How could the SciBite semantic platform help you?

Get in touch with us to find out how we can transform your data

Contact us