Enterprise FAIR Data

The challenges of harmonising data to be Findable, Accessible, Interoperable and Reusable

arrow

The explosion of data in the life sciences has led many pharmaceutical companies to change the way in which they conduct research and development. Until recently, data were generated and used with a specific analysis in mind.

Today, emerging technologies are unlocking the data-rich workflows needed for researchers to take large sets of historic data and apply them to new questions and applications.

This approach is changing the way organisations view and value data, both their own and that available in the public domain: from owning and using it once for a specific purpose, to sharing and re-using it for any number of potentially disparate projects.

However, the way in which organisations’ capture and manage data is fundamental to the success of this approach, and a wider scientific community initiative has resulted in the establishment of FAIR data principles to ensure that data is Findable, Accessible, Interoperable and Reusable.

While initially focused on public domain sources, FAIR data principles are rapidly gaining acceptance within the pharmaceutical industry.

Evolving Life Science Data Models

The traditional life science model in which innovation was primarily an internal process utilising internally-developed applications and data sources has changed radically over the past few years.

Pharmaceutical companies now require access to a wide range of data, including:-

  • Public domain sources (e.g. PubMed, ClinicalTrials.gov, FDA);
  • Commercial intelligence (e.g. Sitetrove, Pharmaprojects, Pharmapremia);
  • Data provided by contract research organisations (CROs).

They may also wish to collaborate with academic institutions, software and service providers. This form of innovation inevitably takes place in a heterogeneous environment or remotely via the cloud. In all cases, the value of FAIR data principals becomes more and more apparent.

Definition of FAIR Data

The term FAIR was first coined at a Lorentz workshop in 2014, and the resulting FAIR data principles published in 2016 as The FAIR Guiding Principles for Scientific Data Management and Stewardship by Mark D. Wilkinson et al.

Since 2016, FAIR data principals have been adopted by the European Union (EU) together with a growing number of research organisations and universities.

They are also increasingly being adopted by pharmaceutical and other commercial organisations as the standard for managing both scientific and business data.

A brief overview of what the FAIR principles mean for research data within life sciences is given below:-

Findable

  • Data are assigned a unique and persistent identifier;
  • Data are described with rich and machine-readable metadata;
  • Data and metadata are searchable and easy to find.

Accessible

  • Data and metadata can be retrieved by their identifier, read and accessed via a standardised communications protocol;
  • Access to research data should be as open as possible and as restricted as necessary for more sensitive data;
  • Metadata are accessible even after the data are no longer available.

Interoperable

  • Data and metadata are presented with standardised, documented and accessible semantic descriptions;
  • Data and metadata use standardised vocabularies, terminologies and ontologies;
  • Data and metadata are described with references to others so that it is possible to understand the relations between data.

Reusable

  • Data and metadata contain multiple types of contextual information, such as its scientific purpose;
  • Data and metadata are associated with detailed provenance information;
  • Data and metadata are structured and documented in accordance with applicable domain-relevant standards and formats.

The Cost of being unFAIR

European Union Research

In May 2018 the EU published a report (Cost-benefit analysis for FAIR research data) in which they estimated the cost of not having FAIR research data across the EU data market and EU data economy.

Seven indicators were defined to estimate the cost of not having FAIR research data: Time spent, cost of storage, licence costs, research retraction, double funding, interdisciplinarity and potential economic growth.

To provide estimates, they first assessed the inefficiencies arising in research activities due to the absence of FAIR data. From these different levels of inefficiency, they computed the time wasted due to no having FAIR and the associated costs. They also estimated the cost of extra licences that researchers would have to pay to access data that would otherwise be open with the FAIR principles. They looked at the additional storage costs linked to the absence of FAIR data: inaccessible data leads to the creation of additional copies of the data which would otherwise not be required if the FAIR principles were in place.

Computing all these costs, the EU report found that the annual cost of not having FAIR research data costs the European economy at least €10.2bn every year. By drawing a rough parallel with the European open data economy, they concluded that the downstream inefficiencies arising from a not implementing FAIR could account for further €16bn annually.

Research in the United States

According to recent Gartner research , “the average financial impact of poor data quality on organisations is $9.7 million per year”.

In their survey Extracting business value from the 4 V’s of big data, IBM also discovered that in the US alone, businesses lose $3.1 trillion annually due to poor data quality. Even more startling was their finding that “1 in 3 business leaders don’t trust the information they use to make decisions”.

Challenges to the Adoption of FAIR Data Principals

Unstructured Legacy Data

Unstructured data in electronic lab notebooks (ELNs), proprietary databases, PDFs, SharePoint folders, etc., represent a  challenge for any FAIR data initiative.

A common example of this is bioassay data which can be rendered unsearchable for any number of reasons:-

  • It is not consistently tagged;
  • Many terms have multiple names or identifiers;
  • No use of open standards or common terminology.

Data Silos

Data silos are another obstacle to FAIR data principals. Systems or infrastructure added via acquisition or merger are frequently not accessible to other parts of the organisation.

Recovering Historical Data

There is a great deal of historical data trapped in data silos, proprietary databases, spreadsheets, etc. that may still be of intrinsic value in today’s pharmaceutical research programs.

A well-known example of this can be found in the publication of research on the Origin of CRISPR-Cas Technology by Francisco J.M. Mojica et al. By “trawling the literature”,  Mojica was able to connect their work to that undertaken several years earlier by Yoshizumi Ishino et al. on sequencing of the iap gene.

However, examples that rely on manual review and rare, and recovering historical data assets via retrospective, manual curation is expensive and may be impractical or even impossible:-

  • Most of the associated meta-data is now missing;
  • The personnel who created the data in the first place have probably moved on;
  • The technology employed in the original project is very likely obsolete or no longer supported.

In these circumstances, automation offers the most cost-effective and practical solution, and may create new opportunities for leveraging historical data.

Biological Complexity

As described above, the “Findable” criteria of FAIR requires data to be described using “…rich and machine-readable metadata”. However, machine readable representations of biological information can quickly become extremely complex.

FAIR data principals provide a framework for addressing this complexity.

Ontology Management Challenges

Multiple competing ontologies and vocabularies within an organisation are usually indicative of several challenges:-

  • Disparate terminologies. These frequently overlap with high levels of redundancy;
  • Multiple names or identifiers for the same entity;
  • Ownership is hard to establish, making it difficult for users to edit or contribute;
  • No version control;
  • Governance is usually top-down and inflexible;
  • Little or no use of open standards making FAIR compliance hard to achieve.
  • Today’s organisations frequently wish to integrate business as well as life science data.

Whether home-grown or proprietary, ontology management problems such as these make it hard to perform federated searches and will require rationalisation as part of any FAIR data initiative.

Cultural Change within the Organisation

Changing the culture to value FAIR data principals is one of the most challenging tasks facing any organisation:-

Data curation has generally been an under-funded and under-appreciated aspect of research, but it is a vital part of the process and needs to be treated in this way. Investing in technology is necessary, but insufficient by itself: organisations also need to invest in the people tasked with generating the data that drives their research efforts.

Pillars of FAIR Data

Overview

SciBite provides two crucial pillars for any implementation of FAIR data principals, eliminating data duplication, providing a consistent set of terms and ontologies across all data sources and making legacy data searchable:-

  • Findability: Assigning a unique and persistent identifier to data and providing it with a rich and machine-readable metadata: making it searchable and easy to find.
  • Interoperability: Presenting data using standardised semantic descriptions and a common set of vocabularies and ontologies: allowing users to understand individual data elements and the relations between different elements.

CENtree Ontology Management

CENtree provides a centralised, enterprise-ready resource for ontology management and transforms the experience of maintaining and releasing ontologies for research-led businesses.

CENtree leverages machine learning techniques to support ontology management by suggesting parent classes, synonyms and relationship connections when new terms are being added.

TERMite Text Analysis Engine

TERMite (TERM identification, tagging and extraction) is our high performance named entity recognition (NER) and extraction engine.

Coupled with our hand-curated VOCabs, it can recognise and extract relevant terms found in scientific text transforming unstructured content into rich, machine-readable data.

DOCstore Search Engine

DOCstore enables researchers to harness the power of semantic analysis search to rapidly and scan multiple biomedical sources.

It supports a wide range of use cases, from identifying new drug discovery opportunities to monitoring the competitive landscape for a disease of interest.

A Typical FAIR Workflow

The diagram below illustrates a typical ontology-centric workflow based on FAIR data principals. Our CENtree ontology manager sits at the centre of this workflow.

 

Step 1: Edits come into CENtree from all parts of the organisation. All approved staff can contribute to this process subject to agreed controls and governance. The ontologies created within CENtree can then be served to all parts of the organisation.

Step 2: Ontologies can be served to machine learning algorithms either as tagged, structured text via TERMite or directly as ontology artefacts.

Step 3: Ontologies can be pushed into “smart forms” as part of an organisation’s data registry (e.g. assay registration, Omics, etc.).

Step 4: CENtree can output TERMite VOCabs directly, allowing for the automated transformation of legacy data or to produce ontology-annotated text.

Step 5: Ontologies can be consumed directly by other applications within the organisation.

FAIR as an Enabler for Machine Learning

FAIR data principals are ideal for creating the quality training data required by machine learning algorithms.

FAIR data principals can assist with several of the most important aspects of creating successful machine learning models:-

  • Acquiring and curating data;
  • Helping project teams understand how the data can be employed (i.e. what is the license);
  • Assisting in feature extraction by making features more readily identifiable and extractable;
  • Incorporating domain heuristics (e.g. from the ontologies employed to describe the data).

These tasks are frequently the most time-consuming and expensive aspects of developing machine learning models.

Customer Use Cases

Pistoia Alliance FAIR Toolkit:

Read more

More than FAIR: Unlocking the value of your bioassay data:

Read more

 

Master Data Management (MDM)

Overview

Master Data Management (MDM) and FAIR data share many of the same objectives and challenges, the primary difference is the environment in which they operate.

While FAIR is primarily concerned with data sources and processes within the scientific arena, the focus of MDM is on commercial enterprises.

Depending on their size, organisations in this space typically have data on customers, employees, vendors, suppliers, parts, products, locations, contacts, accounts and business policies.

A typical commercial IT infrastructure for master data management is shown below.

 

Defining Master Data Management

From the Gartner IT Glossary: What is MDM?

  • Master data management (MDM) is a technology-enabled discipline in which business and IT work together to ensure the uniformity, accuracy, stewardship, semantic consistency and accountability of the enterprise’s official shared master data assets.
  • Master data is the consistent and uniform set of identifiers and extended attributes that describes the core entities of the enterprise including customers, prospects, citizens, suppliers, sites, hierarchies and chart of accounts.

Challenges Presented by MDM

The challenges presented by master data are similar to those encountered when adopting FAIR data principals:-

  • Complexity: Organisations typically have complex data quality issues with master data, especially with customer and address data from legacy systems;
  • Overlap: There is often a high degree of overlap in master data, for example: organisations storing customer data across many separate systems;
  • Data Modelling: Organisations typically lack the skills and systems to model data accurately;
  • Governance: As with FAIR data, poor information governance (stewardship, ownership and policies) around master data leads to inefficiencies across the organisation.

Rewards of a Successful MDM Program

While MDM initiatives represent a major investment in time and resources for any organisation, the potential rewards are also substantial:-

  • A universal, shared, trusted view of customer data for marketing, sales and service purposes;
  • Improved lead times to launch new products;
  • Synchronised product and location data across the supply chain;
  • Improved business operations for more effective decision making.

Mastering your Enterprise Data

The diagram below illustrates the processes and systems for master data management in any large enterprise.

 

SciBite’s ontology management, semantic search and text mining capabilities play a vital role in mastering data by providing a consistent metadata model and allowing for the efficient processing of unstructured and semi-structured documents.

In performing these tasks, SciBite’s platform enables organisations to eliminate data duplication, provide a consistent set of terms and ontologies across all data sources, and make legacy data searchable.

Want to learn more?

Get in touch with the team to discuss how we can help you clean your data

Contact us

Use cases

Eliminating the Data Preparation Burden

For most pharmaceutical companies, extracting insight from heterogeneous and ambiguous data remains a challenge. The era of data-driven R&D is motivating investment in technologies such as machine learning to provide deeper insights into new drug development strategies.

The quality of data directly impacts the accuracy and reliability of results of computational approaches. However, the work required to achieve clean, high quality data can be costly, often prohibitively so, requiring data scientists to spend the majority of their time as ‘data janitors’, rather than actually analysing data.

SciBite provides an integrated, cost-effective solution to significantly reduce the time and cost associated with the process of data cleansing, normalisation and annotation. The output ensures that downstream integration and discovery activities are based on high quality, contextualised data.

Read the full use case

How could the SciBite semantic platform help you?

Get in touch with us to find out how we can transform your data

Contact us