Enterprise FAIR Data

The challenges of harmonising data to be Findable, Accessible, Interoperable and Reusable

arrow

The explosion of data in the life sciences has led many pharmaceutical companies to change how they conduct research and development. Until recently, data was generated and used with a specific analysis in mind.

Today, emerging technologies are unlocking the data-rich workflows needed for researchers to take large sets of historical data and apply them to new questions and applications.

This approach is changing how organizations view and value data, both their own and that available in the public domain: from owning and using it once for a specific purpose to sharing and re-using it for any number of potentially disparate projects.

However, how organizations capture and manage data is fundamental to this approach’s success. A broader scientific community initiative has established FAIR data principles to ensure that data is Findable, Accessible, Interoperable, and Reusable.

While initially focused on public domain sources, FAIR data principles are rapidly gaining acceptance within the pharmaceutical industry.

Evolving Life Science Data Models

The traditional life science model, in which innovation was primarily an internal process utilizing internally-developed applications and data sources, has changed radically over the past few years.

Pharmaceutical companies now require access to a wide range of data, including:-

  • Public domain sources (e.g., PubMed, ClinicalTrials.gov, FDA);
  • Commercial intelligence (e.g., Sitetrove, Pharmaprojects, Pharmapremia);
  • Data provided by contract research organizations (CROs).

They may also wish to collaborate with academic institutions, software, and service providers. This innovation inevitably occurs in a heterogeneous environment or remotely via the cloud. In all cases, the value of FAIR data principles becomes increasingly apparent.

Definition of FAIR Data

The term FAIR was first coined at a Lorentz workshop in 2014, and the resulting FAIR data principles were published in 2016 as The FAIR Guiding Principles for Scientific Data Management and Stewardship by Mark D. Wilkinson et al.

Since 2016, FAIR data principles have been adopted by the European Union (EU) and a growing number of research organizations and universities.

They are also increasingly being adopted by pharmaceutical and other commercial organizations as the standard for managing both scientific and business data.

A brief overview of what the FAIR principles mean for research data within life sciences is given below:-

Findable

  • Data are assigned a unique and persistent identifier;
  • Data are described with rich and machine-readable metadata;
  • Data and metadata are searchable and easy to find.

Accessible

  • Data and metadata can be retrieved by their identifier, read, and accessed via a standardized communications protocol;
  • Access to research data should be as open as possible and as restricted as necessary for more sensitive data;
  • Metadata is accessible even after the data are no longer available.

Interoperable

  • Data and metadata are presented with standardized, documented, and accessible semantic descriptions;
  • Data and metadata use standardized vocabularies, terminologies, and ontologies;
  • Data and metadata are described with references to others so that it is possible to understand the relations between data.

Reusable

  • Data and metadata contain multiple types of contextual information, such as its scientific purpose;
  • Data and metadata are associated with detailed provenance information;
  • Data and metadata are structured and documented by applicable domain-relevant standards and formats.

The Cost of being unFAIR

European Union Research

In May 2018, the EU published a report (Cost-benefit analysis for FAIR research data) in which they estimated the cost of not having FAIR research data across the EU data market and EU data economy.

Seven indicators were defined to estimate the cost of not having FAIR research data: Time spent, cost of storage, license costs, research retraction, double funding, interdisciplinarity, and potential economic growth.

To provide estimates, they first assessed the inefficiencies arising in research activities due to the absence of FAIR data. From these different levels of inefficiency, they computed the time wasted due to not having FAIR and the associated costs. They also estimated the cost of extra licenses that researchers would have to pay to access data that would otherwise be open with the FAIR principles. They looked at the additional storage costs linked to the absence of FAIR data: inaccessible data leads to the creation of additional copies of the data, which would otherwise not be required if the FAIR principles were in place.

Computing all these costs, the EU report found that the annual cost of not having FAIR research data costs the European economy at least €10.2bn every year. Drawing a rough parallel with the European open data economy, they concluded that the downstream inefficiencies arising from not implementing FAIR could account for a further €16bn annually.

Research in the United States

According to recent Gartner research, “the average financial impact of poor data quality on organizations is $9.7 million per year”.

In their survey Extracting business value from the 4 V’s of big data, IBM also discovered that in the US alone, businesses lose $3.1 trillion annually due to poor data quality. Their finding was even more startling: “1 in 3 business leaders don’t trust the information they use to make decisions”.

Challenges to the Adoption of FAIR Data Principles

Unstructured Legacy Data

Unstructured data in electronic lab notebooks (ELNs), proprietary databases, PDFs, SharePoint folders, etc., represent a challenge for any FAIR data initiative.

A typical example of this is bioassay data which can be rendered unsearchable for any number of reasons:-

  • It is not consistently tagged;
  • Many terms have multiple names or identifiers;
  • No use of open standards or common terminology.

Data Silos

Data silos are another obstacle to FAIR data principles. Systems or infrastructure added via acquisition or merger is frequently not accessible to other parts of the organization.

Recovering Historical Data

A great deal of historical data trapped in data silos, proprietary databases, spreadsheets, etc., may still be of intrinsic value in today’s pharmaceutical research programs.

A well-known example of this can be found in the publication of research on the Origin of CRISPR-Cas Technology by Francisco J.M. Mojica et al. By “trawling the literature,”  Mojica was able to connect their work to that undertaken several years earlier by Yoshizumi Ishino et al. on sequencing of the IAP gene.

However, examples that rely on manual review and rare, and recovering historical data assets via retrospective manual curation is expensive and may be impractical or even impossible:-

  • Most of the associated meta-data is now missing;
  • The personnel who created the data in the first place have probably moved on;
  • The original project’s technology is likely obsolete or no longer supported.

In these circumstances, automation offers the most cost-effective and practical solution and may create new opportunities for leveraging historical data.

Biological Complexity

As described above, the “Findable” criteria of FAIR requires data to be described using “…rich and machine-readable metadata”. However, machine-readable representations of biological information can quickly become highly complex.

FAIR data principles provide a framework for addressing this complexity.

Ontology Management Challenges

Multiple competing ontologies and vocabularies within an organization are usually indicative of several challenges:-

  • Disparate terminologies. These frequently overlap with high levels of redundancy;
  • Multiple names or identifiers for the same entity;
  • Ownership is hard to establish, making it difficult for users to edit or contribute;
  • No version control;
  • Governance is usually top-down and inflexible;
  • Little or no use of open standards makes FAIR compliance hard to achieve.
  • Today’s organizations frequently wish to integrate business and life science data.

Whether home-grown or proprietary, ontology management problems such as these make it hard to perform federated searches and will require rationalization as part of any FAIR data initiative.

Cultural Change within the Organization

Changing the culture to value FAIR data principles is one of the most challenging tasks facing any organization:-

Data curation has generally been an underfunded and under-appreciated aspect of research, but it is a vital part of the process and needs to be treated this way. Investing in technology is necessary but insufficient by itself: organizations also need to invest in the people tasked with generating the data that drives their research efforts.

Pillars of FAIR Data

Overview

SciBite provides two crucial pillars for any implementation of FAIR data principles, eliminating data duplication, providing a consistent set of terms and ontologies across all data sources, and making legacy data searchable:-

  • Findability: Assigning a unique and persistent identifier to data and providing it with rich and machine-readable metadata: making it searchable and easy to find.
  • Interoperability: Presenting data using standardized semantic descriptions and a common set of vocabularies and ontologies: allowing users to understand individual data elements and the relations between different elements.

CENtree Ontology Management

CENtree provides a centralized, enterprise-ready resource for ontology management and transforms the experience of maintaining and releasing ontologies for research-led businesses.

CENtree leverages machine learning techniques to support ontology management by suggesting parent classes, synonyms, and relationship connections when new terms are added.

TERMite Text Analysis Engine

TERMite (TERM identification, tagging, and extraction) is our high-performance named entity recognition (NER) and extraction engine.

Coupled with our hand-curated VOCabs, it can recognize and extract relevant terms found in scientific text, transforming unstructured content into rich, machine-readable data.

SciBite Search

SciBite Search enables researchers to harness the power of semantic analysis scientific search to scan multiple biomedical sources rapidly.

It supports a wide range of use cases, from identifying new drug discovery opportunities to monitoring the competitive landscape for a disease of interest.

A Typical FAIR Workflow

The diagram below illustrates a typical ontology-centric workflow based on FAIR data principles. Our CENtree ontology manager sits at the center of this workflow.

 

Step 1: Edits come into CENtree from all parts of the organization. All approved staff can contribute to this process subject to agreed controls and governance. The ontologies created within CENtree can then serve all aspects of the organization.

Step 2: Ontologies can be served to machine learning algorithms either as tagged, structured text via TERMite or directly as ontology artifacts.

Step 3: Ontologies can be pushed into “smart forms” as part of an organization’s data registry (e.g., assay registration, Omics, etc.).

Step 4: CENtree can output TERMite VOCabs directly, allowing for the automated transformation of legacy data or to produce ontology-annotated text.

Step 5: Ontologies can be consumed directly by other applications within the organization.

FAIR as an Enabler for Machine Learning

FAIR data principles are ideal for creating the quality training data required by machine learning algorithms.

FAIR data principles can assist with several of the most critical aspects of creating successful machine-learning models:-

  • Acquiring and curating data;
  • Helping project teams understand how the data can be employed (i.e., what is the license);
  • Assisting in feature extraction by making features more readily identifiable and extractable;
  • Incorporating domain heuristics (e.g., from the ontologies employed to describe the data).

These tasks are frequently the most time-consuming and expensive aspects of developing machine learning models.

Customer Use Cases

Pistoia Alliance FAIR Toolkit:

Read more

More than FAIR: Unlocking the value of your bioassay data:

Read more

 

Master Data Management (MDM)

Overview

Master Data Management (MDM) and FAIR data share many of the same objectives and challenges, the primary difference is the environment in which they operate.

While FAIR is primarily concerned with data sources and processes within the scientific arena, MDM focuses on commercial enterprises.

Depending on their size, organizations in this space typically have data on customers, employees, vendors, suppliers, parts, products, locations, contacts, accounts, and business policies.

A typical commercial IT infrastructure for master data management is shown below.

 

Defining Master Data Management

From the Gartner IT Glossary: What is MDM?

  • Master data management (MDM) is a technology-enabled discipline in which business and IT work together to ensure the uniformity, accuracy, stewardship, semantic consistency, and accountability of the enterprise’s official shared master data assets.
  • Master data is the consistent and uniform set of identifiers and extended attributes that describes the core entities of the enterprise, including customers, prospects, citizens, suppliers, sites, hierarchies, and chart of accounts.

Challenges Presented by MDM

The challenges presented by master data are similar to those encountered when adopting FAIR data principles:-

  • Complexity: Organizations typically have complex data quality issues with master data, especially with the customer and address data from legacy systems;
  • Overlap: There is often a high degree of overlap in master data, for example: organizations storing customer data across many separate systems;
  • Data Modelling: Organisations typically lack the skills and systems to model data accurately;
  • Governance: As with FAIR data, poor information governance (stewardship, ownership, and policies) around master data leads to inefficiencies across the organization.

Rewards of a Successful MDM Program

While MDM initiatives represent a significant investment in time and resources for any organization, the potential rewards are also substantial:-

  • A universal, shared, trusted view of customer data for marketing, sales, and service purposes;
  • Improved lead times to launch new products;
  • Synchronized product and location data across the supply chain;
  • Improved business operations for more effective decision-making.

Mastering your Enterprise Data

The diagram below illustrates the processes and systems for master data management in any large enterprise.

 

SciBite’s ontology management, semantic search, and text mining capabilities play a vital role in mastering data by providing a consistent metadata model and allowing for the efficient processing of unstructured and semi-structured documents.

In performing these tasks, SciBite’s platform enables organizations to eliminate data duplication, provide a consistent set of terms and ontologies across all data sources, and make legacy data searchable.

Want to learn more?

Get in touch with the team to discuss how we can help you clean your data

Contact us

Use cases

Eliminating the Data Preparation Burden
[Use Case]

For most pharmaceutical companies, extracting insight from heterogeneous and ambiguous data remains a challenge. The era of data-driven R&D is motivating investment in technologies such as machine learning to provide deeper insights into new drug development strategies.

The quality of data directly impacts the accuracy and reliability of the results of computational approaches. However, the work required to achieve clean, high quality data can be costly, often prohibitively so, requiring data scientists to spend the majority of their time as ‘data janitors’, rather than actually analysing data.

SciBite provides an integrated, cost-effective solution to significantly reduce the time and cost associated with the process of data cleansing, normalisation and annotation. The output ensures that downstream integration and discovery activities are based on high quality, contextualised data.

Read the full use case

How could the SciBite semantic platform help you?

Get in touch with us to find out how we can transform your data

Contact us