The explosion of data in the life sciences has led many pharmaceutical companies to change how they conduct research and development. Until recently, data was generated and used with a specific analysis in mind.
Today, emerging technologies are unlocking the data-rich workflows needed for researchers to take large sets of historical data and apply them to new questions and applications.
This approach is changing how organizations view and value data, both their own and that available in the public domain: from owning and using it once for a specific purpose to sharing and re-using it for any number of potentially disparate projects.
However, how organizations capture and manage data is fundamental to this approach’s success. A broader scientific community initiative has established FAIR data principles to ensure that data is Findable, Accessible, Interoperable, and Reusable.
While initially focused on public domain sources, FAIR data principles are rapidly gaining acceptance within the pharmaceutical industry.
The traditional life science model, in which innovation was primarily an internal process utilizing internally-developed applications and data sources, has changed radically over the past few years.
Pharmaceutical companies now require access to a wide range of data, including:-
They may also wish to collaborate with academic institutions, software, and service providers. This innovation inevitably occurs in a heterogeneous environment or remotely via the cloud. In all cases, the value of FAIR data principles becomes increasingly apparent.
The term FAIR was first coined at a Lorentz workshop in 2014, and the resulting FAIR data principles were published in 2016 as The FAIR Guiding Principles for Scientific Data Management and Stewardship by Mark D. Wilkinson et al.
Since 2016, FAIR data principles have been adopted by the European Union (EU) and a growing number of research organizations and universities.
They are also increasingly being adopted by pharmaceutical and other commercial organizations as the standard for managing both scientific and business data.
A brief overview of what the FAIR principles mean for research data within life sciences is given below:-
In May 2018, the EU published a report (Cost-benefit analysis for FAIR research data) in which they estimated the cost of not having FAIR research data across the EU data market and EU data economy.
Seven indicators were defined to estimate the cost of not having FAIR research data: Time spent, cost of storage, license costs, research retraction, double funding, interdisciplinarity, and potential economic growth.
To provide estimates, they first assessed the inefficiencies arising in research activities due to the absence of FAIR data. From these different levels of inefficiency, they computed the time wasted due to not having FAIR and the associated costs. They also estimated the cost of extra licenses that researchers would have to pay to access data that would otherwise be open with the FAIR principles. They looked at the additional storage costs linked to the absence of FAIR data: inaccessible data leads to the creation of additional copies of the data, which would otherwise not be required if the FAIR principles were in place.
Computing all these costs, the EU report found that the annual cost of not having FAIR research data costs the European economy at least €10.2bn every year. Drawing a rough parallel with the European open data economy, they concluded that the downstream inefficiencies arising from not implementing FAIR could account for a further €16bn annually.
According to recent Gartner research, “the average financial impact of poor data quality on organizations is $9.7 million per year”.
In their survey Extracting business value from the 4 V’s of big data, IBM also discovered that in the US alone, businesses lose $3.1 trillion annually due to poor data quality. Their finding was even more startling: “1 in 3 business leaders don’t trust the information they use to make decisions”.
Unstructured data in electronic lab notebooks (ELNs), proprietary databases, PDFs, SharePoint folders, etc., represent a challenge for any FAIR data initiative.
A typical example of this is bioassay data which can be rendered unsearchable for any number of reasons:-
Data silos are another obstacle to FAIR data principles. Systems or infrastructure added via acquisition or merger is frequently not accessible to other parts of the organization.
A great deal of historical data trapped in data silos, proprietary databases, spreadsheets, etc., may still be of intrinsic value in today’s pharmaceutical research programs.
A well-known example of this can be found in the publication of research on the Origin of CRISPR-Cas Technology by Francisco J.M. Mojica et al. By “trawling the literature,” Mojica was able to connect their work to that undertaken several years earlier by Yoshizumi Ishino et al. on sequencing of the IAP gene.
However, examples that rely on manual review and rare, and recovering historical data assets via retrospective manual curation is expensive and may be impractical or even impossible:-
In these circumstances, automation offers the most cost-effective and practical solution and may create new opportunities for leveraging historical data.
As described above, the “Findable” criteria of FAIR requires data to be described using “…rich and machine-readable metadata”. However, machine-readable representations of biological information can quickly become highly complex.
FAIR data principles provide a framework for addressing this complexity.
Multiple competing ontologies and vocabularies within an organization are usually indicative of several challenges:-
Whether home-grown or proprietary, ontology management problems such as these make it hard to perform federated searches and will require rationalization as part of any FAIR data initiative.
Changing the culture to value FAIR data principles is one of the most challenging tasks facing any organization:-
Data curation has generally been an underfunded and under-appreciated aspect of research, but it is a vital part of the process and needs to be treated this way. Investing in technology is necessary but insufficient by itself: organizations also need to invest in the people tasked with generating the data that drives their research efforts.
SciBite provides two crucial pillars for any implementation of FAIR data principles, eliminating data duplication, providing a consistent set of terms and ontologies across all data sources, and making legacy data searchable:-
CENtree provides a centralized, enterprise-ready resource for ontology management and transforms the experience of maintaining and releasing ontologies for research-led businesses.
CENtree leverages machine learning techniques to support ontology management by suggesting parent classes, synonyms, and relationship connections when new terms are added.
TERMite (TERM identification, tagging, and extraction) is our high-performance named entity recognition (NER) and extraction engine.
Coupled with our hand-curated VOCabs, it can recognize and extract relevant terms found in scientific text, transforming unstructured content into rich, machine-readable data.
SciBite Search enables researchers to harness the power of semantic analysis scientific search to scan multiple biomedical sources rapidly.
It supports a wide range of use cases, from identifying new drug discovery opportunities to monitoring the competitive landscape for a disease of interest.
The diagram below illustrates a typical ontology-centric workflow based on FAIR data principles. Our CENtree ontology manager sits at the center of this workflow.
Step 1: Edits come into CENtree from all parts of the organization. All approved staff can contribute to this process subject to agreed controls and governance. The ontologies created within CENtree can then serve all aspects of the organization.
Step 2: Ontologies can be served to machine learning algorithms either as tagged, structured text via TERMite or directly as ontology artifacts.
Step 3: Ontologies can be pushed into “smart forms” as part of an organization’s data registry (e.g., assay registration, Omics, etc.).
Step 4: CENtree can output TERMite VOCabs directly, allowing for the automated transformation of legacy data or to produce ontology-annotated text.
Step 5: Ontologies can be consumed directly by other applications within the organization.
FAIR data principles are ideal for creating the quality training data required by machine learning algorithms.
FAIR data principles can assist with several of the most critical aspects of creating successful machine-learning models:-
These tasks are frequently the most time-consuming and expensive aspects of developing machine learning models.
Pistoia Alliance FAIR Toolkit:
More than FAIR: Unlocking the value of your bioassay data:
Master Data Management (MDM) and FAIR data share many of the same objectives and challenges, the primary difference is the environment in which they operate.
While FAIR is primarily concerned with data sources and processes within the scientific arena, MDM focuses on commercial enterprises.
Depending on their size, organizations in this space typically have data on customers, employees, vendors, suppliers, parts, products, locations, contacts, accounts, and business policies.
A typical commercial IT infrastructure for master data management is shown below.
From the Gartner IT Glossary: What is MDM?
The challenges presented by master data are similar to those encountered when adopting FAIR data principles:-
While MDM initiatives represent a significant investment in time and resources for any organization, the potential rewards are also substantial:-
The diagram below illustrates the processes and systems for master data management in any large enterprise.
SciBite’s ontology management, semantic search, and text mining capabilities play a vital role in mastering data by providing a consistent metadata model and allowing for the efficient processing of unstructured and semi-structured documents.
In performing these tasks, SciBite’s platform enables organizations to eliminate data duplication, provide a consistent set of terms and ontologies across all data sources, and make legacy data searchable.
Get in touch with the team to discuss how we can help you clean your dataContact us
For most pharmaceutical companies, extracting insight from heterogeneous and ambiguous data remains a challenge. The era of data-driven R&D is motivating investment in technologies such as machine learning to provide deeper insights into new drug development strategies.
The quality of data directly impacts the accuracy and reliability of the results of computational approaches. However, the work required to achieve clean, high quality data can be costly, often prohibitively so, requiring data scientists to spend the majority of their time as ‘data janitors’, rather than actually analysing data.
SciBite provides an integrated, cost-effective solution to significantly reduce the time and cost associated with the process of data cleansing, normalisation and annotation. The output ensures that downstream integration and discovery activities are based on high quality, contextualised data.
Get in touch with us to find out how we can transform your data
© SciBite Limited / Registered in England & Wales No. 07778456