The explosion of data in the life sciences has led many pharmaceutical companies to change the way in which they conduct research and development. Until recently, data were generated and used with a specific analysis in mind.
Today, emerging technologies are unlocking the data-rich workflows needed for researchers to take large sets of historic data and apply them to new questions and applications.
This approach is changing the way organisations view and value data, both their own and that available in the public domain: from owning and using it once for a specific purpose, to sharing and re-using it for any number of potentially disparate projects.
However, the way in which organisations’ capture and manage data is fundamental to the success of this approach, and a wider scientific community initiative has resulted in the establishment of FAIR data principles to ensure that data is Findable, Accessible, Interoperable and Reusable.
While initially focused on public domain sources, FAIR data principles are rapidly gaining acceptance within the pharmaceutical industry.
The traditional life science model in which innovation was primarily an internal process utilising internally-developed applications and data sources has changed radically over the past few years.
Pharmaceutical companies now require access to a wide range of data, including:-
They may also wish to collaborate with academic institutions, software and service providers. This form of innovation inevitably takes place in a heterogeneous environment or remotely via the cloud. In all cases, the value of FAIR data principals becomes more and more apparent.
The term FAIR was first coined at a Lorentz workshop in 2014, and the resulting FAIR data principles published in 2016 as The FAIR Guiding Principles for Scientific Data Management and Stewardship by Mark D. Wilkinson et al.
Since 2016, FAIR data principals have been adopted by the European Union (EU) together with a growing number of research organisations and universities.
They are also increasingly being adopted by pharmaceutical and other commercial organisations as the standard for managing both scientific and business data.
A brief overview of what the FAIR principles mean for research data within life sciences is given below:-
In May 2018 the EU published a report (Cost-benefit analysis for FAIR research data) in which they estimated the cost of not having FAIR research data across the EU data market and EU data economy.
Seven indicators were defined to estimate the cost of not having FAIR research data: Time spent, cost of storage, licence costs, research retraction, double funding, interdisciplinarity and potential economic growth.
To provide estimates, they first assessed the inefficiencies arising in research activities due to the absence of FAIR data. From these different levels of inefficiency, they computed the time wasted due to no having FAIR and the associated costs. They also estimated the cost of extra licences that researchers would have to pay to access data that would otherwise be open with the FAIR principles. They looked at the additional storage costs linked to the absence of FAIR data: inaccessible data leads to the creation of additional copies of the data which would otherwise not be required if the FAIR principles were in place.
Computing all these costs, the EU report found that the annual cost of not having FAIR research data costs the European economy at least €10.2bn every year. By drawing a rough parallel with the European open data economy, they concluded that the downstream inefficiencies arising from a not implementing FAIR could account for further €16bn annually.
According to recent Gartner research , “the average financial impact of poor data quality on organisations is $9.7 million per year”.
In their survey Extracting business value from the 4 V’s of big data, IBM also discovered that in the US alone, businesses lose $3.1 trillion annually due to poor data quality. Even more startling was their finding that “1 in 3 business leaders don’t trust the information they use to make decisions”.
Unstructured data in electronic lab notebooks (ELNs), proprietary databases, PDFs, SharePoint folders, etc., represent a challenge for any FAIR data initiative.
A common example of this is bioassay data which can be rendered unsearchable for any number of reasons:-
Data silos are another obstacle to FAIR data principals. Systems or infrastructure added via acquisition or merger are frequently not accessible to other parts of the organisation.
There is a great deal of historical data trapped in data silos, proprietary databases, spreadsheets, etc. that may still be of intrinsic value in today’s pharmaceutical research programs.
A well-known example of this can be found in the publication of research on the Origin of CRISPR-Cas Technology by Francisco J.M. Mojica et al. By “trawling the literature”, Mojica was able to connect their work to that undertaken several years earlier by Yoshizumi Ishino et al. on sequencing of the iap gene.
However, examples that rely on manual review and rare, and recovering historical data assets via retrospective, manual curation is expensive and may be impractical or even impossible:-
In these circumstances, automation offers the most cost-effective and practical solution, and may create new opportunities for leveraging historical data.
As described above, the “Findable” criteria of FAIR requires data to be described using “…rich and machine-readable metadata”. However, machine readable representations of biological information can quickly become extremely complex.
FAIR data principals provide a framework for addressing this complexity.
Multiple competing ontologies and vocabularies within an organisation are usually indicative of several challenges:-
Whether home-grown or proprietary, ontology management problems such as these make it hard to perform federated searches and will require rationalisation as part of any FAIR data initiative.
Changing the culture to value FAIR data principals is one of the most challenging tasks facing any organisation:-
Data curation has generally been an under-funded and under-appreciated aspect of research, but it is a vital part of the process and needs to be treated in this way. Investing in technology is necessary, but insufficient by itself: organisations also need to invest in the people tasked with generating the data that drives their research efforts.
SciBite provides two crucial pillars for any implementation of FAIR data principals, eliminating data duplication, providing a consistent set of terms and ontologies across all data sources and making legacy data searchable:-
CENtree provides a centralised, enterprise-ready resource for ontology management and transforms the experience of maintaining and releasing ontologies for research-led businesses.
CENtree leverages machine learning techniques to support ontology management by suggesting parent classes, synonyms and relationship connections when new terms are being added.
TERMite (TERM identification, tagging and extraction) is our high performance named entity recognition (NER) and extraction engine.
Coupled with our hand-curated VOCabs, it can recognise and extract relevant terms found in scientific text transforming unstructured content into rich, machine-readable data.
DOCstore enables researchers to harness the power of semantic analysis search to rapidly and scan multiple biomedical sources.
It supports a wide range of use cases, from identifying new drug discovery opportunities to monitoring the competitive landscape for a disease of interest.
The diagram below illustrates a typical ontology-centric workflow based on FAIR data principals. Our CENtree ontology manager sits at the centre of this workflow.
Step 1: Edits come into CENtree from all parts of the organisation. All approved staff can contribute to this process subject to agreed controls and governance. The ontologies created within CENtree can then be served to all parts of the organisation.
Step 2: Ontologies can be served to machine learning algorithms either as tagged, structured text via TERMite or directly as ontology artefacts.
Step 3: Ontologies can be pushed into “smart forms” as part of an organisation’s data registry (e.g. assay registration, Omics, etc.).
Step 4: CENtree can output TERMite VOCabs directly, allowing for the automated transformation of legacy data or to produce ontology-annotated text.
Step 5: Ontologies can be consumed directly by other applications within the organisation.
FAIR data principals are ideal for creating the quality training data required by machine learning algorithms.
FAIR data principals can assist with several of the most important aspects of creating successful machine learning models:-
These tasks are frequently the most time-consuming and expensive aspects of developing machine learning models.
Pistoia Alliance FAIR Toolkit:
More than FAIR: Unlocking the value of your bioassay data:
Master Data Management (MDM) and FAIR data share many of the same objectives and challenges, the primary difference is the environment in which they operate.
While FAIR is primarily concerned with data sources and processes within the scientific arena, the focus of MDM is on commercial enterprises.
Depending on their size, organisations in this space typically have data on customers, employees, vendors, suppliers, parts, products, locations, contacts, accounts and business policies.
A typical commercial IT infrastructure for master data management is shown below.
From the Gartner IT Glossary: What is MDM?
The challenges presented by master data are similar to those encountered when adopting FAIR data principals:-
While MDM initiatives represent a major investment in time and resources for any organisation, the potential rewards are also substantial:-
The diagram below illustrates the processes and systems for master data management in any large enterprise.
SciBite’s ontology management, semantic search and text mining capabilities play a vital role in mastering data by providing a consistent metadata model and allowing for the efficient processing of unstructured and semi-structured documents.
In performing these tasks, SciBite’s platform enables organisations to eliminate data duplication, provide a consistent set of terms and ontologies across all data sources, and make legacy data searchable.
Get in touch with the team to discuss how we can help you clean your dataContact us
For most pharmaceutical companies, extracting insight from heterogeneous and ambiguous data remains a challenge. The era of data-driven R&D is motivating investment in technologies such as machine learning to provide deeper insights into new drug development strategies.
The quality of data directly impacts the accuracy and reliability of results of computational approaches. However, the work required to achieve clean, high quality data can be costly, often prohibitively so, requiring data scientists to spend the majority of their time as ‘data janitors’, rather than actually analysing data.
SciBite provides an integrated, cost-effective solution to significantly reduce the time and cost associated with the process of data cleansing, normalisation and annotation. The output ensures that downstream integration and discovery activities are based on high quality, contextualised data.
Get in touch with us to find out how we can transform your data
© SciBite Limited / Registered in England & Wales No. 07778456