Data Catalogs are quickly becoming a core technology for large enterprises looking to better manage their data. Given the sheer scale of major companies, there will be perhaps thousands of data repositories, covering potentially billions of data sets. Researchers looking to use/reuse this data face significant challenges, not least of which is finding an individual dataset in the first place. This is where data catalogs promise to help, bringing in data with key metadata into a centralized environment, akin to a “data supermarket,” to provide a one-stop-shop to find the data needed. While there are many commercial providers of data catalog software in the marketplace, many companies choose to build their own systems specific to their needs.
Some data catalog systems will look to add a certain level of standardization and metadata to the data they are importing. For instance, normalizing “M/Male/Men/Boy” to “MALE” to provide more consistency across data from different sources. This is actually more complex than it seems: What are the rules for this change (for instance, does “Boy”=”Man”)?; Who maintains these rules? Should they be applied to all data? Should they be applied to all data? Should the original data be changed, or copies made? This becomes even more complex where large numbers of diverse data are added to the catalog. Indeed, it is important to note that the role of a data catalog is not to extensively enrich and manipulate the incoming data – that could have downstream consequences if users are looking for “the original data”, for instance, for regulatory submission. Nevertheless, data catalogs are rightly seen as a key foundation for many data science activities within large enterprises.
In an ideal world, everyone would use exactly the same language to describe the same physical things. While such a world would be fantastic from a data processing perspective, it’d likely make life pretty boring! Nevertheless, organizations are faced with a challenge of how to implement some level of standardization in preference to complete chaos. For instance, whenever you buy something online you’re almost always presented with a drop-down menu to select your country of residence (which will often be in alphabetical order, and I wonder how many companies make the majority of their income from customers in Afghanistan vs those lower down the alphabet!). e-Commerce sites do this so you don’t make a mistake in typing in your country, removing a potential error in the financial transaction and thus increasing the efficiency of the system. The same is true inside large companies, it is much better for everyone to use the same names or identifiers for compounds, countries, employees, suppliers, projects, and many more things (also known as “entities”) to generate a greater degree of consistency in the data.
Master Data Management (MDM) tools are often employed to address this critical need. These systems are generally thought of as repositories of the “ground truth” and serve this to multiple I.T. applications across the enterprise. Many MDM systems go beyond this and allow the construction of reference models and data flow pipelines to understand the complex relationship between different data systems. In recent years, the line between data catalogs and master data management has become blurred, with many companies offering solutions that encompass both.
While data standardization is good, it still lacks the depth of information required to power the latest generation of analytics and answer business questions. For instance, lets say we standardized all different ways of writing about the humble mouse, mapping words such as “Mice”, “mouse model”, “M. Musculus” and so on to a common term, “MOUSE”. That does start to achieve some form of standardization, but we cannot go beyond basic Boolean search. For instance, many users may want to search for data on “any rodent”. While the data is there, the computer does not understand that a “MOUSE” is a rodent, as are rats, Guinea pigs, and other critical experimental models. Fundamental data retrieval for things like “all kinases”, “all pyrimidine compounds”, “all anti-inflammatories” are not solved by standardization, to address these, we need semantics.
Representing the relationships between entities is the core function of an ontology. Ontologies such as Uberon represent the relationships between body organs and tissues, the BioAssay Ontology represents key parts of drug discovery and the Gene Ontology represents a deep understanding of cellular processes. Together with many other cornerstones, ontologies help bridge the gap between humans and machines. Indeed, there are many studies that demonstrate the synergistic power of two key pillars of Artificial Intelligence, namely semantics, and deep learning, to answer very complex scientific and medical questions. Watch our webinar on scaling the data mountain with Ontologies, Deep Learning and FAIR to learn more.
While many organizations understand the power of ontologies, they have struggled in some situations due to the lack of highly tuned tools designed to maintain them. Many ontology solutions have suffered from:
It is for these reasons we created SciBite’s ontology management platform CENtree – a 21st century resource designed specifically for life-science enterprises and employing a unique Artificial Intelligence (AI) engine to assist ontology management. For more information on CENtree, download the datasheet or watch our webinar on mastering enterprise-level ontologies for people and applications.
Having described three critical pieces of enterprise reference data management, one may wonder how they fit together. As described above, we’re seeing the merging of master data management and data cataloging tools, though the key functions of the two remain quite different. But a key question concerns the role of semantics and where/how should semantic technologies be deployed within this stack? There is no ‘right’ answer here, and every organization will have its own architecture and needs. However, there are two generic models which are likely most prevalent.
1. Enhancing Data Catalogs with Built-in Semantics
An obvious starting point would be to integrate semantic enrichment technology within your data catalog. The advantage is that the catalog itself becomes much more accessible to your users, through semantics they can now ask questions such as “return data on any kinase” and others outlined above. Such an architecture could look like the following.
Here we have a mix of data sources, some of which are compliant with the companies MDM reference data, some which are not. SciBite’s technologies are able to take the MDM references and ensure these are applied across the entire data catalog, building in a much greater degree of standardization. However, the benefits go further, as query and analytical tools are able to leverage the power of ontologies to ask complex questions which invoke the meaning of data, far beyond traditional keyword-based searches. A good example of such an implementation can be seen at Pfizer and AstraZeneca who each use SciBite’s award-winning technology to build intelligent data catalogs.
Read more about the pivotal role of Semantic Enrichment in the evolution of Data Commons at Pfizer.
Read more about how SciBite’s semantic platform powers AstraZeneca’s smart form application.
2. Adding Semantic Enrichment at the Knowledge Graph phase
The first approach is one we’ve seen employed by a number of our customers, but it doesn’t fit all use cases. Where there are very large data catalogs spanning vast data resources across all aspects of global enterprises, semantic enrichment can be very useful to help narrow down searches to specific classes of data. However, it can also be overkill to label billions of data points in a large data catalog that may never be of interest. Thus, a second model is a “just-in-time” approach where data is enriched with semantics once it has been selected for downstream processing. The architecture here is a little different:
Here we rely on the standard data catalog flow, but once data have been selected, the application of ontologies over data and associated metadata allows for the creation of much richer knowledge graphs. We have seen such workflows work in quite a few customer use cases and have previously outlined why this approach of knowledge graphs and semantics is so powerful.
When one thinks of data catalogs, the image of a large infrastructure spanning all parts of an enterprise probably comes to mind. Indeed, any web search for the phrase will return a myriad of companies providing their view on how large organisations can benefit from bringing together data across the company. However, we’ve also seen a need for more “local” data catalogs, perhaps just on a departmental basis, that don’t need to be exposed companywide but require more nuanced tailoring to individual group needs.
Many of our customers employ our DOCstore semantic search engine technology to address these “Data Catalogue-lite” uses cases. For instance:
Creating a database of entity-entity relationships (drug->adverse event, gene->phenotype) data for use cases such as pharmacovigilance and drug repurposing. Check out our use case on a modern, cost-effective approach to Pharmacovigilance
As an instance of DOCstore can be set up in minutes, our customers can create quick prototypes to demonstrate the utility of semantically enriched data portals. Often the prototype matures into the end solution, but sometimes is simply the starting point for a demonstration of how future tools should evolve. Either way, because DOCstore is designed to implicitly understand data semantics, it can serve as a very powerful mechanism to expose company data and deliver on the FAIR (Findable, Accessible, Interoperable, Reusable) mantra so important to today’s life science IT infrastructure.
Data Catalogues and Master Data Management tools provide a critical and foundational data infrastructure within any large corporation. However, additional layers are required in order to truly see the value from such an investment. Semantics represents such a critical addition, providing a mechanism for use of the data in real-world scenarios. SciBite’s API-first, integration-centric software is designed to enable semantics within these environments, be they enterprise-wide or small department-specific installations.
If your organization is looking to improve its data management, whether you’re a large organization in need of enterprise-level technology or simply on a departmental basis, semantic enrichment technology like SciBite’s is key in ultimately producing machine-readable data to help you power the latest in technological advances.
In our latest blog we discuss the challenges life sciences companies, like LifeArc, face in keeping up-to-date with scientific literature, and how semantic enrichment technology can automate this process to reduce the time spent mining data by up to 80%.Read
Ontologies have become a key piece of infrastructure for organizations as they look to manage their metadata to improve the reusability and findability of their data. This is the final blog in our blog series 'Ontologies with SciBite' we discuss how ontology enrichment is essential in maintaining clean data. Follow the blog series to learn how we've addressed the challenges associated with both consuming and developing ontologies.Read
Get in touch with us to find out how we can transform your data
© SciBite Limited / Registered in England & Wales No. 07778456