One of the most critical pieces of technology here at SciBite is our finely-tuned named entity recognition (NER) engine, TERMite. Its job is to find mentions of genes, diseases, drugs, companies, processes, chemicals and whatever else of interest in text and return that back to people and systems for data insight and analysis. A common use-case is around document search, where our tools are used to help users sift through thousands of documents in their organisation to find those that really matter.
One issue with indexing across an entire organisation is that we humans are quite messy and will often leave personal data on our work computers. Depending on how diligent you are, one day you might decide to spring clean your laptop and file the clutter of documents on your desktop into a “2017” folder. Maybe that folder gets dragged into an “other” folder, and so on, until eventually, it ends up on some shared network folder in your organisation, mixed in with other project and company information. Consider the following snippet which found its way onto a shared network drive:
While this may present a ‘noise’ issue, there’s also a more serious consequence – the potential exposure of sensitive personal information through search portals. If this document is indexed by a search engine and then someone comes along and searches for “contract” they could see your bank details and phone number. The potential of such a serious issue was brought up by one of our customers – they wondered whether we could identify such concepts in text and ‘redact’ them in a pre-processing step. The idea being that this ‘redaction’ would then prevent the search engine from ever seeing this personal information.
Given SciBite is all about “understanding” text, this fits well in our processing capabilities. Developing recognition modules that identify things like phone, bank or social-security numbers is straightforward as these are mostly numbers in a very consistent format. This is very easy to do with TERMite – we can set up recognition modules that use algorithms to identify certain types of things in text, such as these types of numbers. Thus, very quickly TERMite can be set up to find the required concepts, but then we must look to how to remove them from the document. However, we may want to ensure that we don’t redact other numbers (such as data points and compound reference numbers) to maintain the maximum integrity of the document. It’s here that we leverage another key element of TERMite’s processing module, the use of document context. TERMite will scan the document for words and phrases which indicate what the document or sentence is ‘about’ and use these to determine whether or not it should generate a match. For example, we use this to distinguish when the word “hedgehog” is used with reference to the small mammal and when its used to mean the major cancer controlling gene of the same name. In our case, we can use clues such as colleague names or things like “bank account” or “transfer money” to indicate the document has personal information in there, and then redact it as such.
Here’s the result of this stage, after asking TERMite to process this document and scan for “sensitive numbers”. The image below depicts a human-readable version of the API response:
As expected, TERMite has correctly identified personal information as indicated by the shaded areas of the text. Now what we’d like to do is to ask TERMite to redact this information. It’s here we make use of another key TERMite feature – output templating. This feature allows us to completely define what the output of the system looks like, without the need to write any code. Templates are simple text files that instruct TERMite how to create a vast array of different formats from HTML to MS-Word, XML and JSON through to specialist formats such as Cytoscape, OpenBEL and RDF. We can use this feature to create a “redacted” template, which blocks out areas of the original input text that are identified as personal information.
Here’s part of that template where we’ve set “redactedEntities=*” which means anything TERMite finds as significant should be redacted. This could be set to specifically block certain hits but allow others through. If we then re-run TERMite but instruct it to use the redacted output format, we obtain our final result:
Re-executing the analysis specifying the use of the redacted output format gives us the following result (human readable version) from the TERMite API:
The personal information has been completely replaced with the value of the “redactedTag” from the template. The document can now be further processed and added to the department or company’s search tool without fear of exposing the information.
To summarise, this example demonstrates the ability of TERMite to:
Of course, the example above is a simple one. Much more detailed personal information such as addresses, staff appraisal results, salaries etc. may be found in documents but for each of these you can develop TERMite recognition modules that find these information fragments and redact them through the same processing pipeline.
So now you can see the true flexibility of the SciBite Semantic Platform in how it can find important and sensitive things, and then remove those things from documents. It’s a great use case that we know holds value for our customers.
Lee Harland, CSO & Founder
Lee founded SciBite in 2013, spotting a gap in robust, industry-centric text analytics solutions. He has extensive experience in life sciences with a PhD in genetics from Kings College London, followed by more than 15 years leading semantic web, data integration and text-mining efforts as applied to industrial life sciences. He has published many papers in these areas and serves as an advisor and collaborator to a number of initiatives such as Open PHACTS, BiomedBridges, the Experimental Factor Ontology/Biosamples and ChEMBL groups.
Get in touch with us to find out how we can transform your data.
© SciBite Limited / Registered in England & Wales No. 07778456