SciBite demonstrates the scalability of semantic analysis on commodity hardware

Semantics as a Service on a Raspberry Pi
24th May 2017

Author: Lee Harland

It’s Bio IT World 2017 and SciBite is at the super busy Expo and Conference (we’re at Booth #532). Those of you there last year may have seen our little computer cluster built on the fabulous Raspberry Pi’s. If you’ve not heard of these, the Raspberry Pi is a tiny, bare-bones, ultra cheap, fully fledged unix computer.  It’s inspired a generation of children of all ages to go and have fun hacking together software and learning to code.

At SciBite we’re a big fan of the work of the Raspberry Pi Foundation (more on that later) and the little machines themselves. At last year’s Bio IT we showcased our DOCstore system running on a Pi cluster. Powered by Elastic, DOCstore is our semantically enabled search engine, providing scientifically aware search and analysis capabilities at the click of a button. Running it like this came in handy when the conference network went down allowing SciBite to keep on going – think of it as SciBite: even in the face of adversity…

docstore_results

DOCstore in action

Anyway, we’ve brought our little Pi cluster to this year’s Bio IT too. “Why are you doing this?” I hear you ask.  We know that the Raspberry Pi is not the hardware infrastructure of choice for major pharma, so why demonstrate this at the show?

Our Pi cluster last year

Our Pi cluster last year

Well, apart from it being a whole lot of fun getting our enterprise-grade semantic platform up and running on some $35 computers, for DOCstore we really want to demonstrate scalability on standard hardware.

Specifically, we’re able to create deep semantic indexes across large (we’re talking multi-million here) document databases using a solution based on Elasticsearch. The latter is specifically designed to scale horizontally across multiple nodes.

When we engineered the system, we built it from the ground up to support distributed servers within an enterprise or in the cloud. The Raspberry Pi experiment proves that the system scales as intended and that it can be run on very low spec commodity hardware. Just imagine the power of this when distributed across a high-powered cluster.

Good looking and powerful

Good looking and powerful

This year we’ll also have TERMite running on a Pi too. TERMite is our cutting edge named entity recognition software.  It recognises concepts in text like drug names or diseases and excels at correctly identifying entities.  It’s taken quite a bit of effort to make this happen. TERMite works by utilising large thesauri consisting of millions of synonyms across tens of thousands of entities. Storing that data, across multiple ontologies takes a fair chunk of RAM. One of the great features of TERMite is that the same software running on your corporate server will run on your laptop – allowing users to work with powerful text analytics even when not connected to the internet. We have a lot of users who really like this feature, and so to further support them, last year we developed a “low memory mode” TERMite setting that reduced the RAM requirements by up to 50%. This means that you can now run analyses with some of the most common ontologies (gene, disease, drug, etc) in just a couple of GB of RAM.

While this is really valued by our customers, it sadly didn’t allow us to run TERMite on a Pi, where we only have 1GB of RAM to play with (less as we grudgingly have to give the operating system a bit to work with too!).  So that was the challenge, to compress the TERMite vocabularies even more than low memory mode, to fit all three major vocabs into around 700MB of RAM. A lot of evening and weekend tinkering later, and we did it! And, as you’ll see at Bio IT we can now set up a fully functional REST-ful semantic annotation server on a $35 computer.

Our rainbow Pi cluster at Bio IT World 17

Again, you may ask: “Apart from it being immense fun, why did you bother?”

Well, think about it this way, improvements in automotive technology from sports such as Formula 1 eventually make their way into the cars we all buy.  At SciBite, some of the techniques we used to compress and efficiently scan data for this project are already finding their way into the next release of TERMite, helping to increase the efficiency and performance of that product. So it turns out the TERMite-on-a-Pi project was not just an excuse to geek out after all…

And this isn’t the end of SciBite’s involvement with these machines. SciBite is proud to support The Raspberry Pi Foundation's educational mission.  It’s a fantastic organisation that works to democratise the digital world – enabling people from around the globe to gain access computing and digital making.  They provide outreach, education and free resources so that the next generation can learn to code and is empowered to shape our digital future.

So, now you see just how scalable SciBite’s technology is – together with Raspberry Pi and Elasticsearch, you can adapt it from a $35 machine to a huge server.  Pretty nifty.

To find out more about how SciBite could transform your data, drop us a line on info@scibite.com.