Taking Semantic Search to Full Text
28th September 2017
Author: Mike Iarrobino
When engaging in semantic search, many researchers opt for the use of article abstracts over full-text articles as they are easily accessible via biomedical databases like MEDLINE and are in XML, a format used widely for encoding documents so that computer programs can parse. Even though using abstracts seems like a reasonable approach, there are major advantages to searching across the full text of an article. For example, abstracts often don’t include essential facts and relationships, access to secondary findings, and adverse event data.
Access More Facts
While abstracts do provide some valuable information, researchers need access to full-text articles to get the best results from semantic search efforts.
- Full-text articles provide more information than abstracts. The difference is in both volume and type of information, including detailed descriptions of methods and protocols and the complete study results. Authors often include only their most important findings in the abstract, leaving secondary study findings, discoveries, observations and other critical insights only in the full-text article.
- Abstracts often exclude or underrepresent data. Given the size limitations of abstracts, results that are less relevant to the main finding often are left out. In some cases, critical information may reside in a footnote of full text. By interrogating all of a given text, including bibliographic information, researchers can gain richer results that reveal vital patterns and information in the documents.
- New discoveries are more likely to be mentioned in the full text of articles before appearing in abstracts. Following initial publication of a new discovery in a journal, the research is often repeated and included in other publications. But there is a substantial delay between when that discovery appears in full articles and when that information appears in abstracts. In fact, it can take one to two years for discoveries to appear in the abstract of a subsequent article, according to a study conducted by publisher Elsevier.
- Full-text articles are more likely to contain information on adverse events. Per a study published in BMC Medical Research Methodology, “Abstracts published in high impact factor medical journals underreport harm even when the articles provide information in the main body of the article.” This missing information can reduce the value of abstracts as the only “raw material” in searches, especially in pharmacovigilance use cases, or when researchers want to make novel connections that haven’t been a major focus of the literature.
Unearth More Relationships
Full-text articles also contain more relationships between named entities than abstracts. According to a study published in the Journal of Biomedical Informatics, only 8% of the scientific claims made in full-text articles were found in their abstracts.
The same Elsevier study compared the use of abstracts and full-text articles to derive relevant information about drugs and proteins that affect the progression of fibromyalgia. They found 31 relationships in the literature by mining abstracts and an additional 53 relationships when they ran the same search across the full-text articles.
A recent study conducted by bioinformaticians at University of Copenhagen and the University of Denmark confirms that vital information goes undiscovered when mining abstracts rather than full-text articles. Using a named entity recognition system, the team analyzed more than 15 million full-text scientific documents and their abstracts published between 1823 and 2016 and compared their full-text findings to corresponding results from a matching set of MEDLINE abstracts.
The team extracted protein-protein, disease-gene, and protein subcellular associations. In every case, the results showed that mining the full-text article corpus outperformed the same analysis using abstracts only. The biggest performance gain in mining full-text articles was the associations found between diseases and genes (see figure below).
While article abstracts yield some information, there are limitations to what can be discovered through that process. Researchers need access to the full text of the articles to ensure they don’t miss vital data and undiscovered assertions that can lead to new discoveries.
CCC (Copyright Clearance Center) and SciBite offer an integrated solution to help organizations improve the results of semantic enrichment initiatives, reduce costs and simplify copyright compliance. For more information, visit www.copyright.com/SciBiteDOCstore.
About the author
Mike Iarrobino, Product Manager, CCC
Mike Iarrobino is CCC's product manager for content and rights workflow solution RightFind® XML for Mining. He has previously managed marketing technology and content discovery products at FreshAddress, Inc., and HCPro, Inc. He speaks at webinars and conferences on the topics of content discovery and data management, and loves to get into conversations about the nature of free will.