Embedding biomedical abstracts
In the past two years of working as an NLP developer in the biopharma industry, one of the most interesting projects I worked on dealt with how best to create embedded representations of biomedical literature. The short (and maybe not so surprising) answer is that there is no “best way” to embed documents — it can be highly task dependent. The medium-length answer is that for tasks which care about high-level or document-level semantics, what we might call in the field of pragmatics “topics” or themes, one may do better to identify key word tokens in the document and pool their word-level embeddings than to embed the documents as such using “fancier” techniques such as BERT. The long answer is the following paper from the German conference on NLP (based on a poster originally presented at WeCNLP 2019), which I’ll summarize a bit below:
https://corpora.linguistik.uni-erlangen.de/data/konvens/proceedings/papers/KONVENS2019_paper_30.pdf
This paper grew out of a summer project that I created for some wonderful data science students at the University of Michigan (they were fine with me being a Buckeye). I had been tasked with designing a recommendation system for scientific literature, and because I didn’t have much to go on in terms of labeled data, I wanted to take an unsupervised approach based on embedding the documents with pre-trained language models and then using the embeddings to find similar documents to those that our scientists already expressed interest in. I envisioned that the similarity would be at the topic level (e.g., find other papers about adverse effects of TNF inhibitors), and therefore imagined two broad approaches that could work:
Lexical pooling: Extract key terms/phrases from the abstracts, embed the corresponding word tokens into a vector space — either a dense vector space using fasttext embeddings pre-trained on biomedical corpora or a sparse vector space with good ol’ tf-idf — and then pool those embeddings into a single document embedding, in this case by taking the mean.
Sequence embedding: Use a language model pre-trained on biomedical literature (either BioBERT or the NCBI’s sent2vec-based model) to embed the abstracts as sequences of tokens.
We evaluated a number of variants of these approaches on two tasks aimed at getting at topic-level classification: (1) classifying what departments medical school publications came from (e.g. Neurology, Surgery, Neurosurgery) and (2) correlating vector similarity of the document embeddings with similarity in terms of gold-standard MeSH labels (Medical Subject Headings). Details are in the paper, but the upshot is this: for both of these topic-oriented tasks, the lexical pooling methods did better, and the overall “winner” was NCBI’s biomedically trained fasttext embeddings pooled over keywords extracted using entity recognition and noun phrase dependency parsing. This is what that looks like:
This approach seemed deceptively simple to me at first compared to more sophisticated methods which it outperformed, such as pooling the penultimate layer of contextual embeddings from BioBERT. But then, putting on my linguist’s hat, I realized it’s not so surprising. For these tasks we really only care about things like ‘is this a clinical trial?’ or ‘is this a neurosurgery paper?’… and while large transformer models excel at encoding sequence-level syntactic abstractions which are absolutely crucial to tasks like question answering or machine translation, they are overkill for determining the higher-level thematic content, and thus are likely only to introduce more noise into the process where a simpler model would suffice.
In the end, we leveraged the above-depicted approach in our recommendation engine, and after extensive testing it was determined that, indeed, the recommendations produced were highly relevant. So this was a case where the prior research both guided and was vindicated by deployment in a practical application.