Natural language generation for clinical study reports

Clinical trials can be an exciting thing, bringing news of promising new treatments, whether they be novel mRNA vaccines or a cancer treatment with unprecedented results. Outside of the industry, not many people realize just how much work and documentation goes into conducting a clinical trial. All clinical studies being with the protocol, a document outlining in great detail how the trial will be conducted, including methodologies, patient populations, measurements and analyses to be performed, etc. The protocol must be approved by regulatory agencies such as the FDA before the trial begins and must be adhered to very closely. As the trial is conducted, the research team and a team of medical writers will begin to build up a longer document called the clinical study report, or CSR. This document, sometimes thousands of pages long, recapitulates much of the information outlined in the protocol, but adds the results of the trial as it happened. Finally, if the trial is successful or otherwise academically interesting. The key findings from the CSR are extracted and re-worded to create a much shorter manuscript to submit to scientific journals, in hopes that the study and its impact will be published to the broader scientific community. This documentation process, from protocol to CSR to manuscript, is outlined below.

There is great potential to speed up this process by more efficiently transferring information from one document to the next. Here, I outline an ongoing project with some colleagues of mine at AbbVie (Mehmed Sarıyıldız, Mark Ciaccio, Ankit Kumar Singh and Brian Martin), recently presented at the Bio IT World conference, identifying two such opportunities.

From protocol to CSR: Verb tense conversion

Protocols tend to be written in future tense. They outline what will be done in the proposed trial. Given the requirement of close adherence to the protocol, when the study is actually conducted, there are many instances where a passage from the protocol needs to be directly copy and pasted into the CSR, but the verb tense needs to be changed to past, because it is not about what will be done, it is about what was done. The following is a sample of a passage that might get such treatment.

There are a number of off-the-shelf pre-trained deep learning models that we can leverage to, with some added logic on the top, identify future tense verbs and replace them with past tense ones. But like many things, it’s easier said than done. Here are some of the cases we want to handle:

In addition, there are edge cases where we run into the complexities of English syntax:

Note: tense conversion tools could be created for any language with sufficiently accurate part-of-speech and dependency parsing models; however, because of complexities like this, some language-specific knowledge and logic is required to make sure that the results of the tense conversion respect the grammatical and syntactic rules of the language!

With all of these cases in mind, we were able to leverage spaCy, and particularly the biomedical English models provided by scispaCy, together with custom logic to ensure grammaticality, and wrap it all up using FastAPI to create a simple tense conversion tool. For the substitution logic, we used the tenseflow library as a starting point but made many changes to account for the cases we wanted to account for.

It’s a relatively modest thing, but it will save some manual work, and it is a testament to the power of many of these off-the-shelf models.

From CSR to manuscript: Text summarization with SBERT and GPT-J

Going from CSR to manuscript is a much harder problem because of the amount of information compression – the manuscript is much shorter than the CSR, containing a distillation of the key points, either condensing or entirely omitting much of the information from that longer document. We are therefore treating this as a text summarization, which we can further divide into two steps of refinement: extractive summarization and abstractive summarization.

Extractive summarization via regression on semantic similarity (SBERT)

Imagine that I am a medical writer tasked with writing a rough draft of a manuscript reporting a clinical study, and I am using the CSR as my guide. I may locate the most relevant passages where the key methods and results are laid out, and my first step may be to copy and paste these passages in their entirety into my manuscript file, and then further refine the passages by shortening them and connecting them to improve flow. The first of this two-step process corresponds to the task of extractive summarization – I am not composing new passages of text, but rather extracting from existing text. Taking this approach, we trained an extractive summary model using a curated set of pairs of CSR + published article for studies that were already publicized. The premise of our model is that, for any given passage from the CSR, we can use the maximum semantic similarity to some passage from the corresponding article as a stand-in for relevance. In other words, if a passage was copy-pasted wholesale from CSR to manuscript it would receive a max similarity score of 1 since there is an identical passage in the article. Irrelevant passages from the CSR are expected to have lower scores.

In this case semantic similarity was independently calculated using cosine similarity of embeddings derived from SBERT (Sentence-BERT; Reimers and Gurevych, 2019). Taking the maximum similarity between a CSR passage and a manuscript passage to be the labels, we trained simple regression models to predict how relevant a CSR passage would be. The best-performing lightweight model (a gradient boosting regressor over n-grams), while not noise-free, does a reasonable job of predicting scores on a held-out test set:

From here, we can use predicted scores to extract the most relevant passages for extractive summarization.

Abstractive summarization with GPT-J

This is where it gets really interesting. Can an AI literally write a rough draft manuscript, given the output of the CSR extraction model, where each input passage is condensed down further, and where the end result flows like natural English text? This is an example of abstractive summarization, different from extractive summarization in that brand new text is being composed. The most cutting-edge tool we have for these sorts of natural language generation (NLG) tasks is the GPT family of models. OpenAI’s GPT-3 language model can famously compose extremely convincing text. GPT-J is a variant of this model that, while smaller in size and noisier in its NLG capabilities, has the benefit of being (1) open-source and (2) easier to deploy because of its smaller size. (Note: we are also excited to begin experimenting with Facebook AI’s new open-source GPT offerings, the OPT family of models).

GPT models at their core are highly sophisticated auto-complete systems. To use them out of the box for specific tasks like abstractive summarization, one must perform what is commonly referred to as prompt engineering. In other words, we must give the model an input prompt that guides it toward performing the task we need as part of its predictive output. One example would be to supply as input a concatenation of (1) the passage of text we want to summarize and (2) some framing device such as “In summary,”, “In other words,” or even “TL;DR”. Then, when the mode predicts the text that would follow, it does so with an eye toward the summarization task.

This works surprisingly well out of the box. Below is a real example of such a summarization that I created with GPT-J.

One the one hand, this is a remarkably coherent piece of English text, which does pertain to the study at hand, and at first glance appears to distill the information in the input down to a denser form. But there is one problem…

The Hallucination Effect

So-called hallucination (see Ji et al, 2022, for an overview) is when NLG models such as GPT-J produce text that, while coherent on the surface, is semantically dubious in some way. In the context of summarization, this means that the summary output is introducing “facts” that were never in the input! In the example above, the culprit is the bit about “no new safety findings”. The input text deals mostly with the efficacy of the drug and does not claim no new safety findings (though that could well be true). This sort of hallucination is a very unwanted effect. How do we fix it? The answer, we believe, lies in task-specific fine tuning…

Where to go next

The current state of the project is that we are exploring the best ways to fine-tune the GPT-J model (or similar models) to do the exact kind of summaries we want. By training it in this specific task, we believe that we can mitigate hallucinations and create a system that will be able to compose rough drafts of manuscripts from CSRs through a combination of extractive summarization and iterative abstractive summarization.

This is an exploratory project, and we do not expect that AI can be writing scientists’ papers any time soon (nor do we probably want it to). However, the impact this could have is to reduce the burden on medical writers by giving a starting point. Imagine an AI-produced summary of a clinical study report, where each passage is linked back to pages in the CSR document. Researchers and writers would be free to use or alter passages of the rough draft where deemed appropriate, and, in all other cases, at least get a pointer to where in the CSR they should be looking for the most relevant information. In this way, we can use AI and NLG to enhance, not replace, human workflows.