Natural language generation for clinical study reports, part 2: The dog that caught the car

This is a follow-up to this post from last year about NLG applications. In that post, I outlined some promising directions for applying open-source GPT models for summarization of extremely long clinical study report (CSR) documents, and I suggested that fine-tuning the GPT-J model on pairs of CSRs and their corresponding (much shorter) scientific publications may do the trick. Here I show initial results of that experiment. (The post from last year was based on a presentation at Bio-IT World 2022; this follow-up is based on a presentation from this year’s Meeting of the Illinois Language and Linguistics Society; all outputs are the result of joint work.)

After some experimentation we decided on a two-step fine-tuning procedure for GPT-J. At a high level, it looks like this:

We structured the data into sequences such that each sequence consisted of two “chunks”, where each sequence overlapped by one chunk with the prior sequence. For example, the first sequence of a text will consist of chunk 0 and chunk 1, the second will consist of chunk 1 and chunk 2, and so on. This allows us to simulate a sliding window through the document, so that GPT-J can in principle learn to summarize arbitrarily long documents. Moreover, each chunk from each document is labeled with metadata consisting of a study identifier (which clinical study the CSR or publication is about) and document type (either CSR or publication manuscript). The study identifier allows for a linkage between the CSR and corresponding publication. For example, the model will see sequences of chunks of text from the “ACHIEVE 1” clinical study report, and then will see sequences of chunks of text from the scientific publication about the “ACHIEVE 1” clinical study, and will know through the identifier that these two documents pertain to the same study.

Step 1: Fine-tune on document pairs

Every CSR in this step has a corresponding publication, and through the linked metadata, the model is in theory able to learn not only the structure of each type of document, but the correspondences between them. This is where the model learns the summarization task, in essence.

Step 2: Fine-tune further on the CSR only for the study we want to summarize

Whereas the CSR+publication GPT-J model has learned the structure of CSRs, the structure of scientific manuscripts, and how one maps to the other, this study-specific fine-tuning step teaches the model the specific findings of the study we want to summarize. By treating this as a secondary fine-tuning step, we are resetting the optimizer hyperparameters and inducing a recency bias, so that the model will be freshly attuned to the semantic content of that particular study.

Step 3: Inference

This is where we give the doubly fine-tuned study-specific model the title and very start of a scientific manuscript, and have it start writing the rest. The overlapping chunked sequence approach allows us to ask the model to write as many tokens as we’d like, up until the point where the model generates a special “end-of-document” token. For our initial experiments, we generated only the first 2048 tokens, corresponding roughly to the abstract section.

Analysis of sample output

Here is the start of a representative sample output, tested on “ACHIEVE 2”:

For brevity I’m not going through the entire output, but this sample contains everything I need to make my point, which is this:

Compared to earlier efforts, we are seeing remarkable coherence, cohesion and lack of hallucination, but
There are still significant factual errors (and this sample output contains the two biggest ones).

In reality, there were three treatment groups in this trial, not two. And the trial participants were allowed to take rescue medication from 2 to 48 hours after their treatment, not only up to 2 hours.

What we’re finding in these initial experiments is leaps and bounds ahead of all other efforts that I’ve been a part of to create an AI model that processes 1000+ page CSR documents and writes rough-draft manuscripts about them. But — hopefully not surprisingly — it is not quite ready for prime time. At least not without some further tweaks and some substantial safeguards.

Mitigating factual errors

One obvious solution to reducing the number of factual errors is to scale up to larger models from the GPT-J 6B parameter model. Our initial experiments with the much-smaller GPT-2 models, using the same training structure and parameters, yielded mostly nonsense. The leap in quality from GPT-2 to GPT-J was simply enormous. Will that scaling continue?

Two more ideas: First, GPT models seem to have unique challenges when it comes to generating stats or other numerical values. These are, at least anecdotally, the most-hallucinated facts. So, given that we very much want to prevent any factual errors from ever going into even a rough draft of a paper, it would behoove us to simply post-process the outputs in a way that masks out all numerical values and forces the human medical writer(s) to retrieve these values from the raw results data. Second, it could be possible to fact check GPT’s outputs by using sentence similarity models to map generated sentences back to similar sentences in the original CSR document (similar to how the new GPT-based search tools like LaMDA or the new Bing assistant will map GPT-generated outputs to retrieved web search results). Safeguards such as this will allow this technology to be deployed in more responsible way. Speaking of responsible…

The dog that caught the car

The technology is getting closer than I ever imagined to being able to write someone’s publication rough draft for them by analyzing and summarizing a much larger raw data report. But should we be doing this? Of course the burden of quality of paper submissions will always be on the human experts, and they will have the tools necessary to fact-check and edit any AI-generated rough-draft suggestions. But we need to make sure that we are not increasing the probability of errors going unnoticed. In general, we need to think very careful about risk when we undertake any applications of generative LLMs.

I use the metaphor of the dog that caught the car. I think many practitioners of applied NLP have not until recently thought very deeply about questions of could vs. should, because the technologies were not mature enough to be threatening. Now we are able to accomplish things that once were the domain of sci-fi, and a new question arises: Now what? How do we actually deploy these technologies in a way that is safe and useful to businesses? Here are some questions I think should always underlie this sort of work, if it is to be done the right way:

Does this application of LLMs raise risk?
Is it placing unfair responsibility on users?
Is it automating a job it shouldn’t?
Will it introduce harmful bias?
Does it fail to meet explainability requirements?
Can the task be done with less costly methods?
Is there a danger that this project will give AI a bad name?

With these guiding questions in mind, as we undertake this work, we should aim to:

Understand risks and limitations
Educate the public about the risks
Enhance human workflows, don’t replace
Protect users by putting guardrails in place
Evaluate alternatives to LLMs, always

I’ll get down from my soapbox now and get back to improving my models :-)