What are the risk factors for poor COVID-19 outcomes?

This post details some work I did this past spring related to the CORD-19 challenge, which was posted to Kaggle here.

The CORD-19 data set contains over 30,000 biomedical publications pertaining to COVID-19-related coronaviruses and respiratory illnesses. The broad challenge here surrounds information extraction of many stripes, but I was immediately drawn to the challenge of extracting information about risk factors. Some of my colleagues had the idea to use SQuAD-fine-tuned BERT or BioBERT to simply ask questions of the data set, such as, “is smoking a risk factor for COVID-19?” This idea comes up rather often, in fact, and one of the big road blocks is that using the large BERT model to answer questions, while impressively accurate, is rather slow on large sets of documents. The maximum sequence length is 512 tokens, and thus data must be chunked into length-512 tokenized chunks before inference can proceed. Multiply this by tens of thousands, or in some cases tens of millions, of biomedical abstracts, and you’ll be waiting all day for your answer if you can’t convince your company to pay for more GPU cores.

One way to mitigate this practical problem is to use a less computationally intensive method to pre-filter your documents down to a small subset which are maximally likely to contain the relevant answers. This idea is especially useful if you know in advance the general topic of the question, as in this case, where we were specifically interested in risk factors. If we can simply tell the system that the topic we are engaging with is ‘Risk Factors’, the system can pull out the top 1000 or so records with the most information on risk factors, and run the model inference over these more relevant documents. And similarly for other important pre-defined topics.

If we buy into this approach, then our practical problem of reducing compute time for Q/A is reduced to a more traditional NLP challenge: how do we auto-tag the documents in our data set with topic labels such as ‘Risk Factors’? Here we can leverage transfer learning from MeSH (Medical Subject Headings), a set of gold-standard expert-annotated key words from the PubMed database. While many of the articles in the CORD-19 data set didn’t have MeSH annotations, because they are either too new or not on PubMed, we can train a model on a large number of existing, non-CORD-19 PubMed abstracts and annotations, then apply that model to score, rank and filter the relevance to the topic we’re interested in (in this case, ‘Risk Factors’). Here’s what I did.

Learning from MeSH, Step 1:  Creating a Dataset

Extract all database records which have both an abstract and non-empty set of MeSH term annotations. The "positive" set is all such records which have 'Risk Factors' as a MeSH topic (or substitute whatever topics you care about). The "negative" set is the complement of the positive set.

Balance the data set by setting the length of the positive and negative sets equal to each other (truncating the longer one at random); concatenate the resulting sets and give positive label 1 and negative label 0. The result for 'Risk Factors' is a classic binary text classification data set with over a million data points.  More than enough to train a decent model.

Learning from MeSH, Step 2:  Tokenizer and Integer Coder

I used a custom biomedical English tokenizer similar to other word-piece tokenizers used for transformers and other sequence-based models. This tokenizer iteratively tokenizes using frequency thresholds, where the frequencies are taken from our PubMed database:  (1) Pull out all "words" using a simple regex, (2) If word is below frequency f, pull it apart into component syllables tokens (using pyphen), (3) if a given syllable is below frequency f', pull it apart into component character tokens.  The effect is that it will leave biomedically common words intact, and segment other words into word parts.  This controls vocab size and does not require "out of vocabulary" tokens, which is crucial for sequence-based models like RNNs and transformers.  Here is an example tokenization:

Sentence:  "The word 'inhibitor' is intact, but the word 'banana' is not."

Tokens:  ['The', 'word', 'inhibitor', 'is', 'intact', 'but', 'the', 'word', '#ba', '#n', '#a', '#n', '#a', 'is', 'not', '.']

Learning from MeSH, Step 3:  Creating a DataLoader

Just piggybacking off of previous projects I decided to implement this exercise in PyTorch.  And one aspect of PyTorch that I've been wanting to learn is how Dataset and DataLoader objects work.  In the past, I have done the work of loading my data and converting to tensors mostly from scratch, and probably in an unoptimized way compared to what comes bundled with PyTorch.  With some ease I was able to get a DataLoader to pull abstracts and labels from multiple CSV files and create batched tensors of any given batch size.  The code for this is in the Kaggle post linked to at the beginning of this post.

Learning from MeSH, Step 4:  Training a Model

There are lots of things I could have done at this point.  I ended up training a small transformer model, knowing that for binary text classification transformers are likely overkill.  As with Step 2, I did it the way I did it mainly as a learning exercise – another learning goal of mine was to play with the torch.nn implementation of TransformerEncoder.  I learned some good lessons about it, namely that (1) the TransformerEncoder does not "come with" an initial embedding layer – you need to add that to the model yourself, and (2) positional encodings (the additional embeddings that use sine/cosine functions to encode where in the sequence a token is located) are also not included.  I used the PositionalEncodings implementation from PyTorch's transformer tutorial, which was straightforward to integrate.

Learning from MeSH, Step 5:  Evaluation

My MeSH data set consists of 979 files, each with about 1100 data points.  I trained on 900 of them, and evaluated on the other 79.  Accuracy on the balanced evaluation after training was 90%.  This is a great result considering that MeSH annotations are actually made using the full article text, not just the abstract, and thus there is expected to be a sub-100% ceiling on how well a MeSH term can in principle be predicted from abstracts alone.

Learning from MeSH, Step 6:  Applying  the Model

The trained model was then used to generate "risk_factor" scores between 0 and 1 for every abstract in the CORD-19 data set.  These scores can then be used to rank and filter the data (using thresholds, the default being 0.5) by how likely it is to instantiate the topic 'Risk Factors'. Let's explore the results and applications of this model a bit further.

Observation #1:  The highest-ranked CORD-19 abstracts tend to explicitly mention 'risk' or 'risk factors'... Here's one of the top-ranked abstracts by risk_factor score:

Prevalence of comorbidities in the novel Wuhan coronavirus (COVID-19) infection: a systematic review and meta-analysis

Abstract Background An outbreak of Novel Coronavirus (COVID -19) in Wuhan, China, the epidemic is more widespread than initially estimated, with cases now confirmed in multiple countries. Aims The aim of the meta-analysis was to assess the prevalence of comorbidities in the COVID-19 infection patients and the risk of underlying diseases in severe patients compared to non-severe patients. Methods A literature search was conducted using the databases PubMed, EMBASE, and Web of sciences until February 25, 2020. Risk ratio (OR) and 95% confidence intervals (CIs) were pooled using random-effects models. Results Eight studies were included in the meta- analysis, including 46248 infected patients. The result showed the most prevalent clinical symptom was fever ( 91 ± 3, 95% CI 86-97% ), followed by cough (67 ± 7, 95% CI 59-76%), fatigue ( 51 ± 0, 95% CI 34-68% ) and dyspnea ( 30 ± 4, 95% CI 21-40%). The most prevalent comorbidity were hypertension (17 ± 7, 95% CI 14-22%) and diabetes ( 8 ± 6, 95% CI 6-11% ), followed by cardiovascular diseases ( 5 ± 4, 95% CI 4-7% ) and respiratory system disease( 2 ± 0, 95% CI 1-3% ). Compared with the Non-severe patient, the pooled odds ratio of hypertension, respiratory system disease, cardiovascular disease in severe patients were (OR 2.36, 95% CI: 1.46-3.83), (OR 2.46, 95% CI: 1.76-3.44) and (OR 3.42, 95% CI: 1.88-6.22)respectively. Conclusion We assessed the prevalence of comorbidities in the COVID-19 infection patients and found underlying disease, including hypertension, respiratory system disease and cardiovascular, may be a risk factor for severe patients compared with Non-severe patients.

Such mentions are common among the highest-ranked abstracts.

Observation #2:  ... but that's not all it's learning! Here's an example of an abstract with a score greater than 0.99, which does indeed discuss risk factors (see bolded sentence), but which does not explicitly name them as such:

The 2019 Novel Coronavirus Outbreak - A Global Threat The 2019 Novel Corona virus infection (COVID 19) is an ongoing public health emergency of international significance. There are significant knowledge gaps in the epidemiology, transmission dynamics, investigation tools and management. In this article, we review the available evidence about this disease. Every decade has witnessed the evolution of a new coronavirus epidemic since the last three decades. The varying transmission patterns, namely, nosocomial transmission and spread through mildly symptomatic cases is an area of concern. There is a spectrum of clinical features from mild to severe life threatening disease with major complications like severe pneumonia, ARDS, acute cardiac injury and septic shock. Presence of bilateral ground glass opacity and consolidation on imaging in appropriate clinical background should raise a suspicion about COVID 19. Poor prognostic factors include Multilobular infiltration on chest imaging, Lymphopenia, Bacterial co-infection, Smoking history, Chronic medical conditions like Hypertension and age >60 years (MuLBSTA score). Diagnosis is confirmed with PCR based testing of appropriate respiratory samples. Management is primarily supportive, with newer antivirals (lopinavir ritonavir and Remdesivir) under investigation. Role of steroids is still inconclusive. Standard infection control and prevention techniques should be followed. Vigilant screening of suspected cases and their contacts is important. Isolation of symptomatic cases and home quarantine of asymptomatic contacts is recommended. To conclude, controlling this highly transmissible disease requires international co-ordination.

Conversely, mentions of "risk" are also found in very low-ranked articles which do not discuss risk factors of illness.

Using this model to refine a question answering model

Now back to our original purpose: ranking these documents for question answering. If we take the top 1,000 out of the original 30,000 in terms of relevance to risk factors, and ask ‘what are the risk factors for COVID-19?’ the result is not only faster but (in my qualitative judgment) cleaner and more accurate, with less noise in the output, than asking the entire data set. Below is a sample of top answers from these top 1,000 abstracts, displayed in a web interface cooked up by my wonderful colleagues to illustrate this system.

Transfer learning from MeSH + document pre-filtering + SQuAd-trained BERT = risk factors for severe COVID

Transfer learning from MeSH + document pre-filtering + SQuAd-trained BERT = risk factors for severe COVID

Finally, I should mention that I was able to successfully use this method to create a collection of literature related to coronavirus drug design (learning from the MeSH topic ‘Drug Design’), which I in turn used to seed recommendations from the system mentioned here to find and share the newest literature on possible targets and repurposed drugs to treat COVID-19.