AI played a part in spotting COVID-19. Now, it's helping to fight it.

Date posted
1 April 2020
Reading time
24 Minutes
Austin Tanney

AI played a part in spotting COVID-19. Now, it's helping to fight it.

On 11 March 2020, the novel coronavirus was designated by the World Health Organisation (WHO) as a global pandemic. As of the time of writing (30 March 2020) there are 723,732 cases of coronavirus worldwide and 34,000 deaths have been attributed to the disease 

All over the world, countries are closing borders and businesses and issuing guidance to citizens to self-isolate, to stay at home and to maintain strict social distancing.  

Back in December, the AI company BlueDot identified and flagged patterns in Wuhan, China 9 days before the WHO announced the emergence of this novel coronavirus.  

 So, AI played a part in spotting the diseasebut does AI have a part to play in fighting it? The short answer to this is 'yes, absolutely'. There is almost too much to write about when considering how AI can help fight this, and any future pandemics, so I am going to focus on one element.  

How much do we know about this novel virus? How much do we know about similar viruses?  

Well it turns out we know quite a lot? 

On 16 March 2020 the White House issued a 'Call to Action to the Tech Community on New Machine Readable COVID-19 Dataset'. This unique dataset was collated in a collaboration between Microsoft, the National Library of Medicine (NLM)the Chan Zuckerberg Initiative (CZI), and the Allen Institute for AI, coordinated by Georgetown University and represents over 29,000 articles over 13,000 of which have full text.  

That's a lot of information to try and digest. If we assume that each article takes 10 minutes to read, just reading all this information would take 201 days, if you were somehow able to read 24 hours a day. Of course, even if you did read all this info, how much of it would you take in? Even for experts for whom this is their life, livelihood and passion, you're still going to be hard pushed to consume all this information in your career. When we are in a state of emergency and during a pandemic, we need to get real, reliable information fast. Can we use AI to digest, analyse and summarise the key findings from this data? 

The dataset was published on the Allen Institute's website and a challenge was established on the Kaggle platform to enable the citizen data scientists of the world to see what can be extracted. The response has been incredible with 355 kernels published publicly already. Here at Kainos, a few of our data scientists and AI engineers from our AI Practice decided to look at this data. Natural Language Processing (NLP) is one of our key areas of expertise so this seemed like an ideal fit.  

The Kaggle competition has 10 associated tasks looking at a wide range of areas from the genetics and evolution of the virus to the ethical and social science considerations. We decided to focus on two of those tasks that we believe have the potential to be useful and actionable by health authorities. Firstly, what do we know about the COVID-19 risk factors and secondly, what has been published about medical careExtracting something of value from this massive corpus of data about either of these areas could be very beneficial.  

Risk Factors: 

There are some clearly understood risk factors around COVID-19. We know that it affects the elderly in a much more negative manner. We know that immunosuppressed individuals are at much higher risk. It seems clear that children are lower risk. However globally we have seen exceptions. Is there anything that we can extract that gives us a better handle on the risk factors? 

To analyse this data, we started by clustering the documents. The objective of this was to enable us to group similar documents together and giving us smaller, more focussed subsets of documents for analysis. We also used language detection models to filter out non-English language papers which would be problematic for analysis. We then carried out NLP analysis using key word and key phrase extraction that allowed us to split and prioritise the documents across thematic areas of interest.  

After clustering and grouping the papers, we then used NLP techniques to extract the most important and relevant information using text summarisation and Named Entity Recognition (NER). NER enables us to identify key entities within text data using a range of predefined categories. There are a range of software libraries available for this kind of analysis and we specifically used a python library developed for analysis of medical and scientific text (SciSpaCy)We used these models to identify diseases across the coronavirus dataset and to highlight similar diseases to coronavirus as well as other medical conditions which could be result in an individual being more at risk of hospitalisation/death from coronavirus.


This analysis enables us to extract and collage the most common diseases and conditions mentioned in the overall dataset. As one would expect, the most mentioned diseases and symptoms are related to coronavirus directly. What we are most interested in though is what other conditions are mentioned. A range of diseases were detected including cardiovascular disease, diabetes, pulmonary disease and lymphopenia with cardiovascular disease being a very clear risk factor. Some of these can be seen in the output above. The diagram below shows a further analysis of the key conditions identified, the number of documents that mention them and how often the term is used. Obviously, the more documents mention it and the more commonly it is mentioned, the more likely it is to be relevant. 


Understanding the findings from these papers in relation to these diseases and medical conditions is a difficult task but this is the key interest. For example, we could consider the following questions-  Do we know if having diabetes poses a higher risk of hospitalisation? What research has been completed in this area? 

The biggest challenge we face here is not just in identifying the key terms, but also in trying to summarise the data extracted in a meaningful way. If we can summarise the key findings in the papers of interest this can give real insight into the significance of risk factors.  

This is a perfect example of how AI can be used to improve how we do things. We are not aiming to replace scientists, researchers and doctors - we are simply creating tools that enable them to be more efficient in gathering relevant information and in the time of a pandemic. The more quickly and efficiently we can extract relevant and accurate data, the faster we can react and give guidance to healthcare workers and to patients.  

Medical Care: 

As of today, we don't have a vaccine for coronavirus or a cure for COVID-19Some promising candidates are in clinical trials that we know ofbut beyond this, in the vast corpus of literature that we have access to, is there anything in there that can help health professionals in caring for those with the disease, particularly those who may have some of the aforementioned risk factors? 

As with the identification of risk factors, it is clear that there is information in this corpus of data around medical care, but how do we find it?  How do we summarise it and how do we decide what is useful information and what is not? 

In the first instance, and because no publication on NLP is complete without a word cloud, we simply examined the most common words within the abstracts of these scientific papers.


Once we know what the most common words in the abstracts are, the next step is to find the relationship between the words in these documents. For the more technically minded, our approach here used GenSim's Word2Vec algorithm which finds the vector representation of each word in the text. Words with similar meanings will have similar vectors. 

As an example, after training, our model can generate the most similar words related to 'flu':


One of the biggest problems we face in trying to extract the right information from a complex dataset like this is to reduce it to a smaller, more manageable and more useful dataset. This process of dimensionality reduction allows us to create a smaller dataset that retains the key information that we need in order to do this. Principal Component Analysis (PCA) is a technique to reduce dimensionality in a data set and to bring out strong patterns.  PCA has been used on the word vectors of the words presented above in order to show how closely associated they are to each other.


A PCA projection was carried out for the 9 words most like influenza with results presented above. We can make the following observations: 

  1. 'hpai', 'h5n1', 'h7n9', 'h3n2' sit within a cluster in the centre-left. Upon closer inspection 'h5n1' is also called 'Highly Pathogenic Asian Avian Influenza' or 'hpai'.  
  2. 'ph1n1', 'h1n1' and 'pdm09' form separate clusters 
  3. The word 'flu' is separate to these clusters as it is a generic term. 

By making use of the same word vectors that drive the PCA graphs above, we can also provide a list of the most similar words to any keyword of choice. As there have been recent articles in the media where anti-malarial medication has been indicated as a potential treatment, 'anti-malarial' and 'hydroxychroloquine' have been included in the word similarity analysis:


It is interesting to note that 'anti-cov' ranks in the top 10 most similar words in relation to anti-malarial and hydroxychloroquine which is used to treat malaria.  

With respect to some of the drugs mentioned in the mediawe have carried out a frequency count of some of these medications. 

Anti-malarial: 30 

Hydroxychloroquine: 16 

Lopinavair: 959 

Favipiravir: 48 

Remdesivir: 27 

Sarilumab: 1  

It is worth noting that all these drugshave been or are now in trials for treating COVID-19 with Favipiravir and Remdesivar both showing promise. Hydroxychloroquine, which has one of the lowest frequencies of mention, has been shown not to be beneficial.  

After carrying out the analysis on keywords, we also wanted to investigate the relationships between the papers themselves. For this we used doc2vec rather than word2vec as this enables a broader comparison. We used this along with topic modelling (LDA) and clustering (kMeans) to visualise the similarity between the papers. This is shown below. 


As we can see above while most of the papers are within this larger cluster there are some outliers. By taking a closer look at the common terms in these outliers some of them are focused on epidemiology and studying previous, influenza-like, outbreaks. The topic modelling gave a better insight into which outbreaks were?�like?�COVID-19. It noted that coronavirus appeared as an important term in one of them and it was mentioned alongside MERS?�(middle east respiratory syndrome). MERS is another disease caused by a coronavirus. Based on the analysis?�of these papers using NLP,?�COVID-19 has a?�distinct similarity to MERS.?�MERS, SARS and COVID-19 are all caused by?�a form of coronavirus, but this analysis?�indicating?�a greater similarity to?�MERS?�(which is much less contagious)?�could be interesting for researchers looking at ways of treating the disease.?�?�

This work was done by a small team in a very short amount of time but for me it really highlights the value of AI and NLP.?�To be able to contextualise and summarise documents in a way that is then more easily accessed and utilised by key workers is a perfect example of how NLP can benefit healthcare.?�We consistently see that the best use cases for AI are when we use it to improve efficiencies and remove unnecessary human labour enabling people to focus on doing what they do best.

About the author

Austin Tanney