Update on BCG — COVID-19 Challenge. How Data Scientists Can Help with the Research
The global spread of COVID-19 has brought new challenges in front of data science and AI. In their attempt to be helpful in this difficult situation, many professionals in this field have started to self-organise in groups, share new ideas and develop them. One example is the ongoing BCG-COVID-19 AI challenge/hackathon that we organised. It aims to support the analysis of the hypothesis whether the hundred years old tuberculosis (TB) vaccine called BCG could reduce COVID-19 mortality.
Finding a treatment for COVID-19 can take several years. Ecological study designs researchers have identified a correlation between BCG vaccination and lower negative impact caused by COVID-19. Correlation does not imply causation. If we don’t have enough evidence we should be careful because this could result in excessive purchase of the vaccine, which itself could cause supply shortages, warns Prof. Madhukar Pai. BCG vaccine is necessary to protect children in LMICs from childhood TB. The ecological studies of the BCG vaccine relationship to COVID-19 have used data from the BCG World Atlas — project initiated by Prof. Pai. The Atlas is neither perfect nor complete, but it is the only available database of this kind. The hypothesis can be proved only by clinical trials and approximately 30 of them have started already, but the results will not be available for many months.
Following the papers regarding the BCG vaccine hypothesis, Clinical Research Physician Tsvetan Biyukov has asked me whether I can do some analysis with Machine Learning. In response I came up with the idea of a hackathon. His expertise in the panel of the Judges would have been valuable so I invited him. After his agreement I asked Adrian Wright — CEO at Estafet which is the consultancy company that I work for whether he will support such a hackathon and his answer was positive as well. Being part of an interesting project at Elsevier — Fair Data and Data Science platform called Entellect — I knew about an ongoing initiative by Elsevier’s employees who are working as volunteers on COVID-19 analysis. After I shared the idea about the hackathon with a couple of them, Anita de Waard (VP Research Collaborations at Elsevier) supported it by willing to work together with us on the initiative. Her initial idea was to contact scientists and check with them whether we can help them. She contacted Prof. Pai whose warning was that we should be very careful not to cause a hype on BCG vaccine before the clinical trials results are ready. He connected us with Alice Zwerling — Assistant Professor at University of Ottawa and currently responsible for BCG World Atlas and she became part of the organisers of the hackathon.
Our first concerns were: what data would be useful for the analysis, in what format it should be structured, which scientists should be contacted, where the hackathon will be running, how can we find data scientist to work on the tasks, who will be the judges, where from can we find the necessary amount for the awards.
As part of preparation for the first task we created a backlog with all tasks we could think about and started to work on them with the help of the volunteers. We contacted Jun Sato (Businessman and author of a popular blog related to BCG and Covid-19) and asked him if we can work together. We created a website where everyone who is willing to help can do that by data gathering. We started to gather in spreadsheets information such as: “BCG Vaccine Policy Data Sources per Country”, “BCG Strain”, “BCG — COVID-19 clinical trials” and “BCG — COVID-19 Scientific Papers”.
One of Estafet developers made a web scraper that searches and extracts automatically from the internet information about BCG vaccine policies in more than 200 countries. Another developer wrote a small tool that automatically translates the non-English texts to English with the help of google translate API. Part of the volunteers from Timeherous platform and Estafet helped us with manual search and review of the content. Other developers kindly contributed by extracting, cleaning and unifying useful information like COVID-19 number of tests, cases and mortality statistics. Relevant countries related information (Income group, TB Incidence, lock quarantines, population, ethnicity, etc.). Apart from Alice Zwerling and Tzvetan we asked 2 more domain experts whether they are willing to be judges and they agreed not only to judge but also to help with the organisation. The first one is Preslav Nakov who is Principal Scientist at “Qatar Computing Research Institute, HBKU” and quite experienced in NLP and Data Science. The second one Tim Miller — VP Elsevier Life Sciences Platform Solutions and leading Entellect platform project.
Elsevier and Estafet confirmed that they will contribute for the awards. We selected Kaggle for the datathon, because it has a huge community of data scientists and we can run it for free. Finally we finalized the text and 2 hackathon tasks were published in Kaggle (https://www.kaggle.com/bcgvaccine/hackathon). The first one is to try to augment the BCG Atlas data with the help of Natural Language Processing (NLP) and Text Mining (Question Answering, Machine Reading). The second task is for Data Scientists to try to find new insights for BCG — COVID-19 clinical trials.
BCG Atlas contains some information related to the vaccine for approximately 180 countries. With the help of the data scientists who participated in the first task, the data was increased by approximately 12%. The new data is about to be published on bcgatlas.org and will be useful not only for the second task of the hackathon, but also for Tuberculosis researchers. Although the first task is over, if anyone wants to submit more notebooks, they are more than welcome. Apart from that, anyone with python knowledge could contribute to automate even further the process so it can be used in the future as a fully automated pipeline with a very minimally manual interaction.
So far we have received more than 100 notebooks submitted in the hackathon by 11 participants. The deadline of the second task is the end of October and all data scientists or researchers who are willing to work on it could play a key role for the research. Our hope is that they can help discover useful information which itself could be really beneficial for BCG — COVID-19 clinical trials. In case that BCG hypothesis seems promising, some insights that may come from this analysis is whether factors such as the strain of BCG, the age at which people have been vaccinated, revaccination, or how long ago people have been vaccinated are important. What motivated us to develop this project was to use our skills in helping with analysis whether an existing, well known and tested vaccine could reduce COVID-19 mortality. Hopefully this would inspire even more data scientists to join us further in this research.