You are on page 1of 12

Machine Learning Solutions to Infectious Disease early detection,

warning, response and control


Given the relatively sparse health related data available in the Global South, AI poses a unique
solution by making use of a variety of novel or underexplored data sources for public health
surveillance purposes, especially those not originally or intentionally designed to answer
epidemiological questions. For instance, with the rapid development of the Internet and the Internet
of Things applications, ubiquitous social and device sensing capabilities are becoming a reality,
presenting significant surveillance potentials. A variety of open data, external to traditional public
health surveillance systems, can be fruitfully exploited to enhance the surveillance capabilities.

Statistical methods focus on conclusions at the macrolevel, whereas machine learning methods
enable customized inferences aimed at characterizing local patterns.

Case Study 1: Use of SVM and RF for an early warning system for mosquito-borne
infectious diseases
Support vector machine (SVM), gradient boosting machine, and random forest (RF) were applied to
simulate the global distribution of Aedes aegypti and Aedes albopictus to fight against mosquito-
borne infectious diseases, for example, ZIKV, dengue, and chikungunya

What is Support Vector Machine (SVM)? What is Random Forest (RF) ?


The main idea behind SVM is to find a decision A random forest is an ensemble machine
boundary, or a hyperplane, that separates the learning model composed of multiple decision
data into different classes or labels. SVM seeks trees. The "forest" is made up of a large
the best decision boundary, or hyperplane, that number of individual decision trees, each of
maximally separates the different classes and which is trained on a random subset of the
maximizes the margin, which is the distance data. The idea behind a random forest is that
between the decision boundary and the closest each decision tree will make predictions that
data points from each class. are slightly different from one another, due to
the random subset of data that it was trained
SVMs are particularly useful when the data on. When the model is used to make
points are not linearly separable, which is predictions, the predictions from each of the
common in many real-world problems decision trees are combined (e.g., by taking the
average or mode) to make a final prediction.

Its advantages over single decision trees are


that its less prone to over-fitting, are more
accurate and can be used to estimate feature
importance.

Multidisciplinary datasets, such as occurrence records, social factors, and meteorological factors,
were quantified to train the models. The multidisciplinary datasets used were:
1

Climatic, environmental and social factors were considered in the model, therefore indicating the
One Health approach emphasized within the grant. The datapoints measured within these three
categories, along with their sources, were:

Climatic conditions:

Climatic factors that contribute to VBDs include:

 Temperature: Many vectors, such as mosquitoes, require warm temperatures in order to


breed and thrive. Outbreaks of vector-borne diseases are often more common in tropical
and subtropical regions.
 Rainfall: Adequate rainfall is necessary for many vectors to breed. Mosquitoes, for example,
require standing water in which to lay their eggs. Areas that experience heavy rainfall or
flooding can be at a higher risk for vector-borne disease outbreaks.
 Humidity: High humidity can be beneficial for some vectors, as it can provide a favorable
environment for their development and reproduction.
 Altitude: Some vector-borne diseases are more common in low-lying areas, as the vectors
that transmit them are more common at lower altitudes.
 Urbanization: Urbanization and population growth can lead to an increase in vector-borne
diseases as it can create more favorable conditions for vectors to breed and thrive.
Urbanization can also lead to an increase in travel and movement of people, which can
facilitate the spread of vector-borne diseases to new areas

Precipitation2
No satellite yet exists that can reliably identify rainfall and accurately estimate the rainfall
rate in all circumstances. Satellite can see the clouds from above that we see from below, but
cloud presence is not a good indicator of rainfall. Not all clouds produce rain, and rainfall
intensity varies from place to place beneath those clouds that are generating rain. Using a

1
https://sci-hub.se/https://doi.org/10.1016/j.actatropica.2017.11.020
2
https://idpjournal.biomedcentral.com/articles/10.1186/s40249-018-0501-9
variety of sensors, it is possible to distinguish raining cloud from non-raining cloud by
estimating

 Cloud-top temperatures: deep convective clouds have cold, high tops, and so areas of
deep convection show up as low temperatures.
 Cloud thickness: rather than using the temperature of the cloud top as a proxy for the
intensity of deep convection, the amount of water and ice in the cloud can be
estimated by measuring the amount of scattered microwave radiation. These methods
offer a more accurate rainfall estimate, but have coarse spatial resolution and are
updated only twice a day.

Monitoring products that can produce required datasets for precipitation are:
 Global Precipitation Climatology Project (GPCP)
 Climate Prediction Center (CPC) Merged Analysis of Precipitation (CMAP)
 CPC MORPHing technique (CMORPH)
 Tropical Rainfall Measurement Mission (TRMM)
 Global Precipitation Measurement (GPM)
 Enhancing National Climate Services (ENACTS)
 Climate Hazards Group Infrared Precipitation with Station (CHIRPS)

Temperature3

Temperature plays an important role in the growth of Aedes mosquitoes affecting key physiological
processes in these vectors, including adult female survivorship and length of the first gonotrophic
cycle

An annual cumulative precipitation layer derived from a monthly mean precipitation dataset was
chosen as one of the input data layers in the present study. Global climate datasets were obtained
from the WorldClim database version 2.0

These datasets included monthly maximum temperature, monthly minimum temperature and
monthly mean precipitation during the 1970–2000 period.

Air temperature is commonly obtained from synoptic measurements in weather stations measured
at 2-m high. However, spatial distribution of weather stations is scarce in LMICs

Near-surface air temperature (Ta) is particularly useful for health, and it isnt straightforward to
collect it.

For temperature-based data, the following data sets are recommended:

Land-surface temperature (LST) from MODIS provides land-surface temperature estimates. More
detail here: https://idpjournal.biomedcentral.com/articles/10.1186/s40249-018-0501-9

Water bodies:

Using LANDSAT images at 30-m spatial resolution, it is possible to map small water bodies where
mosquitoes will breed and transmit diseases such as malaria, dengue fever, chikungunya

it is possible to map water bodies in blue, vegetation in green, and bare soils in brown

3
https://idpjournal.biomedcentral.com/articles/10.1186/s40249-018-0501-9
Practitioners can access data on water bodies through the following sources:

Terra MODIS middle-infrared, near-infrared, and red reflectances

LANDSAT middle-infrared, near-infrared, and red reflectances

Inundation fraction products are available for daily, 6-day, and 10-day periods for the entire globe at
25-km spatial resolution

More details at: https://idpjournal.biomedcentral.com/articles/10.1186/s40249-018-0501-9

Environmental conditions:

There is evidence that the abundance of reproductive mosquitoes is closely related to vegetation
canopy greenness and relative humidity (Nihei et al., 2014; Thu et al., 1998). Vegetation canopy can
protect mosquito habitats from direct sunlight, and relative humidity reflects the necessary moisture
content for mosquito survival

Radiometers can be used to measure vegetation by sensing the infrared radiation (heat) emitted by
vegetation. The amount of infrared radiation emitted by vegetation depends on several factors,
including the water content and the health of the plants. By measuring this radiation, radiometers
can provide information about the water content, productivity and the health status of vegetation.

Normalized Difference Vegetation Index quantifies vegetation by measuring the difference between
near-infrared (which vegetation strongly reflects) and red light (which vegetation absorbs)

An advanced very high resolution radiometer (AVHRR) NDVI dataset developed by the Global
Inventory Modeling and Mapping Studies (GIMMS) group (http://glcf.umd.edu/) was used with an 8
× 8 km spatial resolution and a 15 day interval temporal resolution.

Using those datasets, the mean annual NDVI layer was calculated

The global mean annual relative humidity dataset obtained from the NASA Surface Meteorology
and Solar Energy (https://eosweb.larc.nasa.gov/) was converted from a shapefile to a raster layer
and was also used as input data.

Social factors:

The study used global urban region, population density and nighttime light layers to characterize
the temporal and geographic variation in human habitat.

Global urban region dataset was used : from the Global Urban Heat Island dataset, downloaded from
the NASA Socioeconomic Data and Application Center (SEDAC), also got UN-Adjusted Population
Density

The nighttime light dataset was obtained from the NOAA Earth Observation Group

There is a well-established link between international travel and trade routes and mosquito
expansion. Global datasets of human movement are often unavailable free of charge. However, a 1
× 1 km spatial resolution global urban accessibility dataset is freely available from the European
Commission Joint Research Centre website

Technical flowchart of model:


Case Study 2: Assess the risk of dengue transmission in Singapore with dengue, population,
entomological and environmental data4
The datapoints are illustrated in the framework below:

Datapoints used:

 dengue cases
 population density
 breeding percentage
 Vegetation index, also known as the Normalized Difference Vegetation Index (NDVI)

4
https://sci-hub.se/https://doi.org/10.1371/journal.pntd.0006587
 dengue clusters: Dengue cases are clustered for vector operations purposes based on their
geographical and temporal proximity. A dengue cluster is formed when two or more cases
are located within a 150-meter radius and with the onsets of illness within a 14-day period.
Dengue clusters are generated using the Geographical Information System (GIS), and
information such as transmission duration, serotypes detected and the number of dengue
cases is recorded for every cluster

Case Study 3: neural networks and online extreme learning machine (OLEM) estimated the
distribution of kinds of water containers with the Aedes mosquito larvae in Recife, Brazil. 5
Nine years of environmental and entomological data were used to train the OLEM model

Datapoints used:

predicted the number of water containers contaminated with larvae; however only disaggregates
data to district level, not household. and did not consider things like NBVE

Case Study 4: a deep AlexNet model was trained on sea surface temperature images and
rainfall data by transfer learning to examine emerging spatiotemporal hotspots of dengue
fever at the township level in Taiwan
What is transfer learning?

Transfer learning is a research problem in machine learning that focuses on storing knowledge
gained while solving one problem and applying it to a different but related problem.

This transfer learning–based method overcame the overfitting problem due to the small dataset and
yielded an accuracy of 100% on an eightfold cross-validation test dataset.

Datapoints used: Sea surface temperature data, raninfall data, temperature data

A climate based model was used for predicting outbreak, and thereby utilizing a One Health
Approach.

5
https://sci-hub.se/https://doi.org/10.1145/3357729.3357738
ANALYZING DATA FROM CYBERSPACE USING MACHINE
LEARNING
There are primarily four approaches:

 Keywords-based Approaches: Most earlier studies rely on the keywords analysis such as
word occurrences and word frequency. These methods rely on the keyword analysis and
disregard context, grammar and even word order. They cannot sufficiently capture the
complex linguistic characteristics of words
 Learning-based approaches: These approaches have been intensively studied during the past
decade, which require labeled data for training. Naive Bayes, k nearest neighbors (KNN),
maximum entropy, and support vector machines (SVM) have been applied to a lot of health
classification problems and achieved satisfactory results
In summary, the learning-based approaches suffer from the limitation of labeling training
datasets, which requires experts to read the tweets and ascertain the category to which they
belong. For a large-scale twitter data, it is difficult to manually label the large-scale training
tweets.
 Lexicon-Based Approaches: The other direction is knowledge-based, which is also called a
dictionary method or knowledge-based method. It is considered to be a part of the
unsupervised learning method.
 Word Embedding Based Approach: Unlike other approaches in public health, we present a
word embedding based clustering method. Word embedding is the one of the strongest
trends in Natural Language Processing at this moment. It learns the continuous vector
representation of words from context words and the vectors can represent the semantic
information of words. A tweet can be represented as a few vectors and divided into clusters
of similar words. According to similarity measures of all the clusters, the tweet then can be
classified as related or unrelated to a topic (e.g., influenza). The approach is unsupervised
and does not require annotated data.

Case Study 5: A social media–based early warning system for mosquito-borne disease in
India was proposed6
The data was collected from two sources i.e. Twitter and news articles.

 Twitter API was used for fetching data using relevant keywords and medical science terms.
The keywords collection methodology is based Jain & Kumar which gives dynamic words
which are well known amid during a specific day and related to general trends and public
feelings.

Some of the trending keywords are: Keywords:{ #Dengue,# Chikungunya,


#Zika,#DengueFever,#yellow fever#Dengue virus,#Zika virus,#Flu,#Swine Flu,#Fluvirus}

 The essential thought was that tweets are spoken to as arbitrary blends of words related to
symptoms, fear, prevention, and care. In this task three main classes symptoms, fear, and
prevention have been created using bag-of-words taking from multiple sources
 The dataset is then divided into two basic classes, namely, diseases related tweets and
irrelevant tweets using Support Vector Machine(SVM) and Naive Bayes(NB).Secondly,
effective fine-grained classification of relevant tweets into three categories main classes
symptoms, fear, and prevention has been performed using SVM and NB

6
https://sci-hub.se/https://doi.org/10.1016/j.jocs.2017.07.003
What is Naïve Bayes?

Naive Bayes is a type of probabilistic machine learning algorithm based on Bayes' theorem, which is
a statistical rule that describes how to calculate conditional probabilities. Naive Bayes algorithms are
called "naive" because they make a strong assumption about the independence of the features,
which is not always true in real-world data.

Naive Bayes algorithms are typically used for classification tasks, where the goal is to assign a label
(or class) to a given input based on certain features.
Case Study 6: Using social media to observe movement of Internet users
DEFENDER is a software system developed in the United Kingdom that integrates Twitter and news
media for outbreak detection.7 SVM and naive Bayes classifiers were used for disease-related text
classification. The DBSCAN algorithm was utilized to cluster the geographic space and observe the
movement behavior of Internet users.7

7
https://www.sciencedirect.com/science/article/pii/B9780128212592000223#bib6
Case Study 7: word embedding for public health monitoring8.

In preprocessing stage, the system was able to identify the following characters within the tweet:

8
https://sci-hub.se/10.1109/SECON.2017.7925400
Case Study 8: a model to detect the magnitude of unexpected changes in terms of usage with
spatiotemporal patterns from social media data streams9.
This work had direct public health relevance and can be used for health events represented by
relatively infrequent terms.

However, applications in public health surveillance have typically relied on aggregating large static
datasets rather than implementing methods that can be used in real time surveillance, which has
limited the ability to operationalize the methods in public health practice.

The proposed event detection method’s architecture:

1) The model developed an event detection framework to process tweets as they arrive in a non-
stop, continuous, and streaming fashion. Information about spatial and temporal features extracted
or estimated from the metadata of Twitter posts (tweets) are used directly in the construction of a
spatiotemporal lattice, which then stores other features that are hierarchically aggregated to
characterise events that occur over periods of time and are common across multiple cities or
countries.

2) The four main modules represent a novel combination or extension of existing methods. The
preprocessing module constructs a set of low-level features from incoming tweets by estimating the
locations of the users posting tweets where possible, and adjusting their timestamps to match a
local hour of the day. The feature construction module aggregates tweets to construct a set of high-
level features by computing aggregate statistics of signatures of selected terms or entire tweet. The
estimation module uses a regression model to estimate the expected output value distribution with
respect to spatiotemporal dimensions of tweets. The event detection module then maps the
features into a lattice structure, applies statistical tests to compare expected and observed value
distributions, and extracts a ranked list of events from the lattice based on their severity or
importance.

3) The generality of the approach is due to the use of a detection function that can take a range of
different forms but still work effectively within the framework to robustly detect events at multiple
spatial and temporal granularities. As a consequence, the approach can be used for applications in
which users may prespecify individual or multiple terms of interest, or embed a function that
transforms all tweets into a score to detect unspecified events.

9
https://sci-hub.se/10.1109/TBDATA.2019.2948594
4) The approach was illustrated for three different types of event detection. Experiments
demonstrate how the detection functions can generalise to detect events across a range of
application domains.

You might also like