Professional Documents
Culture Documents
Introduction
Safeguarding our body's well-being should stand as a top priority in our pursuit of good health,
To achieve this, it's crucial to possess an understanding of how vulnerable we are to bacteria,
bacteria that causes tuberculosis, as stated by (world life expectancy 2023) Based on the latest
data from the WHO, tuberculosis claimed the lives of 127,335 individuals in Nigeria during
2020, constituting approximately 8.60% of all reported deaths. On a global scale, Nigeria holds
the sixth position with a death rate of 99.13 per 100,000 people, after adjusting for age-related
factors. According to (Center For Disease Control and Prevention 2023) Claiming 1.6 million
lives annually, tuberculosis (TB) remains among the most lethal infectious diseases globally.
(Bakula et al (2022)) In 2019, Nigeria secures a spot among the leading trio. Predicting 440,000
new TB infections and 154,000 TB-related fatalities, Nigeria further stands out with rates
approaching approximately 11 and 23 per 100,000 individuals, signifying one of the most
substantial prevalence rates for MDR/RR TB and TB/HIV coinfection worldwide. (Poonam
2023) A perilous infection, typically affecting the lungs, is tuberculosis (TB). It can also spread
to other bodily regions such as the brain and spine. This infection is triggered because of the
bacteria Mycobacterium tuberculosis. This specific bacterium is believed to have existed for over
three million years. Even ancient Greece and Rome had knowledge of this ailment. At the onset
of the 1900s, tuberculosis, formerly known as consumption, stood as the primary reason for
mortality in the US. Despite the effective management of tuberculosis (TB) today, over a million
1
people still succumb to it annually within the US. Mycobacterium tuberculosis is a medically and
scientifically significant bacterium due to its role in causing tuberculosis and its impact on global
public health. Researchers and healthcare professionals continue to study and combat this
Researchers use various approaches, including molecular biology, genomics, and bioinformatics,
to study the bacterium's genetics, drug resistance mechanisms, and interactions with the human
immune system. (leverage edu 2023) Bioinformatics finds its applications in medicine across a
spectrum, spanning from drug and gene exploration to safeguarding against diseases. The
individual's genetic makeup. This amalgamation of computer science, statistics, and biology
spread its influence across diverse domains, encompassing agricultural progress and veterinary
investigation. In this study, our contribution will extend to advancing the health sector through
the utilization of Machine Learning and bioinformatics in the realm of drug discovery. The
fusion of bioinformatics and Artificial Intelligence (AI), particularly Machine Learning (ML),
holds great promise in revolutionizing the healthcare landscape. This synergistic combination
enables the extraction of meaningful insights from vast and complex biological data, ultimately
leading to more accurate diagnoses, effective treatments, and enhanced personalized care.
2
1.1 Aims and Objectives
This project aims to employ bioinformatics and machine learning techniques to expedite the
discovery of novel drug candidates targeting Mycobacterium tuberculosis, with the overall goal
This study aims to achieve a multifaceted set of objectives: delving into the existing literature on
thorough comparative analysis of a minimum of three regression algorithms to identify the most
suitable solution for aiding drug discovery; implementing the selected algorithm that best aligns
with the specific problem; and rigorously evaluating the model's performance and results.
In the world of medical research and treatments, tuberculosis (TB) caused by Mycobacterium
tuberculosis is still a big problem. Some types of TB bacteria have become resistant to drugs,
making it urgent to find new ways to treat it. The usual methods of finding new drugs are slow
and cost a lot, which means we don't have many good drugs available. That's why it's really
important to find new, smarter ways to use computers and data to discover drugs that can fight
Mycobacterium tuberculosis better. The TB bacteria can change a lot, and this makes it hard to
find drugs that work against all of them. Trying to find the right targets for drugs and making
compounds that can attack different types of bacteria is tough. When scientists try to make new
drugs the old way, it takes a lot of time and money to test a large number of compounds in the
3
lab. This is a big challenge because it costs a lot and takes a long time. The way potential drugs
interact with the TB bacteria is very complicated. It's like trying to figure out how puzzle pieces
fit together, and the bacteria have many different shapes. This makes it hard to make computer
models that can predict how the drugs and the bacteria will interact. To deal with these problems,
scientists are turning to using computers and Ml techniques to help. To speed up the process of
finding new drugs that can work against different kinds of TB bacteria. By using computers to
make predictions about which compounds might be the best, scientists can save time and money.
The scope of this project is to utilize bioinformatics and machine learning (ML) techniques to
determine the effectiveness of drugs against Mycobacterium tuberculosis. The primary goal is to
enhance our understanding of the drug discovery process targeted at inhibiting the bacterium's
growth.
The significance of this study resides in its potential to reshape the course of TB drug discovery.
pressing requirement for improved TB treatments. This study bridges the gap between traditional
bioinformatics and machine learning paves the way for interdisciplinary collaboration and
we gain the ability to swiftly identify potential drugs with the right potency for treating
Tuberculosis. This study is of paramount importance given the urgency for enhanced TB
4
treatments, particularly against drug-resistant strains. Also, the methods to use can also help find
treatments for other diseases, and in making the whole process of finding new drugs better and
faster. Ultimately, it's about improving healthcare and finding cures more quickly.
To acquire insights into Plasmodium falciparum, regression algorithms, machine learning, and
regression methods will undergo comparison to identify the most suitable approach for drug
discovery challenges. The dataset for constructing the model will be sourced from publicly
available online repositories like the chemBL database. The chosen model will be implemented
using Python and its associated machine-learning libraries. An in-depth evaluation of the model's
performance will be conducted. Subsequently, the predictive model will be seamlessly integrated
for user interaction within a web application developed using Streamlit—a dedicated Python
package designed for crafting machine learning applications. The entire workflow, encompassing
methodologies, outcomes, and notable discoveries, will be meticulously documented in the final
5
MDR/RR TB: Multi-Drug Resistant/Resistant to Rifampicin Tuberculosis, indicating
Bioinformatics: The field that brings together biology, computer science, and statistics to
manage and analyze biological data, often used in genomics and drug discovery.
Genomic Medicine: A medical domain employing genomic data to guide patient care,
6
Chapter Two
Literature review
2.0 Introduction
The field of bioinformatics encompasses the storage, retrieval, and analysis of extensive
biological data. It involves a diverse range of experts, including biologists, molecular life
helping manage the vast amounts of data scientists can now extract from living organisms. This
data can range from the simplicity of a single cell to the complexity of a person's immune
system. Bioinformatics is paving the way for future personalized medicine, enabling researchers
develop innovative biotechnologies, and refine legal and forensic tools. Enormous datasets
harbor patterns that might be challenging or even impossible for humans to manually uncover.
This capability is facilitated by advanced observation tools and increasingly powerful computers,
which work hand in hand to make this analysis feasible (PNNL 2023).
Bioinformatics has opened doors to research in previously uncharted domains through effective
data management and analysis. (PNNL 2023)By harnessing bioinformatics tools, the value of
historical data is significantly amplified, enabling researchers to sift through extensive datasets
from various studies and unveil novel correlations. These tools facilitate the synthesis of
information collected globally, even from sources contributed by researchers with limited prior
7
familiarity. Moreover, these tools have the potential to enhance ongoing research endeavors.
their experimental designs. Data analysis aids in selecting targets for exploration and determining
the necessary sample sizes to achieve statistically significant findings. (Matti and LLOYD 2023)
Through the utilization of bioinformatics, the enigmas of life can be unraveled. This
interdisciplinary realm fuses biology, computer science, and statistics to delve into the mysteries
comprehensive outlook on biological processes. This process aids in pinpointing new targets for
drug development and refining disease diagnosis and treatment strategies. The intricate
interactions within living systems can be illuminated by scrutinizing vast biological datasets,
which would be a formidable task if done manually. The dynamic field of bioinformatics has the
potential to revolutionize healthcare and agriculture by unveiling the mysteries of life and
(Matti and Lloyd 2023) Bioinformatics finds application in various domains such as genomes,
proteomics, drug development, and personalized medicine. It plays a pivotal role in identifying
genes responsible for diseases, predicting potential adverse effects of therapies, and designing
new drugs with enhanced effectiveness and precision. Some of its application areas are ;
Genomics
Proteomics
Drug Discovery
Personalized Medicine
8
Evolutionary Biology
Agriculture
Comparative Genomics
Structural Bioinformatics
Functional Genomics
Metagenomics
Systems Biology
Transcriptomics
Phylogenetics
Medical Informatics
(Ioannis 2022 ) Bioinformatics applications such as gene sequencing, genetic statistics, and gene
expression level measurements have significantly enhanced the dosage response, toxicity
profiles, and overall effectiveness of medications aimed at treating a range of genetic disorders.
The drug development journey, spanning from discovery to final approval, is both prolonged and
financially demanding. Bioinformatics aids in expediting and streamlining the target discovery
and validation phases, ultimately bolstering the efficiency and cost-effectiveness of the approval
process by increasing the number of successful drug candidates. (Matt and Llyod 2023) Drug
discovery has historically been a challenging and intricate journey, often spanning years and
have brought about a fundamental shift in how we approach the quest for new medicines.
9
candidates and predict their effectiveness and safety, bioinformatics has revolutionized the
landscape of drug discovery. With cutting-edge tools like molecular docking and virtual
screening, researchers can now sift through vast volumes of biological data to uncover the
targets, paving the way for the development of novel medications with enhanced specificity and
efficacy.
2.3 Tuberculosis
According to (Mayo Clinic 2023) Tuberculosis (TB) is a dangerous disease primarily affecting
the lungs. It is caused by a specific type of bacteria. When an individual with tuberculosis
coughs, sneezes, or even talks, the disease can spread. This occurs through tiny droplets
containing the bacteria that are released into the air. Another person can inhale these droplets,
leading to the entry of the bacteria into their lungs. Tuberculosis can spread rapidly in places
where people gather or live in crowded conditions. The risk of contracting tuberculosis is notably
higher among those with conditions such as HIV/AIDS and other immune system disorders,
compared to healthy individuals. Treatment for TB involves the use of antibiotic medications.
However, there are strains of the bacteria that have become resistant to these treatments. (WHO
2023). It is estimated that approximately a quarter of the global population has been infected by
the TB bacteria. However, only around 5-10% of those who contract the infection will
subsequently exhibit symptoms and progress to the active disease stage. (Mary 2022) The
disease has existed for most of human history and has at times posed a significant threat. In fact,
tuberculosis can be traced back over 5,000 years to ancient Egypt. Moreover, references to TB
10
are found in the biblical books of Deuteronomy and Leviticus, using the Hebrew term
"schachepheth," while Hippocrates mentions it in his writings as "phthisis." It's likely that more
people have succumbed to M. tuberculosis than any other pathogen. Throughout the 18th and
19th centuries, tuberculosis was rampant in industrialized regions of Europe and North America,
causative agent of tuberculosis (TB). These bacteria are airborne and primarily target the lungs,
although they can also impact other areas of the body. TB is contagious, but its transmission is
not rapid. Generally, prolonged close contact with an infectious individual is required to contract
the disease. (WHO 2023) TB predominantly affects the respiratory system, particularly the lungs.
The transmission of this disease occurs when infected individuals cough, sneeze, or spit,
According to (Mayo Clinic 2023) a good description of the stages and symptoms was discussed,
for a better understanding of how this terrible unfriendly disease showcases itself in its different
stages. A TB infection arises when tuberculosis (TB) bacteria persist and multiply in the lungs.
a) initial TB infection. During this stage, immune system cells identify and capture the
bacteria. While the immune system can effectively eliminate most of the bacteria, some
11
may remain and reproduce.In general, a primary infection doesn't exhibit noticeable
Mild fever.
Fatigue.
Coughing.
b) Latent tuberculosis, the stage that usually ensues after the primary infection, involves
immune system cells enclosing lung tissue containing TB bacteria with a protective
barrier. While the immune system successfully contains the bacteria, preventing further
damage, the bacteria remain present. During latent TB, no symptoms are apparent.
infection, allowing the illness to spread throughout the body, including the lungs. Active
TB can manifest shortly after the initial infection or may arise from a latent TB infection
that has persisted for months or years. Symptoms of lung active Tuberculosis typically
Coughing.
Chest pain.
Fever.
Chills.
Night sweats.
12
Weight loss.
Loss of appetite.
Fatigue.
affecting various parts of the body. Symptoms can vary based on the infected area.
Fever.
Chills.
Night sweats.
Weight loss.
Loss of appetite.
Fatigue.
Active TB disease in children can manifest with varying symptoms based on age:
Children aged 1 to 12 years: Younger children might experience persistent fever and weight loss.
Infants: Infants may not gain weight as expected. They might also show signs of enlarged fluid
around the brain or spinal cord, such as lethargy, fussiness, vomiting, poor nutrition, delayed
responses, and a soft patch on the skull that has expanded. (Mayo Clinic 2023)
13
2.4 Life cycle of Mycobacterium tuberculosis
According to ( Lerm & Netea, 2016) they gave an intensive breakdown on the lifecycle of the
Alveolar macrophages serve as the primary target cells for M. tuberculosis infection. When
inhaled particles carrying the infection enter the alveolar space, these cells encounter the
pathogen and engulf it. To counteract the microbicidal activities of these cells, M. tuberculosis
the phagosome with diminished antimicrobial capability, and subsequently, it enters the cytosol
through the membrane-damaging toxin known as early secreted antigenic target (ESAT)-6.
14
( Lerm & Netea, 2016) Within the cytosol, mycobacteria replicate efficiently before the primary
host cell is eventually killed, a process in which ESAT-6 plays a significant role. The dying cell-
infection site, further propagating the infection paradoxically. Interestingly, these cells can aid in
the propagation of theinfection within the tissue. The initial formation of granulomas also relies
on ESAT-6 and is likely tied to the inflammation induced by toxin-triggered necrosis. As the
particularly CD4+ T helper type 1 (Th1) cells and CD8+ cytolytic T cells (CTLs), are attracted to
the developing granuloma, encircling the infected macrophages. Eventually, the inner core of the
granuloma undergoes necrosis, leading to its rupture. The life cycle of M. tuberculosis
culminates with the drainage of viable bacilli into the alveolar space, where coughing and
aerosol production facilitate the pathogen's dissemination to other individuals ( Lerm & Netea,
2016).
(Flam, 2022) Machine learning algorithms provide significant benefits to the healthcare sector
by aiding in the interpretation of extensive volumes of healthcare data generated daily through
electronic health records. By utilizing machine learning techniques and algorithms, patterns and
insights within medical data can be discovered, surpassing the capabilities of manual
providers to adopt a predictive approach to precision medicine. This transition has the potential
15
2023), "Machine learning," an emerging domain within computer science, employs algorithms to
empower computers with the ability to learn and autonomously make decisions. The realm of
machine learning (ML) offers the potential to enhance various aspects of businesses and is
currently demonstrating remarkable achievements across multiple industries. With the surge in
data volumes reaching unprecedented levels, ML is playing an increasingly pivotal role in the
corporate landscape. For instance, it assists enterprises in extracting insights and value from their
data.
As stated by (Xia, 2017), the utilization of data-driven machine learning (ML) applications has
witnessed substantial growth in recent years, becoming an essential tool in the initial stages of
drug discovery. This rising trend is attributed to several factors, including the swift accumulation
of relevant experimental data from sources like DrugBank, ChEMBL, PDB, PubChem, and
every phase of the Small Molecule Drug Discovery and Development (SBDR) pipeline, along
with subsequent stages, can reap benefits from the integration of ML algorithms. These
algorithms can be applied to a range of tasks, encompassing drug screening, target screening,
prediction of target structures and binding sites, lead optimization, anticipation of drug-drug
understanding, several ML approaches, notably, aim to extract insights from existing data
16
In the field of bioinformatics, methods hold the potential to identify the underlying causes of
cancer in individual patients, offering the opportunity to tailor cancer therapy to a more
personalized level. This avenue holds the promise of developing novel and repurposed
medications targeting specific proteins, thereby selectively eliminating or incapacitating only the
affected cells. Additionally, bioinformatics plays a pivotal role in the domain of translational
drug discovery for infectious diseases. For instance, distinct gene expression patterns are
triggered within cells due to the presence of bacterial or viral infections. By comparing these
genetic profiles with those associated with other disorders and influenced by pharmaceutical
interventions, there exists the potential to repurpose existing drugs (Wooller et al., 2017).
1. Supervised Learning:
2. Unsupervised Learning:
Finding patterns and relationships in data without labeled outcomes is often used for
17
3. Semi-Supervised Learning:
Combining labeled and unlabeled data to enhance model accuracy, useful when obtaining
4. Reinforcement Learning:
18
(Gillis 2023) A technique for constructing artificial intelligence (AI), termed supervised
learning, entails training a computer system using labeled input data to predict a particular
output. The model is iteratively trained until it can discern the inherent relationships and patterns
connecting the input and output labels. This empowers the model to generate accurate
predictions when presented with previously unseen data. (IBM 2023) Supervised machine
learning distinguishes itself by the method it employs to train computers for accurately
classifying data or predicting outcomes using labeled datasets. The model adjusts its weights as
input data is fed into it, ensuring proper fitting during the cross-validation process. Applications
like segregating spam emails into separate folders from regular emails exemplify how supervised
outcomes. This training dataset comprises both appropriate inputs and corresponding outputs,
enabling the model's progressive improvement. The loss function serves as a metric for the
algorithm's accuracy, and iterations are executed until the error is sufficiently minimized. In the
context of data mining, supervised learning can be categorized into two main types: regression
and classification.
When applying data mining techniques, supervised learning can be categorized into two main
Classification: This approach employs algorithms to accurately categorize test data into
distinct classes. It identifies specific entities within the dataset and strives to determine
19
trees, k-nearest neighbors, random forests, support vector machines (SVM), linear
When the output variable possesses a real or continuous value, regression is employed. In this
connection between two or more variables. For instance, examples include salary linked to
.(Mathworks 2023) To forecast continuous outcomes, like the values of financial assets or
applications encompass algorithmic trading, virtual sensing, and electrical load prediction.
When dealing with a dataset spanning a range of values or if your response falls within the
domain of real numbers, such as temperature or the time until equipment failure, employ
regression methods. Presented below are some of the prevalent algorithms for conducting
regression.
20
Fig 2.3 Regression analysis image
(Yang et. al(2019)) The authors employed an extensive dataset spanning 16 countries and six
isolate sequences and corresponding drug susceptibility testing outcomes. Their introduced
model, DeepAMR (Deep Autoencoder for Multiple Drug Resistance Classification), and its
clustering variant, DeepAMR_cluster, were aimed at multi-drug classification and latent data
space clustering, respectively. The study showcased DeepAMR's superiority over baseline and
other models, achieving impressive mean AUROC scores (94.4% to 98.7%) for predicting
resistance to four primary drugs, MDR-TB, and PANS-TB. DeepAMR excelled in sensitivity as
well, with best rates seen for isoniazid (94.3%), ethambutol (91.5%), pyrazinamide (87.3%), and
MDR-TB (96.3%). However, some cases showed slightly lower sensitivity compared to baseline,
such as rifampicin and PANS-TB, with 0.7% and 1.9% reduction, respectively. The study also
21
detailed cross-resistance patterns, notably between INH and RIF, and examined multi-label vs.
single-label models, with DeepAMR's success attributed to abstract data use and non-linear
reduction. Although predicting resistance for specific drugs (e.g., EMB and PZA) posed
challenges, DeepAMR demonstrated significant improvement for these cases.( Nagamani &
Sastry 2021) study focused on addressing the challenges posed by drug-resistant strains of
Mycobacterium tuberculosis (M.tb) through the application of machine learning models and
computational drug repurposing. The authors highlight the urgent need for novel antitubercular
drugs due to the rapid evolution of drug-resistant M.tb strains. They point out that factors like
genetic mutations, the complex cell wall system of M.tb, and transporter systems contribute to
the ineffectiveness of many small molecules in arresting M.tb cell growth.The study's objective
was to overcome the permeability barriers of M.tb by developing machine learning models that
can distinguish between permeable and impermeable compounds. The authors utilized enzyme-
based (IC50) and cell-based (minimal inhibitory concentration) data to classify compounds based
on their permeability. The XGBoost machine learning model emerged as the top performer
compared to other algorithms like random forest, support vector machine, and naive Bayes. (
Deelder et. al(2019)) research employs a machine learning-driven methodology to address the
authors worked with an extensive dataset of 16,688 M.tb isolates, each having undergone whole-
extensively drug-resistant profiles, underlining the significance of their study. the authors
predict drug resistance and identify potential associated mutations. This approach allowed the
22
creation of separate models for each drug, and they also considered the influence of "co-
occurrent resistance" markers, known to cause resistance to drugs other than the one under
and the area under the receiver operating characteristic curve, with DST outcomes serving as the
benchmark for evaluation. Notably, their models demonstrated particularly high accuracy in
predicting resistance to first-line drugs and several second-line drugs, with the area under the
receiver operating characteristic curve exceeding 96%. However, the performance was
comparatively lower for certain third-line drugs. The inclusion of co-occurrent resistance
markers notably enhanced the predictive capabilities of some drugs, leading to superior
(Hadikurniawati et. al(2021)) The authors compare the performance of ML models using 10-fold
cross-validation. Contrary to previous research that favored certain models over others, this study
finds that the best-performing model is data-specific. This conclusion aligns with observations
from other studies. Nevertheless, the research achieves better results compared to recent studies
and notes that certain methods, particularly Logistic Regression and MD-WDNN, exhibit similar
performance levels. Additional parameter tuning is conducted using the scikit-learn library in
Python, reinforcing the data-specific nature of model performance. They further underscore the
successful application of ML techniques to predict MTB drug resistance based on DNA data.
With an impressive accuracy rate of up to 99% and high Area Under Curve (AUC) values, the
ML approach holds promise in tuberculosis drug resistance prediction. The study emphasizes the
data-specific nature of model performance and highlights the potential for slight improvements
through parameter tuning. Overall, the research contributes to the growing body of knowledge on
employing ML for tuberculosis drug resistance prediction, showcasing its potential as a valuable
23
tool in medical research and diagnostics. (Ye et. al(2021)) presents a research endeavor focused
tuberculosis (Mtb), a leading global cause of mortality. The emergence of extensively drug-
resistant TB has underscored the necessity for novel drug candidates. This study leverages
various machine learning (ML) algorithms, including support vector machine, random forest
(RF), extreme gradient boosting (XGBoost), and deep neural networks (DNN), to construct
classification models that distinguish Mtb inhibitors from non-inhibitors.The outcomes reveal
that the XGBoost model displays the most robust predictive performance. To enhance accuracy
further, two consensus strategies are employed by integrating predictions from multiple models.
The stacking model that combines predictions from RF, XGBoost, and DNN offers the highest
accuracy, with an area under the receiver operating characteristic curve (AUC) of 0.842 for the
10-fold cross-validated training set and 0.942 for the external test set. The authors also explore
the relationship between important molecular descriptors and bioactivities using the Shapley
additive explanations method.( Radchenko et. al (2023)) The authors establish a well-structured
foundation for their modeling approach, emphasizing the significance of a diverse dataset. They
draw upon publicly available data to create a dataset containing both target-based and cell-based
assay results. Their preprocessing methods and dataset preparation are thorough, reflecting the
complexities and challenges of working with large, heterogeneous datasets. The utilization of
fragmental descriptors and neural networks for modeling is well-justified, considering their prior
success in other QSAR and QSPR applications. The architecture of the neural network, with its
design for robustness and validation. The presentation of their modeling process is detailed and
24
enhanced predictivity. Comparisons with other models in the literature add to the paper's
credibility. (Deelder et. al (2022)) The authors address the growing concern of drug-resistant
emphasize the importance of incorporating whole genome sequencing and machine learning
techniques to predict drug resistance and identify genetic mutations associated with M.
approaches without tailoring them to the specific context of tuberculosis. To address these
specifically for tuberculosis. This approach focuses on extracting and analyzing genomic variants
across multiple studies to enhance genotypic profiling. The authors applied Treesist-TB to
predict drug resistance for well-known drugs like rifampicin, isoniazid, and ethambutol,
achieving predictive accuracy comparable to existing tools like TB-Profiler. (Hrizi et. al(2022))
The study's techniques are applied to computed tomography (CT) scans of TB patients, with a
division into training and testing sets. Feature extraction using the spatial gray-level dependence
method (SGLDM) is conducted. The results are presented in terms of hyper-parameter and
feature selection. The study employs Python, utilizing an RTX 2060 Graphics Card and 16 GB of
RAM. The ImageCLEF 2020 dataset is used, employing multi-label classification for lung
conditions. The metric of interest is accuracy. Experiments involve a range of machine learning
methods, with Sklearn as the toolkit for comparison. SVM hyper-parameter selection employs a
genetic algorithm, focusing on radial basis function (RBF) kernel performance. Performance is
compared with known classifiers like KNN, CART, NB, LDA, and RF.Focusing on tuberculosis
(TB), the study proposes an optimized machine learning-based model that extracts optimal
texture features from TB-related images and simultaneously fine-tunes classifier hyper-
25
parameters. The overarching objectives are to improve accuracy and reduce the number of
involves a genetic algorithm (GA) for feature selection followed by a support vector machine
(SVM) classifier. Experimental results, using the ImageCLEF 2020 dataset, demonstrate
improved accuracy and outperforming state-of-the-art methods through the enhanced approach.
(Kuang et. al (2022)) concisely introduces the research's motivation, its novel deep learning
approach, and the cohort used for AMR prediction. The results section effectively presents the
data analysis, feature selection, training, and validation processes. The comparison of model
performance with a rule-based method provides clear insights. They further delve into the results,
emphasizing the substantial increase in F1-score achieved by the best ML classifiers compared to
the rule-based Mykrobe predictor. The performance of the 1D CNN model is slightly superior to
training. The impact of feature selection on reducing resource demands is noted. The potential
for hyperparameter optimization and the inclusion of novel variants for improved model
performance is discussed. The importance of managing imbalanced classes and the consideration
of sensitivity and specificity in clinical settings is acknowledged. The potential extension of the
model to bacteria with plasmid-mediated resistance is examined, along with the importance of
diverse datasets in managing overfitting. The study's reliance on the F1-score metric is discussed,
along with the introduction of the G-mean metric. Automation of the entire process into a
flexible pipeline is highlighted, enabling easy adaptation and expansion of the models for other
antibiotics and bacteria. The overall focus on accurate AMR prediction is emphasized. (Jamal et.
al (2020)) presents a computational framework that utilizes artificial intelligence (AI) and
26
Mycobacterium tuberculosis (M.tb) using high-throughput sequencing data. The authors focus on
specific genes related to drug resistance and utilize various ML algorithms to build prediction
models. The study includes dataset preparation, model evaluation, and the impact analysis of
predicted mutations on protein stability. They indicate the successful development of prediction
models for several genes associated with drug resistance, including rpoB, inhA, katG, pncA,
gyrA, and gyrB. The models exhibit good accuracy in predicting the susceptibility or resistance
of mutations, achieving approximately 70% accuracy on average in the training dataset. The
authors evaluate the models using non-redundant testing data, showcasing accuracy ranging from
66.66% to 100%. Performance varies among different genes, with artificial neural network
(ANN) models generally performing the best. Furthermore, the authors highlight the significance
of their approach in predicting drug resistance and classifying mutations. They emphasize the
importance of various features, such as changes in amino acid properties and stability
calculations, in accurately predicting mutation effects. The potential utility of the models for
Year
28
Diagnosis machine machine learning some
Based on an (SVM), K- algorithms (SVM, irrelevant
Optimized Nearest KNN, CART, NB, characteristi
Machine Neighbors LDA, and RF) cs that
Learning (KNN), showed that SVM increase the
Model Classificatio (0.84) classifier likelihood
n was more accurate that the
And Regress than the other learning
ion Tree classification models will
(CART), algorithms, while be overfit,
Naïve Bayes KNN (0.82), LDA complex,
(NB), Linear (0.82) and RF and
Discriminan (0.81) performed challenging
t Analysis better than CART to
(LDA), and (0.73) and NB understand,
Random (0.67). leading to
forest (RF). low
efficiency
and poor
performance
.
29
against (SVM), RDKitFP performs unbalanced.
Mycobacteriu Random the best with
2). For each
m tuberculosis forest (RF), AUC= 0.832.
scaffold, the
through Extreme Overall, all the
inhibitors
machine gradient models perform
and
learning boosting well, with the AUC
noninhibitor
(XGBoost) values all higher
s are
and Deep than 0.91. The
imbalanced.
neural stacking model
networks outperforms the
(DNN) other four
individual models,
with an average
AUC= 0.935 and
ACC= 0.878 for
the scaffold test
set.
30
respectively.
31
gyrB, respectively.
32
classificatio
n for first-
line
medications
, and (ii) a
small
number of
resistant
isolates
would
easily lead
to over-
fitting for
such a
complex
model.
3). The
permutation
feature is
unable to
distinguish
between
feature
relationship
s.
33
Sequencing PZA (69.7%), RIF such as
Data (88.8%) and INH bedaquiline,
(91.1%) had delamanid,
stronger GBT- and
CRM sensitivity. linezolid, as
CIP (85.7%), OFL well as
(81.0%), and MOX XDR-TB.
(53.3%) had the
highest
fluoroquinolone
sensitivity. The
injectables with the
highest sensitivity
were KAN
(82.2%), AMK
(80.5%), and CAP
(74.6%).
34
References
Applying Bioinformatics in Clinical Drug Discovery. (n.d.). Retrieved August 24, 2023, from
https://www.longdom.org/open-access/applying-bioinformatics-in-clinical-drug-
discovery.pdf
articles/bioinformatics
Deelder, W., Christakoudi, S., Phelan, J., Benavente, E. D., Campino, S., McNerney, R., Palla,
tuberculosis drug resistance from whole genome sequencing data. Frontiers in Genetics,
10(SEP). https://doi.org/10.3389/fgene.2019.00922
Deelder, W., Napier, G., Campino, S., Palla, L., Phelan, J., & Clark, T. G. (2022). A modified
decision tree approach to improve the prediction and mutation discovery for drug
https://doi.org/10.1186/s12864-022-08291-4
Hadikurniawati, W., Anwar, M. T., Marlina, D., & Kusumo, H. (2021). Predicting tuberculosis
drug resistance using machine learning based on DNA sequencing data. Journal of
Hrizi, O., Gasmi, K., ben Ltaifa, I., Alshammari, H., Karamti, H., Krichen, M., ben Ammar, L.,
https://doi.org/10.1155/2022/8950243
Jamal, S., Khubaib, M., Gangwar, R., Grover, S., Grover, A., & Hasnain, S. E. (2020). Artificial
Intelligence and Machine learning based prediction of resistant and susceptible mutations
35
in Mycobacterium tuberculosis. Scientific Reports, 10(1). https://doi.org/10.1038/s41598-
020-62368-2
Kuang, X., Wang, F., Hernandez, K. M., Zhang, Z., & Grossman, R. L. (2022). Accurate and
rapid prediction of tuberculosis drug resistance from genome sequence data using
https://doi.org/10.1038/s41598-022-06449-4
Nagamani, S., & Sastry, G. N. (2021). Mycobacterium tuberculosis cell wall permeability model
Radchenko, E. v., Antonyan, G. v., Ignatov, S. K., & Palyulin, V. A. (2023). Machine Learning
Romano, J. D., & Tatonetti, N. P. (2019). Informatics and computational methods in natural
442506. https://doi.org/10.3389/FGENE.2019.00368/BIBTEX
Tuberculosis (TB) | Cedars-Sinai. (n.d.). Retrieved August 24, 2023, from https://www.cedars-
sinai.org/health-library/diseases-and-conditions/t/tuberculosis-tb.html
Tuberculosis (TB): Symptoms, treatment, diagnosis, and more. (n.d.). Retrieved August 24,
sheets/detail/tuberculosis
Tuberculosis: Causes, Symptoms, Diagnosis & Treatment. (n.d.). Retrieved August 24, 2023,
from https://my.clevelandclinic.org/health/diseases/11301-tuberculosis
36
What is bioinformatics, and why is it important? (n.d.). Retrieved August 24, 2023, from
https://bioinformaticshome.com/blog/What_is_bioinformatics_why_%20important.html
What is bioinformatics? | Bioinformatics for the terrified. (n.d.). Retrieved August 24, 2023,
from https://www.ebi.ac.uk/training/online/courses/bioinformatics-terrified/what-
bioinformatics/
Yang, Y., Walker, T. M., Walker, A. S., Wilson, D. J., Peto, T. E. A., Crook, D. W., Shamout,
F., Zhu, T., Clifton, D. A., Arandjelovic, I., Comas, I., Farhat, M. R., Gao, Q.,
Sintchenko, V., Soolingen, D., Hoosdally, S., Cruz, A. L. G., Carter, J., Grazian, C., …
https://doi.org/10.1093/bioinformatics/btz067
Ye, Q., Chai, X., Jiang, D., Yang, L., Shen, C., Zhang, X., Li, D., Cao, D., & Hou, T. (2021).
37