You are on page 1of 37

Chapter One

Introduction

1.0 Background of the study

Safeguarding our body's well-being should stand as a top priority in our pursuit of good health,

To achieve this, it's crucial to possess an understanding of how vulnerable we are to bacteria,

parasites, and disease-causing organisms. Mycobacterium tuberculosis is an unfriendly species of

bacteria that causes tuberculosis, as stated by (world life expectancy 2023) Based on the latest

data from the WHO, tuberculosis claimed the lives of 127,335 individuals in Nigeria during

2020, constituting approximately 8.60% of all reported deaths. On a global scale, Nigeria holds

the sixth position with a death rate of 99.13 per 100,000 people, after adjusting for age-related

factors. According to (Center For Disease Control and Prevention 2023) Claiming 1.6 million

lives annually, tuberculosis (TB) remains among the most lethal infectious diseases globally.

(Bakula et al (2022)) In 2019, Nigeria secures a spot among the leading trio. Predicting 440,000

new TB infections and 154,000 TB-related fatalities, Nigeria further stands out with rates

approaching approximately 11 and 23 per 100,000 individuals, signifying one of the most

substantial prevalence rates for MDR/RR TB and TB/HIV coinfection worldwide. (Poonam

2023) A perilous infection, typically affecting the lungs, is tuberculosis (TB). It can also spread

to other bodily regions such as the brain and spine. This infection is triggered because of the

bacteria Mycobacterium tuberculosis. This specific bacterium is believed to have existed for over

three million years. Even ancient Greece and Rome had knowledge of this ailment. At the onset

of the 1900s, tuberculosis, formerly known as consumption, stood as the primary reason for

mortality in the US. Despite the effective management of tuberculosis (TB) today, over a million

1
people still succumb to it annually within the US. Mycobacterium tuberculosis is a medically and

scientifically significant bacterium due to its role in causing tuberculosis and its impact on global

public health. Researchers and healthcare professionals continue to study and combat this

bacterium to improve diagnosis, treatment, and prevention strategies for tuberculosis.

Researchers use various approaches, including molecular biology, genomics, and bioinformatics,

to study the bacterium's genetics, drug resistance mechanisms, and interactions with the human

immune system. (leverage edu 2023) Bioinformatics finds its applications in medicine across a

spectrum, spanning from drug and gene exploration to safeguarding against diseases. The

contribution of bioinformatics researchers in pharmaceutical advancement, particularly in

studying infectious ailments, is paramount. Moreover, bioinformatics strides forward in shaping

personalized medicine, providing fresh perspectives into crafting medicines attuned to an

individual's genetic makeup. This amalgamation of computer science, statistics, and biology

spread its influence across diverse domains, encompassing agricultural progress and veterinary

investigation. In this study, our contribution will extend to advancing the health sector through

the utilization of Machine Learning and bioinformatics in the realm of drug discovery. The

fusion of bioinformatics and Artificial Intelligence (AI), particularly Machine Learning (ML),

holds great promise in revolutionizing the healthcare landscape. This synergistic combination

enables the extraction of meaningful insights from vast and complex biological data, ultimately

leading to more accurate diagnoses, effective treatments, and enhanced personalized care.

2
1.1 Aims and Objectives

This project aims to employ bioinformatics and machine learning techniques to expedite the

discovery of novel drug candidates targeting Mycobacterium tuberculosis, with the overall goal

of contributing to more effective treatment strategies against tuberculosis.

The objectives of the study are:

This study aims to achieve a multifaceted set of objectives: delving into the existing literature on

bioinformatics, Machine learning applications, and Mycobacterium tuberculosis; conducting a

thorough comparative analysis of a minimum of three regression algorithms to identify the most

suitable solution for aiding drug discovery; implementing the selected algorithm that best aligns

with the specific problem; and rigorously evaluating the model's performance and results.

1.2 Statement of the problem

In the world of medical research and treatments, tuberculosis (TB) caused by Mycobacterium

tuberculosis is still a big problem. Some types of TB bacteria have become resistant to drugs,

making it urgent to find new ways to treat it. The usual methods of finding new drugs are slow

and cost a lot, which means we don't have many good drugs available. That's why it's really

important to find new, smarter ways to use computers and data to discover drugs that can fight

Mycobacterium tuberculosis better. The TB bacteria can change a lot, and this makes it hard to

find drugs that work against all of them. Trying to find the right targets for drugs and making

compounds that can attack different types of bacteria is tough. When scientists try to make new

drugs the old way, it takes a lot of time and money to test a large number of compounds in the

3
lab. This is a big challenge because it costs a lot and takes a long time. The way potential drugs

interact with the TB bacteria is very complicated. It's like trying to figure out how puzzle pieces

fit together, and the bacteria have many different shapes. This makes it hard to make computer

models that can predict how the drugs and the bacteria will interact. To deal with these problems,

scientists are turning to using computers and Ml techniques to help. To speed up the process of

finding new drugs that can work against different kinds of TB bacteria. By using computers to

make predictions about which compounds might be the best, scientists can save time and money.

1.3 Scope of the study

The scope of this project is to utilize bioinformatics and machine learning (ML) techniques to

determine the effectiveness of drugs against Mycobacterium tuberculosis. The primary goal is to

enhance our understanding of the drug discovery process targeted at inhibiting the bacterium's

growth.

1.4 Significant/Justification of the Study

The significance of this study resides in its potential to reshape the course of TB drug discovery.

By revolutionizing the identification of effective drug candidates, it directly addresses the

pressing requirement for improved TB treatments. This study bridges the gap between traditional

laboratory methods and contemporary computational approaches. The integration of

bioinformatics and machine learning paves the way for interdisciplinary collaboration and

stimulates innovation in drug discovery methodologies. By utilizing computational techniques,

we gain the ability to swiftly identify potential drugs with the right potency for treating

Tuberculosis. This study is of paramount importance given the urgency for enhanced TB

4
treatments, particularly against drug-resistant strains. Also, the methods to use can also help find

treatments for other diseases, and in making the whole process of finding new drugs better and

faster. Ultimately, it's about improving healthcare and finding cures more quickly.

1.5 Methodology Overview

To acquire insights into Plasmodium falciparum, regression algorithms, machine learning, and

bioinformatics, a comprehensive literature review will be conducted. A minimum of three

regression methods will undergo comparison to identify the most suitable approach for drug

discovery challenges. The dataset for constructing the model will be sourced from publicly

available online repositories like the chemBL database. The chosen model will be implemented

using Python and its associated machine-learning libraries. An in-depth evaluation of the model's

performance will be conducted. Subsequently, the predictive model will be seamlessly integrated

for user interaction within a web application developed using Streamlit—a dedicated Python

package designed for crafting machine learning applications. The entire workflow, encompassing

methodologies, outcomes, and notable discoveries, will be meticulously documented in the final

report of the project.

1.6 Definition Of Terms

 Mycobacterium tuberculosis: A type of bacterium that causes tuberculosis (TB), an

infectious disease primarily affecting the lungs.

5
 MDR/RR TB: Multi-Drug Resistant/Resistant to Rifampicin Tuberculosis, indicating

strains of TB bacteria that have developed resistance to commonly used antibiotics.

 TB/HIV coinfection: The presence of both tuberculosis and human immunodeficiency

virus (HIV) infections in an individual.

 Bioinformatics: The field that brings together biology, computer science, and statistics to

manage and analyze biological data, often used in genomics and drug discovery.

 Personalized Medicine: Medical treatment customized for an individual based on their

genetic information, medical history, and other relevant factors.

 Genomic Medicine: A medical domain employing genomic data to guide patient care,

including disease diagnosis, treatment selection, and prevention strategies.

 Drug Repurposing: The procedure of identifying new applications for pre-existing

medications, often leveraging computational methods to discover alternative uses.

 Public Health Management: The practice of safeguarding and enhancing community

health through coordinated efforts of both public and private organizations.

 ChemBL Database: A publicly available database containing information about

bioactive molecules, their properties, and their targets.

6
Chapter Two

Literature review

2.0 Introduction

The field of bioinformatics encompasses the storage, retrieval, and analysis of extensive

biological data. It involves a diverse range of experts, including biologists, molecular life

scientists, computer scientists, and mathematicians, working collaboratively in this highly

interdisciplinary domain (EMBL-EBI 2023). Computers play a vital role in bioinformatics,

helping manage the vast amounts of data scientists can now extract from living organisms. This

data can range from the simplicity of a single cell to the complexity of a person's immune

system. Bioinformatics is paving the way for future personalized medicine, enabling researchers

to decode the human genome, gain a comprehensive understanding of biological systems,

develop innovative biotechnologies, and refine legal and forensic tools. Enormous datasets

harbor patterns that might be challenging or even impossible for humans to manually uncover.

This capability is facilitated by advanced observation tools and increasingly powerful computers,

which work hand in hand to make this analysis feasible (PNNL 2023).

2.1 Importance of Bioinformatics

Bioinformatics has opened doors to research in previously uncharted domains through effective

data management and analysis. (PNNL 2023)By harnessing bioinformatics tools, the value of

historical data is significantly amplified, enabling researchers to sift through extensive datasets

from various studies and unveil novel correlations. These tools facilitate the synthesis of

information collected globally, even from sources contributed by researchers with limited prior

7
familiarity. Moreover, these tools have the potential to enhance ongoing research endeavors.

Conducting in silico experiments is relatively straightforward, allowing researchers to fine-tune

their experimental designs. Data analysis aids in selecting targets for exploration and determining

the necessary sample sizes to achieve statistically significant findings. (Matti and LLOYD 2023)

Through the utilization of bioinformatics, the enigmas of life can be unraveled. This

interdisciplinary realm fuses biology, computer science, and statistics to delve into the mysteries

of biological systems. By amalgamating data from diverse origins, bioinformatics offers a

comprehensive outlook on biological processes. This process aids in pinpointing new targets for

drug development and refining disease diagnosis and treatment strategies. The intricate

interactions within living systems can be illuminated by scrutinizing vast biological datasets,

which would be a formidable task if done manually. The dynamic field of bioinformatics has the

potential to revolutionize healthcare and agriculture by unveiling the mysteries of life and

uncovering innovative avenues to enhance human health and well-being.

2.1.2 Application areas of Bioinformatics

(Matti and Lloyd 2023) Bioinformatics finds application in various domains such as genomes,

proteomics, drug development, and personalized medicine. It plays a pivotal role in identifying

genes responsible for diseases, predicting potential adverse effects of therapies, and designing

new drugs with enhanced effectiveness and precision. Some of its application areas are ;

 Genomics

 Proteomics

 Drug Discovery

 Personalized Medicine

8
 Evolutionary Biology

 Agriculture

 Comparative Genomics

 Structural Bioinformatics

 Functional Genomics

 Metagenomics

 Systems Biology

 Transcriptomics

 Phylogenetics

 Medical Informatics

2.2 Bioinformatics in drug discovery

(Ioannis 2022 ) Bioinformatics applications such as gene sequencing, genetic statistics, and gene

expression level measurements have significantly enhanced the dosage response, toxicity

profiles, and overall effectiveness of medications aimed at treating a range of genetic disorders.

The drug development journey, spanning from discovery to final approval, is both prolonged and

financially demanding. Bioinformatics aids in expediting and streamlining the target discovery

and validation phases, ultimately bolstering the efficiency and cost-effectiveness of the approval

process by increasing the number of successful drug candidates. (Matt and Llyod 2023) Drug

discovery has historically been a challenging and intricate journey, often spanning years and

requiring investments of billions of dollars. However, recent advancements in bioinformatics

have brought about a fundamental shift in how we approach the quest for new medicines.

Through the utilization of state-of-the-art computational techniques to identify potential drug

9
candidates and predict their effectiveness and safety, bioinformatics has revolutionized the

landscape of drug discovery. With cutting-edge tools like molecular docking and virtual

screening, researchers can now sift through vast volumes of biological data to uncover the

intricate interactions between pharmaceuticals and biological systems. By amalgamating

information from diverse sources, bioinformatics provides a comprehensive view of therapeutic

targets, paving the way for the development of novel medications with enhanced specificity and

efficacy.

2.3 Tuberculosis

According to (Mayo Clinic 2023) Tuberculosis (TB) is a dangerous disease primarily affecting

the lungs. It is caused by a specific type of bacteria. When an individual with tuberculosis

coughs, sneezes, or even talks, the disease can spread. This occurs through tiny droplets

containing the bacteria that are released into the air. Another person can inhale these droplets,

leading to the entry of the bacteria into their lungs. Tuberculosis can spread rapidly in places

where people gather or live in crowded conditions. The risk of contracting tuberculosis is notably

higher among those with conditions such as HIV/AIDS and other immune system disorders,

compared to healthy individuals. Treatment for TB involves the use of antibiotic medications.

However, there are strains of the bacteria that have become resistant to these treatments. (WHO

2023). It is estimated that approximately a quarter of the global population has been infected by

the TB bacteria. However, only around 5-10% of those who contract the infection will

subsequently exhibit symptoms and progress to the active disease stage. (Mary 2022) The

disease has existed for most of human history and has at times posed a significant threat. In fact,

tuberculosis can be traced back over 5,000 years to ancient Egypt. Moreover, references to TB

10
are found in the biblical books of Deuteronomy and Leviticus, using the Hebrew term

"schachepheth," while Hippocrates mentions it in his writings as "phthisis." It's likely that more

people have succumbed to M. tuberculosis than any other pathogen. Throughout the 18th and

19th centuries, tuberculosis was rampant in industrialized regions of Europe and North America,

earning it the name "consumption."

2.3.1 Causes of Tuberculosis

According to (Cleveland clinic 2023 ) The bacterium Mycobacterium tuberculosis is the

causative agent of tuberculosis (TB). These bacteria are airborne and primarily target the lungs,

although they can also impact other areas of the body. TB is contagious, but its transmission is

not rapid. Generally, prolonged close contact with an infectious individual is required to contract

the disease. (WHO 2023) TB predominantly affects the respiratory system, particularly the lungs.

The transmission of this disease occurs when infected individuals cough, sneeze, or spit,

releasing the bacteria into the air.

2.3.2 Stages and Symptoms of Tuberculosis

According to (Mayo Clinic 2023) a good description of the stages and symptoms was discussed,

for a better understanding of how this terrible unfriendly disease showcases itself in its different

stages. A TB infection arises when tuberculosis (TB) bacteria persist and multiply in the lungs.

There are three stages of TB infection, each characterized by specific symptoms.

a) initial TB infection. During this stage, immune system cells identify and capture the

bacteria. While the immune system can effectively eliminate most of the bacteria, some

11
may remain and reproduce.In general, a primary infection doesn't exhibit noticeable

symptoms. However, a few individuals might experience flu-like symptoms, including:

 Mild fever.

 Fatigue.

 Coughing.

b) Latent tuberculosis, the stage that usually ensues after the primary infection, involves

immune system cells enclosing lung tissue containing TB bacteria with a protective

barrier. While the immune system successfully contains the bacteria, preventing further

damage, the bacteria remain present. During latent TB, no symptoms are apparent.

c) Active tuberculosis develops when the immune system is unable to suppress an

infection, allowing the illness to spread throughout the body, including the lungs. Active

TB can manifest shortly after the initial infection or may arise from a latent TB infection

that has persisted for months or years. Symptoms of lung active Tuberculosis typically

begin mildly and worsen over several weeks, possibly including:

 Coughing.

 Coughing up blood or mucus.

 Chest pain.

 Discomfort while breathing or coughing.

 Fever.

 Chills.

 Night sweats.

12
 Weight loss.

 Loss of appetite.

 Fatigue.

 Overall feeling of unwellness.

d) Extrapulmonary tuberculosis refers to the spread of TB infection beyond the lungs,

affecting various parts of the body. Symptoms can vary based on the infected area.

Common signs may encompass:

 Fever.

 Chills.

 Night sweats.

 Weight loss.

 Loss of appetite.

 Fatigue.

 Overall feeling of unwellness.

 Pain near the site of infection.

Active TB disease in children can manifest with varying symptoms based on age:

Teenagers: Symptoms resemble those seen in adults.

Children aged 1 to 12 years: Younger children might experience persistent fever and weight loss.

Infants: Infants may not gain weight as expected. They might also show signs of enlarged fluid

around the brain or spinal cord, such as lethargy, fussiness, vomiting, poor nutrition, delayed

responses, and a soft patch on the skull that has expanded. (Mayo Clinic 2023)

13
2.4 Life cycle of Mycobacterium tuberculosis

According to ( Lerm & Netea, 2016) they gave an intensive breakdown on the lifecycle of the

bacteria and a proper understanding of its functions in its life span.

Fig 2.0 Lifecycle of Mycobacterium tuberculosis

Alveolar macrophages serve as the primary target cells for M. tuberculosis infection. When

inhaled particles carrying the infection enter the alveolar space, these cells encounter the

pathogen and engulf it. To counteract the microbicidal activities of these cells, M. tuberculosis

employs various mechanisms, including phagosomal acidification, activation of proteolytic

enzymes within acidified phagolysosomes, production of antimicrobial peptides, and generation

of reactive oxygen and nitrogen metabolites. Consequently, if the macrophage defense is

ineffective, M. tuberculosis establishes itself within an intracellular niche. Initially, it resides in

the phagosome with diminished antimicrobial capability, and subsequently, it enters the cytosol

through the membrane-damaging toxin known as early secreted antigenic target (ESAT)-6.
14
( Lerm & Netea, 2016) Within the cytosol, mycobacteria replicate efficiently before the primary

host cell is eventually killed, a process in which ESAT-6 plays a significant role. The dying cell-

induced inflammation prompts the recruitment of additional monocytes/macrophages to the

infection site, further propagating the infection paradoxically. Interestingly, these cells can aid in

the propagation of theinfection within the tissue. The initial formation of granulomas also relies

on ESAT-6 and is likely tied to the inflammation induced by toxin-triggered necrosis. As the

inflammatory process advances, macrophage clusters termed granulomas emerge. Lymphocytes,

particularly CD4+ T helper type 1 (Th1) cells and CD8+ cytolytic T cells (CTLs), are attracted to

the developing granuloma, encircling the infected macrophages. Eventually, the inner core of the

granuloma undergoes necrosis, leading to its rupture. The life cycle of M. tuberculosis

culminates with the drainage of viable bacilli into the alveolar space, where coughing and

aerosol production facilitate the pathogen's dissemination to other individuals ( Lerm & Netea,

2016).

2.5 Machine Learning

(Flam, 2022) Machine learning algorithms provide significant benefits to the healthcare sector

by aiding in the interpretation of extensive volumes of healthcare data generated daily through

electronic health records. By utilizing machine learning techniques and algorithms, patterns and

insights within medical data can be discovered, surpassing the capabilities of manual

identification. The growing integration of machine learning in healthcare enables healthcare

providers to adopt a predictive approach to precision medicine. This transition has the potential

to establish a more cohesive system, characterized by improved treatment administration,

enhanced patient outcomes, and streamlined patient-centered operations. According to (Predik,

15
2023), "Machine learning," an emerging domain within computer science, employs algorithms to

empower computers with the ability to learn and autonomously make decisions. The realm of

machine learning (ML) offers the potential to enhance various aspects of businesses and is

currently demonstrating remarkable achievements across multiple industries. With the surge in

data volumes reaching unprecedented levels, ML is playing an increasingly pivotal role in the

corporate landscape. For instance, it assists enterprises in extracting insights and value from their

data.

2.5.1 Machine learning in Drug discovery

As stated by (Xia, 2017), the utilization of data-driven machine learning (ML) applications has

witnessed substantial growth in recent years, becoming an essential tool in the initial stages of

drug discovery. This rising trend is attributed to several factors, including the swift accumulation

of relevant experimental data from sources like DrugBank, ChEMBL, PDB, PubChem, and

PDBbind. Furthermore, the advancement of contemporary ML techniques, libraries, and the

availability of cost-effective computing power have contributed to this phenomenon. Virtually

every phase of the Small Molecule Drug Discovery and Development (SBDR) pipeline, along

with subsequent stages, can reap benefits from the integration of ML algorithms. These

algorithms can be applied to a range of tasks, encompassing drug screening, target screening,

prediction of target structures and binding sites, lead optimization, anticipation of drug-drug

interactions, and projection of ADMET (Absorption, Distribution, Metabolism, Excretion,

Toxicity) properties. Rather than directly computing properties based on a physics-based

understanding, several ML approaches, notably, aim to extract insights from existing data

(Wooller et al., 2017).

16
In the field of bioinformatics, methods hold the potential to identify the underlying causes of

cancer in individual patients, offering the opportunity to tailor cancer therapy to a more

personalized level. This avenue holds the promise of developing novel and repurposed

medications targeting specific proteins, thereby selectively eliminating or incapacitating only the

affected cells. Additionally, bioinformatics plays a pivotal role in the domain of translational

drug discovery for infectious diseases. For instance, distinct gene expression patterns are

triggered within cells due to the presence of bacterial or viral infections. By comparing these

genetic profiles with those associated with other disorders and influenced by pharmaceutical

interventions, there exists the potential to repurpose existing drugs (Wooller et al., 2017).

2.5.2 Types of Machine Learning

Fig 2.1 Types of Machine Learning

1. Supervised Learning:

Training an algorithm using labeled data to make accurate predictions or classifications on

new, unseen data.

2. Unsupervised Learning:

Finding patterns and relationships in data without labeled outcomes is often used for

clustering or dimensionality reduction.

17
3. Semi-Supervised Learning:

Combining labeled and unlabeled data to enhance model accuracy, useful when obtaining

fully labeled datasets is challenging.

4. Reinforcement Learning:

Training algorithms to make decisions in an environment to maximize rewards, and

learning through experimentation and feedback.

2.5.3 Supervised Machine Learning

Fig 2.2 Supervised Machine learning workflow

18
(Gillis 2023) A technique for constructing artificial intelligence (AI), termed supervised

learning, entails training a computer system using labeled input data to predict a particular

output. The model is iteratively trained until it can discern the inherent relationships and patterns

connecting the input and output labels. This empowers the model to generate accurate

predictions when presented with previously unseen data. (IBM 2023) Supervised machine

learning distinguishes itself by the method it employs to train computers for accurately

classifying data or predicting outcomes using labeled datasets. The model adjusts its weights as

input data is fed into it, ensuring proper fitting during the cross-validation process. Applications

like segregating spam emails into separate folders from regular emails exemplify how supervised

learning assists enterprises in finding scalable solutions to various real-world issues.

In supervised learning, a training set is employed to guide models in generating desired

outcomes. This training dataset comprises both appropriate inputs and corresponding outputs,

enabling the model's progressive improvement. The loss function serves as a metric for the

algorithm's accuracy, and iterations are executed until the error is sufficiently minimized. In the

context of data mining, supervised learning can be categorized into two main types: regression

and classification.

When applying data mining techniques, supervised learning can be categorized into two main

types: regression and classification.

 Classification: This approach employs algorithms to accurately categorize test data into

distinct classes. It identifies specific entities within the dataset and strives to determine

their appropriate labels. Commonly used classification techniques encompass decision

19
trees, k-nearest neighbors, random forests, support vector machines (SVM), linear

classifiers, and SVMs (IBM 2023).

 Regression: Regression is employed to understand the relationship between dependent

and independent variables. It is often utilized to generate estimates, such as predicting a

company's sales revenue. Well-known regression algorithms include linear regression,

logistic regression, and polynomial regression."

2.5.4 Regression Algorithm

When the output variable possesses a real or continuous value, regression is employed. In this

scenario, a change in one variable corresponds to a change in another, as there exists a

connection between two or more variables. For instance, examples include salary linked to

employment history or weight influenced by height.

.(Mathworks 2023) To forecast continuous outcomes, like the values of financial assets or

challenging-to-quantify physical attributes such as battery state-of-charge or grid load. Usual

applications encompass algorithmic trading, virtual sensing, and electrical load prediction.

When dealing with a dataset spanning a range of values or if your response falls within the

domain of real numbers, such as temperature or the time until equipment failure, employ

regression methods. Presented below are some of the prevalent algorithms for conducting

regression.

20
Fig 2.3 Regression analysis image

2.6 Related Works

(Yang et. al(2019)) The authors employed an extensive dataset spanning 16 countries and six

continents, encompassing tuberculosis patients. This dataset incorporated whole-genome MTB

isolate sequences and corresponding drug susceptibility testing outcomes. Their introduced

model, DeepAMR (Deep Autoencoder for Multiple Drug Resistance Classification), and its

clustering variant, DeepAMR_cluster, were aimed at multi-drug classification and latent data

space clustering, respectively. The study showcased DeepAMR's superiority over baseline and

other models, achieving impressive mean AUROC scores (94.4% to 98.7%) for predicting

resistance to four primary drugs, MDR-TB, and PANS-TB. DeepAMR excelled in sensitivity as

well, with best rates seen for isoniazid (94.3%), ethambutol (91.5%), pyrazinamide (87.3%), and

MDR-TB (96.3%). However, some cases showed slightly lower sensitivity compared to baseline,

such as rifampicin and PANS-TB, with 0.7% and 1.9% reduction, respectively. The study also

21
detailed cross-resistance patterns, notably between INH and RIF, and examined multi-label vs.

single-label models, with DeepAMR's success attributed to abstract data use and non-linear

reduction. Although predicting resistance for specific drugs (e.g., EMB and PZA) posed

challenges, DeepAMR demonstrated significant improvement for these cases.( Nagamani &

Sastry 2021) study focused on addressing the challenges posed by drug-resistant strains of

Mycobacterium tuberculosis (M.tb) through the application of machine learning models and

computational drug repurposing. The authors highlight the urgent need for novel antitubercular

drugs due to the rapid evolution of drug-resistant M.tb strains. They point out that factors like

genetic mutations, the complex cell wall system of M.tb, and transporter systems contribute to

the ineffectiveness of many small molecules in arresting M.tb cell growth.The study's objective

was to overcome the permeability barriers of M.tb by developing machine learning models that

can distinguish between permeable and impermeable compounds. The authors utilized enzyme-

based (IC50) and cell-based (minimal inhibitory concentration) data to classify compounds based

on their permeability. The XGBoost machine learning model emerged as the top performer

compared to other algorithms like random forest, support vector machine, and naive Bayes. (

Deelder et. al(2019)) research employs a machine learning-driven methodology to address the

critical challenge of drug resistance prediction in Mycobacterium tuberculosis (M.tb). The

authors worked with an extensive dataset of 16,688 M.tb isolates, each having undergone whole-

genome sequencing (WGS) and laboratory drug-susceptibility testing (DST) across 14

antituberculosis drugs. A substantial portion of the samples demonstrated multidrug-resistant and

extensively drug-resistant profiles, underlining the significance of their study. the authors

employed advanced non-parametric classification-tree and gradient-boosted-tree models to

predict drug resistance and identify potential associated mutations. This approach allowed the

22
creation of separate models for each drug, and they also considered the influence of "co-

occurrent resistance" markers, known to cause resistance to drugs other than the one under

consideration. Predictive performance was meticulously evaluated using sensitivity, specificity,

and the area under the receiver operating characteristic curve, with DST outcomes serving as the

benchmark for evaluation. Notably, their models demonstrated particularly high accuracy in

predicting resistance to first-line drugs and several second-line drugs, with the area under the

receiver operating characteristic curve exceeding 96%. However, the performance was

comparatively lower for certain third-line drugs. The inclusion of co-occurrent resistance

markers notably enhanced the predictive capabilities of some drugs, leading to superior

outcomes compared to similar models used in other large-scale studies.

(Hadikurniawati et. al(2021)) The authors compare the performance of ML models using 10-fold

cross-validation. Contrary to previous research that favored certain models over others, this study

finds that the best-performing model is data-specific. This conclusion aligns with observations

from other studies. Nevertheless, the research achieves better results compared to recent studies

and notes that certain methods, particularly Logistic Regression and MD-WDNN, exhibit similar

performance levels. Additional parameter tuning is conducted using the scikit-learn library in

Python, reinforcing the data-specific nature of model performance. They further underscore the

successful application of ML techniques to predict MTB drug resistance based on DNA data.

With an impressive accuracy rate of up to 99% and high Area Under Curve (AUC) values, the

ML approach holds promise in tuberculosis drug resistance prediction. The study emphasizes the

data-specific nature of model performance and highlights the potential for slight improvements

through parameter tuning. Overall, the research contributes to the growing body of knowledge on

employing ML for tuberculosis drug resistance prediction, showcasing its potential as a valuable

23
tool in medical research and diagnostics. (Ye et. al(2021)) presents a research endeavor focused

on addressing the challenge of drug-resistant tuberculosis (TB) caused by Mycobacterium

tuberculosis (Mtb), a leading global cause of mortality. The emergence of extensively drug-

resistant TB has underscored the necessity for novel drug candidates. This study leverages

various machine learning (ML) algorithms, including support vector machine, random forest

(RF), extreme gradient boosting (XGBoost), and deep neural networks (DNN), to construct

classification models that distinguish Mtb inhibitors from non-inhibitors.The outcomes reveal

that the XGBoost model displays the most robust predictive performance. To enhance accuracy

further, two consensus strategies are employed by integrating predictions from multiple models.

The stacking model that combines predictions from RF, XGBoost, and DNN offers the highest

accuracy, with an area under the receiver operating characteristic curve (AUC) of 0.842 for the

10-fold cross-validated training set and 0.942 for the external test set. The authors also explore

the relationship between important molecular descriptors and bioactivities using the Shapley

additive explanations method.( Radchenko et. al (2023)) The authors establish a well-structured

foundation for their modeling approach, emphasizing the significance of a diverse dataset. They

draw upon publicly available data to create a dataset containing both target-based and cell-based

assay results. Their preprocessing methods and dataset preparation are thorough, reflecting the

complexities and challenges of working with large, heterogeneous datasets. The utilization of

fragmental descriptors and neural networks for modeling is well-justified, considering their prior

success in other QSAR and QSPR applications. The architecture of the neural network, with its

integration of feed-forward back-propagation and double cross-validation, demonstrates careful

design for robustness and validation. The presentation of their modeling process is detailed and

demonstrates a systematic exploration of hyperparameters, leading to a refined model with

24
enhanced predictivity. Comparisons with other models in the literature add to the paper's

credibility. (Deelder et. al (2022)) The authors address the growing concern of drug-resistant

Mycobacterium tuberculosis complicating the treatment and control of tuberculosis. They

emphasize the importance of incorporating whole genome sequencing and machine learning

techniques to predict drug resistance and identify genetic mutations associated with M.

tuberculosis. However, they highlight the limitations of applying generic machine-learning

approaches without tailoring them to the specific context of tuberculosis. To address these

challenges, the authors introduce a novel machine-learning approach, Treesist-TB, designed

specifically for tuberculosis. This approach focuses on extracting and analyzing genomic variants

across multiple studies to enhance genotypic profiling. The authors applied Treesist-TB to

predict drug resistance for well-known drugs like rifampicin, isoniazid, and ethambutol,

achieving predictive accuracy comparable to existing tools like TB-Profiler. (Hrizi et. al(2022))

The study's techniques are applied to computed tomography (CT) scans of TB patients, with a

division into training and testing sets. Feature extraction using the spatial gray-level dependence

method (SGLDM) is conducted. The results are presented in terms of hyper-parameter and

feature selection. The study employs Python, utilizing an RTX 2060 Graphics Card and 16 GB of

RAM. The ImageCLEF 2020 dataset is used, employing multi-label classification for lung

conditions. The metric of interest is accuracy. Experiments involve a range of machine learning

methods, with Sklearn as the toolkit for comparison. SVM hyper-parameter selection employs a

genetic algorithm, focusing on radial basis function (RBF) kernel performance. Performance is

compared with known classifiers like KNN, CART, NB, LDA, and RF.Focusing on tuberculosis

(TB), the study proposes an optimized machine learning-based model that extracts optimal

texture features from TB-related images and simultaneously fine-tunes classifier hyper-

25
parameters. The overarching objectives are to improve accuracy and reduce the number of

extracted characteristics, framed as a multitask optimization challenge. The proposed approach

involves a genetic algorithm (GA) for feature selection followed by a support vector machine

(SVM) classifier. Experimental results, using the ImageCLEF 2020 dataset, demonstrate

improved accuracy and outperforming state-of-the-art methods through the enhanced approach.

(Kuang et. al (2022)) concisely introduces the research's motivation, its novel deep learning

approach, and the cohort used for AMR prediction. The results section effectively presents the

data analysis, feature selection, training, and validation processes. The comparison of model

performance with a rule-based method provides clear insights. They further delve into the results,

emphasizing the substantial increase in F1-score achieved by the best ML classifiers compared to

the rule-based Mykrobe predictor. The performance of the 1D CNN model is slightly superior to

traditional ML algorithms, despite its higher computational resource requirements during

training. The impact of feature selection on reducing resource demands is noted. The potential

for hyperparameter optimization and the inclusion of novel variants for improved model

performance is discussed. The importance of managing imbalanced classes and the consideration

of sensitivity and specificity in clinical settings is acknowledged. The potential extension of the

model to bacteria with plasmid-mediated resistance is examined, along with the importance of

diverse datasets in managing overfitting. The study's reliance on the F1-score metric is discussed,

along with the introduction of the G-mean metric. Automation of the entire process into a

flexible pipeline is highlighted, enabling easy adaptation and expansion of the models for other

antibiotics and bacteria. The overall focus on accurate AMR prediction is emphasized. (Jamal et.

al (2020)) presents a computational framework that utilizes artificial intelligence (AI) and

machine learning (ML) methods to predict multi-drug resistance associated mutations in

26
Mycobacterium tuberculosis (M.tb) using high-throughput sequencing data. The authors focus on

specific genes related to drug resistance and utilize various ML algorithms to build prediction

models. The study includes dataset preparation, model evaluation, and the impact analysis of

predicted mutations on protein stability. They indicate the successful development of prediction

models for several genes associated with drug resistance, including rpoB, inhA, katG, pncA,

gyrA, and gyrB. The models exhibit good accuracy in predicting the susceptibility or resistance

of mutations, achieving approximately 70% accuracy on average in the training dataset. The

authors evaluate the models using non-redundant testing data, showcasing accuracy ranging from

66.66% to 100%. Performance varies among different genes, with artificial neural network

(ANN) models generally performing the best. Furthermore, the authors highlight the significance

of their approach in predicting drug resistance and classifying mutations. They emphasize the

importance of various features, such as changes in amino acid properties and stability

calculations, in accurately predicting mutation effects. The potential utility of the models for

clinical applications and the prediction of novel mutations is underscored.

2.7 Summary of the related works

S/ Author Topic Techniques Result Limitation

N Name & Used

Year

1. Radchenko Machine Artificial The outcome Some


et al. (2023) Learning neural indicates that ANN models
Prediction of network provides better failed to
Mycobacterial (ANN) accuracy (cross- recognize
Cell Wall validated balanced many
27
Permeability accuracy 0.768, penetrating
of Drugs and sensitivity 0.768, compounds
Drug-like specificity 0.769, due to bias
Compounds area under ROC dataset
curve 0.911). caused by
imbalance
data,
resulting in
many false
negatives.

2. Kuang et al. Accurate and Logistic In terms of F1- 1). data


(2022) rapid regression, scores (81.1 to imbalance
prediction of Random 93.8%, 93.7 to with more
tuberculosis forest and 96.2%, 93.1 to vulnerable
drug resistance 1D CNN 94.8%, 95.9 to isolates.
from genome 97.2%, and 97.1 to
2). Due to
sequence data 98.2% for
the study's
using ethambutol,
lack of
traditional rifampicin,
hyperparam
machine pyrazinamide,
eter
learning isoniazid, and
optimization
algorithms and ofloxacin,
, the 1D
CNN respectively), 1D
CNN
CNN models
architecture
outperformed LR
performed
and RF. CNN had
less than
the highest
traditional
accuracy (ranging
ML
from 90.0% -
methods
96.2%).
(LR and
RF).

3. Hrizi et al. Tuberculosis Support The outcome of The dataset


(2022) Disease vector comparing six contains

28
Diagnosis machine machine learning some
Based on an (SVM), K- algorithms (SVM, irrelevant
Optimized Nearest KNN, CART, NB, characteristi
Machine Neighbors LDA, and RF) cs that
Learning (KNN), showed that SVM increase the
Model Classificatio (0.84) classifier likelihood
n was more accurate that the
And Regress than the other learning
ion Tree classification models will
(CART), algorithms, while be overfit,
Naïve Bayes KNN (0.82), LDA complex,
(NB), Linear (0.82) and RF and
Discriminan (0.81) performed challenging
t Analysis better than CART to
(LDA), and (0.73) and NB understand,
Random (0.67). leading to
forest (RF). low
efficiency
and poor
performance
.

4. Deelder et A modified Decision The predictive N/A


al. (2022) decision tree tree accuracy of
approach to resistance from
improve the Treesist-TB was
prediction and comparable to that
mutation of the TB-Profiler
discovery for tool (RIF 97.5%
drug resistance vs. 97.6%; INH
in 96.8% vs. 96.5%;
Mycobacteriu EMB 96.8% vs.
m tuberculosis 95.8%).

5. Ye et al. Identification Support The XGBoost 1). The


(2021) of active vector model based on training
molecules machine MorganFP and dataset is

29
against (SVM), RDKitFP performs unbalanced.
Mycobacteriu Random the best with
2). For each
m tuberculosis forest (RF), AUC= 0.832.
scaffold, the
through Extreme Overall, all the
inhibitors
machine gradient models perform
and
learning boosting well, with the AUC
noninhibitor
(XGBoost) values all higher
s are
and Deep than 0.91. The
imbalanced.
neural stacking model
networks outperforms the
(DNN) other four
individual models,
with an average
AUC= 0.935 and
ACC= 0.878 for
the scaffold test
set.

6. Nagamani Mycobacteriu XGBoost, The accuracy N/A


and Sastry m tuberculosis Random values shown for
(2021) Cell Wall forest (RF), random forest
Permeability Support (RF), gradient
Model vector boosting model
Generation machine (GBM),
Using (SVM), and classification and
Chemoinform Naïve Bayes regression model
atics and (NB) (CART), Glmnet,
Machine support vector
Learning machine (SVM), k-
Approaches nearest neighbors
(KNN), naive
Bayes (NB), and
logistic regression
were 0.946, 0.939,
0.851, 0.927,
0.925, 0.925,
0.864, and 0.490,

30
respectively.

7. Hadikurnia Predicting C4.5, With an average N/A


wati et al. tuberculosis Random AUC of 0.979,
(2021) drug resistance Forest, and MD-WDNN and
using machine Logitboost. logistic regression
learning based showed the best
on DNA performance.
sequencing
data

8. Jamal et al. Artificial Naïve bayes Four ML N/A


(2020) Intelligence (NB), K- algorithms, NB,
and Machine nearest kNN, SVM, and
learning based neighbor ANN, were used to
prediction of (KNN), create learnt model
resistant and Support systems for genes
susceptible vector linked with the
mutations in machine first-line TB
Mycobacteriu (SVM) and medicines
m tuberculosis Artificial rifampicin (rpoB),
neural isoniazid (katG
network and inhA),
(ANN) pyrazinamide
(pncA), and
fluoroquinolones
(gyrA and gyrB).
The models were
extremely
accurate, with
average accuracies
of 88.86%,
85.22%, 88.0%,
87.30%, 78.88%,
and 86.88% for
rpoB, inhA, katG,
pncA, gyrA, and

31
gyrB, respectively.

9. Yang et al. DeepAMR for DeepAMR, DeepAMR 1). Each


(2019) predicting co- Random outperformed the label's class
occurrent forest (RF), baseline model and is
resistance of Support four machine unbalanced,
Mycobacteriu vector learning models in as are the
m tuberculosis machine predicting labels'
(SVM), resistance to four cooccurrenc
multi-label first-line e rates
K-nearest medicines, INH, among
neighbours EMB, PZA, and various
(MLKNN) MDR-TB, with drugs.
and AUROCs of
2). This
Ensemble 97.7%, 96.8%, and
study only
classificatio 94.4%,
considered
n chains respectively. The
cross-
(ECC) SVM has
resistance
sensitivity of
between
92.6%, 85.6%, and
four first-
78.6%, with
line
AUROC of 96.4%,
medications
92.1%, and 89.5%,
, ignoring
respectively. ECC
that of
outperformed ML
second-line
KNN, with
medications
specificities of
because (i)
99.0% and 96.3%
inaccurate
for INH and EMB,
phenotyping
respectively, and
for second-
F1 scores of 78.2%
line
and 72.7% for
medications
EMB and PZA.
would
introduce
significant
error in the

32
classificatio
n for first-
line
medications
, and (ii) a
small
number of
resistant
isolates
would
easily lead
to over-
fitting for
such a
complex
model.
3). The
permutation
feature is
unable to
distinguish
between
feature
relationship
s.

10 Deelder et Machine non- Overall, the There was


. al. (2019) Learning parametric performance of the insufficient
Predicts classificatio gradient-boosted phenotypic
Accurately n-tree and tree models was data to
Mycobacteriu gradient- superior than that include
m tuberculosis boosted-tree of the newly
Drug classification tree developed
Resistance models. In and
From Whole comparison to repurposed
Genome EMB (82.8%) and medicines

33
Sequencing PZA (69.7%), RIF such as
Data (88.8%) and INH bedaquiline,
(91.1%) had delamanid,
stronger GBT- and
CRM sensitivity. linezolid, as
CIP (85.7%), OFL well as
(81.0%), and MOX XDR-TB.
(53.3%) had the
highest
fluoroquinolone
sensitivity. The
injectables with the
highest sensitivity
were KAN
(82.2%), AMK
(80.5%), and CAP
(74.6%).

34
References

Applying Bioinformatics in Clinical Drug Discovery. (n.d.). Retrieved August 24, 2023, from

https://www.longdom.org/open-access/applying-bioinformatics-in-clinical-drug-

discovery.pdf

Bioinformatics | PNNL. (n.d.). Retrieved August 24, 2023, from https://www.pnnl.gov/explainer-

articles/bioinformatics

Deelder, W., Christakoudi, S., Phelan, J., Benavente, E. D., Campino, S., McNerney, R., Palla,

L., & Clark, T. G. (2019). Machine learning predicts accurately mycobacterium

tuberculosis drug resistance from whole genome sequencing data. Frontiers in Genetics,

10(SEP). https://doi.org/10.3389/fgene.2019.00922

Deelder, W., Napier, G., Campino, S., Palla, L., Phelan, J., & Clark, T. G. (2022). A modified

decision tree approach to improve the prediction and mutation discovery for drug

resistance in Mycobacterium tuberculosis. BMC Genomics, 23(1).

https://doi.org/10.1186/s12864-022-08291-4

Hadikurniawati, W., Anwar, M. T., Marlina, D., & Kusumo, H. (2021). Predicting tuberculosis

drug resistance using machine learning based on DNA sequencing data. Journal of

Physics: Conference Series, 1869(1). https://doi.org/10.1088/1742-6596/1869/1/012093

Hrizi, O., Gasmi, K., ben Ltaifa, I., Alshammari, H., Karamti, H., Krichen, M., ben Ammar, L.,

& Mahmood, M. A. (2022). Tuberculosis Disease Diagnosis Based on an Optimized

Machine Learning Model. Journal of Healthcare Engineering, 2022.

https://doi.org/10.1155/2022/8950243

Jamal, S., Khubaib, M., Gangwar, R., Grover, S., Grover, A., & Hasnain, S. E. (2020). Artificial

Intelligence and Machine learning based prediction of resistant and susceptible mutations

35
in Mycobacterium tuberculosis. Scientific Reports, 10(1). https://doi.org/10.1038/s41598-

020-62368-2

Kuang, X., Wang, F., Hernandez, K. M., Zhang, Z., & Grossman, R. L. (2022). Accurate and

rapid prediction of tuberculosis drug resistance from genome sequence data using

traditional machine learning algorithms and CNN. Scientific Reports, 12(1).

https://doi.org/10.1038/s41598-022-06449-4

Nagamani, S., & Sastry, G. N. (2021). Mycobacterium tuberculosis cell wall permeability model

generation using chemoinformatics and machine learning approaches. ACS Omega,

6(27), 17472–17482. https://doi.org/10.1021/acsomega.1c01865

Radchenko, E. v., Antonyan, G. v., Ignatov, S. K., & Palyulin, V. A. (2023). Machine Learning

Prediction of Mycobacterial Cell Wall Permeability of Drugs and Drug-like Compounds.

Molecules, 28(2). https://doi.org/10.3390/molecules28020633

Romano, J. D., & Tatonetti, N. P. (2019). Informatics and computational methods in natural

product drug discovery: A review and perspectives. Frontiers in Genetics, 10(APR),

442506. https://doi.org/10.3389/FGENE.2019.00368/BIBTEX

Tuberculosis (TB) | Cedars-Sinai. (n.d.). Retrieved August 24, 2023, from https://www.cedars-

sinai.org/health-library/diseases-and-conditions/t/tuberculosis-tb.html

Tuberculosis (TB): Symptoms, treatment, diagnosis, and more. (n.d.). Retrieved August 24,

2023, from https://www.medicalnewstoday.com/articles/8856#causes

Tuberculosis. (n.d.). Retrieved August 24, 2023, from https://www.who.int/news-room/fact-

sheets/detail/tuberculosis

Tuberculosis: Causes, Symptoms, Diagnosis & Treatment. (n.d.). Retrieved August 24, 2023,

from https://my.clevelandclinic.org/health/diseases/11301-tuberculosis

36
What is bioinformatics, and why is it important? (n.d.). Retrieved August 24, 2023, from

https://bioinformaticshome.com/blog/What_is_bioinformatics_why_%20important.html

What is bioinformatics? | Bioinformatics for the terrified. (n.d.). Retrieved August 24, 2023,

from https://www.ebi.ac.uk/training/online/courses/bioinformatics-terrified/what-

bioinformatics/

What Is Tuberculosis? Symptoms, Causes, Diagnosis, Treatment, and Prevention. (n.d.).

Retrieved August 24, 2023, from https://www.everydayhealth.com/tuberculosis/guide/

Yang, Y., Walker, T. M., Walker, A. S., Wilson, D. J., Peto, T. E. A., Crook, D. W., Shamout,

F., Zhu, T., Clifton, D. A., Arandjelovic, I., Comas, I., Farhat, M. R., Gao, Q.,

Sintchenko, V., Soolingen, D., Hoosdally, S., Cruz, A. L. G., Carter, J., Grazian, C., …

de Oliveira, R. S. (2019). DeepAMR for predicting co-occurrent resistance of

Mycobacterium tuberculosis. Bioinformatics, 35(18), 3240–3249.

https://doi.org/10.1093/bioinformatics/btz067

Ye, Q., Chai, X., Jiang, D., Yang, L., Shen, C., Zhang, X., Li, D., Cao, D., & Hou, T. (2021).

Identification of active molecules against Mycobacterium tuberculosis through machine

learning. Briefings in Bioinformatics, 22(5). https://doi.org/10.1093/bib/bbab068

37

You might also like