You are on page 1of 8

RMIT International University Vietnam

Subject Code COSC2753


Subject Name Machine Learning
Campus RMIT SGS
Title of Assignment Individual Assessment 1
Student Name & ID Nguyen Xuan Huy – s3877913
Nguyen Thien Bao
Lecturer Name Tran Nhat Quang
Assignment Due Date April 14th, 2022
Submission Date April 14th, 2022
Word Count
Page Number

"I declare that in submitting all work for this assessment I have read,
understood, and agree to the content and expectations of the Assessment
Declaration."

1
I. Exploratory data analysis (EDA):
1. Setting up the evaluation framework:
a. Understanding data:
Understanding data is essential for establishing an evaluation
framework in machine learning, as it helps select appropriate performance
metrics, set up validation techniques, and address class imbalances or biases
(Dietterich, 1998). Thorough data comprehension facilitates robust model
evaluation and improves generalization performance in real-world
applications. For this project, the given data age, PR, and PRG seem relevant to
the Sepsis outcome because old age, low plasma glucose, and low blood
pressure may contribute to increased vulnerability to developing diseases and
symptoms (Maggio et al., 2013).
b. Categorizing and data inspection:

Figure 1. Data inspection

2
Figure 1 provides an overview of the data distribution range and overall
dataset distribution through mean and min-max delimiter. Data distribution
describes the frequency and patterns of data points in a dataset, and
understanding it is crucial in machine learning (Bengio et al., 2012). The range
represents the difference between the highest and lowest values, while the
overall distribution provides insights into the data's shape, central tendency,
and spread. We can analyze the data skewness from the median and median
to gain new insight into data distribution. For instance, the average plasma
glucose level is 120.15, with a standard deviation of 32.68. The minimum value
is 0, and the maximum is 198. This could suggest a diverse range of glucose
levels in the dataset.
Additionally, the Sepsis mean value is 0.35, which indicates that only
35% of the result returned positive, and the other portion is classified as
negative. This could suggest a diverse range of glucose levels in the dataset.
Moreover, the datasets are equal in number, indicating no empty values in the
dataset. However, some datasets contain 0 values, indicating that the data is
missing or incorrect, which can help sort the dataset for the machine learning
model to increase the accuracy.
c. Box plot and histogram chart:

3
Figure 2. Histogram and box plot chart
Histograms and box plot help identify data skewness, outliers, or
multimodal distributions and can also aid in identifying the choice of data
normalization or transformation required for successful machine learning
(James et al., 2013). In Figure 2, when inspecting the diagram, the PRG, SK, TS
BD2, and age data mostly distribute on the left side, which is positive
skewness. Therefore, consideration to transformation to normalize the data
positive skewness. On the other hand, the PL, PR, and M11 seem to be a more
symmetrical distribution, which improves the data accuracy for machine
learning. As for the box plot, it illustrates the outliner and supports the data
skewness evaluation. In Figure 2, the white circle with a white outline indicates
the outliner data of the category. Identifying the outliners will improve the
accuracy and reliability of the result because the variant variables can affect
the performance of machine learning.
d. Correlation heat map:

Figure 3. Correlation heat map and data


Correlation maps play a critical role in machine learning as they identify
relationships between variables and patterns that may not be apparent from

4
the individual variable analysis. This helps in feature selection and engineering,
improving model performance. (Tashman, 2000). As for Figure 3, the
correlation between PL, M11, Age, PRG, and BD2 accounts for the most impact
on the Sepsis variable. The high correlation between the two variables
indicates a strong linear relationship. If the correlation coefficient is close to
+1, it indicates that the increment in one variable can lead to an increase in
other factors. Therefore, from the correlation map, it is reasonable to
eliminate the PR and SK to the machine learning model to ensure the result
accuracy.

II. Selecting models:


1. Machine learning model:
Because the requirement predicts the Sepsis result, defined as positive (1) and negative (0),
with the classification model's binary classification and feature selection, the project will
construct a different type of machine learning model for development.
a. Logistic regression:
Logistic regression is ideal for predicting symptoms due to its simplicity,
interpretability, and effectiveness. As Hosmer noted, logistic regression is a
widely used binary classification model that can handle non-linear
relationships between features, making it a versatile tool for predicting the
presence or absence of symptoms. Moreover, logistic regression can
effectively model interactions between features, making it especially useful for
predicting complex symptoms. Logistic regression is a reliable and well-
established method for predicting symptoms in clinical practice (Hosmer et al.,
2013).
b. Decision Tree:
Decision trees are popular for predicting symptoms due to their ability
to handle non-linear relationships, ease of interpretability, and flexibility for
handling categorical and continuous features. Breiman claimed that decision
trees are a powerful tool for classification tasks involving complex interactions
between features, making them ideal for symptom prediction. Moreover,
decision trees can effectively handle missing data and outliers, making them
robust to real-world data challenges. The interpretability of decision trees also
makes them useful in medical settings, as they can provide clinicians with

5
valuable insights into the relationships between features and symptoms.
Decision trees are a reliable and well-established method for predicting
symptoms in clinical practice (Breiman et al., 1984).
2. Performance evaluation:
Using classification reports, confusion matrices, and accuracy scores is critical
in comparing the performance of logistic regression and decision tree machine
learning models. These metrics comprehensively evaluate the model's performance,
allowing for a detailed analysis of its strengths and weaknesses.
The classification report provides valuable insights into the models' precision,
recall, and f1-score, allowing for a comparison of their overall performance.
Additionally, the confusion matrix helps highlight the number of true positives, true
negatives, false positives, and false negatives, enabling a comparison of their ability to
predict positive and negative values accurately. Finally, the accuracy score provides an
overall assessment of the model's performance, helping to determine which model
provides more reliable predictions.
Using these metrics is crucial in comparing the performance of logistic
regression and decision tree models. These metrics comprehensively evaluate the
models' performance, enabling data scientists and machine learning engineers to
decide which model to use for a particular task.
3. Choosing Model:
Based on the result in Figure 4, the Logistic Regression was chosen for
developing a predicting model. It can be clearly seen that the performance result of
Logistic Regression was higher, which indicated by the Accuracy, precision, and
confusion matrix

6
Figure 4. Performance result

Ultimate Judgment & Analysis: You must ultimately judge the “best” model you
would use and recommend in a real-world setting for this problem. It is up to you to
determine the criteria by which you evaluate your model and determine what it means
to be “the best model.” You need to provide evidence to support your ultimate
judgment and discuss the limitation of your approach/Ultimate.
model if there are any in the notebook as Markdown text.

Performance on the test set (Unseen data): You must use the model chosen in your
ultimate judgment to predict the target for unseen testing data (provided in test
data.csv). Your ultimate prediction will be evaluated, and the performance of all of
the ultimate judgments will be published.

7
REFERENCE
Dietterich, T. G. (1998). Approximate statistical tests for comparing supervised
classification learning algorithms. Neural Computation, 10(7), 1895-1923.
Maggio, M., Guralnik, J. M., Longo, D. L., & Ferrucci, L. (2013). Interleukin-6 in aging
and chronic disease: A magnificent pathway. Journals of Gerontology Series A:
Biomedical Sciences and Medical Sciences, 61(6), 575-584.
Bengio, Y., Goodfellow, I. J., & Courville, A. (2012). Deep learning. Nature, 7553(1), 35-
65.
James, G., Witten, D., Hastie, T., & Tibshirani, R. (2013). An introduction to statistical
learning (Vol. 112). New York: springer.
Tashman, L. J. (2000). Out-of-sample tests of forecasting accuracy: an analysis and
review. International Journal of Forecasting, 16(4), 437-450.
Hosmer, D.W., Lemeshow, S., and Sturdivant, R.X. (2013). Applied Logistic Regression,
3rd ed. Hoboken, NJ: Wiley.

You might also like