Professional Documents
Culture Documents
Assignment 1:
Intro to Machine Learning
Student details:
Nguyen Minh Tri (s3726096)
Course coordinator: Dr. Duy Dang-Pham
II. Definitions:
In this section, the definitions of all terminologies and techniques relating to or being
implemented in the process of determine the most appropriate model:
- Independent variable: The variable represents a quantity that cannot be predicted
from other variable. In more details, the independent variable has orthogonal
relationship with all other variables in the dataset and can be used as the input for
machine learning algorithms.[1]
- Dependent variable: In contrast, dependent variable can be calculated from the other
variables, which normally refers to the label (the target output value). [1]
- Correlation: This is a statistical value representing how strongly pairs of variables
are related. It ranges from -1.0 to +1.0. If the correlation is close to 0, it means that
there is no relationship between variables or the examining relationship is not linear.
If this value is positive, it means that the higher-than-average values of one variable
tend to be paired with higher-than-average values of other variable. If the correlation
is nesitive, it means that the higher-than-average values of one variable tend to be
paired with lower-than-average values of other variable. [2]
- Min-max Scaling: This is a data pre-processing technique that shift and rescale
values so that they end up ranging from 0 to 1. It does this by subtracting the min
value and dividing by the max minus the min. To implement this in Python, Scikit-
Learn provides a transformer called MinMaxScaler. [3]
- Training dataset: the data used for training hypothesis [4]
- Validation dataset: the data used for “testing” and tuning hyperparameters of a
Machine Learning Algorithm [4]
- Testing dataset: the data used for evaluating and comparing final hypotheses [4]
- K-Fold Cross-validation: In this process, data is divided into k partitions. In the first
iteration, one partition is assigned as the validation set, all other k-1 partitions from
training set. We then make evaluation and repeat this procedure with different
partitions as test set. The final result is the average of all iterations’ results. [4] This
process is automated by using “cross_val_score” provided by Scikit-Learn.
- Linear Regression: It is a Linear approach to modelling the relationship between one
dependent variable with one or more independent variables. In more details, its
hypothesis is a linear equation. [5]
- Polynomial Regression: It is a form of Linear Regression in which the relationship
between the target value and the independent variables is modeled as an nth degree
polynomial [6]. In other words, , its hypothesis is a polynomial equation.
- Regularization: This is a technique that constraints or shrinks the coefficients
estimates towards zeros to discourages learning a more complex model and to avoid
overfitting. [7]
- Ridge Regression: While most the Regression models often try to overfit the training
dataset, the Ridge Regression comes into play to balance the bias and variance by
adding a penalty term/regularization term in the loss function, which is shown below.
We can tune the Ridge Regression, to find the most generalizable set of coefficients
estimates. [1]
n m
1 i 2
J ( θ )= ∑ ( hθ x − y ) + λ ∑ θ2v [1]
i
( )
n i=1 v=1
- Root Mean Squared Error (RMSE): This value measure the standard deviation of
the prediction errors. In other words, it measures how much concentrated the data is
around the line of best fit. [8]
III. Methodologies:
- Initially, by using the head() function in the pandas libraries, we can observe that the
unit range of independent attributes in the data frame is not equal. Moreover, after
plotting the histogram of each column, it is cleared to say that most of them have
skewed distributions. Therefore, a pre-processing technique called Min-Max Scale
should be implemented instead of Standardization. However, this method may be
suffered by large amount of outliers, which are found using Box-plot demonstration.
- The brute-force methodology will be applied to determine the most appropriate model
for the given “Global Health Observatory data repository” dataset. Hence, various
Regression models will be trained and their results will be compared. The list of tried
Regression models including Linear Regression (with and without using data scaling
and regularization), Polynomial Regression (degree 2,3 and 4) (with and without
using data scaling and regularization) and Random Forest (with and without using
data scaling). The results to be compared is computed by implementing K-fold Cross
Validation technique and based on Root Mean Square Error metric.
- For Ridge Regularisation and Random Forest models, to tune to find the best
hyperparameters, GridSearchCV object provided by Scikit-learn library will be
employed.
- The final predictive model will be selected as the model that gives the lowest Root
Mean Squared Error and will be used to predict the Life Expectancy of the newborns.
V. Results:
- In terms of the Linear Regression models, this type of Regression model is too simple
that the Regularization and Min-Max Scale techniques do not have much impact on it.
Based on the table above, the recorded average Root Mean Squares Errors seem to be
stable around 4.94.
- For Polynomial Regression models, the model with degree 2 which is trained with
Min-Max Scaled Data and applied Regularization turns out to be the best among
models of this type (Root Mean Square Error: 4.113455165247399, alpha = 1.2).
- Lastly, the Random Forest Regression trained with raw data and with the tuned
hyperparameters bootstrap: False, max_features: 7, n_estimators: 50 is the best model
to be selected, which has the Root Mean Squared Error equal to only 3.379681122.
VI. Conclusion:
- After evaluating all the mentioned Regression models, the Random Forest Regression
trained with raw data appears to be the best model based on the Root Mean Square
metric. As a consequence, the Life Expectancy predictions of the newborns are made
using this model. Furthermore, as we implementing K-Fold Cross Validation,
therefore, the final decision we made based on the RMSE is more reliable and reduces
the overfitting. However, due to the limitation of this assignment that no features
selection is allowed, the “accuracy” of this model cannot be improved. At the
beginning, a large amount of outliers can be seen in the boxplot, which must be
removed or winsorized. Moreover, using heat map, we can find out that some
independent variables seem to have strong correlation with each other, which may
lead to “Multi-collinearity” phenomenon.
VII. References:
[1] D. D. Pham, “Logistic Regression and Regularisation”, COSC 2753 Machine Learning, 2020
[2] Creative Research Systems, “Techniques in Determining Correlation”, 2016. [Online].
Available: https://www.surveysystem.com/correlation.htm. [Accessed: 10- Apr-2020].
[3] Aurelien Geron, Hands-On Machine Learning with Scikit-Learn & TensorFlow. O’Reilly Media,
Inc., 1005 Gravenstein Highway North, Sebastopol, CA 95472, 2017.
[4] D. D. Pham, “Evaluating Hypotheses and Bayesian Learning”, COSC 2753 Machine Learning,
2020
[5] Wikipedia, “Linear Regression”, 2020. [Online]. Available:
https://en.wikipedia.org/wiki/Linear_regression. [Accessed: 10-Apr-2020].
[6] GeeksforGeeks, “Python | Implementation of Polynomial Regression”. [Online]. Available:
https://www.geeksforgeeks.org/python-implementation-of-polynomial-regression/ [Accessed:
10- Apr-2020].
[7] Towards Data Science, “Regularization in Machine Learning”. ”. [Online]. Available:
https://towardsdatascience.com/regularization-in-machine-learning-76441ddcf99a/. [Accessed:
10- Apr-2020].
[8] Statistics How To, “What is Root Mean Square Error (RMSE)?””. [Online]. Available:
https://www.statisticshowto.com/rmse/. [Accessed: 10- Apr-2020].