You are on page 1of 6

School of Science & Technology

COSC2753 – Machine Learning

Assignment 1:
Intro to Machine Learning

Student details:
 Nguyen Minh Tri (s3726096)
Course coordinator: Dr. Duy Dang-Pham

Submission date: 10/04/2020


I. Introduction:
- Regression is a type of Supervised Learning in Machine Learning. It is utilized when
the output labels are not divided into classes or categories, but have real continuous
values. The importance of regression models are indispensable, they provide a robust
statistical method that allows us to examine the relationships between two or more
variables of interest. After understanding the influences of one or more attributes on
the output label, these model can be employed to predict and forecast future values.
- In this assignment, we are required to use regression approaches to find a predictive
model. Moreover, the process of determining the final model based on a specified
metric is also illustrated. Finally, the ultimate judgement of the “best” model is made
and this model is implemented to predict the life expectancy of newborns from the
given test dataset.

II. Definitions:
In this section, the definitions of all terminologies and techniques relating to or being
implemented in the process of determine the most appropriate model:
- Independent variable: The variable represents a quantity that cannot be predicted
from other variable. In more details, the independent variable has orthogonal
relationship with all other variables in the dataset and can be used as the input for
machine learning algorithms.[1]
- Dependent variable: In contrast, dependent variable can be calculated from the other
variables, which normally refers to the label (the target output value). [1]
- Correlation: This is a statistical value representing how strongly pairs of variables
are related. It ranges from -1.0 to +1.0. If the correlation is close to 0, it means that
there is no relationship between variables or the examining relationship is not linear.
If this value is positive, it means that the higher-than-average values of one variable
tend to be paired with higher-than-average values of other variable. If the correlation
is nesitive, it means that the higher-than-average values of one variable tend to be
paired with lower-than-average values of other variable. [2]
- Min-max Scaling: This is a data pre-processing technique that shift and rescale
values so that they end up ranging from 0 to 1. It does this by subtracting the min
value and dividing by the max minus the min. To implement this in Python, Scikit-
Learn provides a transformer called MinMaxScaler. [3]
- Training dataset: the data used for training hypothesis [4]
- Validation dataset: the data used for “testing” and tuning hyperparameters of a
Machine Learning Algorithm [4]
- Testing dataset: the data used for evaluating and comparing final hypotheses [4]
- K-Fold Cross-validation: In this process, data is divided into k partitions. In the first
iteration, one partition is assigned as the validation set, all other k-1 partitions from
training set. We then make evaluation and repeat this procedure with different
partitions as test set. The final result is the average of all iterations’ results. [4] This
process is automated by using “cross_val_score” provided by Scikit-Learn.
- Linear Regression: It is a Linear approach to modelling the relationship between one
dependent variable with one or more independent variables. In more details, its
hypothesis is a linear equation. [5]
- Polynomial Regression: It is a form of Linear Regression in which the relationship
between the target value and the independent variables is modeled as an nth degree
polynomial [6]. In other words, , its hypothesis is a polynomial equation.
- Regularization: This is a technique that constraints or shrinks the coefficients
estimates towards zeros to discourages learning a more complex model and to avoid
overfitting. [7]
- Ridge Regression: While most the Regression models often try to overfit the training
dataset, the Ridge Regression comes into play to balance the bias and variance by
adding a penalty term/regularization term in the loss function, which is shown below.
We can tune the Ridge Regression, to find the most generalizable set of coefficients
estimates. [1]
n m
1 i 2
J ( θ )= ∑ ( hθ x − y ) + λ ∑ θ2v [1]
i
( )
n i=1 v=1
- Root Mean Squared Error (RMSE): This value measure the standard deviation of
the prediction errors. In other words, it measures how much concentrated the data is
around the line of best fit. [8]

III. Methodologies:
- Initially, by using the head() function in the pandas libraries, we can observe that the
unit range of independent attributes in the data frame is not equal. Moreover, after
plotting the histogram of each column, it is cleared to say that most of them have
skewed distributions. Therefore, a pre-processing technique called Min-Max Scale
should be implemented instead of Standardization. However, this method may be
suffered by large amount of outliers, which are found using Box-plot demonstration.
- The brute-force methodology will be applied to determine the most appropriate model
for the given “Global Health Observatory data repository” dataset. Hence, various
Regression models will be trained and their results will be compared. The list of tried
Regression models including Linear Regression (with and without using data scaling
and regularization), Polynomial Regression (degree 2,3 and 4) (with and without
using data scaling and regularization) and Random Forest (with and without using
data scaling). The results to be compared is computed by implementing K-fold Cross
Validation technique and based on Root Mean Square Error metric.
- For Ridge Regularisation and Random Forest models, to tune to find the best
hyperparameters, GridSearchCV object provided by Scikit-learn library will be
employed.
- The final predictive model will be selected as the model that gives the lowest Root
Mean Squared Error and will be used to predict the Life Expectancy of the newborns.

IV. Code explanation:


The whole code is broken down into seven main parts, including installing libraries; reading
data from files; visualizing data; splitting data into training set and testing set; pre-processing
data; training, tuning coefficients and evaluating models; selecting the best model to predict
life expectancy.
- Install libraries: “pandas” and “numpy” libraries are imported to load and process
the data from CSV files. “matplotlib” and “seaborn” are used to visualize data. Lastly,
some sub-libraries from the “sklearn” library are imported to pre-process data, train
models and evaluate.
- Reading file: Two data frame will be initialized. The first one called “df” stores the
data read from “train.csv” which is used for training. The other called “df_to_predict”
stores the data read from “test.csv” used for making predictions.
- Data visualization: Firstly, for continuous data, a figure containing pairs of box-plots
and histograms of each columns is shown. It not only tells us the range of data and the
shapes of distributions, but it also find the existences of outliers. In terms of
categorical data like “Country”, “Year”, and “Status” , “H-Bar” diagrams will be
plotted for us to examine the frequencies of the values. Finally, a correlation matrix of
all attributes in the training data frame will be illustrated using heat map.
- Data preparation: In this part, the dependent variable “Target_LifeExpectancy” is
simply separated from the remaining independent variables using “.drop()” function
provided by “pandas” library. Then both of them are assigned to y and X respectively.
Moreover, we also turn the “df_to_predict” data frame into a numpy matrix called
“X_test” which can be feed into a Regressor for further predictions.
- Data pre-processing: From which has been discussed above, because the data
distributions of columns have different ranges and most of them are skewed, hence,
Min-Max Scaling technique is implemented using the MinMaxScaler provided by
Scikit-learn.
- Evaluating Models and Tuning Coefficients:
In this part, various models are trained, including:
1. Linear Regression (without Min-Max Scaling)(without Regularisation)
2. Linear Regression (with Min-Max Scaling)(without Regularisation)
3. Linear Regression (without Min-Max Scaling)(with Regularisation)
4. Linear Regression (with Min-Max Scaling)(with Regularisation)
5. Polynomial Regression (degree 2,3,4)(without Min-Max Scaling)(without
Regularisation)
6. Polynomial Regression (degree 2,3,4) (with Min-Max Scaling)(without
Regularisation)
7. Polynomial Regression (degree 2,3,4) (without Min-Max Scaling)(with
Regularisation)
8. Polynomial Regression (degree 2,3,4) (with Min-Max Scaling)(with
Regularisation)
9. Random Forest Regression (without Min-Max scaling)
10. Random Forest Regression (with Min-Max scaling)
In terms of model 1,2,5 and 6, the “cross_val_score” form sklearn.model_selection is
implemented, this function will automatically perform K-Fold cross validation
process (default K=5) and return the average scores (which is specified as Root Mean
Square).
When it comes to the remaining models, they are all the models that apply
Regularisation or Random Forest, which require a list of hyperparameter to be tuned.
Therefore, the GridSearchCV object is employed. To use this object, we have to input
the model type, a dictionary of hyperparameters. When the “fit” function of this object
is called, it will exhaustively searches over all combinations of the input
hyperparameters and do K-Fold cross-validation process. Finally, the best estimator
model is selected as the model has the best score.
However, for the two functions “cross_val_score” and “GridSearchCV”, when
defining a custom scorer for them, the convention is that the functions ending in score
that returns a value to maximize. Therefore, when using the Root Mean Square error
as the evaluating metric, we have to set “greater_is_better” parameter in the
“make_scorer” function to False as we want to minimize the it. As the result, the
computed scores will be negative and we have to minus the scores before taking the
square root of them.
- Train best model for predictions: After manually select the best model from the ten
mentioned ones, we will refit this model using the best found hyperparameters on the
whole training dataset. Ultimately, this model will be implemented to predict Life
Expectancy of the newborns based on the independent variables in the “df_to_predict”
data frame.

V. Results:

- In terms of the Linear Regression models, this type of Regression model is too simple
that the Regularization and Min-Max Scale techniques do not have much impact on it.
Based on the table above, the recorded average Root Mean Squares Errors seem to be
stable around 4.94.
- For Polynomial Regression models, the model with degree 2 which is trained with
Min-Max Scaled Data and applied Regularization turns out to be the best among
models of this type (Root Mean Square Error: 4.113455165247399, alpha = 1.2).
- Lastly, the Random Forest Regression trained with raw data and with the tuned
hyperparameters bootstrap: False, max_features: 7, n_estimators: 50 is the best model
to be selected, which has the Root Mean Squared Error equal to only 3.379681122.

VI. Conclusion:
- After evaluating all the mentioned Regression models, the Random Forest Regression
trained with raw data appears to be the best model based on the Root Mean Square
metric. As a consequence, the Life Expectancy predictions of the newborns are made
using this model. Furthermore, as we implementing K-Fold Cross Validation,
therefore, the final decision we made based on the RMSE is more reliable and reduces
the overfitting. However, due to the limitation of this assignment that no features
selection is allowed, the “accuracy” of this model cannot be improved. At the
beginning, a large amount of outliers can be seen in the boxplot, which must be
removed or winsorized. Moreover, using heat map, we can find out that some
independent variables seem to have strong correlation with each other, which may
lead to “Multi-collinearity” phenomenon.

VII. References:
[1] D. D. Pham, “Logistic Regression and Regularisation”, COSC 2753 Machine Learning, 2020
[2] Creative Research Systems, “Techniques in Determining Correlation”, 2016. [Online].
Available: https://www.surveysystem.com/correlation.htm. [Accessed: 10- Apr-2020].
[3] Aurelien Geron, Hands-On Machine Learning with Scikit-Learn & TensorFlow. O’Reilly Media,
Inc., 1005 Gravenstein Highway North, Sebastopol, CA 95472, 2017.
[4] D. D. Pham, “Evaluating Hypotheses and Bayesian Learning”, COSC 2753 Machine Learning,
2020
[5] Wikipedia, “Linear Regression”, 2020. [Online]. Available:
https://en.wikipedia.org/wiki/Linear_regression. [Accessed: 10-Apr-2020].
[6] GeeksforGeeks, “Python | Implementation of Polynomial Regression”. [Online]. Available:
https://www.geeksforgeeks.org/python-implementation-of-polynomial-regression/ [Accessed:
10- Apr-2020].
[7] Towards Data Science, “Regularization in Machine Learning”. ”. [Online]. Available:
https://towardsdatascience.com/regularization-in-machine-learning-76441ddcf99a/. [Accessed:
10- Apr-2020].
[8] Statistics How To, “What is Root Mean Square Error (RMSE)?””. [Online]. Available:
https://www.statisticshowto.com/rmse/. [Accessed: 10- Apr-2020].

You might also like