COMP1801 - Copy 1

1
1. Executive summary
In this report a detailed report is provided on the machine learning concepts. This is based on
the real-world application. In order to make suitable business decisions like best product
prices and advertising targets, the customer data is obtained for the e-commerce company.
The analysis is made based on the machine learning algorithms. The various types of
machine learning algorithms were discussed in detail. The purpose of using regression model
along with its regression evaluation metric is discussed for analysing the performance of the
model based on metrics. The binary classification model is used for describing the
implementation details and the decision tree is preferred for making appropriate analysis. The
accuracy of the model is determined by the confusion matrix. In order to solve the binary
classification problem in this application, we have designed and trained the neural network
architecture. The performance of the neural network is compared along with the decision
trees. In order to ease the task performance, the clustering technique is preferred in this
method. The relationships between the data have been analysed by using the k-means
clustering algorithm. This report is concluded by comparing the various methods to predict
the customer salary.
2
2. Introduction to Machine learning
A data analysis technique called machine learning automates the creation of analytical
models. It is a subfield of artificial intelligence founded on the notion that machines are
capable of learning from data, spotting patterns, and making judgments with little assistance
from humans. The rapidly expanding discipline of data science includes machine learning as
a key element. Algorithms are trained to generate classifications or predictions using
statistical techniques, revealing important insights in data mining operations. The decisions
made as a result of these insights influence key growth indicators in applications and
enterprises, ideally. Data scientists will be more in demand as big data develops and grows
because they will be needed to help identify the most important business issues and then the
data to answer them. ML algorithms keep become more accurate and effective as they gather
experience. They can consequently make wiser selections. Take the example of creating a
weather forecast model. The algorithms become faster at making more accurate predictions
as the data set expands. In dynamic or uncertain contexts, machine learning algorithms are
adapted at managing data that is multidimensional and multivariate.
3. Regression
A regression model is a statistical method and it is most commonly used in the field of
finance, investments, and so on. This method helps in determining the strength and character
of a relationship between a dependent variable and an independent variable. This method can
also be termed simple regression, ordinary least squares, linear regression, etc. A linear
regression model is a most commonly preferred model. The regression analysis tool can be
used to uncover the associations among variables that are observed within the data and it
cannot represent the causation easily. This analysis is used for helping investment managers
to value the assets, and understand the relationships between the factors like a commodity,
prices, and also the business stocks that deals in those commodities (team, 2022).
The main purpose of regression in statistical analysis is used for identifying the connections
among variables that occur in the data. This data comprises both the magnitude of the
association and determines its significance. This method is used for predicting future
outcomes that are based on past observations. There are several types of regression models
that are available to make predictions. The selection of this technique is based on the number
of independent variables, the regression line forms, and also the type of dependent variables.
3
Linear regression model
This is the most widely used modelling technique that represents the linear relationship
between both the independent (X) and dependent variables (Y). A regression line known as
the best fit line is employed in the linear regression model. The linear connection can also be
defined as, Y= c + m X + e
Where,
c represents the intercept; m represents the slope and e represents the error.
The linear regression might be simple or complex.
Figure 4. Linear regression model (Yashwanth, 2020)
We have chosen the linear regression model and the evaluation metric for this model is the
mean squared error. These metrics are used for measuring the performance of the model and
how this is approximated with the relationship. The mean squared error (MSE) is a common
metric that has a convex shape. This is obtained as the average of the squared difference
between the predicted value and the actual value. It is easier for optimizing because it is
differentiable and has a convex shape. Large errors are penalized through mean squared error
(Yashwanth, 2020).
Implementation of Regression Model

Initially, the customer data in the csv is read using the pandas read_csv() into the data frame
customer_df.
customer_df=pd.read_csv("Comp1801.csv")
4
The data frame is pre-processed before the data is fit into the regression model. Using the
python methods, the data is checked whether it has null and duplicate values or not. The
regression line can be fit between the two variables. Before fitting the regression line, the
correlation between the site spending and salary attributes can be found by plotting the points
in the graph.
Then the data is assigned for X and Y variables. X variable contains the values of the Site
Spending and Y variable contains the values of the Salary. Then the data is divided into
training and testing data. Then the Linear Regression model is created. Then the X and Y
5
values of training data is given to the linear regression model to fit the two variables. After
training the regression model, prediction is done using the training data.
The following graph shows the line that best fits the training data.
Then the testing of the linear regression model is done by predicting the Y values given the X
values. The following graph shows the line that best fits the testing data.
The metric which is used to evaluate this linear regression model is R Squared value.
6
The R squared value shows how well the regression model explains the given data. Here the
value is 0.14 and this implies that only 10 % of the variation in the data is explained by the
model.
4. Binary classification
In the binary classification algorithms, the classification is made between mutually exclusive
classes. This is the simplest classification in machine learning used for solving the problems.
The solution of the problem is based on the best runtime performance, accuracy on data
volume and quality. There are several classification algorithms in machine learning and this
might include the following.
Bernoulli Naïve Bayes – this is a variation of Naïve Bayes which is used for binary
classification problem and its advantages includes speed, performance both in small and large
datasets and handling irrelevant features in easy manner.
Logistic regression – this is a very simple algorithm and efficient machine learning classifier
for the classification problems. The implementation is very easier and the training can be
achieved efficiently. This algorithm can be affected by the outliers.
K-nearest Neighbors- this is a supervised machine learning algorithm used for solving the
problems based on both classification and regression. The performance is very low if the
datasets are large and the dimensions are high (Ortner, 2020).
Support vector machine- this is used for classification and regression tasks which provides
high accuracy with minimal computation power. The quality of the prediction can be
improved and the overfitting can be reduced by using the linear SVC.
Decision tree
If the dataset contains outliers, it is necessary to have a powerful algorithm which might not
be affected by the outliers. The decision tree algorithms are mostly used in the classification
task that cannot be affected by the outliers. The advantages of the decision tree classifier are
that very little data preparation is required because this algorithm is not affected by the
outliers (Kharwal, 2021). This algorithm is preferred in the analysis because the decision
trees are mostly suitable for tabular data. The outputs obtained are discrete and it is necessary
to have explanations for the decisions. Decision tree algorithms are mostly preferable at the
situation in which the sequence of conditions and actions are critical. Its is also preferred that
not every condition is relevant to all the action. These decision tree might include the parts of
7
determining the pay for the market and also deciding on the best strategies. A comprehensive
analysis can be created and the decision nodes can be identified for the purpose of analysis.
Specific values are assigned for every problem and the costs and benefits can be made by
using explicit monetary values. This approach is also very easier to use and versatile
(Woodruff, 2019).
The evaluation metrics used for the classification problem includes the confusion matrix and
are associated with the machine learning tasks. In case of multiple tasks, metrics like
precision and recall are used. In the confusion metrics the performance can be measured for
the problems with two or more classes of output (Beheshti, 2022). This includes the table
with predicted and actual values. The recall, precision, accuracy, and ROC curves can be
measured by using the confusion matrix. Hence it is preferred in this analysis process.
Implementation of Decision Tree

A structure built on a sequential decision-making process is known as a binary decision tree.
A feature is assessed from the root, and one of the two branches is chosen. Until a final leaf is
reached, which typically reflects the categorization target looking for, this operation is
repeated. Hence, I felt decision tree is suitable for binary classification.
The data frame is pre-processed before the data is used for binary classification. Using the
python methods, the data is checked whether it has null and duplicate values or not. As binary
classification to be done on the salary attribute, a method is implementing the two salary
categories. This new column “Salary Range” is added with the existing data frame.
The number of each salary range is viewed using the count () method. Then the attribute
“Region” is removed from the data frame since it is insignificant feature. Then the remaining
attributes in String type are encoded using the label encoder.
8
Then the data is divided into set of input features and target feature
Then heatmap is used to visualize the correlation between the attributes.
9
Then the X and Y data are divided into training and testing data. Then the decision tree model
is created with criterion as entropy. Then the model is trained with the input and target
training data. The testing data is used to predict the target feature given the input features.
The accuracy metric is used to evaluate the decision tree model. The accuracy score shows
the ratio of number of correctly classified data to the total number of data samples. The
accuracy score for this decision tree model is 99.67 %. The confusion matrix is also shown
for the binary classification of the given data. The matrix shows that the number of true
positive is 197, the number of true negative is 102 and the number of false positives is 1.
5. Neural Networks
In the binary classification problem, all the predictions of a neural network can be classified
to the positive class in a condition if the probability that is estimated is more than the
threshold value and it can also be classified to the negative class in a condition if the
probability that is estimated is less than the threshold value. It is very important to make two
changes in order to build a neural network for performing binary classification. An activation
function such as sigmoid can be added to the output layer. This function is used for reducing
10
the output value from 0.0 to 1.0 that represents the probability function. The loss function is
changed to binary cross entropy that is built for the binary classifiers. The accuracies that are
computed by the loss function can be obtained by changing the metrics to accuracy (Khan,
2019). The classification performance can be obtained for different types of neural networks
that includes the back propagation neural network, radial basis function neural network,
probabilistic neural network, general regression neural network and complementary neural
network.
Artificial neural networks are mostly based on the neural structure of the brain that obtains
the learning capabilities through experiences. Based on the past data, the outputs can be
generated easily in this network. This is considered to be a powerful technique for
classification because this has various advantages. This method can able to adapt to the data
without making any assumptions to the functions previously. This technique is also
considered as a universal approximator function (Wong, 2009). Hence it is possible for the
artificial neural network to approximate functions with accuracy. This network is a non-linear
model in which it can be implemented in the real world applications that are very complex.
The benefits of a neural network include the easy fit of non linear datasets. If a binary
classifier is trained the predictions can be made by calling the predict method. The sigmoid
activation function predicts the numbers between 0 to 1 which represents the probability that
the input data belongs to the positive class. The neural network can able to perform binary
classification by including a single neuron along with the sigmoid activation function in the
output layer of the network and specifying the binary cross entropy as the loss function. The
problems in binary classification can be solved at a greater degree by using the deep learning
models in neural networks. In most of the cases, a convolution neural network is used and
this has provided effective results in the image recognition area, process, and classification
(Verma, 2022). The training data is required for this CNN model to perform training weights
and checking on the performance. Every data is passed through several convolution layers
along with filters and this includes the pooling layers and fully connected areas that applies
the sigmoid function.
Implementation of Neural Networks

The data frame is pre-processed before the data is used to implement machine learning. Using
the python methods, the data is checked whether it has null and duplicate values or not. This
simple neural network model has three dense layers. The first layer has the relu activation
function with 16 units. The second layer has the relu activation function with 16 units. The
third and output layer has the sigmoid function. The optimizer for this neural network model
11
is configured adam and the loss is binary cross entropy and the metric used to evaluate the
model is accuracy.
After the neural network model is created, the model is trained with the training data using
the fit() method. Then the neural network model is tested with the testing data which contains
the input features and the model predicts the target feature.
The performance metric used to evaluate the model is accuracy score. The accuracy score
shows the ratio of number of correctly classified data to the total number of data samples.
The accuracy score for the neural network model is 66 %. The confusion matrix is also
displayed for this model. The confusion matrix shows that the number of true positive is 198
and the number of false negative is 102. From the results it is evident that the decision tree
performs well when compared to the neural network model.
The hyper parameters used in this neural network model are optimizer ADAM and the
loss model is binary cross entropy.
6. Clustering
Clustering in machine learning is considered to an unsupervised learning method in which the
references can be drawn from the datasets that includes the input data. By using this
technique, it is possible to find the structures that are meaningful, explanatory processes, with
generative features. With the help of clustering technique, the task can be divided into several
number of groups in which the data points in the one group are similar to the other data points
present in same group and not similar to data points in other groups. Clustering can be
12
basically represented as an object collection based on the similarity and dissimilarities among
themselves. Clustering is an important process because intrinsic grouping is determined
among the data that are present without label (Scikit, 2022). No criteria are required for good
clustering process and it is usually based on the user.in order to constitute the similarity
among the data points, it is necessary for the clustering algorithm to make some assumptions.
Each and every assumption made in this algorithm will make different clusters that are
equally valid. The clustering methods can be based on density, hierarchy, partition, and grids.
In the density-based methods, the clusters are considered as a dense region that have
similarities and differences in the lower dense region. Good accuracy and capability of
merging two clusters can be made possible through these methods.
In hierarchical methods, a tree structure is formed by the clusters and it is possible to form
new clusters from the already existing one. Agglomerative and divisive are the two types of
clustering methods (McGregor, 2020).
K-means Clustering
In this technique we prefer k means clustering algorithm because it is considered as the
simplest unsupervised learning algorithm for the purpose of solving clustering problem. The
n observations are partitioned into k clusters in this algorithm and each observation might
belong to the cluster that has a nearest mean and serves as a cluster prototype (Scikit, 2022).
Figure 5. (Scikit, 2022)

This algorithm is mostly used the marketing applications for characterising and discovering
the customer segments. In case of biology, they can be used to classify various species of
plants and animals. In order to cluster different books based on the topics and information
they are widely used in libraries also.
13
Implementation of K-means Clustering
The data frame is pre-processed before the data is used to implement clustering. Using the
python methods, the data is checked whether it has null and duplicate values or not. The
explorative data analysis is performed on the data. The details about the attribute Sex can be
visualized in bar chart as follows:
The distribution of the attribute Education can be visualized in bar chart as follows:
The distribution of the attribute Work type can be visualized in bar chart as follows:
14
Then the string attributes in the data is encoded using the label encoder. Then in order to find
the optimal number of clusters for clustering process, the elbow method is used.
Based on the above plot, the optimal number of clusters for the given customer data is 4 since
after the K value the curve becomes flat parallel to the X-axis. Then the k-means clustering
model with number of clusters set of 4 is created. Then the clustering model is trained to
cluster the customer data. Then the data are clustered by the k-means model.
15
16
Based on the average value of each attribute in each cluster, the conclusion can be made as
“Customers with high salary, more site time, more site spending” belongs to cluster 3.
7. Conclusion
Machine learning algorithms are used to make the analysis. The various machine learning
algorithms were thoroughly covered. For the goal of analysing the model's performance
based on metrics, the use of a regression model and its regression assessment measure is
explained. The decision tree is preferred for performing the necessary analysis, and the binary
classification model is utilised to describe the implementation details. The confusion matrix
establishes the model's correctness. In this application, we have constructed and trained the
neural network architecture to address the binary classification problem. Along with decision
trees, the neural network's performance is contrasted. In this strategy, the clustering technique
is preferred to make task performance easier. The k-means clustering technique has been used
to analyse the relationships between the data. How well the regression model describes the
provided data is indicated by the R squared value. The number in this case is 0.14, which
suggests that the model only accounts for 10% of the variation in the data. This decision tree
model has a 99.67% accuracy rating. The confusion matrix is also displayed for the given
data's binary categorization. The matrix indicates that there are 197 real positives, 102 true
negatives, and only one false positive. The neural network model's accuracy rating is 66%.
According to the confusion matrix, there are 198 real positives and 102 false negatives. It is
clear from the results that the decision tree model outperforms the neural network model.
Customers with high salaries, more site time, and higher site expenditure belong to cluster 3,
according to the average value of each attribute in each cluster. Based on the results, it is
evident that the decision tree algorithm performs well when compared to other machine
learning models.
17
18

COMP1801 - Copy 1

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

COMP1801 - Copy 1

Uploaded by

Copyright:

Available Formats

1

Figure 4. Linear regression model (Yashwanth, 2020)

Implementation of Regression Model

Implementation of Decision Tree

Then heatmap is used to visualize the correlation between the attributes.

Implementation of Neural Networks

Figure 5. (Scikit, 2022)

You might also like