Professional Documents
Culture Documents
M2 Industrial Engineering
1
Abstract
The aim of this project is to use two machine learning methods: supervised and
unsupervised learning. Machine learning allows experts to take well-informed decisions and
identify new customers in the market. Many companies collect data and some use this data to
classify their clients. In this project, a supervised learning model knows as Bayes’ method will be
used on a dataset of car consumption example. It will try to identify the consumption class of the
car based on its features, such as: horsepower, number of cylinders, cubic inch, weight,
acceleration, and brand. Furthermore, an unsupervised model known as K-means clustering will
be used and this method does not explicitly teach the computer the labels or MpG consumption
that each sets belongs to. This project aims to introduce the students to data mining, visualization
and classification. The methods will be used to identify the car consumption class: 15, 20, 25, 30,
2
Contents
Abstract ............................................................................................. 3
Introduction ....................................................................................... 4
Methods............................................................................................. 4
Results ............................................................................................... 5
Discussion ........................................................................................ 19
Conclusion........................................................................................ 20
3
Introduction
Today, huge sets of data can be analyzed by companies in order to give them marketing or
business insights. It is important to collect data but it is more important to use it. This is where
machine learning can be helpful because it relies on supervised or unsupervised machine learning
to classify the data. In general, machine learning models will start by identifying a hypothesis,
create a cost function, minimize the cost function by using the gradient descent method, and
hopefully obtain an accurate prediction model. The type of data collected plays an important role.
Some will be labelled and will make the task easier for the computer, while others will be unlabeled
and the computer will learn to classify them on his own. It must be noted that the computer will
learn the model on a training set and will be evaluated on smaller testing set to determine its
accuracy.
K-means methods will identify clusters and minimize the error margin to give accurate
label prediction. This method starts without pre-defined labels but aims to teach the computer to
predict and classify correctly the information. On the other hand, the Bayes method will provide
the computer with the labels and should better the prediction model in theory. In practice, if a set
of data is provided with labels then it is best to always use it. In our case, we are comparing the
Methods
Excel will be used to fix some irregularities in data. This is helpful to standardize the data
and make sure that it all makes sense. The decisions taken here will be based on human
interpretation before the machine learning. Finally, the machine learning methods and graphs will
4
Results
In table 1, we can see that all the Volkswagen brand name were merged that was misspelled
before. The same was done for other irregularities. Furthermore, in the test set there was a “Hi”
message that was removed. Finally, the brand models’ column was deleted because we assumed
that brands like Ford will use similar engines and the same technology on all their models.
5
Figure 1. Screenshot of data entry with the brands.
In figure 1, we can see our target value, the consumption, and its six features: cylinders,
cubic inch, horsepower, weight, acceleration and brand. It will be difficult for the machine to deal
with the string of brands so they should be converted into numbers(known as categorical data
features).
6
Figure 2. Screenshot of data entry with the encoded brands.
In figure 2, we can see that the name of the brands were encoded in order to use this this
7
Figure 3. Screenshot of the statistical analysis.
In figure 3, we can observe that the consumption data entries are not uniform and they
could resemble a skewed to the left bell shaped distribution. In addition, we can also see the
8
In figure 4, we have negative and positive correlations values ranging from -1 to 1. In
simple terms, a strong positive correlation means that the two parameters are positively
proportional: if one increases than the other also increases. On the other hand, a strong negative
correlation means that if one parameter increases then the other decreases. For example, if the
number of cylinders increases then the cubic inch also increases because the correlation is equal
to 0.94.
In figure 5, we can see that horsepower and cubic inch have a positive correlation due to
the linear function. In colors it is possible to observe the type of consumption or label associated
to each data point. We can observe that higher consumptions per miles per gallon are concentrated
where the displacement and the horsepower are low, and the higher the cubic inch displacement
9
Figure 6. All the graphical representations.
In figure 6, we can see all the graphs and the histograms. A correlation basically is the
degree in which a pair of variables are linearly related, and correlation takes values between -1 and
+1. From the figure above the diagonal plots are perfect correlations of a variable against itself
and we will not concentrate on them. We can also see very high positive correlations and very high
10
negative correlations from the plots. Cubic inch vs cylinders, cylinders vs horsepower, weight vs
cylinders, cubic inch vs horsepower, cubic inch vs weight and horsepower vs weight are some of
Horsepower vs acceleration, and consumption vs cylinders, weight and horsepower are some of
11
In figure 7, we can see that many boxplots are hard to read because they are not on the
same scales. In addition the variation effect of a data feature having wide range could be more as
compared to a data feature having small range . It can be useful to standardize/normalize the data.
In figure 8, we can see that the x and y axis are normalized with values ranging between 0
and 1. After normalizing the data using the minimum maximum criteria, we plot again another
scatter plot of cubic inch displacement and horsepower. With the scaled data the linearity of the
distribution is more precise and we can concretely establish from the points that the higher the
cubic inch displacement and horsepower, the less the consumption in miles per gallon of a car.
12
Figure 9. Normalized boxplot graphs.
In figure 9, we can see that the box plots are visible and normalized. A boxplot as seen in
the figure is mainly used when you need to check the distribution of a feature and whether it
contains any outliers or not. From the above boxplots where every independent variable has been
grouped with the consumption variable, we can note that even though there are a few values above
and below of some of the boxplots, there is no major outlier issues in the data
13
Figure 10. PCA graph.
In figure 10, we can the generated PCA graph for each data entry and colored depending
on the consumption label it belong to. Here we are trying to distinguish the state of the structure
of different consumption values. From the results, it is not possible to see into the PCA plot a clear
In figure 11, we can see the optimal PCA0 and PCA1 to ensure components contain the
14
GAUSSIAN NAIVES BAYES:
The figures above are generated after running the Gaussian Naives Bayes model on the
training dataset and predicting the fitted values onto the test data. The figures are the best way of
establishing whether there is a relationship between the actual values and the predicted values. We
can see that there is a pattern displayed and the predicted values are mirroring what the real values
portray.
15
Figure 13. Screenshot of the confusion matrix.
After performing a confusion matrix that showed us the True positives, the False
positives, the False negatives and the True negatives, this plot is an information plot that gives us
16
justification of the correctly predicted values and where errors occurred. The points that are not
In figure 15, we can see the accuracy of the Gaussian Naives Bayes that is 51.02% and
the precision and recall percentages. Consumption 25 had the highest precision and recall of 67%
17
K-Means:
The figure above shows the predicted values from the K-means algorithm after running the
model. In comparing the predicted values and the actual values, we can see a very huge difference
in the distributions. The predictions performed poorly. For example, where it predicted 40 miles
per gallon is actually 15 miles per gallon (the green dots on the real consumption data).
18
Figure 17. Screenshot of the confusion matrix.
The accuracy for this model is 32.98%. For example, first row represents the consumption
of 10 MpG and we can see that three 10 MpG data entries were wrongly predicted as 20 MpG
labels.
Note: For both models, training data as well as test data has been normalized and data feature
Discussion
The project aimed at comparing the following two methods: Bayes and K-means. Before
starting teaching our machine the training data set, it was important to look at the data. Many
intentional mistakes that can represent real life errors were spotted and fixed to standardize the
data. Furthermore, a column representing the car models was removed because we assumed that
car manufacturer will use the same technology and engines on their respective vehicles. Finally,
all irregularities were removed from the training set and test set.
Once the data entered into Python, it was decided that the brands needed to be encoded to
fit in our predictive model. It is difficult to use string so each car company received a number. In
19
the end, we had small numbers below 100 and others above 1000. This needed to be standardize
in order to observe the data clearly and to reduce computational power and time.
The training sets were used in Python to teach the model the two methods of interest: Bayes
and K-Means. Bayes is a model that is supervised and is taught the labels of the cars. To each data
entry, it was taught the parameters associated to the MpG label. On the other hand, K-Means
learned each data entry with its respective parameters without prior information of the label it
belongs to. K-Means is clearly at a disadvantage against Bayes but it is the purpose of the project.
In the end, the Bayes prediction model had an accuracy of 51% compared to 32% for the K-Means
model.
As a recommendation for further studies or analysis, the target variable can be further
classified into high, medium and low consumption, so that the models have a set of 3 outcomes to
look at rather than the 8 consumption categories. This will help to improve the accuracies even
further.
Conclusion
Machine learning is a powerful tool used in industry and academia. It helps experts deduce
useful insights and predict an output. In some cases, it can identify and classify data points such
as potential clients on an e-commerce platform. These methods are widely used today by
The project aimed to compare two models and learn the difference between them. It is clear
that if a data is provided with labels then it is better to use this valuable data to build a better model.
In practice, there is no reason to use the K-Means clustering method because we already have the
MpG labels. This is why the Bayes model did a better job in this specific example. On the other
20
hand, it was interesting to observe how the K-Means model behaved and think about its potential
uses in cases were we would not have the associated labels. One recommendation would be to
simplify the data even further in order to render the model more mistake-proof. It would be
interesting to go back to the training set and determine the exact engine models instead of the name
of brands because in many cases companies can use similar engines. For example, Stellantis is an
automotive manufacturing company that groups Fiat-Chrysler Automobiles and PSA Group so it
is possible that they use same technology so perhaps the technical engine name should be use in
our case.
21