You are on page 1of 21

École centrale de Nantes

M2 Industrial Engineering

Artificial Intelligence for Decision


Making in Industrial Engineering
MpG Consumption

BECHAALANI Christian & TAMAK Sundeep

Academic year 2021 – 2022


Instructor: DA-CUNHA Catherine
Due date: January 22, 2022

1
Abstract
The aim of this project is to use two machine learning methods: supervised and

unsupervised learning. Machine learning allows experts to take well-informed decisions and

identify new customers in the market. Many companies collect data and some use this data to

classify their clients. In this project, a supervised learning model knows as Bayes’ method will be

used on a dataset of car consumption example. It will try to identify the consumption class of the

car based on its features, such as: horsepower, number of cylinders, cubic inch, weight,

acceleration, and brand. Furthermore, an unsupervised model known as K-means clustering will

be used and this method does not explicitly teach the computer the labels or MpG consumption

that each sets belongs to. This project aims to introduce the students to data mining, visualization

and classification. The methods will be used to identify the car consumption class: 15, 20, 25, 30,

35, 40, or 45.

2
Contents
Abstract ............................................................................................. 3
Introduction ....................................................................................... 4
Methods............................................................................................. 4
Results ............................................................................................... 5
Discussion ........................................................................................ 19
Conclusion........................................................................................ 20

3
Introduction
Today, huge sets of data can be analyzed by companies in order to give them marketing or

business insights. It is important to collect data but it is more important to use it. This is where

machine learning can be helpful because it relies on supervised or unsupervised machine learning

to classify the data. In general, machine learning models will start by identifying a hypothesis,

create a cost function, minimize the cost function by using the gradient descent method, and

hopefully obtain an accurate prediction model. The type of data collected plays an important role.

Some will be labelled and will make the task easier for the computer, while others will be unlabeled

and the computer will learn to classify them on his own. It must be noted that the computer will

learn the model on a training set and will be evaluated on smaller testing set to determine its

accuracy.

K-means methods will identify clusters and minimize the error margin to give accurate

label prediction. This method starts without pre-defined labels but aims to teach the computer to

predict and classify correctly the information. On the other hand, the Bayes method will provide

the computer with the labels and should better the prediction model in theory. In practice, if a set

of data is provided with labels then it is best to always use it. In our case, we are comparing the

two methods so we will be testing and comparing their respective performance.

Methods
Excel will be used to fix some irregularities in data. This is helpful to standardize the data

and make sure that it all makes sense. The decisions taken here will be based on human

interpretation before the machine learning. Finally, the machine learning methods and graphs will

be applied and generated in Python.

4
Results

Table 1. Cleaned table of the Volkswagen cars.

Consumption Cylinders cubic_inch horsepower weight acceleration brand


20 4 121 76 2511 18 volkswagen
25 4 79 67 1963 15.5 volkswagen
25 4 90 71 2223 16.5 volkswagen
30 4 97 78 2190 14.1 volkswagen
45 4 90 48 2335 23.7 volkswagen
35 4 105 74 2190 14.2 volkswagen
45 4 97 52 2130 24.6 volkswagen
30 4 89 62 1845 15.3 volkswagen
30 4 90 70 1937 14 volkswagen
30 4 97 71 1825 12.2 volkswagen
30 4 97 78 1940 14.5 volkswagen
45 4 90 48 1985 21.5 volkswagen
35 4 105 74 1980 15.3 volkswagen
30 4 90 70 1937 14.2 volkswagen
30 4 89 71 1925 14 volkswagen
40 4 98 76 2144 14.7 volkswagen
45 4 90 48 2085 21.7 volkswagen
30 4 89 71 1990 14.9 volkswagen
25 4 97 46 1950 21 volkswagen

In table 1, we can see that all the Volkswagen brand name were merged that was misspelled

before. The same was done for other irregularities. Furthermore, in the test set there was a “Hi”

message that was removed. Finally, the brand models’ column was deleted because we assumed

that brands like Ford will use similar engines and the same technology on all their models.

5
Figure 1. Screenshot of data entry with the brands.

In figure 1, we can see our target value, the consumption, and its six features: cylinders,

cubic inch, horsepower, weight, acceleration and brand. It will be difficult for the machine to deal

with the string of brands so they should be converted into numbers(known as categorical data

features).

6
Figure 2. Screenshot of data entry with the encoded brands.

In figure 2, we can see that the name of the brands were encoded in order to use this this

data feature in our numerical model.

7
Figure 3. Screenshot of the statistical analysis.

In figure 3, we can observe that the consumption data entries are not uniform and they

could resemble a skewed to the left bell shaped distribution. In addition, we can also see the

statistical analysis of the cylinders parameter.

Figure 4. Screenshot of the correlation table.

8
In figure 4, we have negative and positive correlations values ranging from -1 to 1. In

simple terms, a strong positive correlation means that the two parameters are positively

proportional: if one increases than the other also increases. On the other hand, a strong negative

correlation means that if one parameter increases then the other decreases. For example, if the

number of cylinders increases then the cubic inch also increases because the correlation is equal

to 0.94.

Figure 5. A graphical representation of horsepower vs. cubic inch.

In figure 5, we can see that horsepower and cubic inch have a positive correlation due to

the linear function. In colors it is possible to observe the type of consumption or label associated

to each data point. We can observe that higher consumptions per miles per gallon are concentrated

where the displacement and the horsepower are low, and the higher the cubic inch displacement

and horsepower the lower the consumption in miles per gallon

9
Figure 6. All the graphical representations.

In figure 6, we can see all the graphs and the histograms. A correlation basically is the

degree in which a pair of variables are linearly related, and correlation takes values between -1 and

+1. From the figure above the diagonal plots are perfect correlations of a variable against itself

and we will not concentrate on them. We can also see very high positive correlations and very high

10
negative correlations from the plots. Cubic inch vs cylinders, cylinders vs horsepower, weight vs

cylinders, cubic inch vs horsepower, cubic inch vs weight and horsepower vs weight are some of

highly positively correlated pairs with their plot tending to go up steadily.

Horsepower vs acceleration, and consumption vs cylinders, weight and horsepower are some of

the combinations which are highly negatively correlated..

Figure 7. Boxplot graphs.

11
In figure 7, we can see that many boxplots are hard to read because they are not on the

same scales. In addition the variation effect of a data feature having wide range could be more as

compared to a data feature having small range . It can be useful to standardize/normalize the data.

Figure 8. A graphical representation of the normalized parameters.

In figure 8, we can see that the x and y axis are normalized with values ranging between 0

and 1. After normalizing the data using the minimum maximum criteria, we plot again another

scatter plot of cubic inch displacement and horsepower. With the scaled data the linearity of the

distribution is more precise and we can concretely establish from the points that the higher the

cubic inch displacement and horsepower, the less the consumption in miles per gallon of a car.

12
Figure 9. Normalized boxplot graphs.

In figure 9, we can see that the box plots are visible and normalized. A boxplot as seen in

the figure is mainly used when you need to check the distribution of a feature and whether it

contains any outliers or not. From the above boxplots where every independent variable has been

grouped with the consumption variable, we can note that even though there are a few values above

and below of some of the boxplots, there is no major outlier issues in the data

13
Figure 10. PCA graph.

In figure 10, we can the generated PCA graph for each data entry and colored depending

on the consumption label it belong to. Here we are trying to distinguish the state of the structure

of different consumption values. From the results, it is not possible to see into the PCA plot a clear

separation between each cluster.

Figure 11. Screenshot of the optimal PCA.

In figure 11, we can see the optimal PCA0 and PCA1 to ensure components contain the

most relevant information with maximal variance.

14
GAUSSIAN NAIVES BAYES:

Figure 12. Screenshot of the Bayes model prediction.

The figures above are generated after running the Gaussian Naives Bayes model on the

training dataset and predicting the fitted values onto the test data. The figures are the best way of

establishing whether there is a relationship between the actual values and the predicted values. We

can see that there is a pattern displayed and the predicted values are mirroring what the real values

portray.

15
Figure 13. Screenshot of the confusion matrix.

Figure 14. Screenshot of the prediction relevance.

After performing a confusion matrix that showed us the True positives, the False

positives, the False negatives and the True negatives, this plot is an information plot that gives us

16
justification of the correctly predicted values and where errors occurred. The points that are not

on the diagonal are what are referred to as the errors.

Figure 15. Screenshot of the accuracy indicators of the Bayes model.

In figure 15, we can see the accuracy of the Gaussian Naives Bayes that is 51.02% and

the precision and recall percentages. Consumption 25 had the highest precision and recall of 67%

and 53% respectively.

17
K-Means:

Figure 16. Screenshot of the K-Means cluster model prediction.

The figure above shows the predicted values from the K-means algorithm after running the

model. In comparing the predicted values and the actual values, we can see a very huge difference

in the distributions. The predictions performed poorly. For example, where it predicted 40 miles

per gallon is actually 15 miles per gallon (the green dots on the real consumption data).

18
Figure 17. Screenshot of the confusion matrix.

The accuracy for this model is 32.98%. For example, first row represents the consumption

of 10 MpG and we can see that three 10 MpG data entries were wrongly predicted as 20 MpG

labels.

Note: For both models, training data as well as test data has been normalized and data feature

brand was encoded into integer values.

Discussion
The project aimed at comparing the following two methods: Bayes and K-means. Before

starting teaching our machine the training data set, it was important to look at the data. Many

intentional mistakes that can represent real life errors were spotted and fixed to standardize the

data. Furthermore, a column representing the car models was removed because we assumed that

car manufacturer will use the same technology and engines on their respective vehicles. Finally,

all irregularities were removed from the training set and test set.

Once the data entered into Python, it was decided that the brands needed to be encoded to

fit in our predictive model. It is difficult to use string so each car company received a number. In

19
the end, we had small numbers below 100 and others above 1000. This needed to be standardize

in order to observe the data clearly and to reduce computational power and time.

The training sets were used in Python to teach the model the two methods of interest: Bayes

and K-Means. Bayes is a model that is supervised and is taught the labels of the cars. To each data

entry, it was taught the parameters associated to the MpG label. On the other hand, K-Means

learned each data entry with its respective parameters without prior information of the label it

belongs to. K-Means is clearly at a disadvantage against Bayes but it is the purpose of the project.

In the end, the Bayes prediction model had an accuracy of 51% compared to 32% for the K-Means

model.

As a recommendation for further studies or analysis, the target variable can be further

classified into high, medium and low consumption, so that the models have a set of 3 outcomes to

look at rather than the 8 consumption categories. This will help to improve the accuracies even

further.

Conclusion
Machine learning is a powerful tool used in industry and academia. It helps experts deduce

useful insights and predict an output. In some cases, it can identify and classify data points such

as potential clients on an e-commerce platform. These methods are widely used today by

companies such as: Youtube, Amazon and Netflix.

The project aimed to compare two models and learn the difference between them. It is clear

that if a data is provided with labels then it is better to use this valuable data to build a better model.

In practice, there is no reason to use the K-Means clustering method because we already have the

MpG labels. This is why the Bayes model did a better job in this specific example. On the other

20
hand, it was interesting to observe how the K-Means model behaved and think about its potential

uses in cases were we would not have the associated labels. One recommendation would be to

simplify the data even further in order to render the model more mistake-proof. It would be

interesting to go back to the training set and determine the exact engine models instead of the name

of brands because in many cases companies can use similar engines. For example, Stellantis is an

automotive manufacturing company that groups Fiat-Chrysler Automobiles and PSA Group so it

is possible that they use same technology so perhaps the technical engine name should be use in

our case.

21

You might also like