You are on page 1of 62

PROGRAM TITLE: Machine Learning

ASSIGNMENT NUMBER: Assignment

SUBMISSION DATE: 20/04/2022

DATE RECEIVED: 20/04/2022

TUTORIAL LECTURER: Nguyen Ngoc Tan

WORD COUNT: 9033

STUDENT NAME:

STUDENT ID:

MOBILE NUMBER:
Summative Feedback:

Internal verification:
A. INTRODUCTION

I work for BK, a software development company that produces client-server and web
applications. The company decided to expand its expertise in simulation software. Machine
learning is one of the disciplines that play an important role in this type of development. My
job is to conduct research and investigate new ventures and train staff to prepare.

Contents
A. INTRODUCTION .............................................................................................................................. 3
B. CONTENTS....................................................................................................................................... 4
Part3: Use Machine Learning To Determine Titanic Survivors ............................................................. 4
LO1 Analyse the theoretical foundation of machine learning to determine how an intelligent
machine works ................................................................................................................................. 15
P1 Analyse the types of learning problems. ................................................................................. 15
P2 Demonstrate the taxonomy of machine learning algorithms. ................................................ 19
M1 Evaluate the category of machine learning algorithms with appropriate examples ............. 21
D1 Critically evaluate why machine learning is essential to the design of intelligent machines. 26
LO2 Investigate the most popular and efficient machine learning algorithms used in industry .. 27
P3 Investigate a range of machine learning algorithms and how these algorithms solve the
learning problems. ........................................................................................................................ 27
P4 Demonstrate the efficiency of these algorithms by implementing them using an appropriate
programming language or machine learning tool. ....................................................................... 34
M2 Analyse these algorithms using an appropriate example to determine their power. ........... 38
LO3 Develop a machine learning application using an appropriate programming language or
machine learning tool for solving a real-world problem ................................................................ 39
P5 Chose an appropriate learning problem and prepare the training and test data sets in order
to implement a machine learning solution. .................................................................................. 39
P6 Implement a machine learning solution with a suitable machine learning algorithm and
demonstrate the outcome. ........................................................................................................... 42
M3 Test the machine learning application using a range of test data and explain each stages of
this activity. ................................................................................................................................... 46
D2 Critically evaluate the implemented learning solution and it's effectiveness in meeting end
user requirements......................................................................................................................... 55
LO4 Evaluate the outcome or the result of the application to determine the effectiveness of the
learning algorithm used in the application ..................................................................................... 57
P7 Discuss whether the result is balanced, under-fitting or over-fitting...................................... 57
P8 Analyse the result of the application to determine the effectiveness of the algorithm ......... 60
M4 Evaluate the effectiveness of the learning algorithm used in the application. ...................... 62
C. REFERENCES: ................................................................................................................................. 62

B. CONTENTS

Part3: Use Machine Learning To Determine Titanic Survivors

Our approach to this machine learning implementation will use the following steps:

1. Perform an exploratory data analysis to see which of the variables we might want to
include in our model

2. Examine the baseline model, which is based on a single variable (sex) and yet provides
a survivability of 77%. Any model we generate must yield a probability of surviving
greater than 0.77.

3. Create a decision tree model to see whether we can use multiple variables to yield a
higher probability of survival.

4. Create a model using AutoML tools.

5. Finally, we’ll compare the scores from each method, and analyze the efficacy of each
one.

1. How to Perform an Exploratory Data Analysis

The Titanic dataset given by Kaggle is parted into train and test records. The preparation record
contains a variable called Survived (addressing the quantity of survivors), which is our
objective. In the wake of downloading the dataset, you can play out a programmed Exploratory
Data Analysis (EDA) to experience the accessible factors. We will depend on the pandas-
profiling library, as displayed beneath:
The report gives an overall outline of the factors, including:

• Number of factors
• Missing qualities
• Cardinality
• Copy lines

For each mathematical variable, you will likewise get a histogram showing its worth and the
way that it relates with different factors. The subtleties given for the straight out values
incorporate the recurrence of every classification, as in the portrayal of the Sex variable
underneath:
Now that we know which variables are available, we can explore the data in detail in order to
find patterns that will help us define a useful model. Let’s start with plotting the relationship
between the Sex and Survived variables:

As you can see, more than 70% of female passengers survived, whereas less than 20% of their
male counterparts made it out alive. We can examine the ticket class (Pclass) versus
the Survived variable in the same way:
The contrast between the three classes is obvious, as practically 60% of travelers with top notch
tickets made due. This could give us knowledge into the departure orders, or even let us know
how the rafts were loaded up (with inclination being given to top notch travelers). We might
actually take a look at the connection between the port (S=Southampton, C=Cherbourg,
Q=Queenstown) where the travelers set out and their endurance:
Considering this arrangement of factors, we can think of various hypotheses regarding which
of them may be almost certain related with survivors. For instance, ladies with top of the line
tickets who left in Cherbourg appear to have a far more noteworthy possibility getting by than
a man with a second rate class ticket who set out at Southampton. Presently, we should continue
on toward our models.

2. Examining the Baseline Model

The information for the opposition incorporates an example accommodation document that
accepts all female travelers made due. This is known as a benchmark model, and that implies
that the least complex model can be worked from the information without requiring any more
profound examination other than a little check. In this model, the level of female versus male
survivors upholds the theory that orientation is a decent indicator of endurance.
The score for this baseline model is over 0.7, and any new model that we submit should have
a better score.

3. How to Create a Decision Tree Model

The first step in building a good model is to make sure we start with clean, workable data, so
we’ll need to work on the dataset a bit. Since Sex is important but only has two possible
variables, we can transform M and F to numerical values using the scikit-learn preprocessing
class LabelEncoder, which assigns a unique integer to each column’s label in the DataFrame:

Recall our hypothesis that a top of the line lady from Cherbourg had a vastly improved
possibility of making due than a second rate class man from Southampton? Indeed, it very well
may be displayed as a choice tree, and you can prepare a class to make forecasts in view of this
sort of investigation utilizing scikit-learn. The thought here is that the calculation can induce a
few guidelines in view of the elements passed as preparing information, and afterward apply
those principles to make expectations when given new information:

Here, we defined three features from the DataFrame to be used in training


the DecisionTreeClassifier instance:

• Sex

• Pclass

• Embarked

The Cross_val_score performs five emphasess in which it chooses a few information for
preparing and some for testing. It then, at that point, fits the DecisionTreeClassifier case and
assesses the outcomes utilizing the default metric of the calculation, which was the precision
(the quantity of good outcomes/complete tests performed), in this model. The outcomes were
superior to the pattern model in all cases, so presently we can prepare a model and anticipate
the outcomes with the test dataset:
As may be obvious, our code stacks the test dataset and plays out the very changes that we
utilized in the preparation information. Then, at that point, it makes expectations and recoveries
the outcomes in a CSV record, monitoring the kind of information in the forecast. Kaggle
assesses the inaccurate forecasts with an alternate sort of information than the one utilized in
preparing. The outcomes are somewhat more regrettable than they were with the single
variable:
4. How to use AutoML Tools to Create a Model

We truly need a more profound investigation to separate additional data from the information.
We additionally need to play with the calculations and the hyperparameters to tune the ideal
strategy for characterization appropriately. Yet, that will be a ton of work, so all things
considered we should allow the robotized tooling an opportunity to perceive the amount it can
further develop our gauge model.

The group behind the MLBox project collected an investigation for the Titanic dataset that
incorporates full preprocessing, calculation determination, hyperparameter tuning, preparing,
anticipating, and in any event, bundling the outcomes for accommodation:
In the above code:

1. Step 1 essentially utilizes a peruser to stack the preparation and test datasets.
2. Step 2 is the most convoluted, in light of the fact that it manages the determination cycle
that drops pointless factors, yet additionally deals with the floating factors. (A floating variable
changes its measurable properties from the preparation dataset to the test dataset. For more
data, look at this connection).

3. Step 3 enhances the hyperparameters by setting a hunt space and fitting the chose
calculation with the preparation information.

4. Step 4 plays out the forecasts and saves them in a mlbox.csv record.

5. Step 5 readies the forecasts for accommodation to Kaggle.

As you can see, the predictions made by the AutoML model were slightly better than the
baseline model. The lesson is clear: the automatic model was better parametrized, but it still
lacks the feature engineering that a human could contribute.

5. Conclusion: Kaggle’s Titanic Competition with ActivePython – a faster simpler way to


results
Kaggle's Titanic contest has been around for a really long time and right now has in excess of
160,000 passages! It's improbable that our no fuss approach will furnish us with winning
outcomes, yet it will get you substantially more acquainted with the Kaggle stage, which is
truly outstanding for learning Machine Learning.

All things considered, there's still much more you can do with the information given by the
Titanic dataset. Our scores from the standard model, the straightforward choice tree model,
and the AutoML model are alright, yet they could be incredibly worked on by working with
the highlights, calculations, and hyperparameters accessible in the Python libraries for AI.

LO1 Analyse the theoretical foundation of machine learning to determine


how an intelligent machine works
P1 Analyse the types of learning problems.
1. Linear Regression

Linear regression is perhaps one of the best-known and best understood algorithms in statistics
and machine learning.

Predictive modeling is primarily concerned with minimizing the model's errors or making the
most accurate predictions possible, at an explanatory cost.

A linear regression representation is an equation that describes a straight line that best describes
the relationship between the input variables (x) and the output variables (y), by finding specific
weights for the variables. The input variables are called the coefficients (B).

For example: y = B0 + B1 * x
We will predict y for a given variable x, and the standard of a linear regression algorithm is to
look for values for the coefficients B0 and B1.

Various techniques can be used to search for model computational lines from the data, such as
an algebraic solution for Least Squares and optimized Gradient descent.

2. Logistic Regression

Logistic regression is another algorithm borrowed by machine learning from the field of
statistics. This is the best method for binary classification problems (problems with two value
classes).

Logistic regression is like linear regression whose aim is to find values for the coefficients that
weight each input variable. Unlike linear regression, the output prediction is transformed using
a non-linear function called the logistic function.

The logistic function looks like a big S and will transform any value to 0-1. This is useful
because we can apply a rule to the output of the logistic function to increment values for 0 and
1 (eg IF less than 0.5 then output 1) and predict a clash of values.
Because of the way the model is learned, the predictions made by logistic regression can also
be used as the probability that a certain data instance is of class 0 or class 1. This can be useful
for problems problem when you need to give multiple reasons for a prediction.

3. Linear Discriminant Analysis

Logistic regression is a traditional classification algorithm limited to two-class classification


problems. If you have more than two classes then a linear classifier analysis algorithm should
be preferred.

The expression of LDA is quite simple. It includes the statistical properties of your data,
calculated for each class. For a single input variable, it includes:

1. Average value for each class.

2. Variance is calculated across all classes.


Prediction is performed by discriminant computation of the value for each class and prediction
for that class with the largest value. The technique assumes that the data has a Gaussian
distribution (bell curve), which makes it better for you to remove outliers from your data first.
This is a simple method and power to parse these types of problems and model projects.

4. Classification and regression trees

Decision trees are an important type of algorithm for predictive machine learning modeling.

The representation of the decision tree model is a binary tree. This is your binary tree from
algorithms and data structures, nothing too fancy. Each node represents a single input variable
(x) and a split point on that variable (assuming the variable is numeric).

The leaf nodes of the tree contain an output variable (y) that is used for prediction. Predictions
are made by traversing the branches of the tree until reaching a leaf node and giving the class
of values at that leaf node.

Trees can learn very quickly and can be used to make predictions very quickly. They are often
accurate for many types of problems and your data does not require any special preparation.

5. K – Nearest Neighbors
The KNN algorithm is very simple and very efficient. The representative model for KNN is
the entire training data. Simple isn't it?

Prediction is made for a new data point by searching through the entire training set for the most
similar K examples (neighbors) and summarizing the output variable for K examples.
regression problem, this can be the average output variable, for classification problems this can
be the mode (or most common) of the class.

The simplest technique if your attributes are all the same size (all in inches, for example) is to
use the Euclidean distance, a number you can calculate directly based on the difference between
each input variable. into the.

may require a lot of memory or space to store all the data, but only perform computations (or
learn) when a forecast is needed, just in time. You can also update and organize training
exercises over time to keep predictions accurate.

P2 Demonstrate the taxonomy of machine learning algorithms.


Classification of Machine Learning
At a broad level, machine learning can be classified into three types:

1. Supervised learning
2. Unsupervised learning
3. Reinforcement learning

1) Supervised Learning

Supervised learning is a type of machine learning method in which we provide sample labeled
data to the machine learning system in order to train it, and on that basis, it predicts the output.

The system creates a model using labeled data to understand the datasets and learn about each
data, once the training and processing are done then we test the model by providing a sample
data to check whether it is predicting the exact output or not.

The goal of supervised learning is to map input data with the output data. The supervised
learning is based on supervision, and it is the same as when a student learns things in the
supervision of the teacher. The example of supervised learning is spam filtering.

Supervised learning can be grouped further in two categories of algorithms:

o Classification
o Regression

2) Unsupervised Learning

Unsupervised learning is a learning method in which a machine learns without any supervision.

The training is provided to the machine with the set of data that has not been labeled, classified,
or categorized, and the algorithm needs to act on that data without any supervision. The goal
of unsupervised learning is to restructure the input data into new features or a group of objects
with similar patterns.

In unsupervised learning, we don't have a predetermined result. The machine tries to find useful
insights from the huge amount of data. It can be further classifieds into two categories of
algorithms:

o Clustering
o Association

3) Reinforcement Learning

Reinforcement learning is a feedback-based learning method, in which a learning agent gets a


reward for each right action and gets a penalty for each wrong action. The agent learns
automatically with these feedbacks and improves its performance. In reinforcement learning,
the agent interacts with the environment and explores it. The goal of an agent is to get the most
reward points, and hence, it improves its performance.

The robotic dog, which automatically learns the movement of his arms, is an example of
Reinforcement learning.

M1 Evaluate the category of machine learning algorithms with appropriate examples


Practice with real examples

We will practice right away with DataSet which is the property of flowers.
The input is a CSV file with 6 columns with the first column being the index, the middle 4
columns being the parameters of each attribute and the last column being the name of the
flower.

Our requirement is that through the data, I can predict the name of the flower based on the
similarity parameters.

First, I will import some necessary modules:

I will go through each function for everyone to follow

We will first read the input from the CSV file:


I will use the CSV module to format the data read in, then convert it to a matrix using numpy
for easy processing.

Some preprocessing operations include: Delete the first row containing the header, delete the
first column (ordinal), then we will use numpy.random's shuffle method to shuffle the data.
The reason is that after merging we will take the last 50 rows as test data.

Next we will build the distance function:

It will perform the calculation of the input 2-point data using the Euclidean formula. Simply
iterate over all the attributes corresponding to each point, calculate the sum of the squared
difference of each attribute, and finally return the square root of that sum. (If you find it difficult
to understand, you can review the formula above)

Next is the function to find the nearest k data points:


This function will iterate through all the values in the trainSet, calculating the distance between
the input point and the points in the original data set. The result of this loop is a list of
dictionaries containing the name of the label (name of the flower) and the distance to that point.
Next, we will sort this list in ascending order with the comparison value being the distance. As
the result we only need to know the names of k flowers, we will add a loop to create a list of
labels in the same order. Finally, return the first k data points of the list (the smallest).

And the last function is to find the most common flower among the k species found:

We have the list labels as a collection of labels, then cycle through those to find the one that
occurs most often.

Finally, we'll iterate over the values in the test tuple to enter the lookup:
And this is the result:

You can see the results are relatively accurate. And to calculate the accuracy, I can add 1
variable to calculate like this:

The accuracy result will be ~0.9

Conclude:
In this article, we have learned about the KNN algorithm at the most basic level. Through this
example, it will help you to easily approach the KNN algorithm.

D1 Critically evaluate why machine learning is essential to the design of intelligent machines.
Why is machine learning important?

The reason machine learning is important is that it provides businesses with insight into trends
in customer behavior and business models, and aids the development of new products. Many
leading companies today, such as Facebook, Google, and Uber, make machine learning a
central part of their operations. Machine learning has become a significant competitive
differentiator for many companies.

Perhaps one of the most famous examples of machine learning in action is Facebook's news
feed-powered recommendation engine.

Facebook uses machine learning to personalize how each member's feed is delivered. If a
member frequently stops to read a particular group's posts, the recommendation engine will
start showing more of that group's activity earlier in the feed.

Behind the scenes, the engine is trying to reinforce known patterns in members' online
behavior. If a member changes the template and doesn't read posts from that group in the
coming weeks, the news feed will adjust accordingly.

In addition to recommendation engines, other uses for machine learning include:

• Customer relationship management. CRM software can use machine


learning models to analyze emails and prompt sales team members to respond
to the most important messages first. More advanced systems can even suggest
potentially effective responses.
• Smart business. BI and analytics vendors use machine learning in their
software to identify potentially important data points, data point patterns and
outliers.
• Human resource information system. The HRIS system can use a machine
learning model to filter through applications and identify the best candidates for
an open position.
• Self-driving cars. Machine learning algorithms could even make it possible for
a semi-autonomous vehicle to recognize a partially visible object and alert the
driver.
• Virtual assistant. Intelligent assistants typically combine supervised and
unsupervised machine learning models to interpret natural speech and provide
context.

LO2 Investigate the most popular and efficient machine learning


algorithms used in industry
P3 Investigate a range of machine learning algorithms and how these algorithms solve the
learning problems.

Grouping machine learning algorithms

There are basically two ways to group the Machine Learning algorithms that you may
come across in the field.

• The first is a group of ML-style learning algorithms

• Second is a group of ML algorithms by similarity in form or function


In general, both approaches are useful. Although, we will focus on grouping ML algorithms by
similarity.

1. Machine learning algorithms are grouped by learning style

Basically, there are different ways an algorithm can model a problem. Also, it involves
interaction with experience. Although, it has nothing to do with how we want to call the
input data.

This way of organizing machine learning algorithms is very useful. Because it forces you to
think about the role of input data and the model preparation process. Also, choose the one that
best suits your problem for the best results.

Let's take a look at three different learning styles in machine learning algorithms:

a. Supervised Learning

In this supervised algorithm, the input data is called training data and has a label or known
result like spam/non-spam or stock price at a time.

In it, a model is prepared through a training process. In addition, this is necessary to make
predictions. And corrected when those predictions are wrong. The training process continues
until the model reaches the desired level.

• Example problems are classification and regression.


• Example algorithms include logistic regression and Neural Network back-
propagation.

b. Unsupervised learning

In this unsupervised learning, the input data is unlabelled and there is no known outcome.

We have to prepare a model by inferring the structures present in the input data. This could be
to extract general rules. It is possible through a mathematical process to reduce redundancy.

• Example problems are clustering, dimensionality reduction, and association rule learning

• Example algorithms include Apriori and k-Means algorithms.

c. Semi-supervised learning
The data head is a mix of labeled and unlabeled examples.

There is a problem that the desired prediction. But the model has to learn the structure to
organize the data as well as the predictions to be made.

• Example problems are classification types and recovery rules.

• Algorithm examples are extensions to different modes of operation. That the used by
used to the style of the dramatization model is not mounted.

2. Machine learning algorithms are grouped by similarity

ML algorithms are often grouped according to a similarity in their functionality.

For example, tree-based methods and neural network-inspired methods.

I think this is the most useful way to group machine learning algorithms, and it's the
approach we're going to use here.

We can handle these cases by enumerating the ML algorithms twice. Or by subjectively


choosing the group that best suits the group. I like the latter approach of deduplicating
algorithms to keep things simple.

a. Regression Algorithm
Regression algorithms are concerned with modeling relationships between variables. That we
use to refine using the method of measuring error in the predictions made by the model.

The most popular regression algorithms in Machine Learning are:

• Ordinary Least Squares Regression (OLSR)


• Linear Regression
• Logistic Regression
• Stepwise Regression
• Multivariate Adaptive Regression Splines (MARS)
• Locally Estimated Scatterplot Smoothing (LOESS)

b. Instance-based algorithm
This model is a decision problem with training data instances. That is considered important or
necessary for the model.

Such methods build a database of example data. And it needs to compare the new data with the
database. For comparison, we use a similarity measure to find the best match and make
predictions. For this reason, instance-based methods are also referred to as win-get-all and
memory-based learning. The focus is placed on the representation of the stored instances. So
the same measures are used between cases.

The most popular instance-based algorithms in Machine Learning are:

• k-Nearest Neighbor (kNN)


• Learning Vector Quantization (LVQ)
• Self-Organizing Map (SOM)
• Locally Weighted Learning (LWL)

c. Regularization algorithm

An extension was made to another method. It's sanctioning models relative to their complexity.
Also, favoring simpler models is also better at generalizing.

I have listed Regularization algorithms here because they are popular, powerful. And in general
simple modifications are made to other methods.
The most popular Regularization algorithms in Machine Learning are:

• Ridge Regression
• Least Absolute Shrinkage and Selection Operator (LASSO)
• Elastic Net
• Least-Angle Regression (LARS)

d. Decision Tree Algorithm

The decision tree method builds a model of decisions. That is done based on the actual values
of the attributes in the data.

Decide fork in the tree structure until a prediction decision is made for a given profile. Decision
trees are trained on data for classification and regression problems. Decision trees are often fast
and accurate and are a big favorite in machine learning.

The most popular decision tree algorithms in Machine Learning are:

• Classification and Regression Tree (CART)


• Iterative Dichotomiser 3 (ID3)
• 5 and C5.0 (different versions of a powerful approach)
• Chi-squared Automatic Interaction Detection (CHAID)
• Decision Stump
• M5
• Conditional Decision Trees

P4 Demonstrate the efficiency of these algorithms by implementing them using an


appropriate programming language or machine learning tool.
Linear Regression (Python Implementation)

Simple Linear Regression

Simple linear regression is an approach for predicting a response using a single feature.
It is assumed that the two variables are linearly related. Hence, we try to find a linear function
that predicts the response value(y) as accurately as possible as a function of the feature or
independent variable(x).
Let us consider a dataset where we have a value of response y for every feature x:

For generality, we define:


x as feature vector, i.e x = [x_1, x_2, …., x_n],
y as response vector, i.e y = [y_1, y_2, …., y_n]
for n observations (in above example, n=10).
A scatter plot of the above dataset looks like:
Now, the task is to find a line that fits best in the above scatter plot so that we can predict
the response for any new feature values. (i.e a value of x not present in a dataset)
This line is called a regression line.
The equation of regression line is represented as:

Here,

• h(x_i) represents the predicted response value for ith observation.


• b_0 and b_1 are regression coefficients and represent y-intercept and slope of
regression line respectively.
To create our model, we must “learn” or estimate the values of regression coefficients b_0
and b_1. And once we’ve estimated these coefficients, we can use the model to predict
responses!
In this article, we are going to use the principle of Least Squares.
Now consider:
Here, e_i is a residual error in ith observation.
So, our aim is to minimize the total residual error.
We define the squared error or cost function, J as:

and our task is to find the value of b_0 and b_1 for which J(b_0,b_1) is minimum!
Without going into the mathematical details, we present the result here:

where SS_xy is the sum of cross-deviations of y and x:

and SS_xx is the sum of squared deviations of x:

and SS_xx is the total method of the Bias of x:

Code: Python implementation of above technique on our small dataset


Output:
And graph obtained looks like this:

M2 Analyse these algorithms using an appropriate example to determine their power.


Supervised Learning

Supervises Learning is an algorithm that predicts the output (outcome) of a new data (new
input) based on known (input, result) pairs. This data is or is called (data, label). Supervised
learning is a popular clustering algorithm in Engineering Machine Learning.

Supervised learning means when we have a set of input variables X = {x1 ,x2,…, xN} and a
set of corresponding labels Y = {y1, y2,…, yN}, where xi, yixi , yi are vectors. The pairs of
known data (xi, yi) ∈ X × Y are called the training data set. From these training data sets, we
need to create a function that maps each element from the set X to a corresponding
(approximately) element from the set YY:

Yi ≈ f(xi), ∀I = 1, 2,…, N

The goal is to approximate the function f very well so that when we have a new data X we can
compute the corresponding label from which y = f(x).
Example: In the recognition of capital letters. We have pictures of thousands of examples from
each digit written by many different people. After feeding this picture into the algorithm and
showing it knows each to a function whose input will be a digit. After receiving the new photo
the model has never seen. From there, it will predict what numbers the picture contains.

The above example is very similar to the way people learn as children. We give the alphabet
to any child and show them which is A and which is B. After many times of being taught,
children can completely recognize which is the letter A, and the letter B through the words.
books they have never seen.

Besides, there are many algorithms to detect faces in images. It can be seen that facebook has
used this algorithm to identify their faces in photos and ask users to tag friends...

LO3 Develop a machine learning application using an appropriate


programming language or machine learning tool for solving a real-world
problem
P5 Chose an appropriate learning problem and prepare the training and test data sets in order
to implement a machine learning solution.

Data Preparation Process


The more disciplined you are in your handling of data, the more consistent and better results
you are like likely to achieve. The process for getting data ready for a machine learning
algorithm can be summarized in three steps:

• Step 1: Select Data


• Step 2: Preprocess Data
• Step 3: Transform Data
You can follow this process in a linear manner, but it is very likely to be iterative with many
loops.

Step 1: Select Data

This step is concerned with selecting the subset of all available data that you will be working
with. There is always a strong desire for including all data that is available, that the maxim
“more is better” will hold.

You need to consider what data you actually need to address the question or problem you are
working on. Make some assumptions about the data you require and be careful to record those
assumptions so that you can test them later if needed.

Below are some questions to help you think through this process:

• What is the extent of the data you have available? For example through time, database
tables, connected systems. Ensure you have a clear picture of everything that you can
use.
• What data is not available that you wish you had available? For example data that is
not recorded or cannot be recorded. You may be able to derive or simulate this data.
• What data don’t you need to address the problem? Excluding data is almost always
easier than including data. Note down which data you excluded and why.
It is only in small problems, like competition or toy datasets where the data has already been
selected for you.

Step 2: Preprocess Data

After you have selected the data, you need to consider how you are going to use the data. This
preprocessing step is about getting the selected data into a form that you can work.
Three common data preprocessing steps are formatting, cleaning and sampling:

• Formatting: The data you have selected may not be in a format that is suitable for you
to work with. The data may be in a relational database and you would like it in a flat
file, or the data may be in a proprietary file format and you would like it in a relational
database or a text file.
• Cleaning: Cleaning data is the removal or fixing of missing data. There may be data
instances that are incomplete and do not carry the data you believe you need to address
the problem. These instances may need to be removed. Additionally, there may be
sensitive information in some of the attributes and these attributes may need to be
anonymized or removed from the data entirely.
• Sampling: There may be far more selected data available than you need to work with.
More data can result in much longer running times for algorithms and larger
computational and memory requirements. You can take a smaller representative sample
of the selected data that may be much faster for exploring and prototyping solutions
before considering the whole dataset.
It is very likely that the machine learning tools you use on the data will influence the
preprocessing you will be required to perform.

Step 3: Transform Data

The final step is to transform the process data. The specific algorithm you are working with
and the knowledge of the problem domain will influence this step and you will very likely have
to revisit different transformations of your preprocessed data as you work on your problem.

Three common data transformations are scaling, attribute decompositions and attribute
aggregations. This step is also referred to as feature engineering.

• Scaling: The preprocessed data may contain attributes with a mixtures of scales for
various quantities such as dollars, kilograms and sales volume. Many machine learning
methods like data attributes to have the same scale such as between 0 and 1 for the
smallest and largest value for a given feature. Consider any feature scaling you may
need to perform.
• Decomposition: There may be features that represent a complex concept that may be
more useful to a machine learning method when split into the constituent parts. An
example is a date that may have day and time components that in turn could be split out
further. Perhaps only the hour of day is relevant to the problem being solved. consider
what feature decompositions you can perform.
• Aggregation: There may be features that can be aggregated into a single feature that
would be more meaningful to the problem you are trying to solve. For example, there
may be a data instances for each time a customer logged into a system that could be
aggregated into a count for the number of logins allowing the additional instances to be
discarded. Consider what type of feature aggregations could perform.
You can spend a lot of time engineering features from your data and it can be very beneficial
to the performance of an algorithm. Start small and build on the skills you learn.

P6 Implement a machine learning solution with a suitable machine learning algorithm and
demonstrate the outcome.
How to Implement a Machine Learning Algorithm

Implementing a machine learning algorithm in code can teach you a lot about the algorithm
and how it works.

In this article, you will learn how to effectively implement machine learning algorithms and
how to maximize your learning from these projects.

Benefits of Implementing Machine Learning Algorithms

You can use the implementation of machine learning algorithms as a strategy for learning about
applied machine learning. You can also carve out a niche and skills in algorithm
implementation.

Algorithm Understanding

Implementing a machine learning algorithm will give you a deep and practical appreciation for
how the algorithm works. This knowledge can also help you to internalize the mathematical
description of the algorithm by thinking of the vectors and matrices as arrays and the
computational intuitions for the transformations on those structures.
Practical Skills

You are developing valuable skills when you implement machine learning algorithms by hand.
Skills such as mastery of the algorithm, skills that can help in the development of production
systems and skills that can be used for classical research in the field.

Three examples of skills you can develop are listed include:

• Mastery: Implementation of an algorithm is the first step towards mastering the


algorithm. You are forced to understand the algorithm intimately when you implement
it. You are also creating your own laboratory for tinkering to help you internalize the
computation it performs over time, such as by debugging and adding measures for
assessing the running process.
• Production Systems: Custom implementations of algorithms are typically required for
production systems because of the changes that need to be made to the algorithm for
efficiency and efficacy reasons. Better, faster, less resource intensive results ultimately
can lead to lower costs and greater revenue in business, and implementing algorithms
by hand help you develop the skills to deliver these solutions.
• Literature Review: When implementing an algorithm you are performing research.
You are forced to locate and read multiple canonical and formal descriptions of the
algorithm. You are also likely to locate and code review other implementations of the
algorithm to confirm your understandings. You are performing targeted research, and
learning how to read and make practical use of research publications.
Process

You can use the process outlined below.

1. Select programming language: Select the programming language you want to use for
the implementation. This decision may influence the APIs and standard libraries you
can use in your implementation.
2. Select Algorithm: Select the algorithm that you want to implement from scratch. Be
as specific as possible. This means not only the class, and type of algorithm, but also
go as far as selecting a specific description or implementation that you want to
implement.
3. Select Problem: Select a canonical problem or set of problems you can use to test and
validate your implementation of the algorithm. Machine learning algorithms do not
exist in isolation.
4. Research Algorithm: Locate papers, books, websites, libraries and any other
descriptions of the algorithm you can read and learn from. Although, you ideally want
to have one keystone description of the algorithm from which to work, you will want
to have multiple perspectives on the algorithm. This is useful because the multiple
perspectives will help you to internalize the algorithm description faster and overcome
roadblocks from any ambiguities or assumptions made in the description (there are
always ambiguities in algorithm descriptions).
5. Unit Test: Write unit tests for each function, even consider test driven development
from the beginning of the project so that you are forced to understand the purpose and
expectations of each unit of code before you implement them.
Extensions

Once you have implemented an algorithm you can explore making improvements to the
implementation. Some examples of improvements you could explore include:

• Experimentation: You can expose many of the micro-decisions you made in the
algorithms implementation as parameters and perform studies on variations of those
parameters. This can lead to new insights and disambiguation of algorithm
implementations that you can share and promote.
• Optimization: You can explore opportunities to make the implementation more
efficient by using tools, libraries, different languages, different data structures, patterns
and internal algorithms. Knowledge you have of algorithms and data structures for
classical computer science can be very beneficial in this type of work.
• Specialization: You may explore ways of making the algorithm more specific to a
problem. This can be required when creating production systems and is a valuable skill.
Making an algorithm more problem specific can also lead to increases in efficiency
(such as running time) and efficacy (such as accuracy or other performance measures).
• Generalization: Opportunities can be created by making a specific algorithm more
general. Programmers (like mathematicians) are uniquely skilled in abstraction and you
may be able to see how the algorithm could be applied to more general cases of a class
of problem or other problems entirely.
Limitations

You can learn a lot by implementing machine learning algorithms by hand, but there are also
some downsides to keep in mind.

• Redundancy: Many algorithms already have implementations, some very robust


implementations that have been used by hundreds or thousands of researchers and
practitioners around the world. Your implementation may be considered redundant, a
duplication of effort already invested by the community.
• Bugs: New code that has few users is more likely to have bugs, even with a skilled
programmer and unit tests. Using a standard library can reduce the likelihood of having
bugs in the algorithm implementation.
• Non-intuitive Leaps: Some algorithms rely on non-intuitive jumps in reasoning or
logic because of the sophisticated mathematics involved. It is feasible that an
implementation that does not appreciate these leaps to be limited or even incorrect.
Example Projects

In this post I want to make some suggestions for intuitive algorithms from which you might
like to select your first machine learning algorithm to implement from scratch.

• Ordinary Least Squares Linear Regression: Use two dimensional data sets and
model x from y. Print out the error for each iteration of the algorithm. Consider plotting
the line of best fit and predictions for each iteration of the algorithm to see how the
updates affect the model.
• k-Nearest Neighbor: Consider using two dimensional data sets with 2 classes even
ones that you create with graph paper so that you can plot them. Once you can plot and
make predictions, you can plot the relationships created for each prediction decision the
model makes.
• Perceptron: Considered the simplest artificial neural network model and very similar
to a regression model. You can track and graph the performance of the model as it learns
a dataset.
Summary

In this article, you learned the benefits of implementing machine learning algorithms by hand.
You have learned that you can understand an algorithm, improve, and develop valuable skills
by following this path.
You've learned a simple process that you can follow and customize when implementing
multiple algorithms from scratch, and you've learned three algorithms that you can choose as
your first algorithm to implement from scratch. head.

M3 Test the machine learning application using a range of test data and explain each stages
of this activity.
7 steps to building a machine learning model.

Step 1. Understand the business problem (and define success)

To start, work with the owner of the project and make sure you understand the project's
objectives and requirements. The goal is to convert this knowledge into a suitable problem
definition for the machine learning project and devise a preliminary plan for achieving the
project's objectives. Key questions to answer include the following:

• What's the business objective that requires a cognitive solution?

• What parts of the solution are cognitive, and what aren't?

• Have all the necessary technical, business and deployment issues been addressed?

• What are the defined "success" criteria for the project?

• How can the project be staged in iterative sprints?

• Are there any special requirements for transparency, explainability or bias


reduction?

• What are the ethical considerations?

• What are the acceptable parameters for accuracy, precision and confusion matrix
values?

• What are the expected inputs to the model and the expected outputs?

• What are the characteristics of the problem being solved? Is this a classification,
regression or clustering problem?
• What is the "heuristic" -- the quick-and-dirty approach to solving the problem that
doesn't require machine learning? How much better than the heuristic does the
model need to be?

• How will the benefits of the model be measured?

Setting specific, quantifiable goals will help realize measurable ROI from the machine learning
project instead of simply implementing it as a proof of concept that'll be tossed aside later.

In order for a machine learning project to go forward, you need to determine the feasibility of
the effort from a business, data and implementation standpoint.

Step 2. Understand and identify data

Once you have a firm understanding of the business requirements and receive approval for the
plan, you can start to build a machine learning model, right? Wrong. Establishing the business
case doesn't mean you have the data needed to create the machine learning model.

A machine learning model is built by learning and generalizing from training data, then
applying that acquired knowledge to new data it has never seen before to make predictions and
fulfill its purpose.
The focus should be on data identification, initial collection, requirements, quality
identification, insights and potentially interesting aspects that are worth further investigation.
Here are some key questions to consider:

• Where are the sources of the data that's needed for training the model?

• What quantity of data is needed for the machine learning project?

• What is the current quantity and quality of training data?

• How are the test set data and training set data being split?

• For supervised learning tasks, is there a way to label that data?

• Can pre-trained models be used?

• Where is the operational and training data located?

• Are there special needs for accessing real-time data on edge devices or in more
difficult-to-reach places?

Answering these important questions helps you get a handle on the quantity and quality of data
as well as understand the type of data that's needed to make the model work.

In addition, you need to know how the model will operate on real-world data. For example,
will the model be used offline, operate in batch mode on data that's fed in and processed
asynchronously, or be used in real time, operating with high-performance requirements to
provide instant results? This information will also determine the sort of data needed and data
access requirements.

During this phase of the AI project, it's also important to know if any differences exist between
real-world data and training data as well as test data and training data, and what approach you
will take to validate and evaluate the model for performance.
The above chart outlines different kinds of data and sources needed for machine learning
projects.

Step 3. Collect and prepare data

Procedures during the data preparation, collection and cleansing process include the following:

• Collect data from the various sources.

• Standardize formats across different data sources.

• Replace incorrect data.

• Enhance and augment data.

• Add more dimensions with pre-calculated amounts and aggregate information as


needed.

• Enhance data with third-party data.

• "Multiply" image-based data sets if they aren't sufficient enough for training.

• Remove extraneous information and deduplication.

• Remove irrelevant data from training to improve results.

• Reduce noise reduction and remove ambiguity.


• Consider anonymizing data.

• Normalize or standardize data to get it into formatted ranges.

• Sample data from large data sets.

• Select features that identify the most important dimensions and, if necessary, reduce
dimensions using a variety of techniques.

• Split data into training, test and validation sets.

Data preparation and cleansing tasks can take a substantial amount of time. Surveys of machine
learning developers and data scientists show that the data collection and preparation steps can
take up to 80% of a machine learning project's time. As the saying goes, "garbage in, garbage
out." Since machine learning models need to learn from data, the amount of time spent on
prepping and cleansing is well worth it.

The above chart is an overview of the training and inference pipelines used in developing and
updating machine learning models.

Step 4. Determine the model's features and train it


This phase requires model technique selection and application, model training, model
hyperparameter setting and adjustment, model validation, ensemble model development and
testing, algorithm selection, and model optimization. To accomplish all that, the following
actions are required:

• Select the right algorithm based on the learning objective and data requirements.

• Configure and tune hyperparameters for optimal performance and determine a


method of iteration to attain the best hyperparameters.

• Identify the features that provide the best results.

• Determine whether model explainability or interpretability is required.

• Develop ensemble models for improved performance.

• Test different model versions for performance.

• Identify requirements for the model's operation and deployment.

The resulting model can then be evaluated to determine whether it meets the business and
operational requirements.
In machine learning, an algorithm is the formula or set of instructions to follow to record
experience and improve learning over time. Depending on what type of machine learning
approach you are doing, different algorithms perform better than others.

Step 5. Evaluate the model's performance and establish benchmarks

From an AI perspective, evaluation includes model metric evaluation, confusion matrix


calculations, KPIs, model performance metrics, model quality measurements and
a final determination of whether the model can meet the established business goals. During the
model evaluation process, you should do the following:

• Evaluate the models using a validation data set.

• Determine confusion matrix values for classification problems.

• Identify methods for k-fold cross-validation if that approach is used.

• Further tune hyperparameters for optimal performance.


• Compare the machine learning model to the baseline model or heuristic.

Model evaluation can be considered the quality assurance of machine learning. Adequately
evaluating model performance against metrics and requirements determines how the model
will work in the real world.

Understanding the concepts of bias and variance helps you find the sweet spot for optimizing
the performance of your machine learning models.

Step 6. Put the model in operation and make sure it works well

When you're confident that the machine learning model can work in the real world, it's time to
see how it actually operates in the real world -- also known as "operationalizing" the model:

• Deploy the model with a means to continually measure and monitor its
performance.
• Develop a baseline or benchmark against which future iterations of the model can
be measured.

• Continuously iterate on different aspects of the model to improve overall


performance.

Model operationalization might include deployment scenarios in a cloud environment, at the


edge, in an on-premises or closed environment, or within a closed, controlled group.

Depending on the requirements, model operationalization can range from simply generating a
report to a more complex, multi-endpoint deployment.

Successful AI projects iterate models to ensure the models continue to provide valuable,
reliable and desirable results in the real world.

Step 7. Iterate and adjust the model

Real-world data changes in unexpected ways. All of which might create new requirements for
deploying the model onto different endpoints or in new systems. The end may just be a new
beginning, so it's best to determine the following:
• the next requirements for the model's functionality;

• expansion of model training to encompass greater capabilities;

• improvements in model performance and accuracy;

• improvements in model operational performance;

• operational requirements for different deployments; and

• solutions to "model drift" or "data drift," which can cause changes in performance
due to changes in real-world data.

The surefire way to achieve success in machine learning model building is to continuously look
for improvements and better ways to meet evolving business requirements.

D2 Critically evaluate the implemented learning solution and it's effectiveness in meeting end
user requirements.
Evaluate algorithm complexity

1. Method of evaluation by theory

In competitive programming, one will evaluate algorithmic complexity using theoretical


methods. In this method, we are interested in the size factor of the input data, usually a number
n. The relationship between this factor and the number of calculations to find the result of a
problem is called algorithmic complexity (not a specific time like 1, 2, or 10 seconds). We use
the function T(n)T(n) to represent the execution time of the algorithm with input data of size
n.

The magnitude of the function T(n) is represented by a function O(f(n)) with T(n) and f(n)
being two non-negative real functions. If an algorithm has an execution time of T(n) = O(f(n))
then we say that the algorithm has an execution time of order f(n).

2. Rules for evaluating algorithm execution time


To evaluate the algorithm's execution time, we start from single instructions in the program,
then to structured statements, more complex blocks of instructions, and then combine to form
the entire program execution time. Specifically, we have the following rules:
• Single instructions (declaration, assignment, data import and export, arithmetic
operations, ...): Time O (1).
• Command blocks: Assume a block of statements S1, S2, ...., Sm whose execution time
is O(f1(n)), O(f2(n)), ..., O(fm(n)) then the execution time of the whole block is:
O(max(f1(n), f2(n), ...., fm(n))).
• Branching statement: We have the branch command syntax is:

• Loop statement: Assuming the execution time of the body of the loop statement is
O(f(n)) and the maximum number of iterations of the loop is g(n), then The execution
time of the whole loop is O(f(n).g(n)). This applies to all for, while, and do...while
loops.
• After evaluating the execution time of all instructions in the program, the execution
time of the entire program will be the execution time of the statement with the largest
execution time.
3. Example analysis
Example: Analyze the execution time of the following program segment:

The execution time of the above program depends on the number n. We analyze in detail:
• Instructions (1), (2), (3), (5), (6), (8), (9) all have O(1) execution time.
• The for loop number 4 has n iterations and the statement in the body (which is a statement
(5)) has O(1) execution time. So the whole loop has an execution time of O(n). Same
with loop number (7).

So the execution time of the whole algorithm is:

Max (O(1), O(n)) = O(n)

LO4 Evaluate the outcome or the result of the application to determine the
effectiveness of the learning algorithm used in the application
P7 Discuss whether the result is balanced, under-fitting or over-fitting.
Tactics To Combat Imbalanced Training Data

1) Can You Collect More Data?

You might think it’s silly, but collecting more data is almost always overlooked.

Can you collect more data? Take a second and think about whether you are able to gather more
data on your problem.

More examples of minor classes may be useful later when we look at resampling your dataset.

2) Try Changing Your Performance Metric

Accuracy is not the metric to use when working with an imbalanced dataset. We have seen that
it is misleading.

There are metrics that have been designed to tell you a more truthful story when working with
imbalanced classes.

You should consider the following performance metrics that can provide more insight into
model accuracy than traditional classifier accuracy:

• Confusion Matrix: A breakdown of predictions into a table showing correct


predictions (the diagonal) and the types of incorrect predictions made (what classes
incorrect predictions were assigned).
• Precision: A measure of a classifiers exactness.
• Recall: A measure of a classifiers completeness
• F1 Score (or F-score): A weighted average of precision and recall.
I would also advice you to take a look at the following:

• Kappa (or Cohen’s kappa): Classification accuracy normalized by the imbalance of


the classes in the data.
• ROC Curves: Like precision and recall, accuracy is divided into sensitivity and
specificity and models can be chosen based on the balance thresholds of these values.
• Confusion Matrix: A breakdown of predictions into a table showing correct
predictions (the diagonal) and the types of incorrect predictions made (what classes
incorrect predictions were assigned).
• Precision: A measure of a classifiers exactness.
• Recall: A measure of a classifiers completeness
• F1 Score (or F-score): A weighted average of precision and recall.
I would also advice you to take a look at the following:

• Kappa (or Cohen’s kappa): Classification accuracy normalized by the imbalance of


the classes in the data.
• ROC Curves: Like precision and recall, accuracy is divided into sensitivity and
specificity and models can be chosen based on the balance thresholds of these values.
3) Try Resampling Your Dataset

You can change the dataset that you use to build your predictive model to have more balanced
data.

This change is called sampling your dataset and there are two main methods that you can use
to even-up the classes:

1. You can add copies of instances from the under-represented class called over-sampling
(or more formally sampling with replacement), or
2. You can delete instances from the over-represented class, called under-sampling.
4) Try Generate Synthetic Samples

A simple way to generate synthetic samples is to randomly sample the attributes from instances
in the minority class.
You could sample them empirically within your dataset or you could use a method like Naive
Bayes that can sample each attribute independently when run in reverse. You will have more
and different data, but the non-linear relationships between the attributes may not be preserved.

These approaches are often very easy to implement and fast to run. They are an excellent
starting point.

5) Try Different Algorithms

As always, I strongly advice you to not use your favorite algorithm on every problem. You
should at least be spot-checking a variety of different types of algorithms on a given problem.

For more on spot-checking algorithms, see my post “Why you should be Spot-Checking
Algorithms on your Machine Learning Problems”.

6) Try Penalized Models

Penalized classification imposes an additional cost on the model for making classification
mistakes on the minority class during training. These penalties can bias the model to pay more
attention to the minority class.

Often the handling of class penalties or weights are specialized to the learning algorithm. There
are penalized versions of algorithms such as penalized-SVM and penalized-LDA.

7) Try a Different Perspective

There are fields of study dedicated to imbalanced datasets. They have their own algorithms,
measures and terminology.

Taking a look and thinking about your problem from these perspectives can sometimes shame
loose some ideas.

This shift in thinking considers the minor class as the outliers class which might help you think
of new ways to separate and classify samples.

Change detection is similar to anomaly detection except rather than looking for an anomaly it
is looking for a change or difference. This might be a change in behavior of a user as observed
by usage patterns or bank transactions.
Both of these shifts take a more real-time stance to the classification problem that might give
you some new ways of thinking about your problem and maybe some more techniques to try.

8) Try Getting Creative

Really climb inside your problem and think about how to break it down into smaller problems
that are more tractable.

For example:

Decompose your larger class into smaller number of other classes…

…use a One Class Classifier… (e.g. treat like outlier detection)

…resampling the unbalanced training set into not one balanced set, but several. Running an
ensemble of classifiers on these sets could produce a much better result than one classifier
alone

These are just a few of some interesting and creative ideas you could try.

For more ideas, check out these comments on the reddit post “Classification when 80% of my
training set is of one class“.

P8 Analyse the result of the application to determine the effectiveness of the algorithm
Analyze the efficiency of the algorithm

1. Criteria for evaluating an algorithm

Usually, when solving a problem, we always tend to choose the "best" solution. But what is
"good"? In mathematics, a "good" solution can be a short, concise, or criterion-based solution
that uses easy-to-understand knowledge. As for algorithms in Informatics, it is based on the
following two criteria:

• The algorithm is simple, easy to understand, and easy to install.


• Algorithm efficiency: Based on two factors that are the execution time of the algorithm
(also known as algorithm complexity) and the amount of memory required to store data.
However, in the current context when computers have very large storage capacity, the
factor that we need to pay more attention to is algorithmic complexity.
2. The Necessity of Efficient Algorithms
As technology develops, it will lead to an increasingly large amount of data to be calculated,
of course, the computing power of computers is also growing. But it's not because of modern
computers that we can ignore the importance of an efficient algorithm. To clarify this issue, I
would like to quote an example in the specialized textbook Tin book 1 about the algorithm to
check the primality of a number.

The above is the simplest implementation of the primality checking algorithm. This algorithm
needs N-2 checks in the loop. Let's say we need to test a number of about 25 digits, and we
have a supercomputer that can calculate a hundred trillion (1014) calculation per second, then
the total time needed to check is:

1025
≈ 3170 years!
1014 × 60 × 60 × 24 × 365

However, if we are observant, we can comment as follows: A number N has a divisor of x (x


𝑁
≤ √𝑁) then it definitely has a wish is ≥√𝑁). Therefore, instead of traversing from 2 to N - 1,
𝑥

we just need to traverse from 2 to √𝑁) is it possible to know if N has any divisors in this passage:
Following this method, still an integer of about 25 digits but the check time will be reduced to:

√1025
≈ 0,03 seconds!
1014

M4 Evaluate the effectiveness of the learning algorithm used in the application.

C. REFERENCES:

• https://en.wikipedia.org/wiki/Linear_regression
• https://en.wikipedia.org/wiki/Simple_linear_regression
• http://scikit-learn.org/stable/auto_examples/linear_model/plot_ols.html
• http://www.statisticssolutions.com/assumptions-of-linear-regression/
• https://www.oreilly.com/library/view/machine-learning-pocket/9781492047537/

You might also like