You are on page 1of 26

Module -1

Introduction:
Terminologies in machine learning, Applications, Types of machine learning: supervised,
unsupervised, semi-supervised learning, Reinforcement Learning.
Features: Types of Data (Qualitative and Quantitative), Scales of Measurement (Nominal, Ordinal,
Interval, Ratio), Concept of Feature, Feature construction, Feature Selection and Transformation,
Curse of Dimensionality. Linear discriminant Analysis (LDA).

Definition of Machine Learning


Machine Learning is the field of study that gives computers the capability to learn without being
explicitly programmed. ML is one of the most exciting technologies that one would have ever come
across. As it is evident from the name, it gives the computer that makes it more similar to
humans: The ability to learn. Machine learning is actively being used today, perhaps in many more
places than one would expect.

Terminologies in machine learning

Labels

A label is the thing we're predicting—the y variable in simple linear regression. The label could
be the future price of wheat, the kind of animal shown in a picture, the meaning of an audio clip,
or just about anything.

Features

A feature is an input variable—the x variable in simple linear regression. A simple machine


learning project might use a single feature, while a more sophisticated machine learning project
could use millions of features, specified as:

x1,x2,...xN

In the spam detector example, the features could include the following:

 words in the email text


 sender's address
 time of day the email was sent
 email contains the phrase "one weird trick."
Examples

An example is a particular instance of data, x. (We put x in boldface to indicate that it is a vector.)
We break examples into two categories:

 labeled examples
 unlabeled examples

A labeled example includes both feature(s) and the label. That is:

labeled examples: {features, label}: (x, y)

Use labeled examples to train the model. In our spam detector example, the labeled examples
would be individual emails that users have explicitly marked as "spam" or "not spam."

An unlabeled example contains features but not the label. That is:

unlabeled examples: {features, ?}: (x, ?)

Once we've trained our model with labeled examples, we use that model to predict the label on
unlabeled examples. In the spam detector, unlabeled examples are new emails that humans haven't
yet labeled.

Models

A model defines the relationship between features and label. For example, a spam detection model
might associate certain features strongly with "spam". Let's highlight two phases of a model's life:

 Training means creating or learning the model. That is, you show the model labeled examples
and enable the model to gradually learn the relationships between features and label.
 Inference means applying the trained model to unlabeled examples. That is, you use the trained
model to make useful predictions (y'). For example, during inference, you can
predict medianHouseValue for new unlabeled examples.

Regression vs. classification

A regression model predicts continuous values. For example, regression models make predictions
that answer questions like the following:

 What is the value of a house in California?


 What is the probability that a user will click on this ad?

A classification model predicts discrete values. For example, classification models make
predictions that answer questions like the following:
 Is a given email message spam or not spam?
 Is this an image of a dog, a cat, or a hamster?

Applications of Machine learning

1. Image Recognition:

Image recognition is one of the most common applications of machine learning. It is used to
identify objects, persons, places, digital images, etc. The popular use case of image recognition
and face detection is, Automatic friend tagging suggestion:
Facebook provides us a feature of auto friend tagging suggestion. Whenever we upload a photo
with our Facebook friends, then we automatically get a tagging suggestion with name, and the
technology behind this is machine learning's face detection and recognition algorithm.

It is based on the Facebook project named "Deep Face," which is responsible for face recognition
and person identification in the picture.

2. Speech Recognition

While using Google, we get an option of "Search by voice," it comes under speech recognition,
and it's a popular application of machine learning.

Speech recognition is a process of converting voice instructions into text, and it is also known as
"Speech to text", or "Computer speech recognition." At present, machine learning algorithms
are widely used by various applications of speech recognition. Google assistant, Siri, Cortana,
and Alexa are using speech recognition technology to follow the voice instructions.

3. Traffic prediction:

If we want to visit a new place, we take help of Google Maps, which shows us the correct path
with the shortest route and predicts the traffic conditions.

It predicts the traffic conditions such as whether traffic is cleared, slow-moving, or heavily
congested with the help of two ways:

o Real Time location of the vehicle form Google Map app and sensors
o Average time has taken on past days at the same time.

Everyone who is using Google Map is helping this app to make it better. It takes information from
the user and sends back to its database to improve the performance.

4. Product recommendations:

Machine learning is widely used by various e-commerce and entertainment companies such
as Amazon, Netflix, etc., for product recommendation to the user. Whenever we search for some
product on Amazon, then we started getting an advertisement for the same product while internet
surfing on the same browser and this is because of machine learning.

Google understands the user interest using various machine learning algorithms and suggests the
product as per customer interest.

As similar, when we use Netflix, we find some recommendations for entertainment series, movies,
etc., and this is also done with the help of machine learning.
5. Self-driving cars:

One of the most exciting applications of machine learning is self-driving cars. Machine learning
plays a significant role in self-driving cars. Tesla, the most popular car manufacturing company is
working on self-driving car. It is using unsupervised learning method to train the car models to
detect people and objects while driving.

6. Email Spam and Malware Filtering:

Whenever we receive a new email, it is filtered automatically as important, normal, and spam. We
always receive an important mail in our inbox with the important symbol and spam emails in our
spam box, and the technology behind this is Machine learning. Below are some spam filters used
by Gmail:

o Content Filter
o Header filter
o General blacklists filter
o Rules-based filters
o Permission filters

Some machine learning algorithms such as Multi-Layer Perceptron, Decision tree, and Naïve
Bayes classifier are used for email spam filtering and malware detection.

7. Virtual Personal Assistant:

We have various virtual personal assistants such as Google assistant, Alexa, Cortana, Siri. As
the name suggests, they help us in finding the information using our voice instruction. These
assistants can help us in various ways just by our voice instructions such as Play music, call
someone, Open an email, Scheduling an appointment, etc.

These virtual assistants use machine learning algorithms as an important part.

These assistant record our voice instructions, send it over the server on a cloud, and decode it using
ML algorithms and act accordingly.

8. Online Fraud Detection:

Machine learning is making our online transaction safe and secure by detecting fraud transaction.
Whenever we perform some online transaction, there may be various ways that a fraudulent
transaction can take place such as fake accounts, fake ids, and steal money in the middle of a
transaction. So to detect this, Feed Forward Neural network helps us by checking whether it is
a genuine transaction or a fraud transaction.
For each genuine transaction, the output is converted into some hash values, and these values
become the input for the next round. For each genuine transaction, there is a specific pattern which
gets change for the fraud transaction hence, it detects it and makes our online transactions more
secure.

9. Stock Market trading:

Machine learning is widely used in stock market trading. In the stock market, there is always a risk
of up and downs in shares, so for this machine learning's long short term memory neural
network is used for the prediction of stock market trends.

10. Medical Diagnosis:

In medical science, machine learning is used for diseases diagnoses. With this, medical technology
is growing very fast and able to build 3D models that can predict the exact position of lesions in
the brain.

It helps in finding brain tumors and other brain-related diseases easily.

11. Automatic Language Translation:

Nowadays, if we visit a new place and we are not aware of the language then it is not a problem at
all, as for this also machine learning helps us by converting the text into our known languages.
Google's GNMT (Google Neural Machine Translation) provide this feature, which is a Neural
Machine Learning that translates the text into our familiar language, and it called as automatic
translation.

The technology behind the automatic translation is a sequence to sequence learning algorithm,
which is used with image recognition and translates the text from one language to another
language.

Types of Machine Learning


As we have learned earlier, the focus of the field of machine learning is “learning the way a human
brain learns”, there are many types of Machine Learning that you may encounter as a general
Machine Learning enthusiast.

Some types of learning describe whole subfields of study composed of many different types of
algorithms in themselves such as “supervised learning.”

There are mainly 4 types of learning that you must be familiar with as a machine learning
practitioner, namely:
Learning Problems
1.Supervised Learning
2. Unsupervised Learning
3. Reinforcement Learning

Hybrid Learning Problems


4. Semi-Supervised Learning

Now that we broadly know the types of Machine Learning Algorithms, let us try and understand
them better one after the other.

1. Supervised Machine Learning

As you must have understood from the name, Supervised machine learning is based on supervision
of the learning process of the machines. It indicates that in the supervised learning technique, we
train the machines using the labeled or trained dataset, and on the basis of this training, the machine
predicts the output. In this context, the labeled data specifies that some of the inputs that we feed
to the algorithm are already mapped to the output.

To put it simply, in this type of Machine Learning, we try to teach the machines using the trained
data and then expect it to predict the outcomes on the test data.

Let’s understand the working of supervised learning with an example. Consider that we have an
input dataset of cats and dog images. As the first step, we will provide training to the machine to
understand the images. It means that we will try to teach the machine to classify the images on the
features such as the shape & size of the tail of cat and dog, Shape of eyes, color, height, and so on.

After successful completion of training, we input the picture of a cat and ask the machine to
identify the object and predict the output on the basis of its training. As a result, the machine is
well trained, so it will check all the classifying features of the object, such as height, shape, color,
eyes, ears, tail, and ascertain that it’s a cat. So, it will classify the image in the Cat category.

Through this process, the machine identifies the objects in Supervised Learning.
Categories of Supervised Machine Learning

On the basis of the problem that we have to encounter, we can classify Supervised Learning into
two categories:

a. Classification:

We use Classification algorithms to solve the classification problems in which the output variable
is categorical. These categories can be of many types such as Yes or No, Male or Female, Red or
Blue, and so on. The classification algorithms predict the categories present in the dataset on the
basis of training data.

Some popular classification algorithms in practice are Random Forest, Algorithm, Decision Tree
Algorithm, Logistic Regression Algorithm, Support Vector Machine Algorithm.

Some of the widely used Classification algorithms are:

 Random Forest Algorithm


 Decision Tree Algorithm
 Logistic Regression Algorithm
 Support Vector Machine Algorithm
b. Regression:

We use Regression algorithms to solve regression problems in which there exists a linear
relationship between input and output variables. We can use these variables to predict continuous
output variables, such as market trends, weather prediction, and so on.

Some of the popular Regression algorithms are: Simple Linear Regression Algorithm, Multivariate
Regression Algorithm, Decision Tree Algorithm, Lasso Regression

Advantages of Supervised Learning

 Since supervised learning works with the labeled dataset, we can have an exact idea about the
classification of objects and other variables that we feed in as input.
 These algorithms are helpful in predicting the output on the basis of prior experience and
trained data.
Disadvantages of Supervised Learning

 These algorithms are not able to solve complex tasks due to a lack of data.
 It may predict the wrong output if the test data is different from the training data or the training
data has some noise.
 It requires lots of computational time to train the algorithm and a huge load of trained data.
Applications of Supervised Learning

a. Image Segmentation:
We make use of Supervised Learning algorithms in image segmentation. In this process, we
perform image classification on different image data with predefined labels in the dataset.
b. Medical Diagnosis:

We can also observe the usage of Supervised algorithms in the medical field for diagnosis
purposes. It is achieved by using medical images and past labeled data with labels for disease
conditions to accurately predict the aliments. With such a process, the machine can identify a
disease for the new patients with the help of previous data history of other patients.

c. Fraud Detection:

We use Supervised Learning classification algorithms for identifying fraud transactions, fraud
customers, and other financial frauds. We achieve this by using historic data to identify the patterns
that can lead to possible frauds before they can occur.

d. Spam detection:
In the usage of spam detection & filtering, classification algorithms are quite effective and reliable.
These algorithms are able to classify an email as spam or not spam. It then sends the spam emails
to the spam folder.

e. Speech Recognition

Supervised learning algorithms are even used in speech recognition. We can even train the
algorithm with voice data, and can even perform various identifications using the same, such as
voice-activated passwords, voice commands, and so on.

2. Unsupervised Machine Learning

Unsupervised Machine Learning is far different from the Supervised learning technique in many
ways. As you would have understood from the name, there is no need for supervision in this type
of Machine Learning. It indicates that, in unsupervised machine learning, we train the machine
using the unlabeled or untrained dataset, and the machine predicts the output without any
supervision of such data.

In unsupervised learning, we train the models with the data that is neither classified nor labeled,
and the model acts on that data without performing any supervision methods.
The main aim of the unsupervised learning algorithm is to group or categorize the unsorted dataset
according to the similarities, patterns, and differences that it recognizes from the dataset. We
instruct the Machines to recognize the hidden patterns from the input dataset and analyze the
results.

Types of Unsupervised Learning

a. Clustering:

We make use of the clustering technique when we want to look for the inherent groups from the
data. It is a technique to group the objects into a cluster. We group the objects such that the objects
with the most similarities are classified in one group and have fewer or no similarities with the
objects of other groups.

Some of the popular clustering algorithms are given below: K-Means Clustering algorithm, Mean-
shift algorithm, DBSCAN Algorithm, Principal Component Analysis, Independent Component
Analysis

b. Association

Association rule learning is a type of unsupervised learning technique, which looks for interesting
relations among variables within a large dataset that we feed in as input. This algorithm aims to
find the dependency of one data item on another data item and map those variables accordingly so
that it can generate maximum profit from the unsorted large dataset.

Some of the popular algorithms of Association rule learning are Apriori Algorithm, Eclat, FP-
growth algorithm.

Advantages of Unsupervised Learning

 We can use these algorithms for complicated tasks as compared to the supervised ones
because these algorithms work on unlabeled datasets and do not require large datasets.
 Unsupervised algorithms are preferable for various tasks as getting the unlabeled dataset is
easier as compared to the labeled dataset for the training of these algorithms.
Disadvantages of Unsupervised Learning

 An unsupervised algorithm may result in less accurate outputs as the dataset is not labeled,
and we have not trained the algorithms with the exact output beforehand.
 Working with Unsupervised learning is more difficult as compared to other types as it works
with the unlabelled dataset that does not map with the output precisely.

Applications of Unsupervised Learning

a. Network Analysis:

We use Unsupervised learning for identifying plagiarism and copyright in document network
analysis of text data for scholarly articles and prevent copyright frauds.
b. Recommendation Systems:

Recommendation systems extensively makes use of unsupervised learning techniques for building
recommendation applications for different web applications and e-commerce websites for
convenience and popularity of products.

c. Anomaly Detection:

Anomaly detection is a widely used application of unsupervised learning, which can identify
unusual data points within a large input dataset. We extensively use it to discover fraudulent
transactions.

d. Singular Value Decomposition:

We use Singular Value Decomposition or commonly known as SVD to extract a certain type of
information from the database. For example, extraction of information of each user located at a
certain geographical location.

3. Semi-Supervised Learning

Semi-Supervised learning is yet another type of Machine Learning algorithm that lies between or
is a hybrid of Supervised and Unsupervised machine learning. It demonstrates the intermediate
ground between Supervised and Unsupervised learning algorithms and makes use of the
combination of labeled and unlabeled data sets during the training period for the machines.

Although Semi-supervised learning is the middle ground between supervised and unsupervised
learning and we use it on the data that consists of a few label datasets, it mostly comprises
unlabeled data. As these labeled datasets are quite expensive, but for corporate purposes, they may
require a few labeled datasets. It is completely different from supervised and unsupervised learning
since they are based on the presence and absence of labeled data.

In order to overcome the drawbacks of supervised learning and unsupervised learning algorithms,
we came up with the concept of Semi-supervised learning. Semi-supervised learning aims to
effectively use all the available data, rather than only labeled data like in supervised learning.
Initially, the model clusters together similar data along with an unsupervised learning algorithm.
Furthermore, it helps to put labels on the unlabeled data and turns it into labeled data. We perform
this step because labeled data is quite expensive as compared to untrained data.

Advantages of semi-supervised learning

 It is quite simple and easy to understand the algorithm and does not encounter anomalies.
 This is highly efficient in predicting the output on the basis of input data.
 It overcomes the drawbacks of Supervised and Unsupervised Learning algorithms.
Disadvantages of semi-supervised learning

 Iteration results may not be stable and outputs may vary significantly.
 We cannot apply these algorithms to network-level data due to its complexities.
 The accuracy rate for this type of Learning is low.

4. Reinforcement Learning

Reinforcement learning operates on a feedback-based process, in which an Artificial Intelligence


agent automatically explores its surroundings by hit and trial methods. It takes action, learns from
experiences, and improves its performance. The algorithm rewards such agents for each good
action and punishes it for each bad action. Hence, the reinforcement learning agent aims to
maximize the rewards and minimize the punishments.

In reinforcement learning, the algorithm does not require any labeled data like supervised learning,
and agents solely learn from their experiences.

The reinforcement learning process is similar to the learning process of a human being. As an
example, consider a child that learns various things by his day-to-day life experiences. A simple
example to understand reinforcement learning is to play a game, where the Game mimics the
environment, moves of an agent at each step define states of the game, and the goal of the agent is
to get a high score at the end of the game. Agent receives feedback in terms of punishments and
rewards that it collects at the end.
Classification of Reinforcement Learning

1. Positive Reinforcement Learning

Positive reinforcement learning specifies increasing the tendency that the required behavior would
occur again and again by adding something. It enhances the strength of the behavior of the agent
and positively leaves an impact.

2. Negative Reinforcement Learning:

Negative reinforcement learning works exactly in the opposite way as compared to positive
Reinforcement Learning. It tends to increase the tendency that the specific behavior would occur
again by avoiding the negative condition that will lead to punishment.

Use cases of Reinforcement Learning

a. Video Games:

Reinforcement Learning algorithms are extensively used in gaming applications. It is used to


acquire superhuman performance in theme-based gaming. Some popular games that make use of
this type of algorithm are AlphaGO and AlphaGO Zero.

b. Resource Management:

The “Resource Management with Deep Reinforcement Learning” paper demonstrates how to use
this type of learning in computers to automatically learn and helps them in resource scheduling
which helps the resources to wait for different jobs in order to minimize average job slowdown.

c. Robotics:

This type of learning is extensively being used in Robotics applications. We make use of Robots
in the industrial and manufacturing area, and we can make these robots more powerful with
reinforcement learning. There are different industries that have a keen eye for building intelligent
robots using AI and Machine learning technology.

d. Text Mining:

Text mining, to date, is one of the greatest applications of NLP. We now tend to implement NLP
with the help of Reinforcement Learning by Salesforce company.

Advantages of Reinforcement Learning

 This type of learning assists us in solving complex real-world problems which are difficult to
be solved by general techniques that we use conventionally.
 The learning model of Reinforcement Learning is similar to the learning process of human
beings; therefore, we can look for the most accurate results.
 This type of learning helps us in achieving long-term results.
Disadvantages of Reinforcement Learning

 We generally do not prefer these algorithms for simple problems.


 These algorithms require huge data and high computational powers.
 Too much reinforcement learning can lead to an overload of states which can weaken the
results defying the purpose of deploying them.

Features:
Types of Data
There are two forms of data types:

1. Qualitative
2. Quantitative
As we go deeper into these two types, we will encounter more classifications.

Let’s start by defining what these terms actually represent.

When we mention qualitative data, you should be able to tell that qualitative means “quality” just
by looking at the word. It’s an attribute because it includes values that express a quality or state.

This type of data is impossible to count or quantify. Gender, feelings, and so on are some examples.

Quantitative data, on the other hand, is concerned with quantity. This type of statistics is numerical
in nature, which means you can count or measure them. Income, age, and so on are few examples.

Let us clarify this with a more general example. Consider feelings; you cannot quantify someone’s
emotions. Because it is impossible to quantify how sad or happy someone is, we consider feelings
to be qualitative data because they express quality. However, if you were to count how many
people are sad or how many people are happy, you would be able to call it quantitative data derived
from qualitative data.

Types of Qualitative Data

When we delve deeper into qualitative data, we can further divide it into two types.

The following are the types:

Nominal Variable

Nominal variables are figures that cannot be easily classified and ordered into hierarchies.

Flowers, for example, cannot have a hierarchical order ranking. You cannot say that lily is superior
to rose. The same is true for colors, gender, race, country, and so on. Nominal variables are what
they’re termed as.
On such variables, you can experiment four types of tests which are:

 McNemar
 Fisher’s Exact
 Cochran Q
 Chi-Square
Ordinal Variable

Ordinal variables, on the other hand, can be ordered and scaled. It also possesses all of the
properties of a nominal variable. Therefore, the ability to be able to measure ordinal variable and
to be able to rank it in hierarchical order is a plus.

However, we should not assume that it can be numbered because it is still qualitative data!

So, if I were to create a Google form to collect feedback on, say, a webinar I hosted on Zoom
application, I would insert the following question: “How informative did you find the webinar?”
I’d also offer them the following marking options:

 A little informative
 Exceptionally Informative
 Not at all informative.
These options, which I just listed, are examples of ordinal variables.

On such variables, you can experiment four types of tests which are:

 Kruskal-Wallis 1-way
 Wilcoxon signed-rank
 Wilcoxon rank-sum
 Friedman 2-way ANOVA
It is important to note that both nominal and ordinal variables are non-parametric variables. The
only difference between them is their ability to rank information based on their position.

Types of Quantitative Data

Quantitative data can be classified into 2 types:

Interval

We know both the order and the exact difference between the values in interval scales. For
example, the difference between 20 and 10 degrees is at the same magnitude as the difference
between 30 and 20 degrees.
Ratio

In contrast, ratio data is an interval scale with a natural 0 point. This basically means that negative
values cannot exist in ratio data. Height measurements in centimeters, meters, inches, and so on
are an example of it.

Quantitative data can also be:

 Discrete: You can easily count discrete variables within a limited amount of time. You
can, for example, count the money in your bank account or the number of times you drank
a glass of water in a day. Essentially, if a variable is discrete, you will be able to count
them, though it may take longer at times to count completely. They can only accept certain
values that do not include decimal form.

 Continuous: Continuous variables, on the other hand, are “continuous” in nature, as the
name implies. However, you can never completely count them because they are an on-
going process. For example, if I asked you to tell me how many times you drank a glass of
water in this century, you would be unable to do so because a century has not passed, so
you would end up counting until you died. Continuous variables can have any value within
a certain range of values.

Advantages and Disadvantages of Qualitative and Quantitative Data Research

Advantages

QUALITATIVE QUANTITATIVE

Efficient and Effective: Since the majority of


Higher Sample Size: Higher sampler size can
information is obtained from each responder, a
be reached through this method of research
smaller sample size is used to collect information. It
which would make it easier for a researcher to
not only lowers research expenditure but also
derive at a conclusion.
delivers speedier results.

Flexible: It is adaptable in nature because there is no Straightforward Process: It is a simple


set structure for how a researcher should conduct procedure to put in place. It is both effective
research. This enables researchers to concentrate on and efficient since it does not need the
any area which they believe the statistics should be identification of variables in order to obtain
mined from. results.

Creativity: Qualitative research allows respondents Anonymous: Since quantitative information is


to provide honest input, and their candid opinion is collected anonymously from individuals, they
what is called creative in this context which are able to provide honest criticism without
ultimately becomes an asset to the company. fear of their name being disclosed.

Disadvantages

QUALITATIVE QUANTITATIVE

Ignorant: This technique is ignorant about the motives


Not Measurable: The figures from that individuals have while providing feedback. It is
qualitative research can not be measured or primarily concerned with obtaining answers to
graphically represented. questions so that a specific hypothesis can be validated
or invalidated.

Incomplete Information: You won’t be able to go


through each answer individually using this technique.
Difficulty in Replicating Results: Since So even if you start to wonder why someone consented
qualitative research is based on individual to a particular question, you will never know why they
perspective, the findings can not be truly agreed. A respondent’s input is regarded as
replicated. essential since it exposes their opinions on a certain
topic, which can’t be gathered via quantitative
research.

False Conclusions: Since the findings from


qualitative research cannot be repeated and a
Expensive: This approach is far more expensive than
smaller sample size is used to collect
qualitative research, and even low-cost quantitative
information, there is no guarantee that the
research tools, such as online surveys, are not always
conclusion derived from the statistics
trustworthy.
obtained will apply to an entire population
size.

What is a feature?

Generally, all machine learning algorithms take input data to generate the output. The input data
remains in a tabular form consisting of rows (instances or observations) and columns (variable or
attributes), and these attributes are often known as features. For example, an image is an instance
in computer vision, but a line in the image could be the feature. Similarly, in NLP, a document can
be an observation, and the word count could be the feature. So, we can say a feature is an attribute
that impacts a problem or is useful for the problem.

What is Feature Engineering?

Feature engineering is the pre-processing step of machine learning, which extracts features
from raw data. It helps to represent an underlying problem to predictive models in a better way,
which as a result, improve the accuracy of the model for unseen data. The predictive model
contains predictor variables and an outcome variable, and while the feature engineering process
selects the most useful predictor variables for the model.

Since 2016, automated feature engineering is also used in different machine learning software that
helps in automatically extracting features from raw data. Feature engineering in ML contains
mainly four processes: Feature Creation, Transformations, Feature Extraction, and Feature
Selection.

These processes are described as below:

1. Feature Creation: Feature creation is finding the most useful variables to be used in a
predictive model. The process is subjective, and it requires human creativity and
intervention. The new features are created by mixing existing features using addition,
subtraction, and ration, and these new features have great flexibility.
2. Transformations: The transformation step of feature engineering involves adjusting the
predictor variable to improve the accuracy and performance of the model. For example, it
ensures that the model is flexible to take input of the variety of data; it ensures that all the
variables are on the same scale, making the model easier to understand. It improves the
model's accuracy and ensures that all the features are within the acceptable range to avoid
any computational error.
3. Feature Extraction: Feature extraction is an automated feature engineering process that
generates new variables by extracting them from the raw data. The main aim of this step is
to reduce the volume of data so that it can be easily used and managed for data modelling.
Feature extraction methods include cluster analysis, text analytics, edge detection
algorithms, and principal components analysis (PCA).
4. Feature Selection: While developing the machine learning model, only a few variables in
the dataset are useful for building the model, and the rest features are either redundant or
irrelevant. If we input the dataset with all these redundant and irrelevant features, it may
negatively impact and reduce the overall performance and accuracy of the model. Hence it
is very important to identify and select the most appropriate features from the data and
remove the irrelevant or less important features, which is done with the help of feature
selection in machine learning. "Feature selection is a way of selecting the subset of the
most relevant features from the original features set by removing the redundant,
irrelevant, or noisy features."

Below are some benefits of using feature selection in machine learning:

o It helps in avoiding the curse of dimensionality.


o It helps in the simplification of the model so that the researchers can easily interpret it.
o It reduces the training time.
o It reduces overfitting hence enhancing the generalization.

Need for Feature Engineering in Machine Learning

In machine learning, the performance of the model depends on data pre-processing and data
handling. But if we create a model without pre-processing or data handling, then it may not give
good accuracy. Whereas, if we apply feature engineering on the same model, then the accuracy of
the model is enhanced. Hence, feature engineering in machine learning improves the model's
performance. Below are some points that explain the need for feature engineering:

o Better features mean flexibility.


In machine learning, we always try to choose the optimal model to get good results.
However, sometimes after choosing the wrong model, still, we can get better predictions,
and this is because of better features. The flexibility in features will enable you to select
the less complex models. Because less complex models are faster to run, easier to
understand and maintain, which is always desirable.
o Better features mean simpler models.
If we input the well-engineered features to our model, then even after selecting the wrong
parameters (Not much optimal), we can have good outcomes. After feature engineering,
it is not necessary to do hard for picking the right model with the most optimized
parameters. If we have good features, we can better represent the complete data and use it
to best characterize the given problem.
o Better features mean better results.
As already discussed, in machine learning, as data we will provide will get the same
output. So, to obtain better results, we must need to use better features.

Steps in Feature Engineering

The steps of feature engineering may vary as per different data scientists and ML engineers.
However, there are some common steps that are involved in most machine learning algorithms,
and these steps are as follows:

o Data Preparation: The first step is data preparation. In this step, raw data acquired from
different resources are prepared to make it in a suitable format so that it can be used in the
ML model. The data preparation may contain cleaning of data, delivery, data augmentation,
fusion, ingestion, or loading.
o Exploratory Analysis: Exploratory analysis or Exploratory data analysis (EDA) is an
important step of features engineering, which is mainly used by data scientists. This step
involves analysis, investing data set, and summarization of the main characteristics of data.
Different data visualization techniques are used to better understand the manipulation of
data sources, to find the most appropriate statistical technique for data analysis, and to
select the best features for the data.
o Benchmark: Benchmarking is a process of setting a standard baseline for accuracy to
compare all the variables from this baseline. The benchmarking process is used to improve
the predictability of the model and reduce the error rate.

Feature Engineering Techniques

Some of the popular feature engineering techniques include:


1. Imputation

Feature engineering deals with inappropriate data, missing values, human interruption, general
errors, insufficient data sources, etc. Missing values within the dataset highly affect the
performance of the algorithm, and to deal with them "Imputation" technique is used. Imputation
is responsible for handling irregularities within the dataset.

For example, removing the missing values from the complete row or complete column by a huge
percentage of missing values. But at the same time, to maintain the data size, it is required to
impute the missing data, which can be done as:

o For numerical data imputation, a default value can be imputed in a column, and missing
values can be filled with means or medians of the columns.
o For categorical data imputation, missing values can be interchanged with the maximum
occurred value in a column.

2. Handling Outliers

Outliers are the deviated values or data points that are observed too away from other data points
in such a way that they badly affect the performance of the model. Outliers can be handled with
this feature engineering technique. This technique first identifies the outliers and then remove them
out.

Standard deviation can be used to identify the outliers. For example, each value within a space
has a definite to an average distance, but if a value is greater distant than a certain value, it can be
considered as an outlier. Z-score can also be used to detect outliers.

3. Log transform

Logarithm transformation or log transform is one of the commonly used mathematical techniques
in machine learning. Log transform helps in handling the skewed data, and it makes the distribution
more approximate to normal after transformation. It also reduces the effects of outliers on the data,
as because of the normalization of magnitude differences, a model becomes much robust.

Note: Log transformation is only applicable for the positive values; else, it will give an error.
To avoid this, we can add 1 to the data before transformation, which ensures transformation
to be positive.

4. Binning

In machine learning, overfitting is one of the main issues that degrade the performance of the
model and which occurs due to a greater number of parameters and noisy data. However, one of
the popular techniques of feature engineering, "binning", can be used to normalize the noisy data.
This process involves segmenting different features into bins.

5. Feature Split

As the name suggests, feature split is the process of splitting features intimately into two or more
parts and performing to make new features. This technique helps the algorithms to better
understand and learn the patterns in the dataset.

The feature splitting process enables the new features to be clustered and binned, which results in
extracting useful information and improving the performance of the data models.

6. One hot encoding

One hot encoding is the popular encoding technique in machine learning. It is a technique that
converts the categorical data in a form so that they can be easily understood by machine learning
algorithms and hence can make a good prediction. It enables group the of categorical data without
losing any information.

Curse of Dimensionality
Curse of Dimensionality refers to a set of problems that arise when working with high -
dimensional data. The dimension of a dataset corresponds to the number of attributes/features
that exist in a dataset. A dataset with a large number of attributes, generally of the order of a
hundred or more, is referred to as high dimensional data. Some of the difficulties that come
with high dimensional data manifest during analyzing or visualizing the data to identify
patterns, and some manifest while training machine learning models. The difficulties related
to training machine learning models due to high dimensional data are referred to as the ‘Curse
of Dimensionality’.

In Machine Learning, a marginal increase in dimensionality also requires a large increase in


the volume in the data in order to maintain the same level of performance. The curse of
dimensionality is the by-product of a phenomenon which appears with high-dimensional data.

Linear Discriminant Analysis


Linear Discriminant Analysis or Normal Discriminant Analysis or Discriminant Function
Analysis is a dimensionality reduction technique that is commonly used for supervised
classification problems. It is used for modelling differences in groups i.e. separating two or more
classes. It is used to project the features in higher dimension space into a lower dimension space.
For example, we have two classes and we need to separate them efficiently. Classes can have
multiple features. Using only a single feature to classify them may result in some overlapping as
shown in the below figure. So, we will keep on increasing the number of features for proper
classification.

Example:
Suppose we have two sets of data points belonging to two different classes that we want to
classify. As shown in the given 2D graph, when the data points are plotted on the 2D plane,
there’s no straight line that can separate the two classes of the data points completely. Hence, in
this case, LDA (Linear Discriminant Analysis) is used which reduces the 2D graph into a 1D
graph in order to maximize the separability between the two classes.

Here, Linear Discriminant Analysis uses both the axes (X and Y) to create a new axis and
projects data onto a new axis in a way to maximize the separation of the two categories and
hence, reducing the 2D graph into a 1D graph.

Two criteria are used by LDA to create a new axis:


1. Maximize the distance between means of the two classes.
2. Minimize the variation within each class.
In the above graph, it can be seen that a new axis (in red) is generated and plotted in the 2D
graph such that it maximizes the distance between the means of the two classes and minimizes
the variation within each class. In simple terms, this newly generated axis increases the
separation between the data points of the two classes. After generating this new axis using the
above-mentioned criteria, all the data points of the classes are plotted on this new axis and are
shown in the figure given below.

But Linear Discriminant Analysis fails when the mean of the distributions are shared, as it
becomes impossible for LDA to find a new axis that makes both the classes linearly separable.
In such cases, we use non-linear discriminant analysis.
Extensions to LDA:
1. Quadratic Discriminant Analysis (QDA): Each class uses its own estimate of variance (or
covariance when there are multiple input variables).
2. Flexible Discriminant Analysis (FDA): Where non-linear combinations of inputs are used
such as splines.
3. Regularized Discriminant Analysis (RDA): Introduces regularization into the estimate of
the variance (actually covariance), moderating the influence of different variables on LDA.
Implementation
 In this implementation, we will perform linear discriminant analysis using the Scikit-learn
library on the Iris dataset.
# necessary import
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import sklearn
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, confusion_matrix

# read dataset from URL


url = "https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data"
cls = ['sepal-length', 'sepal-width', 'petal-length', 'petal-width', 'Class']
dataset = pd.read_csv(url, names=cls)

# divide the dataset into class and target variable


X = dataset.iloc[:, 0:4].values
y = dataset.iloc[:, 4].values

# Preprocess the dataset and divide into train and test


sc = StandardScaler()
X = sc.fit_transform(X)
le = LabelEncoder()
y = le.fit_transform(y)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

# apply Linear Discriminant Analysis


lda = LinearDiscriminantAnalysis(n_components=2)
X_train = lda.fit_transform(X_train, y_train)
X_test = lda.transform(X_test)

# plot the scatterplot


plt.scatter(
X_train[:,0],X_train[:,1],c=y_train,cmap='rainbow',
alpha=0.7,edgecolors='b'
)

# classify using random forest classifier


classifier = RandomForestClassifier(max_depth=2, random_state=0)
classifier.fit(X_train, y_train)
y_pred = classifier.predict(X_test)
# print the accuracy and confusion matrix
print('Accuracy : ' + str(accuracy_score(y_test, y_pred)))
conf_m = confusion_matrix(y_test, y_pred)
print(conf_m)

You might also like