You are on page 1of 112

Machine Learning

Syllabus:
Unit 1- Introduction to Machine Learning
Basics of Statistics, Introduction of Machine learning, Examples of Machine Learning Problems,
Learning versus Designing, Training versus Testing, Characteristics of Machine learning tasks,
Predictive and descriptive tasks, database and data processing for ML.
Features: Feature types, Feature Construction and Transformation, Feature Selection.
Unit 2- Flavors of Machine Learning
Definition of learning systems, Types: Supervised, Unsupervised, Semi Supervised, Reinforcement
learning with examples, Introduction to Deep Learning, Deep learning vs Machine Learning.
Unit 3- Classification and Regression
Classification: Binary Classification- Assessing Classification performance, Class probability
Estimation- Assessing class probability Estimates, Multiclass Classification.
Regression: Assessing performance of Regression- Error measures, Overfitting- Catalysts for
Overfitting, Case study of Polynomial Regression.
Theory of Generalization: Effective number of hypothesis, Bounding the Growth function, VC
Dimensions, Regularization theory.
Unit 4- Neural Networks
Introduction, Neural Network Elements, Basic Perceptron, Feed Forward Network, Back Propagation
Algorithm, Introduction to Artificial Neural Network.
Unit 5- Machine Learning Models
Linear Models: Least Squares method, Multivariate Linear Regression, Regularized Regression,
Using Least Square regression for Classification.
Logic Based and Algebraic Models: Distance Based Models: Neighbours and Examples, Nearest
Neighbours Classification,
Rule Based Models: Rule learning for subgroup discovery, Association rule mining,
Tree Based Models: Decision Trees
Probabilistic Models: Normal Distribution and Its Geometric Interpretations, Naïve Bayes Classifier,
Discriminative learning with Maximum likelihood
Unit 6- Applications of Machine Learning
Email Spam and Malware Filtering, Image recognition, Speech Recognition, Traffic Prediction, Self-
driving Cars, Virtual Personal Assistant, Medical Diagnosis.
Unit no 01 Introduction to Machine Learning
Before actually defining Machine learning, we should first understand the meaning of two words that
are machine and learning, and then we can understand what machine learning is?
 Learning: The ability to improve behavior based on experience is called learning
 Machine: A mechanically, electrically or electronically operated device for performing a task
is machine
 Machine Learning:
Machine learning explores algorithm learn/ build model from data and that model is used for
prediction, decision making, and for solving tasks.
Definition: A computer program is said to learn from experience E (data) with respect to some class
of task T (prediction, classification etc...) and performance measure P if its performance on task in T
as measured by P improves with experience E.
Machine Learning is a subset of artificial intelligence which focuses mainly on machine learning from
their experience and making predictions based on its experience.
It enables the computers or the machines to make data-driven decisions rather than being explicitly
programmed for carrying out a certain task. These programs or algorithms are designed in a way that
they learn and improve over time when are exposed to new data.

Figure 1.1 Flow of Machine Learning


Machine Learning algorithm is trained using a training data set to create a model. When new input
data is introduced to the ML algorithm, it makes a prediction on the basis of the model.
The prediction is evaluated for accuracy and if the accuracy is acceptable, the Machine Learning
algorithm is deployed. If the accuracy is not acceptable, the Machine Learning algorithm is trained
again and again with an augmented training data set.
Figure1.2 Difference between Traditional Programming and Machine Learning
For creating a learner
1. Choose the training experience (features of the domain)
2. Choose the target function (that is to be learned)
3. Choose how to represent the target function (class of the functions/Hypothesis language)
4. Choose a learning algorithm to infer the target function
 Statistics and Machine Learning
Why a machine learning practitioner should deepen their understanding of statistics.
1. Statistics in Data Preparation
Statistical methods are required in the preparation of train and test data for your machine learning
model.
This includes techniques for:
 Outlier detection.
 Missing value imputation.
 Data sampling.
 Data scaling.
 Variable encoding and much more.
A basic understanding of data distributions, descriptive statistics, and data visualization is required to
help you identify the methods to choose when performing these tasks.
2. Statistics in Model Evaluation
Statistical methods are required when evaluating the skill of a machine learning model on data not
seen during training.
This includes techniques for:
 Data sampling
 Data Resampling
 Experimental design
Resampling techniques such as k-fold cross-validation are often well understood by machine learning
practitioners, but the rationale for why this method is required is not.
3. Statistics in Model Selection
Statistical methods are required when selecting a final model or model configuration to use for a
predictive modeling problem.
These include techniques for:
 Checking for a significant difference between results.
 Quantifying the size of the difference between results.
This might include the use of statistical hypothesis tests.
4. Statistics in Model Presentation
Statistical methods are required when presenting the skill of a final model to stakeholders.
This includes techniques for:
 Summarizing the expected skill of the model on average.
 Quantifying the expected variability of the skill of the model in practice.
This might include estimation statistics such as confidence intervals.
5. Statistics in Prediction
Statistical methods are required when making a prediction with a finalized model on new data.
This includes techniques for:
 Quantifying the expected variability for the prediction.
This might include estimation statistics such as prediction intervals.
 Introduction to Statistics
Statistics is a subfield of mathematics. It refers to a collection of methods for working with data and
using data to answer questions.
It is because the field is comprised of a grab bag of methods for working with data that it can seem
large and amorphous to beginners. It can be hard to see the line between methods that belong to
statistics and methods that belong to other fields of study.
When it comes to the statistical tools that we use in practice, it can be helpful to divide the field of
statistics into two large groups of methods: descriptive statistics for summarizing data, and inferential
statistics for drawing conclusions from samples of data.
 Descriptive Statistics: Descriptive statistics refer to methods for summarizing raw observations into
information that we can understand and share.
 Inferential Statistics: Inferential statistics is a fancy name for methods that aid in quantifying
properties of the domain or population from a smaller set of obtained observations called a sample.

 Gaussian distribution and Descriptive Stats


In this lesson, you will discover the Gaussian distribution for data and how to calculate simple
descriptive statistics.
A sample of data is a snapshot from a broader population of all possible observations that could be
taken from a domain or generated by a process.
Interestingly, many observations fit a common pattern or distribution called the normal distribution, or
more formally, the Gaussian distribution. It is the bell-shaped distribution that you may be familiar
with.
A lot is known about the Gaussian distribution, and as such, there are whole sub-fields of statistics
and statistical methods that can be used with Gaussian data.
Any Gaussian distribution, and in turn any data sample drawn from a Gaussian distribution can be
summarized with just two parameters:
 Mean. The central tendency or most likely value in the distribution (the top of the bell).
 Variance. The average difference that observations have from the mean value in the distribution (the
spread).
The units of the mean are the same as the units of the distribution, although the units of the variance
are squared, and therefore harder to interpret. A popular alternative to the variance parameter is
the standard deviation, which is simply the square root of the variance, returning the units to be the
same as those of the distribution.

 Correlation between Variables


Variables in a dataset may be related for lots of reasons.
It can be useful in data analysis and modeling to better understand the relationships between variables.
The statistical relationship between two variables is referred to as their correlation.
A correlation could be positive, meaning both variables move in the same direction, or negative,
meaning that when one variable‘s value increases, the other variables‘ values decrease.
 Positive Correlation: Both variables change in the same direction.
 Neutral Correlation: No relationship in the change of the variables.
 Negative Correlation: Variables change in opposite directions.
The performance of some algorithms can deteriorate if two or more variables are tightly related, called
multicollinearity. An example is linear regression, where one of the offending correlated variables
should be removed in order to improve the skill of the model.
We can quantify the relationship between samples of two variables using a statistical method called
Pearson‘s correlation coefficient, named for the developer of the method, Karl Pearson.
 Statistical Hypothesis Tests
Data must be interpreted in order to add meaning. We can interpret data by assuming a specific
structure our outcome and use statistical methods to confirm or reject the assumption.
The assumption is called a hypothesis and the statistical tests used for this purpose are called
statistical hypothesis tests.
The assumption of a statistical test is called the null hypothesis, or hypothesis zero (H0 for short). It is
often called the default assumption, or the assumption that nothing has changed. A violation of the
test‘s assumption is often called the first hypothesis, hypothesis one, or H1 for short.
 Hypothesis 0 (H0): Assumption of the test holds and is failed to be rejected.
 Hypothesis 1 (H1): Assumption of the test does not hold and is rejected at some level of significance.
We can interpret the result of a statistical hypothesis test using a p-value.
The p-value is the probability of observing the data, given the null hypothesis is true.
A large probability means that the H0 or default assumption is likely. A small value, such as below
5% (0.05) suggests that it is not likely and that we can reject H0 in favor of H1, or that something is
likely to be different (e.g. a significant result).
A widely used statistical hypothesis test is the Student‘s t-test for comparing the mean values from
two independent samples.
The default assumption is that there is no difference between the samples, whereas a rejection of this
assumption suggests some significant difference. The test assumes that both samples were drawn from
a Gaussian distribution and have the same variance.
 Estimation Statistics
Statistical hypothesis tests can be used to indicate whether the difference between two samples is due
to random chance, but cannot comment on the size of the difference.
A group of methods referred to as ―new statistics‖ are seeing increased use instead of or in addition to
p-values in order to quantify the magnitude of effects and the amount of uncertainty for estimated
values. This group of statistical methods is referred to as estimation statistics.
Estimation statistics is a term to describe three main classes of methods. The three main
classes of methods include:
 Effect Size. : Methods for quantifying the size of an effect given a treatment or intervention.
 Interval Estimation: Methods for quantifying the amount of uncertainty in a value.
 Meta-Analysis: Methods for quantifying the findings across multiple similar studies.
Of the three, perhaps the most useful methods in applied machine learning are interval estimation.
There are three main types of intervals. They are:
 Tolerance Interval: The bounds or coverage of a proportion of a distribution with a specific level of
confidence.
 Confidence Interval: The bounds on the estimate of a population parameter.
 Prediction Interval: The bounds on a single observation.
A simple way to calculate a confidence interval for a classification algorithm is to calculate the
binomial proportion confidence interval, which can provide an interval around a model‘s estimated
accuracy or error.
 Nonparametric Statistics
Statistical methods may be used when your data does not come from a Gaussian distribution.
A large portion of the field of statistics and statistical methods is dedicated to data where the
distribution is known.
Data in which the distribution is unknown or cannot be easily identified is called nonparametric.
In the case where you are working with nonparametric data, specialized nonparametric statistical
methods can be used that discard all information about the distribution. As such, these methods are
often referred to as distribution-free methods.
Before a nonparametric statistical method can be applied, the data must be converted into a rank
format. As such, statistical methods that expect data in rank format are sometimes called rank
statistics, such as rank correlation and rank statistical hypothesis tests. Ranking data is exactly as its
name suggests.
The procedure is as follows:
 Sort all data in the sample in ascending order.
 Assign an integer rank from 1 to N for each unique value in the data sample.
A widely used nonparametric statistical hypothesis test for checking for a difference between two
independent samples is the Mann-Whitney U test, named for Henry Mann and Donald Whitney.
It is the nonparametric equivalent of the Student‘s t-test but does not assume that the data is drawn
from a Gaussian distribution.
 Introduction to Machine Learning
 Machine learning (ML) is a category of an algorithm that allows software applications to
become more accurate in predicting outcomes without being explicitly programmed. The basic
premise of machine learning is to build algorithms that can receive input data and use
statistical analysis to predict an output while updating outputs as new data becomes available.
 Machine learning is a subfield of artificial intelligence (AI). The goal of machine learning
generally is to understand the structure of data and fit that data into models that can be
understood and utilized by people.
 Machine learning algorithms allow for computers to train on data inputs and use statistical
analysis in order to output values that fall within a specific range. Because of this, machine
learning facilitates computers in building models from sample data in order to automate
decision-making processes based on data inputs.
 Any technology user today has benefitted from machine learning. Facial recognition
technology allows social media platforms to help users tag and share photos of friends.
Optical character recognition (OCR) technology converts images of text into movable type.
Recommendation engines, powered by machine learning, suggest what movies or television
shows to watch next based on user preferences. Self-driving cars that rely on machine
learning to navigate may soon be available to consumers.
 Learning Versus Designing
 The learning is about acquiring skills or knowledge from experience. Most
commonly, this means synthesizing useful concepts from historical data.
 To deliver the best results, learning algorithms need vast amounts of detailed data,
clean of any confounding factors or built-in biases.
 Design help machine learning gather a better data
 Designers can help create user experiences that eliminate noise in data, leading to
more accurate and efficient ML-powered applications.
 Design help set expectations and trust with the users
 These design details build trust and understanding among users. Trust is a vital
component in how ML achieves its goals.
 Training versus Testing
 Training data and test data are two important concepts in machine learning.
 In a dataset, a training set is implemented to build up a model, while a test (or
validation) set is to validate the model built. Data points in the training set are
excluded from the test (validation) set. Usually, a dataset is divided into a training set,
a validation set (some people use ‗test set‘ instead) in each iteration, or divided into a
training set, a validation set and a test set in each iteration.
 In Machine Learning, we basically try to create a model to predict the test data. So, we
use the training data to fit the model and testing data to test it. The models generated
are to predict the results unknown which are named as the test set. As we know, the
dataset is divided into train and test set in order to check accuracies, precisions by
training and testing it on it.
1. Training Set: Here, you have the complete training dataset. You can extract features and train to fit a
model and so on.
2. Validation Set: This is crucial to choose the right parameters for your estimator. We can divide the
training set into a train set and validation set. Based on the validation test results, the model can be
trained (for instance, changing parameters, classifiers). This will help us get the most optimized model.
3. Testing Set: Here, once the model is obtained, you can predict using the model obtained on the
training set.
Figure 1.3 Training vs. Testing
 Characteristics of Machine Learning
In order to understand the actual power of machine learning, you have to consider the characteristics of
this technology. There are lots of examples that echo the characteristics of machine learning in today‘s
data-rich world. Here are seven key characteristics of machine learning for which companies should
prefer it over other technologies.
1. The ability to perform automated data visualization
2. Automation at its best
3. Customer engagement like never before
4. The ability to take efficiency to the next level when merged with IoT
5. The ability to change the mortgage market
6. Accurate data analysis
7. Business intelligence at its best
 Predictive and descriptive tasks
1. Descriptive task: Insight into the past
Descriptive task does exactly what the name implies: they ―describe‖, or summarize raw data and
make it something that is interpretable by humans. They are analytics that describe the past. The past
refers to any point of time that an event has occurred, whether it is one minute ago, or one year ago.
Descriptive analytics are useful because they allow us to learn from past behaviors, and understand
how they might influence future outcomes.
The vast majority of the statistics we use fall into this category. (Think basic arithmetic like sums,
averages, percent changes.) Usually, the underlying data is a count, or aggregate of a filtered column
of data to which basic math is applied. For all practical purposes, there are an infinite number of these
statistics. Descriptive statistics are useful to show things like total stock in inventory, average dollars
spent per customer and year-over-year change in sales. Common examples of descriptive analytics
are reports that provide historical insights regarding the company‘s production, financials, operations,
sales, finance, inventory and customers. Use Descriptive Analytics when you need to understand at an
aggregate level what is going on in your company, and when you want to summarize and describe
different aspects of your business.
2. Predictive task: Understanding the Future
Predictive analytics has its roots in the ability to ―predict‖ what might happen. These analytics are
about understanding the future. Predictive analytics provides companies with actionable insights
based on data. Predictive analytics provides estimates about the likelihood of a future outcome. It is
important to remember that no statistical algorithm can ―predict‖ the future with 100% certainty.
Companies use these statistics to forecast what might happen in the future. This is because the
foundation of predictive analytics is based on probabilities.
These statistics try to take the data that you have, and fill in the missing data with best guesses. They
combine historical data found in ERP, CRM, HR and POS systems to identify patterns in the data and
apply statistical models and algorithms to capture relationships between various data sets. Companies
use predictive statistics and analytics any time they want to look into the future. Predictive analytics
can be used throughout the organization, from forecasting customer behavior and purchasing patterns
to identifying trends in sales activities. They also help forecast demand for inputs from the supply
chain, operations and inventory.
One common application most people are familiar with is the use of predictive analytics to produce a
credit score. These scores are used by financial services to determine the probability of customers
making future credit payments on time. Typical business uses include understanding how sales might
close at the end of the year, predicting what items customers will purchase together, or forecasting
inventory levels based upon a myriad of variables.
Use Predictive Analytics any time you need to know something about the future, or fill in the
information that you do not have.
 Database for Machine Learning
One of the most critical components in machine learning projects is the database management
system. With the help of this system, a large number of data can be sorted and one can gain
meaningful insights from them.
1. Apache Cassandra
It is an open-source and highly scalable NoSQL database management system that is designed to
manage massive amounts of data in a faster manner. This popular database is being used by
GitHub, Netflix, Instagram, Reddit, among others. Cassandra has Hadoop integration, with
MapReduce support.
Advantages:
 Fault Tolerance: In Cassandra, the data is automatically replicated to multiple nodes for fault-
tolerance. Also, failed nodes can be replaced with no downtime
 Elastic Scalability: Cassandra is designed with both read and write throughput, which
increases linearly as new machines are added.
2. Couchbase
It is an open-source, distributed, NoSQL document-oriented engagement database. It exposes a
fast key-value store with managed cache for sub-millisecond data operations, purpose-built
indexers for fast queries and a powerful query engine for executing SQL-like queries.
Advantages:
 Unified Programming Interface: The Couchbase Data Platform provides simple, uniform and
powerful application development APIs across multiple programming languages, connectors,
and tools that make building applications simple and accelerates time to market for
applications.
 Big data and SQL Integrations: Couchbase Data platform includes built-in Big Data and SQL
integration which allows a user to leverage tools, processing capacity, and data wherever it
may reside.
 Container and Cloud Deployments: Couchbase supports all cloud platforms as well as a
variety of container and virtualization technologies.
3. DynamoDB
Amazon DynamoDB a fully managed, multi-region, durable database with built-in security,
backup and restore, and in-memory caching for internet-scale applications. This accessible
database has been using by Lyft, Airbnb, Toyota, Samsung, among others. DynamoDB offers
encryption at rest which eliminates the operational burden and complexity involved in protecting
sensitive data.
Advantages:
 High Availability and Durability: DynamoDB automatically spreads the data and traffic for
the tables over a sufficient number of servers to handle the throughput and storage
requirements while maintaining consistent as well as fast performance.
 Performance at Scale: DynamoDb provides consistent as well as single-digit millisecond
response times at any scale. The DynamoDB global tables replicate the data across multiple
AWS regions in order to provide fast and local access to data for globally distributed
applications.
4. Elasticsearch
It is built on Apache Lucene and is a distributed, open-source search and analytics engine for
all types of data including textual, numerical, geospatial, structured and unstructured data.
Elasticsearch is the central component of the Elastic Stack which is a set of open-source tools
for data ingestion, enrichment, storage, analysis, and visualization.
Advantages:
 Extensive Number of Features: Besides speed, scalability and resiliency, Elasticsearch has
several built-in features such as data rollups and index lifecycle management which makes
efficient storing and searching data.
 Faster in Manner: Elasticsearch excels at full-text search and it is well-suited for time-
sensitive use cases such as security analytics, infrastructure monitoring, etc.
5. MLDB
The Machine Learning Database (MLDB) is an open-source system for solving big data machine
learning problems, from data collection and storage through analysis and the training of machine
learning models to the deployment of real-time prediction endpoints. In MLDB, machine learning
models are applied using Functions, which are parameterized by the output of training
Procedures, which run over Datasets containing training data.
Advantages:
 Easy to Use: MLDB provides a comprehensive implementation of the SQL SELECT
statement, treating datasets as tables, with rows as relations. This makes the database system
easy to learn and use for data analysts familiar with existing Relational Database Management
Systems (RDBMS).
6. Microsoft SQL Server
Written in C and C++, Microsoft SQL Server is a relational database management system
(RDBMS). This database helps in gaining insights from all the data by querying across relational,
non-relational, structured as well as unstructured data.
Advantages:
 Flexible: One can use the language and platform of choice with open source support.
 Manage Big Data Environment: With SQL Server, one can manage big data environment
more easily with Big Data Clusters. It provides vital elements of a data lake such as Hadoop
Distributed File System (HDFS), Apache Spark and analytics tools which are deeply
integrated with SQL Server and fully supported by Microsoft
7. MySQL
Written in C and C++, MySQL is one of the most popular open-source relational database
management systems (RDBMS) powered by Oracle. It has been used by successful organizations
such as Face book, Twitter, YouTube, among others.
Advantages:
 Security and Scalability: This database management system includes data security layers that
protect sensitive data and it offers scalability to handle large amounts of data.
 Backup Software: mysqldump is a logical backup tool included with both community and
enterprise editions of MySQL. It supports backing up from all storage engines.
8. MongoDB
MongoDB is a general-purpose, document-based, distributed database which is built for advanced
application developers. Since this is a document database, it mainly stores data in JSON-like
documents. It provides support for aggregations and other modern use-cases such as geo-based
search, graph search, and text search.
Advantages:
 Data Store Flexibility: MongoDB stores data in flexible, JSON-like documents which means
fields can vary from document to document and data structure can be changed over time.
 Distributed Database: MongoDB is a distributed database at its core. Which is why high
availability, horizontal scaling, and geographic distribution are built-in and easy to use
9. PostgreSQL
PostgreSQL is a powerful, open-source object-relational database system which uses and extends
the SQL language combined with many features that safely store and scale the most complicated
data workloads. This database management system aims to help developers build applications,
administrators to protect data integrity, build fault-tolerant environments and much more.
` Advantages:
 Security: PostgreSQL has a robust access-control system as well as column and row-level
security.
 Extensibility: This system has foreign data wrappers which connect to other databases or
streams with a standard SQL interface.
10. Redis
Redis is an open-source, in-memory data structure store which is used as a database, cache and
message broker. It supports data structures such as strings, sorted sets with range queries, bitmaps,
hyperloglogs, geospatial indexes, etc. The database has built-in replication, Lua scripting, LRU
eviction, transactions and different levels of on-disk persistence.
Advantages:
 Automatic Failover: In Redis Sentinel, a failover process can be started where a replica is
promoted to master and the other additional replicas can be reconfigured to use the new
master.
 Redis-ML: Redis-ML is a Redis module which implements several machine learning models
as built-in Redis data types. It is simple to load and deploy trained models from any platform
(such as Apache Spark and scikit-learn) in a production environment.
 Data Processing for ML
Data Processing is the task of converting data from a given form to a much more usable and
desired form i.e. making it more meaningful and informative. Using Machine Learning algorithms,
mathematical modeling, and statistical knowledge, this entire process can be automated. The
output of this complete process can be in any desired form like graphs, videos, charts, tables,
images, and many more, depending on the task we are performing and the requirements of the
machine. This might seem to be simple but when it comes to massive organizations like Twitter,
Facebook, Administrative bodies like Parliament, UNESCO, and health sector organizations, this
entire process needs to be performed in a very structured manner. So, the steps to perform are as
follows:

Figure 1.4 Data Processing for Machine Learning


 Collection:
The most crucial step when starting with ML is to have data of good quality and accuracy. Data
can be collected from any authenticated source like data.gov.in, Kaggle or UCI dataset repository.
For example, while preparing for a competitive exam, students study from the best study materi al
that they can access so that they learn the best to obtain the best results. In the same way, high -
quality and accurate data will make the learning process of the model easier and better and at the
time of testing, the model would yield state-of-the-art results. A huge amount of capital, time and
resources are consumed in collecting data. Organizations or researchers have to decide what kind
of data they need to execute their tasks or research. Example: Working on the Facial Expression
Recognizer, needs numerous images having a variety of human expressions. Good data ensures
that the results of the model are valid and can be trusted upon.
 Preparation:
The collected data can be in a raw form which can‘t be directly fed to the machine. So, this is a
process of collecting datasets from different sources, analyzing these datasets and then
constructing a new dataset for further processing and exploration. This preparation can be
performed either manually or from the automatic approach. Data can also be prepared in numeric
forms also which would fasten the model‘s learning.
Example: An image can be converted to a matrix of N X N dimensions; the value of each cell will
indicate the image pixel.
 Input:
Now the prepared data can be in the form that may not be machine-readable, so to convert this
data to the readable form, some conversion algorithms are needed. For this task to be executed,
high computation and accuracy is needed. Example: Data can be collected through the sources like
MNIST Digit data (images), Twitter comments, audio files, video clips.
 Processing:
This is the stage where algorithms and ML techniques are required to perform the instructions
provided over a large volume of data with accuracy and optimal computation.
 Output:
In this stage, results are procured by the machine in a meaningful manner which can be inferred
easily by the user. Output can be in the form of reports, graphs, videos, etc
 Storage:
This is the final step in which the obtained output and the data model data and all the useful
information are saved for future use.
 Feature Types in machine learning
Often the individual observations are analyzed into set of quantifiable properties which are called
features. The four types of features
1. Catogorical (e.g.‖A‖,‖B‖,‖AB‖, or ―O‖ for blood type)
2. Ordinal (e.g.‖large‖, ―medium‖, ―small‖)
3. Integer valued (e.g. the number of words in the text)
4. Real- valued (e.g. height)

 Feature construction in machine learning


Feature construction means constructing additional features from existing data. These features are
usually distributed in multiple related tables. Feature construction requires extracting relevant
information from the data and storing it in a single table, which is then used to train machine learning
models. This requires us to spend a lot of time to study the actual data samples, think about the
potential form of the problem and the data structure, and at the same time be able to better apply it to
the prediction model.
Feature construction requires strong insight and analysis capabilities, and requires us to find some
physically meaningful features from the original data. For tabular data, feature construction means
mixing or combining features to get new features, or constructing new features by decomposing or
segmenting features; for text data, features are enough to design specific problems Text index; for
image data, this means automatic filtering to get the relevant structure.
Feature construction is a very time-consuming process, because each new feature usually requires a
few steps to construct, especially when using information from multiple tables. We can divide the
operation of feature construction into two categories: "transformation" and "aggregation".
By constructing new features from one or more columns, the "transformation" acts on a single table
"Aggregation" is implemented across tables and uses one-to-many association to group observations
and then calculate statistics.
Many machine learning competitions directly give the training set (features + class labels), we can
"transform" the given features to construct more features. In actual work, many times we do not have
ready-made features. We need to perform the "aggregation" operation to construct the features
required by the model from multiple original data tables.
For example, each record in the user behavior data table is a browsing behavior or a click behavior of
a user, we need to construct the user's behavior characteristics through the "aggregation" operation
(such as: the length of the user's last browsing, the user's last login Features such as the number of
clicks), and then use the "convert" operation to construct more features, and finally use these features
to train the model.
 Feature transformation in machine learning
Feature transformation is the process of modifying your data but keeping the information. These
modifications will make Machine Learning algorithms understanding easier, which will deliver better
results.
A proper feature transformation can bring a significant improvement to your model. Sometimes
feature transformation is the only way to gain a better score, so it is a crucial point how you represent
your data and feed it to a target model.
Feature transformation (FT) refers to family of algorithms that create new features using the existing
features. These new features may not have the same interpretation as the original features, but they
may have more discriminatory power in a different space than the original space. This can also be
used for feature reduction. FT may happen in many ways, by simple/linear combinations of original
features or using non-linear functions. Some common techniques for FT are:
 Scaling or normalizing features within a range, say between 0 to 1
 Principle Component Analysis and its variants
 Random Projection
 Neural Networks
 SVM also transforms features internally
 Transforming categorical features to numerical
Here we will look at some feature transformations that are common during preprocessing. They range
from simple scaling and centering to some of the most complicated procedures in machine learning.
Perhaps the most common feature transformation is one that is seldom thought of as a feature
transformation at all. It is very common to scale and center the features you are working with.
Centering of a real valued variable is done by subtracting its sample mean from all values. The
equation for calculating the sample mean is:
N is the number of values in the sample. Note that the sample mean is generally denoted by the bar
over the variable, whereas the population mean is normally denoted by the Greek letter 𝜇 (mu).
Scaling of a real valued variable is done by dividing all its values by its sample standard deviation.
The equation for calculating the sample standard deviation is:

Features contain information about target. More features do not mean more information. Irrelevant
features and redundant features may result in wrong conclusion specially when there is limited
training set and limited computation resources. This leads to curse of dimensionality. To reduce this
curse, feature reduction is implemented. There are two types of feature reduction one is feature
extraction and the other is feature selection. Feature extraction and feature selection method is used to
either improve or maintain the classification, accuracy and simplify classifier complexity.
 Feature Selection in machine learning
It is the method of reducing data dimension while doing predictive analysis. One major reason is
that machine learning follows the rule of ―garbage in-garbage out‖ and that is why one needs to be
very concerned about the data that is being fed to the model.
We will discuss various kinds of feature selection techniques in machine learning and why they play
an important role in machine learning tasks.
The feature selection techniques simplify the machine learning models in order to make it easier to
interpret by the researchers. IT mainly eliminates the effects of the curse of dimensionality. Besides,
this technique reduces the problem of over fitting by enhancing the generalization in the model. Thus
it helps in better understanding of data, improves prediction performance, reducing the computational
time as well as space which is required to run the algorithm.
1. Filter Method

This method uses the variable ranking technique in order to select the variables for ordering and here,
the selection of features is independent of the classifiers used. By ranking, it means how much useful
and important each feature is expected to be for classification. It basically selects the subsets of
variables as a pre-processing step independently of the chosen predictor. In filtering, the ranking
method can be applied before classification for filtering the less relevant features. It carries out the
feature selection task as a pre-processing step which contains no induction algorithm. Some examples
of filter methods are mentioned below:
 Chi-Square Test: In general term, this method is used to test the independence of two events.
If a dataset is given for two events, we can get the observed count and the expected count and
this test measures how much both the counts are derivate from each other.
 Variance Threshold: This approach of feature selection removes all features whose variance
does not meet some threshold. Generally, it removes all the zero-variance features which
mean all the features that have the same value in all samples.
 Information Gain: Information gain or IG measures how much information a feature gives
about the class. Thus, we can determine which attribute in a given set of training feature is the
most meaningful for discriminating between the classes to be learned.
2. Wrapper Method

Fig: Wrapper Approach to feature subset selection

This method utilizes the learning machine of interest as a black box to score subsets of variables
according to their predictive power. In the above figure, in a supervised machine learning, the
induction algorithm is depicted with a set of training instances, where each instance is described by a
vector of feature values and a class label. The induction algorithm which is also considered as the
black box is used to induce a classifier which is useful in classifying. In the wrapper approach, the
feature subset selection algorithm exists as a wrapper around the induction algorithm. One of the main
drawbacks of this technique is the mass of computations required to obtain the feature subset. Some
examples of Wrapper Methods are mentioned below:
 Genetic Algorithms: This algorithm can be used to find a subset of features. CHCGA is the
modified version of this algorithm which converges faster and renders a more effective search
by maintaining the diversity and evade the stagnation of the population.
 Recursive Feature Elimination: RFE is a feature selection method which fits a model and
removes the weakest feature until the specified number of features is satisfied. Here, the
features are ranked by the model‘s coefficient or feature importance attributes.
 Sequential Feature Selection: This naive algorithm starts with a null set and then adds one
feature to the first step which depicts the highest value for the objective function and from the
second step onwards the remaining features are added individually to the current subset and
thus the new subset is evaluated. This process is repeated until the required number of
features is added.
3. Embedded Method
This method tries to combine the efficiency of both the previous methods and performs the selection
of variables in the process of training and is usually specific to given learning machines. This method
basically learns which feature provides the utmost to the accuracy of the model.
Some examples of Embedded Methods are mentioned below:
 L1 Regularization Technique such as LASSO: Least Absolute Shrinkage and Selection
Operator (LASSO) is a linear model which estimates sparse coefficients and is useful in some
contexts due to its tendency to prefer solutions with fewer parameter values.
 Ridge Regression (L2 Regularization): The L2 Regularization is also known as Ridge
Regression or Tikhonov Regularization which solves a regression model where the loss
function is the linear least squares function and regularization.
 Elastic Net: This linear regression model is trained with L1 and L2 as regularizer which
allows for learning a sparse model where few of the weights are non-zero like Lasso and on
the other hand maintaining the regularization properties of Ridge.

 Examples of Machine learning problems


Machine learning algorithms are typically used in areas where the solution requires continuous
improvement post-deployment. Adaptable machine learning solutions are incredibly dynamic and
are adopted by companies across verticals.
1. Identifying Spam
Spam identification is one of the most basic applications of machine learning. Most of our email
inboxes also have an unsolicited, bulk, or spam inbox, where our email provider automatically
filters unwanted spam emails.
But how do they know that the email is spam?
They use a trained Machine Learning model to identify all the spam emails based on common
characteristics such as the email, subject, and sender content.
If you look at your email inbox carefully, you will realize that it is not very hard to pick out spam
emails because they look very different from real emails. Machine learning techniques used
nowadays can automatically filter these spam emails in a very successful way.
Spam detection is one of the best and most common problems solved by Machine Learning.
Neural networks employ content-based filtering to classify unwanted emails as spam. These
neural networks are quite similar to the brain, with the ability to identify spam emails and
messages.
2. Making Product Recommendations
Recommender systems are one of the most characteristic and ubiquitous machine learning use
cases in day-to-day life. These systems are used everywhere by search engines, e-commerce
websites (Amazon), entertainment platforms (Google Play, Netflix), and multiple web & mobile
apps.
Prominent online retailers like Amazon and eBay often show a list of recommended products
individually for each of their consumers. These recommendations are typically based on
behavioral data and parameters such as previous purchases, item views, page views, clicks, form
fill-ins, purchases, item details (price, category), and contextual data (location, language, device),
and browsing history.
These recommender systems allow businesses to drive more traffic, increase customer
engagement, reduce churn rate, deliver relevant content and boost profits. All such recommended
products are based on a machine learning model‘s analysis of customer‘s behavioral data. It is an
excellent way for online retailers to offer extra value and enjoy various upselling opportunities
using machine learning
3. Customer Segmentation
Customer segmentation, churn prediction and customer lifetime value (LTV) prediction are the
main challenges faced by any marketer. Businesses have a huge amount of marketing relevant
data from various sources such as email campaigns, website visitors and lead data.
Using data mining and machine learning, an accurate prediction for individual marketing offers
and incentives can be achieved. Using ML, savvy marketers can eliminate guesswork involved in
data-driven marketing.
For example, given the pattern of behavior by a user during a trial period and the past behaviors of
all users, identifying chances of conversion to paid version can be predicted. A model of this
decision problem would allow a program to trigger customer interventions to persuade the
customer to convert early or better engage in the trial.
4. Image & Video Recognition
Advances in deep learning (a subset of machine learning) have stimulated rapid progress in image
& video recognition techniques over the past few years. They are used for multiple areas,
including object detection; face recognition, text detection, visual search, logo and landmark
detection, and image composition.
Since machines are good at processing images, Machine Learning algorithms can train Deep
Learning frameworks to recognize and classify images in the dataset with much more accuracy
than humans.
Similar to image recognition, companies such as Shutterstock, eBay, Salesforce, Amazon,
and Facebook use Machine Learning for video recognition where videos are broken down frame
by frame and classified as individual digital images.
5. Fraudulent Transactions
Fraudulent banking transactions are quite a common occurrence today. However, it is not feasible
(in terms of cost involved and efficiency) to investigate every transaction for fraud, translating to
a poor customer service experience.
Machine learning in finance can automatically build super-accurate predictive maintenance
models to identify and prioritize all kinds of possible fraudulent activities. Businesses can then
create a data-based queue and investigate the high priority incidents.
It allows you to deploy resources in an area where you will see the greatest return on your
investigative investment. Further, it also helps you optimize customer satisfaction by protecting
their accounts and not challenging valid transactions. Such fraud detection using machine
learning can help banks and financial organizations save money on disputes/chargebacks as one
can train Machine Learning models to flag transactions that appear fraudulent based on specific
characteristics.
6. Demand Forecasting
The concept of demand forecasting is used in multiple industries, from retail and e-commerce to
manufacturing and transportation. It feeds historical data to Machine Learning algorithms and
models to predict the number of products, services, power, and more.
It allows businesses to efficiently collect and process data from the entire supply chain, reducing
overheads and increasing efficiency.
ML-powered demand forecasting is very accurate, rapid, and transparent. Businesses can generate
meaningful insights from a constant stream of supply/demand data and adapt to changes
accordingly.
7. Virtual Personal Assistant
From Alexa and Google Assistant to Cortana and Siri, we have multiple virtual personal assistants
to find accurate information using our voice instruction, such as calling someone, opening an
email, scheduling an appointment, and more.
These virtual assistants use Machine Learning algorithms for recording our voice instructions,
sending them over the server to a cloud, followed by decoding those using Machine Learning
algorithms and acting accordingly.
8. Sentiment Analysis
Sentiment analysis is one of the beneficial and real-time machine learning applications that help
determine the emotion or opinion of the speaker or the writer.
For instance, if you‘ve written a review, email, or any other form of a document, a sentiment
analyzer will be able to assess the actual thought and tone of the text. This sentiment analysis
application can be used to analyze decision-making applications, review-based websites, and
more.
9. Customer Service Automation
Managing an increasing number of online customer interactions has become a pain point for most
businesses. It is because they simply don‘t have the customer support staff available to deal with
the sheer number of inquiries they receive daily.
Machine learning algorithms have made it possible and super easy for chatbots and other similar
automated systems to fill this gap. This application of machine learning enables companies to
automate routine and low priority tasks, freeing up their employees to manage more high-level
customer service tasks.
Further, Machine Learning technology can access the data, interpret behaviors and recognize the
patterns easily. This could also be used for customer support systems that can work identical to a
real human being and solve all of the customers‘ unique queries. The Machine Learning models
behind these voice assistants are trained on human languages and variations in the human voice
because it has to efficiently translate the voice to words and then make an on-topic and intelligent
response.
If implemented the right way, problems solved by machine learning can streamline the entire
process of customer issue resolution and offer much-needed assistance along with enhanced
customer satisfaction.
Unit No 02 Flavors of Machine Learning

 What is learning system?


A learning system is essentially a collection of artifacts that are ‗brought together‘, in an
appropriate way, in order to create an environment that will facilitate various types
of learning process. Learning systems can take a variety of different forms - for example, a book,
a mobile form, a computer, an online forum, a school and a university. Most learning systems will
provide various types of learning resource and descriptions of procedures for using these to
achieve particular learning outcomes. They will also embed various strategies for assessing the
levels and quality of the achievement of their users.
 Types of Machine learning

Figure 2.1 Types of Machine Learning


Figure 2.2 A Tree of Machine Learning

As with any method, there are different ways to train machine learning algorithms, each with their
own advantages and disadvantages. To understand the pros and cons of each type of machine
learning, we must first look at what kind of data they ingest. In ML, there are two kinds of data —
labeled data and unlabeled data.
Labeled data has both the input and output parameters in a completely machine-readable pattern, but
requires a lot of human labor to label the data, to begin with. Unlabeled data only has one or none of
the parameters in a machine-readable form. This negates the need for human labor but requires more
complex solutions. There are also some types of machine learning algorithms that are used in very
specific use-cases, but three main methods are used today.
1. Supervised Machine Learning
Supervised learning is one of the most basic types of machine learning. In this type, the machine
learning algorithm is trained on labeled data. Even though the data needs to be labeled accurately for
this method to work, supervised learning is extremely powerful when used in the right circumstances.
In supervised learning, the ML algorithm is given a small training dataset to work with. This training
dataset is a smaller part of the bigger dataset and serves to give the algorithm a basic idea of the
problem, solution, and data points to be dealt with. The training dataset is also very similar to the final
dataset in its characteristics and provides the algorithm with the labeled parameters required for the
problem.
The algorithm then finds relationships between the parameters given, essentially establishing a cause
and effect relationship between the variables in the dataset. At the end of the training,
the algorithm has an idea of how the data works and the relationship between the input and the output.
This solution is then deployed for use with the final dataset, which it learns from in the same way as
the training dataset. This means that supervised machine learning algorithms will continue to improve
even after being deployed, discovering new patterns and relationships as it trains itself on new data.
Supervised learning is commonly used in real world applications, such as face and speech
recognition, products or movie recommendations, and sales forecasting.
In supervised learning, learning data comes with description, labels, targets or desired outputs and
the objective is to find a general rule that maps inputs to outputs. This kind of learning data is
called labeled data. The learned rule is then used to label new data with unknown outputs.
Supervised learning involves building a machine learning model that is based on labeled samples.
For example, if we build a system to estimate the price of a plot of land or a house based on various
features, such as size, location, and so on, we first need to create a database and label it. We need to
teach the algorithm what features correspond to what prices. Based on this data, the algorithm will
learn how to calculate the price of real estate using the values of the input features.
Supervised learning deals with learning a function from available training data. Here, a learning
algorithm analyzes the training data and produces a derived function that can be used for mapping
new examples.
Supervised learning can be further classified into two types - Regression and Classification.
Regression trains on and predicts a continuous-valued response, for example predicting real estate
prices. When output Y is discrete valued, it is classification and when Y is continuous, then it is
Regression. Classification attempts to find the appropriate class label, such as analyzing
positive/negative sentiment, male and female persons, benign and malignant tumors, secure and
unsecure loans etc.
a. Regression
Regression algorithms are used if there is a relationship between the input variable and the output
variable. It is used for the prediction of continuous variables, such as Weather forecasting, Market
Trends, etc. Below are some popular Regression algorithms which come under supervised learning:
o Linear Regression
o Regression Trees
o Non-Linear Regression
o Bayesian Linear Regression
o Polynomial Regression
o Logistic Regression
b. Classification
Classification algorithms are used when the output variable is categorical, which means there are two
classes such as Yes-No, Male-Female, True-false, etc.
o Decision Trees
o Random Forest
o Support vector Machines
o Neural network
o Naïve Bayes
Common examples of supervised learning include classifying e-mails into spam and not-spam
categories, labeling web pages based on their content, and voice recognition.
2. Unsupervised Machine Learning
Unsupervised machine learning holds the advantage of being able to work with unlabeled data. This
means that human labor is not required to make the dataset machine-readable, allowing much larger
datasets to be worked on by the program.
In supervised learning, the labels allow the algorithm to find the exact nature of the relationship
between any two data points. However, unsupervised learning does not have labels to work off of,
resulting in the creation of hidden structures. Relationships between data points are perceived by the
algorithm in an abstract manner, with no input required from human beings.
The creation of these hidden structures is what makes unsupervised learning algorithms versatile.
Instead of a defined and set problem statement, unsupervised learning algorithms can adapt to the data
by dynamically changing hidden structures. This offers more post-deployment development than
supervised learning algorithms.
Unsupervised learning is used to detect anomalies, outliers, such as fraud or defective equipment, or
to group customers with similar behaviors for a sales campaign. It is the opposite of supervised
learning. There is no labeled data here.
When learning data contains only some indications without any description or labels, it is up to the
coder or to the algorithm to find the structure of the underlying data, to discover hidden patterns, or
to determine how to describe the data. This kind of learning data is called unlabeled data.
Suppose that we have a number of data points, and we want to classify them into several groups. We
may not exactly know what the criteria of classification would be. So, an unsupervised learning
algorithm tries to classify the given dataset into a certain number of groups in an optimum way.
Unsupervised learning algorithms are extremely powerful tools for analyzing data and for identifying
patterns and trends. They are most commonly used for clustering similar input into logical groups.

a. Clustering:
Clustering is a method of grouping the objects into clusters such that objects with most similarities
remains into a group and has less or no similarities with the objects of another group. Cluster analysis
finds the commonalities between the data objects and categorizes them as per the presence and
absence of those commonalities.
b. Association:
An association rule is an unsupervised learning method which is used for finding the relationships
between variables in the large database. It determines the set of items that occurs together in the
dataset. Association rule makes marketing strategy more effective. Such as people who buy X item
(suppose a bread) are also tend to purchase Y (Butter/Jam) item. A typical example of Association
rule is Market Basket Analysis. The list of some popular unsupervised learning algorithms:
o K-means clustering
o KNN (k-nearest neighbors)
o Hierarchal clustering
o Anomaly detection
o Neural Networks
o Principle Component Analysis
o Independent Component Analysis
o Apriori algorithm
o Singular value decomposition

Sr.No Supervised Learning Unsupervised Learning

1 Supervised learning algorithms are trained Unsupervised learning algorithms are trained
using labeled data. using unlabeled data.

2 Supervised learning model takes direct Unsupervised learning model does not take any
feedback to check if it is predicting correct feedback.
output or not.

3 Supervised learning model predicts the output. Unsupervised learning model finds the hidden
patterns in data.

4 In supervised learning, input data is provided In unsupervised learning, only input data is
to the model along with the output. provided to the model.

5 The goal of supervised learning is to train the The goal of unsupervised learning is to find the
model so that it can predict the output when it hidden patterns and useful insights from the
is given new data. unknown dataset.

6 Supervised learning needs supervision to train Unsupervised learning does not need any
the model. supervision to train the model.

7 Supervised learning can be categorized Unsupervised Learning can be classified


in Classification and Regression problems. in Clustering and Associations problems.

8 Supervised learning can be used for those Unsupervised learning can be used for those
cases where we know the input as well as cases where we have only input data and no
corresponding outputs. corresponding output data.

9 Supervised learning model produces an Unsupervised learning model may give less
accurate result. accurate result as compared to supervised
learning.

10 Supervised learning is not close to true Unsupervised learning is more close to the true
Artificial intelligence as in this, we first train Artificial Intelligence as it learns similarly as a
the model for each data, and then only it can child learns daily routine things by his
predict the correct output. experiences.

11 It includes various algorithms such as Linear It includes various algorithms such as


Regression, Logistic Regression, Support Clustering, KNN, and Apriori algorithm.
Vector Machine, Multi-class Classification,
Decision tree, Bayesian Logic, etc.
3. Semi supervised Machine Learning
The most basic disadvantage of any Supervised Learning algorithm is that the dataset has to be
hand-labeled either by a Machine Learning Engineer or a Data Scientist. This is a very costly
process, especially when dealing with large volumes of data. The most basic disadvantage of
any Unsupervised Learning is that its application spectrum is limited.
To counter these disadvantages, the concept of Semi-Supervised Learning was introduced.
It is partly supervised and partly unsupervised .If some learning samples are labeled, but some other
are not labeled, then it is semi-supervised learning. It makes use of a large amount of unlabeled data
for training and a small amount of labeled data for testing. Semi-supervised learning is applied in
cases where it is expensive to acquire a fully labeled dataset while more practical to label a small
subset. For example, it often requires skilled experts to label certain remote sensing images, and lots
of field experiments to locate oil at a particular location, while acquiring unlabeled data is relatively
easy.

Intuitively, one may imagine the three types of learning algorithms as Supervised learning where a
student is under the supervision of a teacher at both home and school, Unsupervised learning
where a student has to figure out a concept himself and Semi-Supervised learning where a teacher
teaches a few concepts in class and gives questions as homework which are based on similar
concepts.
4. Reinforcement Machine Learning
Reinforcement learning directly takes inspiration from how human beings learn from data in their
lives. It features an algorithm that improves upon itself and learns from new situations using a trial-
and-error method. Favorable outputs are encouraged or ‗reinforced‘, and non-favorable outputs are
discouraged or ‗punished‘.
Based on the psychological concept of conditioning, reinforcement learning works by putting the
algorithm in a work environment with an interpreter and a reward system. In every iteration of the
algorithm, the output result is given to the interpreter, which decides whether the outcome is favorable
or not.
In case of the program finding the correct solution, the interpreter reinforces the solution by providing
a reward to the algorithm. If the outcome is not favorable, the algorithm is forced to reiterate until it
finds a better result. In most cases, the reward system is directly tied to the effectiveness of the result.

In typical reinforcement learning use-cases, such as finding the shortest route between two points on a
map, the solution is not an absolute value. Instead, it takes on a score of effectiveness, expressed in a
percentage value. The higher this percentage value is, the more reward is given to the algorithm.
Thus, the program is trained to give the best possible solution for the best possible reward.
Here learning data gives feedback so that the system adjusts to dynamic conditions in order to achieve
a certain objective. The system evaluates its performance based on the feedback responses and reacts
accordingly. The best known instances include self-driving cars and chess master algorithm AlphaGo.
Also, used in games where the outcome may be decided only at the end of the game.
There are two important learning models in reinforcement learning:
 Markov Decision Process
 Q learning

 Deep Learning
Deep learning is a branch of machine learning which is completely based on artificial neural
networks, as neural network is going to mimic the human brain so deep learning is also a kind of
mimic of human brain. In deep learning, we don‘t need to explicitly program everything. A formal
definition of deep learning is- neurons
Deep learning is a particular kind of machine learning that achieves great power and flexibility by
learning to represent the world as a nested hierarchy of concepts, with each concept defined in
relation to simpler concepts, and more abstract representations computed in terms of less abstr act
ones.
In human brain approximately 100 billion neurons all together this is a picture of an individual
neuron and each neuron is connected through thousand of their neighbors.
The question here is how we recreate these neurons in a computer. So, we create an artificial
structure called an artificial neural net where we have nodes or neurons. We have some neurons
for input value and some for output value and in between, there may be lots of neurons
interconnected in the hidden layer.

Figure2.3Relationship between Deep learning, Machine Learning and Artificial Intelligence


Generally speaking, deep learning is a machine learning method that takes in an input X, and uses it to
predict an output of Y. As an example, given the stock prices of the past week as input, my deep
learning algorithm will try to predict the stock price of the next day. Given a large dataset of input and
output pairs, a deep learning algorithm will try to minimize the difference between its prediction and
expected output. By doing this, it tries to learn the association/pattern between given inputs and outputs
— this in turn allows a deep learning model to generalize to inputs that it hasn‘t seen before.

Figure 2.4 Different layers of Neural Network


A neural network is composed of input, hidden, and output layers — all of which are composed of
―nodes‖. Input layers take in a numerical representation of data (e.g. images with pixel specs), output
layers output predictions, while hidden layers are correlated with most of the computation.

Information is passed between network layers through the function shown above. The major points to
keep note of here are the tunable weight and bias parameters — represented by w and b respectively in
the function above. These are essential to the actual ―learning‖ process of a deep learning algorithm.
After the neural network passes its inputs all the way to its outputs, the network evaluates how good its
prediction was (relative to the expected output) through something called a loss function. As an
example, the ―Mean Squared Error‖ loss function is shown below.

Y hat represents the prediction, while Y represents the expected output. A mean is used if batches of
inputs and outputs are used simultaneously (n represents sample count).
The goal of the network is ultimately to minimize this loss by adjusting the weights and biases of the
network. In using something called ―back propagation‖ through gradient descent, the network
backtracks through all its layers to update the weights and biases of every node in the opposite
direction of the loss function — in other words, every iteration of back propagation should result in a
smaller loss function than before. Without going into the proof, the continuous updates of the weights
and biases of the network ultimately turn it into a precise function approximator — one that models the
relationship between inputs and expected outputs. The ―deep‖ part of deep learning refers to creating
deep neural networks. This refers a neural network with a large amount of layers — with the addition
of more weights and biases, the neural network improves its ability to approximate more complex
functions.
 Deep Learning versus Machine Learning
Sr.No Machine Learning Deep Learning
1 Machine learning uses algorithms to parse Deep learning structures algorithms in
data, learn from that data, and make layers to create an ―artificial neural
informed decisions based on what it has network‖ that can learn and make
learned intelligent decisions on its own
2 Works on small amount of Dataset for Works on Large amount of Dataset
accuracy.
3 Dependent on Low-end Machine Heavily dependent on High-end Machine.
4 Divides the tasks into sub-tasks, solves Solves problem end to end.
them individually and finally combine the
results.
5 Takes less time to train. Takes longer time to train.
6 Trains on CPU Trains on GPU for proper training
7 Testing time may increase. Less time to test the data.
8 Machine learning is about computers being Deep learning is about computers learning to
able to think and act with less human think using structures modeled on the
intervention human brain.
9 Machine learning requires less computing Deep learning typically needs less ongoing
power human intervention.

10 Machine learning can‘t easily analyze Deep learning can analyze images, videos,
images, videos, and unstructured data and unstructured data easily
11 Machine learning programs tend to be less Deep learning systems require far more
complex than deep learning algorithms and powerful hardware and resources
can often run on conventional computers
12 Machine learning systems can be set up and Deep learning systems take more time to set
operate quickly but may be limited in the up but can generate results instantaneously
power of their results (although the quality is likely to improve
over time as more data becomes available).
13 Machine learning tends to require structured Deep learning employs neural networks and
data and uses traditional algorithms like is built to accommodate large volumes of
linear regression unstructured data.
14 Machine learning is already in use in your Deep learning technology enables more
email inbox, bank, and doctor‘s office complex and autonomous programs, like
self-driving cars or robots that perform
advanced surgery.
15 The output is in numerical form for The output can be in any form including
classification and scoring applications free form elements such as free text and
sound
16 Limited tuning capability for Can be tuned in various ways
hyperparameter tuning
17 Machine learning requires less data than deep Deep learning requires much more data than
learning to function properly. a traditional machine learning algorithm to
function properly due to complex multilayer
structure.

Figure 2.5 Differences between Machine Learning and Deep Learning


Unit No 03 Classification and Regression
 Classification
Classification is a task that requires the use of machine learning algorithms that learn how to assign a
class label to examples from the problem domain. An easy to understand example is classifying
emails as ―spam‖ or ―not spam‖.
In machine learning, Classification, as the name suggests, classifies data into different
parts/classes/groups. It is used to predict from which dataset the input data belongs to.
For example, if we are taking a dataset of scores of a cricketer in the past few matches, along with
average, strike rate, not outs etc, we can classify him as ―in form‖ or ―out of form‖.
Classification is the process of assigning new input variables (X) to the class they most likely belong
to, based on a classification model, as constructed from previously labeled training data.
Data with labels is used to train a classifier such that it can perform well on data without labels (not
yet labeled). This process of continuous classification, of previously known classes, trains a machine.
If the classes are discrete, it can be difficult to perform classification tasks.

Figure3.1 Types of classification


 Binary Classification
Binary classification refers to predicting one of two classes and multi-class classification involves
predicting one of more than two classes.
Examples include:
 Email spam detection (spam or not).
 Churn prediction (churn or not).
 Conversion prediction (buy or not).
Typically, binary classification tasks involve one class that is the normal state and another class that is
the abnormal state. For example ―not spam‖ is the normal state and ―spam‖ is the abnormal state.
Another example is ―cancer not detected‖ is the normal state of a task that involves a medical test and
―cancer detected‖ is the abnormal state.
The class for the normal state is assigned the class label 0 and the class with the abnormal state is
assigned the class label 1.Popular algorithms that can be used for binary classification include:
 Logistic Regression
 k-Nearest Neighbors
 Decision Trees
 Support Vector Machine
 Naive Bayes
Term Related to binary classification
1. PRECISION
Precision in binary classification (Yes/No) refers to a model's ability to correctly interpret positive
observations. In other words, how often does a positive value forecast turn out to be correct? We may
manipulate this metric by only returning positive for the single observation in which we have the most
confidence.
2. RECALL
The recall is also known as sensitivity. In binary classification (Yes/No) recall is used to measure
how ―sensitive‖ the classifier is to detecting positive cases. To put it another way, how many real
findings did we ―catch‖ in our sample? We may manipulate this metric by classifying both results as
positive.
3. F1 SCORE
The F1 score can be thought of as a weighted average of precision and recall, with the best value
being 1 and the worst being 0. Precision and recall also make an equal contribution to the F1 ranking.
 Multiclass Classification
Multi-class classification is the task of classifying elements into different classes. Unlike binary, it
doesn‘t restrict itself to any number of classes.
Examples of multi-class classification are
 classification of news in different categories,
 classifying books according to the subject,
 Classifying students according to their streams etc.
In these, there are different classes for the response variable to be classified in and thus according to
the name, it is a Multi-class classification.
Can a classification possess both binary and multi-class?
Let us suppose we have to do sentiment analysis of a person, if the classes are just ―positive‖ and
―negative‖, then it will be a problem of binary class. But if the classes are ―sadness‖, happiness‖,
―disgusting‖, ―depressed‖, then it will be called a problem of Multi-class classification.
Figure 3.2 Figure showing Binary and Multiclass Classifications in Machine Learning

Table3.1 Difference between Binary and Multiclass Classifications in Machine Learning


Sr.No. Parameters Binary classification Multi-class classification
1. No. of classes It is a classification of two There can be any number of classes in
groups, i.e. classifies objects in it, i.e., classifies the object into more
at most two classes. than two classes.
2. Algorithms used The most popular algorithms Popular algorithms that can be used for
used by the binary classification multi-class classification include:
are-  k-Nearest Neighbors
Logistic Regression  Decision Trees
 k-Nearest Neighbors  Naive Bayes
 Decision Trees  Random Forest.
 Support Vector Machine  Gradient Boosting
 Naive Bayes
3. Examples Examples of binary Examples of multi-class classification
classification include- include:
 
Email spam detection (spam or Face classification.
not).  Plant species classification.
 
Churn prediction (churn or not). Optical character recognition.
 Conversion prediction (buy or
not).

 Assessing Classification performance


Many learning algorithms have been proposed. It is often valuable to assess the efficacy of an
algorithm. In many cases, such assessment is relative, that is, evaluating which of several alternative
algorithms is best suited to a specific application.
People even end up creating metrics that suit the application. In this article, we will see some of the
most common metrics in a classification setting of a problem.
The most commonly used Performance metrics for classification problem are as follows,
 Accuracy
 Confusion Matrix
 Precision, Recall, and F1 score
 ROC AUC
 Log-loss

1. Accuracy
Accuracy is the simple ratio between the numbers of correctly classified points to the total number of
points.
Accuracy is simple to calculate but has its own disadvantages.
Limitations of accuracy
 If the data set is highly imbalanced, and the model classifies all the data points as the majority
class data points, the accuracy will be high. This makes accuracy not a reliable performance
metric for imbalanced data.
 From accuracy, the probability of the predictions of the model can be derived. So from accuracy,
we cannot measure how good the predictions of the model are.
2. Confusion Matrix
Confusion Matrix is a summary of predicted results in specific table layout that allows visualization of
the performance measure of the machine learning model for a binary classification problem (2 classes)
or multi-class classification problem (more than 2 classes).

Confusion matrix of a binary classification


 TP means True Positive. It can be interpreted as the model predicted positive class and it is True
 FP means False Positive. It can be interpreted as the model predicted positive class but it is False
 FN means False Negative. It can be interpreted as the model predicted negative class but it is
False
 TN means True Negative. It can be interpreted as the model predicted negative class and it is True
 To get an appropriate example in a real-world problem, consider a diagnostic test that seeks to
determine whether a person has a certain disease. A false positive in this case occurs when the
person tests positive but does not actually have the disease. A false negative, on the other hand,
occurs when the person tests negative, suggesting they are healthy when they actually do have
the disease.
 For a multi-class classification problem, with ‗c‘ class labels, the confusion matrix will be a
(c*c) matrix.
Advantages of a confusion matrix:
 The confusion matrix provides detailed results of the classification.
 Derivates of the confusion matrix are widely used.
 Visual inspection of results can be enhanced by using a heat map.
3. Precision, Recall and F1 score
Precision is the fraction of the correctly classified instances from the total classified instances. Recall is
the fraction of the correctly classified instances from the total classified instances. Precision and recall
are given as follows,

Precision helps us understand how useful the results are. Recall helps us understand how complete the
results are.
But to reduce the checking of pockets twice, the F1 score is used. F1 score is the harmonic mean of
precision and recall. It is given as,

The F-score is often used in the field of information retrieval for measuring search, document
classification, and query classification performance.
The F-score has been widely used in the natural language processing literature, such as the evaluation
of named entity recognition and word segmentation.
4. Log Loss
Logarithmic loss (or log loss) measures the performance of a classification model where the prediction
is a probability value between 0 and 1. Log loss increases as the predicted probability diverge from the
actual label. Log loss is a widely used metric for Kaggle competitions.

Here ‘N‘ is the total number of data points in the data set, yi is the actual value of y and pi is the
probability of y belonging to the positive class.
Lower the log-loss value, better are the predictions of the model.
5. ROC AUC
A Receiver Operating Characteristic curve or ROC curve is created by plotting the True Positive (TP)
against the False Positive (FP) at various threshold settings. The ROC curve is generated by plotting
the cumulative distribution function of the True Positive in the y-axis versus the cumulative
distribution function of the False Positive on the x-axis.
The area under the ROC curve (ROC AUC) is the single-valued metric used for evaluating the
performance.
The higher the AUC, the better the performance of the model in distinguishing between the classes
In general, an AUC of 0.5 suggests no discrimination, a value between 0.5–0.7 is acceptable and
anything above 0.7 is good-to-go-model. However, medical diagnosis models, usually AUC of 0.95 or
more is considered to be good-to-go-model.
 ROC curves are widely used to compare and evaluate different classification algorithms.
 ROC curve is widely used when the dataset is imbalanced.
 ROC curves are also used in verification of forecasts in meteorology
 Class Probability Estimation
For Classification problems in machine learning we often want to know how likely the instance belongs
to the class rather than which class it will belong to. So in many cases we would like to use the
estimated class probability for decision making.
For a variety of applications, machine learning algorithms are required to construct models that
minimize the total loss associated with the decisions, rather than the number of errors. One of the
most efficient approaches to building models that are sensitive to non-uniform costs of errors is to
first estimate the class probabilities of the unseen instances and then to make the decision based on
both the computed probabilities and the loss function.
Example: Consider a scenario where we have to detect a credit fraud. The manager of fraud control
department wants to know not only who are likely to be fraud but also the cases where credit risk is at
stake i.e accounts where company‘s monetary loss is expected to be highest. Here, we must know the
class probability of fraud for that particular case.
Roughly, we would like
(i) The probability estimates to be well calibrated, meaning that if you take 100 cases whose class
membership probability is estimated to be 0.2, then about 20 of them will actually belong to the class.
(ii) The probability estimates to be discriminative. Meaning that they should give different probability
estimates for different examples. Say 0.5 class probability indicates that 50% of population is
fraudulent, which is base rate thus we need discrimination to get some higher/lower class probability
boundary for estimation.
 Regression
Regression analysis consists of a set of machine learning methods that allow us to predict a
continuous outcome variable (y) based on the value of one or multiple predictor variables (x).
Briefly, the goal of regression model is to build a mathematical equation that defines y as a function
of the x variables. Next, this equation can be used to predict the outcome (y) on the basis of new
values of the predictor variables (x).
Assessing performance of Regression
Model evaluation is very important in data science. It helps you to understand the performance of your
model and makes it easy to present your model to other people. There are many different evaluation
metrics out there but only some of them are suitable to be used for regression.
There are 3 main metrics for model evaluation in regression:
1. R Square/Adjusted R Square
2. Mean Square Error (MSE)/Root Mean Square Error (RMSE)
3. Mean Absolute Error (MAE)

1. R Square /Adjust R Square


R Square measures how much variability in dependent variable can be explained by the model. It is the
square of the Correlation Coefficient(R) and that is why it is called R Square.

R Square is calculated by the sum of squared of prediction error divided by the total sum of the square
which replaces the calculated prediction with mean. R Square value is between 0 to 1 and a bigger
value indicates a better fit between prediction and actual value.
R Square is a good measure to determine how well the model fits the dependent variables. However, it
does not take into consideration of overfitting problem. If your regression model has many independent
variables, because the model is too complicated, it may fit very well to the training data but performs
badly for testing data. That is why Adjusted R Square is introduced because it will penalize additional
independent variables added to the model and adjust the metric to prevent overfitting issues.
2. Mean Square Error (MSE)/Root Mean square Error (RMSE)
While R Square is a relative measure of how well the model fits dependent variables, Mean Square
Error is an absolute measure of the goodness for the fit.

MSE is calculated by the sum of square of prediction error which is real output minus predicted output
and then divide by the number of data points. It gives you an absolute number on how much your
predicted results deviate from the actual number. You cannot interpret many insights from one single
result but it gives you a real number to compare against other model results and help you select the best
regression model.
Root Mean Square Error (RMSE) is the square root of MSE. It is used more commonly than MSE
because firstly sometimes MSE value can be too big to compare easily. Secondly, MSE is calculated by
the square of error, and thus square root brings it back to the same level of prediction error and makes it
easier for interpretation.
3. Mean Absolute Error (MAE)
Mean Absolute Error (MAE) is similar to Mean Square Error (MSE). However, instead of the sum of
square of error in MSE, MAE is taking the sum of the absolute value of error.

Compare to MSE or RMSE, MAE is a more direct representation of sum of error terms. MSE gives
larger penalization to big prediction error by square it while MAE treats all errors the same. R
Square/Adjusted R Square is better used to explain the model to other people because you can explain
the number as a percentage of the output variability. MSE, RMSE, or MAE are better be used to
compare performance between different regression models.
 Over fitting
A hypothesis is said to be overfit the training data if there is another hypothesis h`
Such that h` has more error than h on training data, h` has less error than h on the test data
Overfitting refers to a model that models the training data too well. Overfitting happens when a model
learns the detail and noise in the training data to the extent that it negatively impacts the performance
of the model on new data. This means that the noise or random fluctuations in the training data is
picked up and learned as concepts by the model. The problem is that these concepts do not apply to
new data and negatively impact the models ability to generalize.
Overfitting is more likely with nonparametric and nonlinear models that have more flexibility when
learning a target function. As such, many nonparametric machine learning algorithms also include
parameters or techniques to limit and constrain how much detail the model learns.
For example, decision trees are a nonparametric machine learning algorithm that is very flexible and
is subject to overfitting training data. This problem can be addressed by pruning a tree after it has
learned in order to remove some of the detail it has picked up
A solution to avoid overfitting is using a linear algorithm if we have linear data or using the
parameters like the maximal depth if we are using decision trees.
In a nutshell, Overfitting – High variance and low bias
Techniques to reduce the overfitting
1. Increase training data.
2. Reduce model complexity
3. Early stopping during the training phase (have an eye over the loss over the training period as
soon as loss begins to increase stop training)
4. Ridge Regularization and Lasso Regularization
5. Use dropout for neural networks to tackle overfitting.
 Case study of Polynomial Regression
Polynomial regression is a special case of linear regression. With the main idea of how do you select
your features. Looking at the multivariate regression with 2 variables: x1 and x2, linear regression
will look like this:
y = a1 * x1 + a2 * x2.
Now you want to have a polynomial regression (let‘s make 2 degree polynomial). We will create a
few additional features: x1*x2, x1^2 and x2^2. So we will get your ‗linear regression‘:
y = a1 * x1 + a2 * x2 + a3 * x1*x2 + a4 * x1^2 + a5 * x2^2
A polynomial term: a quadratic (squared) or cubic (cubed) term turns a linear regression model into a
curve. But because it is the data X that is squared or cubed, not the Beta coefficient, it still qualifies
as a linear model. This makes it a nice, straightforward way to model curves without having to model
complicated nonlinear models. One common pattern within machine learning is to use linear models
trained on nonlinear functions of the data. This approach maintains the generally fast performance of
linear methods, while allowing them to fit a much wider range of data.
For example, a simple linear regression can be extended by constructing polynomial features from the
coefficients. In the standard linear regression case, you might have a model that looks like this for
two-dimensional data:
(w,x) w0 + w1 x1 + w2x2
If we want to fit a paraboloid to the data instead of a plane, we can combine the features in second-
order polynomials, so that the model looks like this:
(w,x) w0 + w1 x1 + w2x2 + w3x1x2 + w4x12 + w5x22
The (sometimes surprising) observation is that this is still a linear model: to see this, imagine creating
a new variable
z = [x1,x2,x1x2,x12,x22]
With this re-labeling of the data, our problem can be written
(w,x) w0 + w1 z1 + w2z2 + w3z3 + w4z4 + w5z5
We see that the resulting polynomial regression is in the same class of linear models we‘d considered
above (i.e. the model is linear in w) and can be solved by the same techniques.
By considering, linear fits within a higher-dimensional space built with these basis functions, the
model has the flexibility to fit a much broader range of data.
 Theory of Generalization
Generalization refers to your model's ability to adapt properly to new, previously unseen data, drawn
from the same distribution as the one used to create the model. When we train a machine learning
model, we don‘t just want it to learn to model the training data. We want it to generalize to data it
hasn‘t seen before. Fortunately, there‘s a very convenient way to measure an algorithm‘s
generalization performance: we measure its performance on a held-out test set, consisting of examples
it hasn‘t seen before. If an algorithm works well on the training set but fails to generalize, we say it is
overfitting.
There‘s an easy way to measure a network‘s generalization performance. We simply partition our data
into three subsets:
A training set, a set of training examples the network is trained on.
A validation set, which is used to tune hyperparameters such as the number of hidden units, or the
learning rate
A test set, which is used to measure the generalization performance
The losses on these subsets are called training, validation, and test loss, respectively. Hopefully it‘s
clear why we need separate training and test sets: if we train on the test data, we have no idea whether
the network is correctly generalizing, or whether it‘s simply memorizing the training.
 Effective number of Hypothesis
A hypothesis is an explanation for something.
It is a provisional idea, an educated guess that requires some evaluation.
A good hypothesis is testable; it can be either true or false.
In science, a hypothesis must be falsifiable, meaning that there exists a test whose outcome could
mean that the hypothesis is not true. The hypothesis must also be framed before the outcome of the
test is known.
A good hypothesis fits the evidence and can be used to make predictions about new observations or
new situations. The hypothesis that best fits the evidence and can be used to make predictions is
called a theory, or is part of a theory.
Hypothesis in Machine Learning
An example of a model that approximates the target function and performs mappings of inputs to
outputs is called a hypothesis in machine learning. The choice of algorithm (e.g. neural network) and
the configuration of the algorithm (e.g. network topology and hyperparameters) define the space of
possible hypothesis that the model may represent.
A common notation is used where lowercase-h (h) represents a given specific hypothesis and
uppercase-h (H) represents the hypothesis space that is being searched.
 h (hypothesis): A single hypothesis, e.g. an instance or specific candidate model that maps inputs to
outputs and can be evaluated and used to make predictions.
 H (hypothesis set): A space of possible hypotheses for mapping inputs to outputs that can be searched,
often constrained by the choice of the framing of the problem, the choice of model and the choice of
model configuration
The choice of algorithm and algorithm configuration involves choosing a hypothesis space that is
believed to contain a hypothesis that is a good or best approximation for the target function.
A hypothesis in machine learning:
1. Covers the available evidence: the training dataset.
2. Is falsifiable (kind-of): a test harness is devised beforehand and used to estimate performance and
compare it to a baseline model to see if is skillful or not.
3. Can be used in new situations: make predictions on new data.
 Bounding the Growth Function
Growth Function
Rademacher complexity can be bounded in terms of the growth function. For any hypothesis h ∈ H
and a sample S = {x1,..., xm} ⊆ X, we denote hS = (h(x1),...,h(xm)) ∈ Y m
Dichotomy: Given a hypothesis set H, a dichotomy of a set S is one of the possible ways of labeling
the points of S using a hypothesis in H.
Growth Function: For a hypothesis set H, the growth function ΠH: N → N is defined as

Following is true for growth function.


(a) It is the maximum number of distinct ways in which m points can be classified using hypotheses
in H.
(b) It is the maximum number of dichotomies for m points using hypotheses in H.
(c) It is a measure of richness of the hypothesis set H.
(d) It is a purely combinatorial measure, and unlike Rademacher complexity, it doesn‘t depend on the
unknown distribution D.
Bounding the Growth function
Let‘s define the quantity B(N,k) that counts the maximum number of possible combinations on N
points with k being a breakpoint (B(3,2) = 4 as an example)
B(N,k) corresponds to the number of rows in the following table:
Table of possible combinations for N points with a k break point

Let α be the count of rows in the S1 group. We also divide the group S2 into S2+ where xN is a ―+‖
and S2- where xN is ―-‖ and each of them have β rows. This means that:
B(N,k) = α + 2β : (0).
Our purpose of the following steps is to find recursive bound of B(N,k) (a bound defined by B on
different values of N & k).
For this purpose, we‘ll start by trying to estimate α + β which is the number of rows in the table
without point xN and the group S2-. The result is a sub-table where all rows are different since the
rows in S1 are inherently different without xN, and the rows in S2+ are different from the ones in S1
because if that wasn‘t the case, the duplicate version of that row in S1 would get its ―uniqueness‖ from
xN and forcing it to leave S1 and join S2 (just like we‘ve seen in the simple case example).
Furthermore, since in the bigger table (N points) there are no k points that have all possible
combinations, it is impossible to find all possible combinations in the smaller table (N-1 points). This
implies that k is a break point for the smaller table too.
Which will give us: α + β < B(N-1,k) : (2).
The next step now is to find an estimation of β by studying the group S2 only and without the xN point:
Because the rows in S2+ are different from the ones in S2- only thanks to xN, when we remove xN,
S2+ becomes the same as S2-.

Consequently, we will only focus on S2+.


For this smaller version of the original table, if we suppose that k is it‘s break point, then we can find k-
1 points that exist in all possible combinations. And if this is the case, when we add xN back, in both
forms ―-‖ and ―+‖, we get a table where we have all possible combinations of k points which is
impossible since k is the breaking point. Therefore, we conclude that k-1 is in fact a break point for
S2+. And since B(N-1, k-1) is the maximum number of combinations for N-1 who have a break point
of k-1, we conclude that β < B(N-1, k-1) : (2).
(1) + (2) results in:
B(N,k) = α + 2β ≤ B(N-1, k) + B(N-1, k-1) (*)
 VC function
We now derive an upper bound for the growth function ΠH (m), for all m ∈ N. For proving the
polynomial bound, we define a new combinatorial quantity, the VC dimension. The VC (Vapnik-
Chervonenkis) dimension is a single parameter that characterizes the growth function:
The VC dimension of a hypothesis set H, denoted by dVC(H), is the largest value of m, for which
ΠH(m) 2m. If ΠH(m) = 2m, then dVC(H) ∞.
To illustrate this definition, we will now take a second look at the examples for the growth function to
learn their VC-dimension. To find a lower bound we have to simply find a set S that can be shattered
by H. To give an upper bound, we need to prove that no set S of d + 1 points exists, that can be
shattered by H, which is usually more difficult.
The growth function for positive rays is ΠH(m) = m + 1. Only for m = 0, 1 we have ΠH(m) = 2m,
therefore dVC(H) = 1.
The growth function for positive intervals is ΠH(m) = 1/2 m2 + 1/2m + 1. We have ΠH(m) = 2m for m
= 0, 1, 2 which yields dVC(H) = 2
We have seen that by arrange convex sets in the right way, sets of every size can be shattered.
Therefore ΠH(m) = 2m for all m and dVC(H) ∞
 Regularization Theory
Regularization is one of the basic and most important concepts in the world of Machine Learning.
Regularize means to make things regular or acceptable. Regularizations are techniques used to reduce
the error by fitting a function appropriately on the given training set and avoid overfitting.
Consider the training dataset comprising of independent variables X (x1,x2….xn) and the
corresponding target variables t (t1,t2,…tn). X are random variables lying uniformly between [0,1].
The target dataset‗t‘ is obtained by substituting the value of X into the function sin (2πx) and then
adding some Gaussian noise into it.
Now, our goal is to find patterns in this underlying dataset and generalize it to predict the
corresponding target value for some new values of ‗x‘. The problem here is, our target dataset is
inflicted with some random noise. So, it will be difficult to find the inlying function sin(2πx) in the
training data. So, how do we solve it?
Let‘s try fitting a polynomial on the given data.

It should be noted that the given polynomial function is a non-linear function of ‗x‘ but a linear
function of ‗w‘. We train our data on this function to determine the values of w that will make the
function to minimize the error in predicting target values.
The error function used in this case is mean squared error
In order to minimize the error, calculus is used. The derivative of E(w) is equated with 0 to get the
value of w which will result at the minimum value of error function. E(w) is a quadratic equation, but
it‘s derivative will be a linear equation and hence will result in only a single value of w. Let that be
denoted by w*.
So now, we will get the correct value of w but the issue is what degree of polynomial to choose? All
degree of polynomials can be used to fit on the training data, but how to decide the best choice with
minimum complexity?
Furthermore, if we see the Taylor expansion of sine series then any higher order polynomial can be
used to determine the correct value of the function.
The function has trained itself to get the correct target values for all the noise induced data points and
thus has failed to predict the correct pattern. This function may give zero error for training set but will
give huge errors in predicting the correct target values for test dataset.
To avoid this condition regularization is used. Regularization is a technique used for tuning the
function by adding an additional penalty term in the error function. The additional term controls the
excessively fluctuating function such that the coefficients don‘t take extreme values. This technique of
keeping a check or reducing the value of error coefficients are called shrinkage methods or weight
decay in case of neural networks.
Overfitting can also be controlled by increasing the size of training dataset.
Unit No 04 Neural Network
 Introduction to Neural Network
The Neuron is the basic unit of neural network.
A neuron takes inputs, does some math with them, and produces one output. Here‘s what a 2-input
neuron looks like:

Figure4.1 Single Neuron


3 things are happening here. First, each input is multiplied by a weight:

Next, all the weighted inputs are added together with a bias b

Finally, the sum is passed through an activation function:

The activation function is used to turn an unbounded input into an output that has a nice, predictable
form. A commonly used activation function is the sigmoid function.

Figure 4.2 Activation Function: Sigmoid function


The sigmoid function only outputs numbers in the range (0, 1). You can think of it as compressing
(−∞,+∞) to (0,1) — big negative numbers become ~0, and big positive numbers become ~1.
Assume we have a 2-input neuron that uses the sigmoid activation function and has the following
parameters:
w = [0, 1]
b=4
w= [0, 1] is just a way of writing w1=0, w2 1 in vector form. Now, let‘s give the neuron an input of x=
[2, 3]. We‘ll use the dot product to write things more concisely:

The neuron outputs 0.999 given the inputs x = [2, 3]. That‘s it! This process of passing inputs forward
to get an output is known as feed forward.
Combining Neurons into a Neural Network
A neural network is nothing more than a bunch of neurons connected together. Here‘s what a simple
neural network might look like:

Figure 4.3 Simple Neural Network


This network has 2 inputs, a hidden layer with 2 neurons (h1 and h2), and an output layer with 1
neuron (o1). Notice that the inputs for o1 are the outputs from h1 and h2 — that‘s what makes this a
network.
A hidden layer is any layer between the input (first) layer and output (last) layer. There can be multiple
hidden layers!
A neural network can have any number of layers with any number of neurons in those layers. The basic
idea stays the same: feed the input(s) forward through the neurons in the network to get the output(s) at
the end.
 Components of Neural network
1. Neuron
The building block of a neural network is the single neuron.
The input to the neuron is x, which has a weight w associated with it. The weight is the intrinsic
parameter, the parameter the model has control over in order to get a better fit for the output. When we
pass an input into a neuron, we multiply it by its weight, giving us x * w. The second element of the
input is called the bias. The bias is determined solely by the value b, since the value of the node is 1.
The bias adds an element of unpredictability to our model, which helps it generalize and gives our
model the flexibility to adapt to different unseen inputs when using testing data. The combination of
the bias and input produces our output y, giving us a formula of w*x + b = y. This should look familiar
as a modification of the equation of a straight line, y = mx + c. Neural Networks are made up of tens,
hundreds or many even thousands of interconnected neurons, each of which runs its own regression.
It‘s essentially a regression on steroids.
2. Multiple Inputs
We will expect to see many more inputs that are combined to estimate the output. This is achieved in a
similar way as the neuron with one input.
The formula for the above equation will read x0 * w0 + x1 * w1 + x2 * w2 + b = y.
3. Layers
Neural networks organize neurons into layers. A layer in which every neuron is connected to every
other neuron in its next layer is called a dense layer. Through this increasing complexity, neural
networks are able to transform data and infer relationships in a variety of complex ways. As we add
more layers and nodes to our network, this complexity increases.
4. Activation Layer
Currently our model is only good for predicting linear relationships in our data. There‘s no benefit to
running this neural network as opposed to a series of regressions. Neural Networks provide a solution
to this in two ways. The first is the ability to add more layers to our network between the input and
output, known as hidden layers. Each of these hidden layers will have a predefined number of nodes
and this added complexity starts to separate the neural network from its regression counterpart. The
second way that Neural Networks add complexity is through the introduction of
an activation function at every node that isn‘t an input or output. An activation function is a function
that transforms our input data using a non linear method. Sigmoid and ReLu are the most commonly
used activation functions. Without an activation function, neural networks can only learn linear
relationships. The fitting of an object as simple as an x² curve would not be possible without the
introduction of an activation function. So the role of a neuron in a hidden layer is to take the sum of the
products of the inputs and their weights and pass this value into an activation function. This will then
be the value passed as the input to the next neuron, be it another hidden neuron or the output.

5. Optimizing weights
When a Neural Network is initialized, its weights are randomly assigned. The power of the neural
network comes from its access to a huge amount of control over the data, through the adjusting of these
weights. The network iteratively adjusts weights and measures performance, continuing this procedure
until the predictions are sufficiently accurate or another stopping criterion is reached.
The accuracy of our predictions is determined by a loss function. Also known as a cost function, this
function will compare the model output with the actual outputs and determine how bad our model is in
estimating our dataset. Essentially we provide the model a function that it aims to minimize and it does
this through the incremental tweaking of weights.
A common metric for a loss function is Mean Absolute Error, MAE. This measures the sum of the
absolute vertical differences between the estimates and their actual values.
The job of finding the best set of weights is conducted by the optimiser. In neural networks, the
optimization method used is stochastic gradient descent.
Every time period, or epoch, the stochastic gradient descent algorithm will repeat a certain set of steps
in order to find the best weights.
1. Start with some initial value for the weights
2. Keep updating weights that we know will reduce the cost function
3. Stop when we have reached the minimum error on our dataset
6. Overfitting and underfitting
Overfitting and Underfitting are two of the most important concepts of machine learning, because they
can help give you an idea of whether your ML algorithm is capable of its true purpose, being unleashed
to the world and encountering new unseen data.
Mathematically, overfitting is defined as the situation where the accuracy on your training data is
greater than the accuracy on your testing data. Underfitting is generally defined as poor performance on
both the training and testing side.

Figure4.4 An Example of Neural Network


A normal neural network looks like this as we all know
 Basic Perceptron
Perceptron is a single layer neural network and a multi-layer Perceptron is called Neural Networks.
Perceptron is a linear classifier (binary). Also, it is used in supervised learning. It helps to classify the
given input data.
The Perceptron consists of 4 parts.
1. Input values or One input layer
2. Weights and Bias
3. Net sum
4. Activation Function
The Neural Networks work the same way as the Perceptron. So, if you want to know how neural
network works, learn how Perceptron works.

Figure 4.5 Perceptron


The Perceptron works on these simple steps
a. All the inputs x are multiplied with their weights w. Let‘s call it k.

Figure 4.6 Multiplying inputs with weights for 5 inputs

b. Add all the multiplied values and call them Weighted Sum.
\
Figure 4.7 Adding with Summation

c. Apply that weighted sum to the correct Activation Function.


For Example: Unit Step Activation Function.

Figure 4.8 Unit step activation function


Weights show the strength of the particular node.
A bias value allows you to shift the activation function curve up or down.

Figure 4.9 Mapping of Activation function


In short, the activation functions are used to map the input between the required values like (0, 1) or (-1, 1).
Perceptron is usually used to classify the data into two parts. Therefore, it is also known as a Linear
Binary Classifier.
Figure4.10 Linear Binary Classifier
 Feed Forward Neural network
A feed forward neural network is a biologically inspired classification algorithm. It consists of a
(possibly large) number of simple neuron-like processing units, organized in layers. Every unit in a
layer is connected with all the units in the previous layer. These connections are not all equal: each
connection may have a different strength or weight. The weights on these connections encode the
knowledge of a network. Often the units in a neural network are also called nodes.
Data enters at the inputs and passes through the network, layer by layer, until it arrives at the outputs.
During normal operation, that is when it acts as a classifier, there is no feedback between layers. This
is why they are called feed forward neural networks.
In the following figure we see an example of a 2-layered network with, from top to bottom: an output
layer with 5 units, a hidden layer with 4 units, respectively. The network has 3 input units.

Figure4.11 Feed Forward Neural Network


The 3 inputs are shown as circles and these do not belong to any layer of the network (although the
inputs sometimes are considered as a virtual layer with layer number 0). Any layer that is not an
output layer is a hidden layer. This network therefore has 1 hidden layer and 1 output layer. The
figure also shows all the connections between the units in different layers. A layer only connects to
the previous layer.
The operation of this network can be divided into two phases
1. Learning Phase
2. Classsification Phase
 Back Propagation Algorithm
Back propagation is a supervised learning algorithm, for training Multi-layer Perceptron (Artificial
Neural Networks).
While designing a Neural Network, in the beginning, we initialize weights with some random values
or any variable for that fact. It‘s not necessary that whatever weight values we have selected will be
correct, or it fits our model the best. Okay, fine, we have selected some weight values in the
beginning, but our model output is way different than our actual output i.e. the error value is huge.
Now, how will you reduce the error?
Basically, what we need to do, we need to somehow explain the model to change the parameters
(weights), such that error becomes minimum. Let‘s put it in another way, we need to train our model.
One way to train our model is called as Back propagation. Consider the diagram below:

Figure 4.12 Back Propagation Method


Summarizing the steps:
 Calculate the error – How far is your model output from the actual output.
 Minimum Error – Check whether the error is minimized or not.
 Update the parameters – If the error is huge then, update the parameters (weights and biases).
After that again check the error. Repeat the process until the error becomes minimum.
 Model is ready to make a prediction – Once the error becomes minimum, you can feed some
inputs to your model and it will produce the output.
The Back propagation algorithm looks for the minimum value of the error function in weight
space using a technique called the delta rule or gradient descent. The weights that minimize the
error function are then considered to be a solution to the learning problem. Steps are
 We first initialized some random value to ‗W‘ and propagated forward.
 Then, we noticed that there is some error. To reduce that error, we propagated backwards and
increased the value of ‗W‘.
 After that, also we noticed that the error has increased. We came to know that, we can‘t
increase the ‗W‘ value.
 So, we again propagated backwards and we decreased ‗W‘ value.
 Now, we noticed that the error has reduced.
So, we are trying to get the value of weight such that the error becomes minimum. Basically, we
need to figure out whether we need to increase or decrease the weight value. Once we know that,
we keep on updating the weight value in that direction until error becomes minimum. You might
reach a point, where if you further update the weight, the error will increase. At that time you
need to stop, and that is your final weight value. Consider the graph below:

Figure 4.13 Graph showing Global loss


We need to reach the ‗Global Loss Minimum‘.
This is nothing but Back propagation.
 Artificial Neural Network(ANN)
An artificial neural network (ANN) is the piece of a computing system designed to simulate the way
the human brain analyzes and processes information. It is the foundation of artificial intelligence (AI)
and solves problems that would prove impossible or difficult by human or statistical standards. ANNs
have self-learning capabilities that enable them to produce better results as more data becomes
available.
Understanding an Artificial Neural Network (ANN)
Artificial neural networks are built like the human brain, with neuron nodes interconnected like a web.
The human brain has hundreds of billions of cells called neurons. Each neuron is made up of a cell
body that is responsible for processing information by carrying information towards (inputs) and away
(outputs) from the brain.
An ANN has hundreds or thousands of artificial neurons called processing units, which are
interconnected by nodes. These processing units are made up of input and output units. The input
units receive various forms and structures of information based on an internal weighting system, and
the neural network attempts to learn about the information presented to produce one output report.
Just like humans need rules and guidelines to come up with a result or output, ANNs also use a set of
learning rules called backpropagation, an abbreviation for backward propagation of error, to perfect
their output results.
An ANN initially goes through a training phase where it learns to recognize patterns in data, whether
visually, aurally, or textually. During this supervised phase, the network compares its actual output
produced with what it was meant to produce—the desired output. The difference between both
outcomes is adjusted using backpropagation. This means that the network works backward, going
from the output unit to the input units to adjust the weight of its connections between the units until
the difference between the actual and desired outcome produces the lowest possible error.
During the training and supervisory stage, the ANN is taught what to look for and what its output
should be, using yes/no question types with binary numbers. For example, a bank that wants to detect
credit card fraud on time may have four input units fed with these questions: (1) Is the transaction in a
different country from the user‘s resident country? (2) Is the website the card is being used at
affiliated with companies or countries on the bank‘s watch list? (3) Is the transaction amount larger
than $2,000? (4) Is the name on the transaction bill the same as the name of the cardholder?
The bank wants the "fraud detected" responses to be Yes Yes Yes No, which in binary format would
be 1 1 1 0. If the network‘s actual output is 1 0 1 0, it adjusts its results until it delivers an output that
coincides with 1 1 1 0. After training, the computer system can alert the bank of pending fraudulent
transactions, saving the bank lots of money.
Advantages of artificial neural networks include:
 Parallel processing abilities mean the network can perform more than one job at a time.
 Information is stored on an entire network, not just a database.
 The ability to learn and model nonlinear, complex relationships helps model the real-life
relationships between input and output.
 Fault tolerance means the corruption of one or more cells of the ANN will not stop the generation
of output.
 Gradual corruption means the network will slowly degrade over time, instead of a problem
destroying the network instantly.
 The ability to produce output with incomplete knowledge with the loss of performance being
based on how important the missing information is.
 No restrictions are placed on the input variables, such as how they should be distributed.
 Machine learning means the ANN can learn from events and make decisions based on the
observations.
 The ability to learn hidden relationships in the data without commanding any fixed relationship
means an ANN can better model highly volatile data and non-constant variance.
 The ability to generalize and infer unseen relationships on unseen data means ANNs can predict
the output of unseen data.
The disadvantages of ANNs include:
 The lack of rules for determining the proper network structure means the appropriate artificial
neural network architecture can only be found through trial and error and experience.
 The requirement of processors with parallel processing abilities makes neural networks hardware-
dependent.
 The network works with numerical information, therefore all problems must be translated into
numerical values before they can be presented to the ANN.
 The lack of explanation behind probing solutions is one of the biggest disadvantages in ANNs.
The inability to explain the why or how behind the solution generates a lack of trust in the
network.
Applications of Artificial Neural Networks
Image recognition was one of the first areas to which neural networks were successfully applied, but
the technology uses have expanded to many more areas, including:
 Chatbots
 Natural language processing, translation and language generation
 Stock market prediction
 Delivery driver route planning and optimization
 Drug discovery and development
Unit no 05 Machine Learning Models
 Linear Models
Linear models are relatively simple. In this case, the function is represented as a linear combination of
its inputs. Thus, if x1 and x2 are two scalars or vectors of the same dimension and a and b are arbitrary
scalars, then ax1 + bx2 represents a linear combination of x1 and x2. In the simplest case where f(x)
represents a straight line, we have an equation of the form f (x) = mx + c where c represents the
intercept and m represents the slope.
Linear models are parametric, which means that they have a fixed form with a small number of
numeric parameters that need to be learned from data. For example, in f (x) = mx + c, m and c are the
parameters that we are trying to learn from the data. This technique is different from tree or rule
models, where the structure of the model (e.g., which features to use in the tree, and where) is not
fixed in advance.
Linear models are stable, i.e., small variations in the training data have only a limited impact on the
learned model. In contrast, tree models tend to vary more with the training data, as the choice of a
different split at the root of the tree typically means that the rest of the tree is different as well. As a
result of having relatively few parameters, linear models have low variance and high bias. This
implies that linear models are less likely to overfit the training data than some other models. However,
they are more likely to underfit. For example, if we want to learn the boundaries between countries
based on labelled data, then linear models are not likely to give a good approximation.
1. Least Square Method
The "least squares" method is a form of mathematical regression analysis used to determine the line of
best fit for a set of data, providing a visual demonstration of the relationship between the data points.
Each point of data represents the relationship between a known independent variable and an unknown
dependent variable.
The least squares method provides the overall rationale for the placement of the line of best fit among
the data points being studied. The most common application of this method, which is sometimes
referred to as "linear" or "ordinary", aims to create a straight line that minimizes the sum of the
squares of the errors that are generated by the results of the associated equations, such as the squared
residuals resulting from differences in the observed value, and the value anticipated, based on that
model.
This method of regression analysis begins with a set of data points to be plotted on an x- and y-axis
graph. An analyst using the least squares method will generate a line of best fit that explains the
potential relationship between independent and dependent variables.
In regression analysis, dependent variables are illustrated on the vertical y-axis, while independent
variables are illustrated on the horizontal x-axis. These designations will form the equation for the line
of best fit, which is determined from the least squares method.
In contrast to a linear problem, a non-linear least squares problem has no closed solution and is
generally solved by iteration.
An example of the least squares method is an analyst who wishes to test the relationship between a
company‘s stock returns, and the returns of the index for which the stock is a component. In this
example, the analyst seeks to test the dependence of the stock returns on the index returns. To achieve
this, all of the returns are plotted on a chart. The index returns are then designated as the independent
variable, and the stock returns are the dependent variable. The line of best fit provides the analyst with
coefficients explaining the level of dependence.
The Line of Best Fit Equation
The line of best fit determined from the least squares method has an equation that tells the story of the
relationship between the data points. Line of best fit equations may be determined by computer
software models, which include a summary of outputs for analysis, where the coefficients and
summary outputs explain the dependence of the variables being tested.
Least Squares Regression Line
If the data shows a leaner relationship between two variables, the line that best fits this linear
relationship is known as a least squares regression line, which minimizes the vertical distance from
the data points to the regression line. The term ―least squares‖ is used because it is the smallest sum of
squares of errors, which is also called the "variance".
2. Multivariate regression Method
a) Regression Analysis
Regression analysis is one of the most sought out methods used in data analysis. It follows a
supervised machine learning algorithm. Regression analysis is an important statistical method
that allows us to examine the relationship between two or more variables in the dataset.
Regression analysis is a way of mathematically differentiating variables that have an impact.
Simple linear regression is a regression model that estimates the relationship between a dependent
variable and an independent variable using a straight line. On the other hand, multiple linear
regression estimates the relationship between two or more independent variables and one
dependent variable. The difference between these two models is the number of independent
variables.
As known, regression analysis is mainly used in understanding the relationship between a
dependent and independent variable. In the real world, there are an ample number of situations
where many independent variables get influenced by other variables for that we have to look for
other options rather than a single regression model that can only work with one independent
variable. With these setbacks in hand, we would want a better model that will fill up the
disadvantages of Simple and Multiple Linear Regression and that model is Multivariate
Regression.
b) Multivariate Regression
Multivariate Regression is a supervised machine learning algorithm involving multiple data
variables for analysis. A Multivariate regression is an extension of multiple regression with one
dependent variable and multiple independent variables. Based on the number of independent
variables, we try to predict the output.
Multivariate regression tries to find out a formula that can explain how factors in variables
respond simultaneously to changes in others.
Example
An agriculture scientist wants to predict the total crop yield expected for the summer. He
collected details of the expected amount of rainfall, fertilizers to be used, and soil conditions. By
building a Multivariate regression model scientists can predict his crop yield. With the crop yield,
the scientist also tries to understand the relationship among the variables.
Steps involved for Multivariate regression analysis are
1) Feature selection-
The selection of features is an important step in multivariate regression. Feature selection also
known as variable selection. It becomes important for us to pick significant variables for better
model building.
2) Normalizing Features-
We need to scale the features as it maintains general distribution and ratios in data. This will lead
to an efficient analysis. The value of each feature can also be changed.
3) Select Loss function and Hypothesis-
The loss function predicts whenever there is an error that is, when the hypothesis prediction
deviates from actual values. Here, the hypothesis is the predicted value from the feature/variable.
4) Set Hypothesis Parameters-
The hypothesis parameter needs to be set in such a way that it reduces the loss function and
predicts well.
5) Minimize the Loss Function-
The loss function needs to be minimized by using a loss minimization algorithm on the dataset,
which will help in adjusting hypothesis parameters. After the loss is minimized, it can be used for
further action. Gradient descent is one of the algorithms commonly used for loss minimization.
6) Test the hypothesis function-
The hypothesis function needs to be checked on as well, as it is predicting values. Once this is
done, it has to be tested on test data.
Advantages of Multivariate Regression
The most important advantage of Multivariate regression is it helps us to understand the
relationships among variables present in the dataset. This will further help in understanding the
correlation between dependent and independent variables. Multivariate linear regression is a
widely used machine learning algorithm.
Disadvantages of Multivariate Regression
 Multivariate techniques are a bit complex and require a high-levels of mathematical
calculation.
 The multivariate regression model‘s output is not easy to interpret sometimes, because it
has some loss and error output which are not identical.
 This model does not have much scope for smaller datasets. Hence, the same cannot be
applied to them. The results are better for larger datasets.
3. Regularized regression Method
One of the most common problems every Data Science practitioner faces is Overfitting. Avoiding
overfitting can single-handedly improve our model‘s performance. Regularization helps in
overcoming the problem of overfitting and also increases the model interpretability.
Sometimes what happens is that our Machine learning model performs well on the training data but
does not perform well on the unseen or test data. It means the model is not able to predict the output
or target column for the unseen data by introducing noise in the output, and hence the model is called
an over fitted model.
By noise we mean those data points in the dataset which don‘t really represent the true properties of
your data, but only due to a random chance. Regularization is explained as
 It is one of the most important concepts of machine learning. This technique prevents the
model from overfitting by adding extra information to it.
 It is a form of regression that shrinks the coefficient estimates towards zero. In other words,
this technique forces us not to learn a more complex or flexible model, to avoid the problem
of overfitting.
 Now, let‘s understand the ―How flexibility of a model is represented?‖
 For regression problems, the increase in flexibility of a model is represented by an increase in
its coefficients, which are calculated from the regression line.
 In simple words, ―In the Regularization technique, we reduce the magnitude of the
independent variables by keeping the same number of variables‖. It maintains accuracy as
well as a generalization of the model.
Regularization works by adding a penalty or complexity term or shrinkage term with Residual Sum of
Squares (RSS) to the complex model.
Let‘s consider the Simple linear regression equation:
 Here Y represents the dependent feature or response which is the learned relation. Then,
 Y is approximated to β0 + β1X1 + β2X2 + …+ βpXp
 Here, X1, X2, …Xp are the independent features or predictors for Y, and
 β0, β1,…..βn represents the coefficients estimates for different variables or predictors(X),
which describes the weights or magnitude attached to the features, respectively.
 In simple linear regression, our optimization function or loss function is known as
the residual sum of squares (RSS).
 We choose those set of coefficients, such that the following loss function is minimized:

4. Least Square regression for Classification


Regression and classification are fundamental topics in machine learning. To remind you, in
regression: the output variable takes continuous values, while in classification: the output variable
takes class labels.
To use the Least Squares Regression to solve a classification problem, a simple trick is used. The data
points of the first and second classes are extended by adding a new extra dimension. This produces an
augmented cloud of points in n+1 dimensional space, where n is the size of the original data space. In
that extra dimension, the data points belonging to the first and second classes take values of -1 and +1
respectively.
Then, samples of the augmented data (with the extra dimension) are fitted using Least Square
Regression. In my code, the function to be fitted is chosen to be a polynomial function. The
regression objective is to estimate the parameters of that polynomial such that it best fits the training
data in a least-squares sense. You can easily change the order of the polynomial by setting the
variable: polynomial order. If it's set to 1, in case of the 2D data points, the fitting polynomial will
represent a plane in 3D. If it's set to more than 1, it will allow curvatures and hence more complex
data fitting.
To achieve classification, the classification decision boundary is simply the intersection between the
fitted polynomial surface and the surface where the extra dimension is constant at a value midway
between -1 and +1. The 1 and -1 in the previous sentence are equal to the values we have previously
set in the extra dimension for each class. If we set different values, it should be different.
Figure 5.1 The left plot shows data from two classes denoted by red cross and blue circle
together with the decision boundary found by the least squares (Magenta curve) and also by
logistic regression model(green curve). The right hand plot shows the corresponding results
obtained when extra data points are added at the bottom left of the diagram showing the least
square is highly sensitive to outliers unlike logistic regression

For classification accuracy, we use the Minimum Correct Classification Rate (MCCR). MCCR is
defined as the minimum of CCR1 and CCR2. CCRn is the ratio of the correctly classified test points
in class n divided by the total number of test points in class n. The MCCR for the linear data set is
zero using a polynomial of order 3. Both images in the figure shows the classification decision
boundary obtained from a Least Squares Regression as detailed above in purple color. The decision
boundary is good, until some outliers data points are added to the blue class as in the image to the
right. The resulted classier penalizes these outliers even that they are 'too correct' data points. What is
in green is the decision boundaries obtained by Logistic Regression. The advantage of it is that it's
prone to outliers, and does not penalize the 'too correct' data points.
1. Distance Based Models
Distance-based models are the second class of Geometric models. Like Linear models, distance-based
models are based on the geometry of data. As the name implies, distance-based models work on the
concept of distance. In the context of Machine learning, the concept of distance is not based on
merely the physical distance between two points. Instead, we could think of the distance between two
points considering the mode of transport between two points. Travelling between two cities by plane
covers less distance physically than by train because as the plane is unrestricted. Similarly, in chess,
the concept of distance depends on the piece used – for example, a Bishop can move
diagonally. Thus, depending on the entity and the mode of travel, the concept of distance can be
experienced differently. The distance metrics commonly used are Euclidean, Minkowski, Manhattan,
and Mahalanobis.
Figure5.2 Distance Metrics
Distance is applied through the concept of neighbors and exemplars. Neighbors are points in
proximity with respect to the distance measure expressed through exemplars. Exemplars are
either centroids that find a centre of mass according to a chosen distance metric or medoids that find
the most centrally located data point. The most commonly used centroid is the arithmetic mean, which
minimizes squared Euclidean distance to all other points.
Notes:
 The centroid represents the geometric centre of a plane figure, i.e., the arithmetic mean position of
all the points in the figure from the centroid point. This definition extends to any object in n-
dimensional space: its centroid is the mean position of all the points.
 Medoids are similar in concept to means or centroids. Medoids are most commonly used on data
when a mean or centroid cannot be defined. They are used in contexts where the centroid is not
representative of the dataset, such as in image data.
Examples of distance-based models include the nearest-neighbor models, which use the training data
as exemplars – for example, in classification. The K-means clustering algorithm also uses exemplars
to create clusters of similar data points.
2. Nearest Neighbors Classification
The principle behind nearest neighbor methods is to find a predefined number of training samples
closest in distance to the new point, and predict the label from these. The number of samples can be a
user-defined constant (k-nearest neighbor learning), or vary based on the local density of points
(radius-based neighbor learning). The distance can, in general, be any metric measure: standard
Euclidean distance is the most common choice. Neighbors-based methods are known as non-
generalizing machine learning methods, since they simply ―remember‖ all of its training data.
Despite its simplicity, nearest neighbors has been successful in a large number of classification and
regression problems, including handwritten digits and satellite image scenes. Being a non-parametric
method, it is often successful in classification situations where the decision boundary is very irregular.
Neighbors-based classification is a type of instance-based learning or non-generalizing learning: it
does not attempt to construct a general internal model, but simply stores instances of the training data.
Classification is computed from a simple majority vote of the nearest neighbors of each point: a query
point is assigned the data class which has the most representatives within the nearest neighbors of the
point. Two different nearest neighbors classifiers: KNeighborsClassifier implements learning based
on the k nearest neighbors of each query point, where k is an integer value specified by the
user. RadiusNeighbors Classifier implements learning based on the number of neighbors within a
fixed radius r of each training point, where r is a floating-point value specified by the user.
The k-neighbors classification in KNeighbors Classifier is the most commonly used technique. The
optimal choice of the value k is highly data-dependent: in general a larger k suppresses the effects of
noise, but makes the classification boundaries less distinct.
In cases where the data is not uniformly sampled, radius-based neighbor‘s classification
in RadiusNeighbors Classifier can be a better choice. The user specifies a fixed radius r, such that
points in sparser neighborhoods use fewer nearest neighbors for the classification. For high-
dimensional parameter spaces, this method becomes less effective due to the so-called ―curse of
dimensionality‖.
The basic nearest neighbors classification uses uniform weights: that is, the value assigned to a query
point is computed from a simple majority vote of the nearest neighbors. Under some circumstances, it
is better to weight the neighbors such that nearer neighbors contribute more to the fit. This can be
accomplished through the weights keyword. The default value, weights = 'uniform', assigns uniform
weights to each neighbor. Weights = 'distance' assigns weights proportional to the inverse of the
distance from the query point. Alternatively, a user-defined function of the distance can be supplied to
compute the weights.

Figure 5.3 Left figure shows 3 class classification with weights=uniform and that of right figure
is with weights= distance
 Association Rule Mining
Association rule mining finds interesting associations and relationships among large sets of data
items. This rule shows how frequently an item set occurs in a transaction. A typical example is
Market Based Analysis.

Market Based Analysis is one of the key techniques used by large relations to show associations
between items. It allows retailers to identify relationships between the items that people buy
together frequently.
Given a set of transactions, we can find rules that will predict the occurrence of an item based on
the occurrences of other items in the transaction.

Before we start defining the rule, let us first see the basic definitions.

Support Count (σ) – Frequency of occurrence of an itemset.


Frequent Item set – An itemset whose support is greater than or equal to minsup threshold.
Association Rule – An implication expression of the form X -> Y, where X and Y are any 2
itemsets.
Rule Evaluation Metrics –
 Support(s)-The number of transactions that include items in the {X} and {Y} parts of the rule
as a percentage of the total number of transaction. It is a measure of how frequently the
collection of items occurs together as a percentage of all transactions.
 Support = σ(X+Y) ÷ total –It is interpreted as fraction of transactions that contain both X and
Y.
 Confidence(c) –It is the ratio of the no of transactions that includes all items in {B} as well as
the no of transactions that includes all items in {A} to the no of transactions that includes all
items in {A}.
 Conf(X=>Y) = Supp (X∪Y) ÷ Supp(X) –It measures how often each item in Y appears in
transactions that contains items in X also.
 Lift (l) –The lift of the rule X=>Y is the confidence of the rule divided by the expected
confidence, assuming that the itemsets X and Y are independent of each other. The expected
confidence is the confidence divided by the frequency of {Y}.
 Lift(X=>Y) = Conf(X=>Y) ÷ Supp(Y) –Lift value near 1 indicates X and Y almost often
appear together as expected, greater than 1 means they appear together more than expected
and less than 1 means they appear less than expected. Greater lift values indicate stronger
association.
The Association rule is very useful in analyzing datasets. The data is collected using bar-code
scanners in supermarkets. Such database consists of a large number of transaction records which
list all items bought by a customer on a single purchase. So the manager could know if c ertain
groups of items are consistently purchased together and use this data for adjusting store layouts,
cross-selling, promotions based on statistics.

 Tree based Models: Decision Tree


Tree-based models use a series of if-then rules to generate predictions from one or more decision
trees. All tree-based models can be used for either regression (predicting numerical values) or
classification (predicting categorical values). A decision tree model can be used to visually represent
the ―decisions‖, or if-then rules, that are used to generate predictions.
We‘ll go through each yes or no question, or decision node, in the tree and will move down the tree
accordingly, until we reach our final predictions.
There are essentially two key components to building a decision tree model: determining which
features to split on and then deciding when to stop splitting.
When determining which features to split on, the goal is to select the feature that will produce the
most homogenous resulting datasets. The simplest and most commonly used method of doing this is
by minimizing entropy, a measure of the randomness within a dataset, and maximizing information
gain, the reduction in entropy that results from splitting on a given feature. we‘ll split on the feature
that results in the highest information gain, and then recomputed entropy and information gain for the
resulting output datasets. For numerical features, we first sort the feature values in ascending order,
and then test each value as the threshold point and calculate the information gain of that split. The
value with the highest information gain and will then be compared with other potential splits, and
whichever has the highest information gain will be used at that node. A tree can split on any
numerical feature multiple times at different value thresholds, which enables decision tree models to
handle non-linear relationships quite well.
The second decision we need to make is when to stop splitting the tree. We can split until each final
node has very few data points, but that will likely result in overfitting, or building a model that is too
specific to the dataset it was trained on. This is problematic because, while it may make good
predictions for that one dataset, it may not generalize well to new data, which is really our larger goal.
To combat this, we can remove sections that have little predictive power, a technique referred to
as pruning. Some of the most common pruning methods include setting a maximum tree depth or
minimum number of samples per leaf, or final node.

Advantages:
 Straightforward interpretation
 Good at handling complex, non-linear relationships
Disadvantages:
 Predictions tend to be weak, as singular decision tree models are prone to overfitting
 Unstable, as a slight change in the input dataset can greatly impact the final results
Applications of Decision Tree Machine Learning Algorithm
1. Decision trees are among the popular machine learning algorithms that find great use in
finance for option pricing.
2. Remote sensing is an application area for pattern recognition based on decision trees.
3. Decision tree algorithms are used by banks to classify loan applicants by their probability of
defaulting payments.
4. Gerber Products, a popular baby product company, used decision tree machine learning
algorithm to decide whether they should continue using the plastic PVC (Poly Vinyl
Chloride) in their products.
5. Rush University Medical Centre has developed a tool named Guardian that uses a decision
tree machine learning algorithm to identify at-risk patients and disease trends.

 Probabilistic Models
A probabilistic method or model is based on the theory of probability or the fact that randomness
plays a role in predicting future events.
Probabilistic models incorporate random variables and probability distributions into the model of an
event or phenomenon. While a deterministic model gives a single possible outcome for an event, a
probabilistic model gives a probability distribution as a solution. These models take into account the
fact that we can rarely know everything about a situation. There‘s nearly always an element of
randomness to take into account. For example, life insurance is based on the fact we know with
certainty that we will die, but we don‘t know when. These models can be part deterministic and part
random or wholly random.
 Normal Distribution and Its Geometric Interpretations
Normal Distribution is an important concept in statistics and the backbone of Machine Learning. A
Data Scientist needs to know about Normal Distribution when they work with Linear Models (perform
well if the data is normally distributed).
As discovered by Carl Friedrich Gauss, Normal Distribution/Gaussian Distribution is a continuous
probability distribution. It has a bell-shaped curve that is symmetrical from the mean point to both
halves of the curve.
Following formula gives the PDF (Probability Density Function) of the normal distribution:

The mean of the normal distribution is μ (mu) and a standard deviation is σ sigma.
PDF gives you the ―relative likelihood of a continuous random variable taking that value‖. In Normal
distribution, it is like the bell-shaped curve.
CDF (Cumulative Distribution Function) is nothing but the integration of pdf. In the normal
distribution, it is shown as Φ(z). Which is nothing but the probability that normally distributed random
variable is less than value z.
Figure 5.4 a) Normal Probability Density function b) Normal Cumulative Distribution
function

Figure 5.5 CDF of a: Φ(a)


In general, mean μ gives you the central value (where normal distribution pdf is at peak ) while
standard deviation (std or σ) gives you the spread around mean. If the std is large, then our sample has
large spread while if std is small then the sample is distributed very closely around the mean.
Example: Fishes in fish tank
1) If the standard deviation is zero all sizes of fish is exactly the same and that of μ. You can see the
spread of distribution is nothing but a straight vertical line.
2) When we increase standard deviation we can see that the size of fish varies around mean. Few of
them have got bigger, while few of them have shrunk.
3) Now when we further increase the size of std. you can notice the variance in fish size also increase.
You can also notice the spread of the distribution is almost flat.

 Naïve Bayes Classifier


A classifier is a machine learning model that is used to discriminate different objects based on certain
features.
A Naive Bayes classifier is a probabilistic machine learning model that‘s used for classification task.
The crux of the classifier is based on the Bayes theorem.
Bayes Theorem:

Using Bayes theorem, we can find the probability of A happening, given that B has occurred.
Here, B is the evidence and A is the hypothesis. The assumption made here is that the
predictors/features are independent. That is presence of one particular feature does not affect the other.
Hence it is called naive.
P (A/B) - Posterior Probability of class (A, target) given Predictor (B, attributes)
P (B/A) - Likelihood which is the probability of predictor given class.
P (A) - Class Prior Probability
P (B) - Predictor Prior Probability
Algorithm
1. Convert the data set into a frequency table
2. Create Likelihood table by finding the probabilities
3. Now, use Naive Bayesian equation to calculate the posterior probability for each class. The class
with the highest posterior probability is the outcome of prediction.
Naive Bayes algorithms are mostly used in sentiment analysis, spam filtering, recommendation systems
etc. They are fast and easy to implement but their biggest disadvantage is that the requirement of
predictors to be independent. In most of the real life cases, the predictors are dependent; this hinders
the performance of the classifier.
Pros:
 It is easy and fast to predict class of test data set. It also perform well in multi class prediction
 When assumption of independence holds, a Naive Bayes classifier performs better compare to
other models like logistic regression and you need less training data.
 It performs well in case of categorical input variables compared to numerical variable(s). For
numerical variable, normal distribution is assumed (bell curve, which is a strong assumption).

Cons:
 If categorical variable has a category (in test data set), which was not observed in training
data set, then model will assign a 0 (zero) probability and will be unable to make a prediction.
This is often known as ―Zero Frequency‖. To solve this, we can use the smoothing technique.
One of the simplest smoothing techniques is called Laplace estimation.
 On the other side naive Bayes is also known as a bad estimator, so the probability outputs
from predict_proba are not to be taken too seriously.
 Another limitation of Naive Bayes is the assumption of independent predictors. In real life, it
is almost impossible that we get a set of predictors which are completely independent.
When to use the Naive Bayes Classifier algorithm?
1. If you have a moderate or large training data set.
2. If the instances have several attributes.
3. Given the classification parameter, attributes that describe the instances should be
conditionally independent.
Unit No 06 Applications of Machine Learning
The machine learning can be used for different applications. Some of the applications are explained
here
1) Email Spam and Malware Filtering
Of more than 300 billion emails sent every day, at least half are spam. Email providers have the huge
task of filtering out the spam and making sure their users receive the messages that matter.
Spam detection is messy. The line between spam and non-spam messages is fuzzy, and the criteria
change over time. From various efforts to automate spam detection, machine learning has so far
proven to be the most effective and the favored approach by email providers. Although we still see
spammy emails, a quick look at the junk folder will show how much spam gets weeded out of our
inboxes every day thanks to machine learning algorithms.
Machine learning algorithms use statistical models to classify data. In the case of spam detection, a
trained machine learning model must be able to determine whether the sequence of words found in an
email are closer to those found in spam emails or safe ones.
Different machine learning algorithms can detect spam, but one that has gained appeal is the ―naïve
Bayes‖ algorithm. As the name implies, naïve Bayes is based on ―Bayes‘ theorem,‖ which describes
the probability of an event based on prior knowledge.

In the case of spam detection, things get a bit more complicated. Our target variable is whether a
given email is ―spam‖ or ―not spam‖ (also called ―ham‖). The features are the words or word
combinations found in the email‘s body. In a nutshell, we want to find out calculate the probability
that an email message is spam based on its text.
The catch here is that our features are not necessarily independent. For instance, consider the terms
―grilled,‖ ―cheese,‖ and ―sandwich.‖ They can have separate meanings depending on whether they
successively or in different parts of the message. Another example is the words ―not‖ and
―interesting.‖ In this case, the meaning can be completely different depending on where they appear in
the message. But even though feature independence is complicated in text data, the naïve Bayes
classifier has proven to be efficient in natural language processing tasks if you configure it properly.
Spam detection is a supervised machine learning problem. This means you must provide your
machine learning model with a set of examples of spam and ham messages and let it find the relevant
patterns that separate the two different categories.
Most email providers have their own vast data sets of labeled emails. For instance, every time you
flag an email as spam in your Gmail account, you‘re providing Google with training data for its
machine learning algorithms.
Therefore, one of the key steps in developing a spam-detector machine learning model is preparing
the data for statistical processing. Before training your naïve Bayes classifier, the corpus of spam and
ham emails must go through certain steps.
We can remove words that appear both in spam and ham emails and don‘t help in telling the
difference between the two classes. These are called ―stop words‖ and include terms such
as the, for, is, to, and some. We can also use other techniques such as ―stemming‖ and
―lemmatization,‖ which transform words to their base forms. Stemming and lemmatization can help
further simplify our machine learning model. When you train your machine learning model on the
training data set, each term is assigned a weight based on how many times it appears in spam and ham
emails. For instance, if ―win big money prize‖ is one of your features and only appears in spam
emails, then it will be given a larger probability of being spam. If ―important meeting‖ is only
mentioned in ham emails, then its inclusion in an email will increase the probability of that email
being classified as not spam. Once you have processed the data and assigned the weights to the
features, your machine learning model is ready filter spam. When a new email comes in, the text is
tokenized and run against the Bayes formula. Each term in the message body is multiplied by its
weight and the sum of the weights determines the probability that the email is spam. (In reality, the
calculation is a bit more complicated, but to keep things simple, we‘ll stick to the sum of weights).
Simple as it sounds, the naïve Bayes machine learning algorithm has proven to be effective for many
text classification tasks, including spam detection. Like other machine learning algorithms, naïve
Bayes does not understand the context of language and relies on statistical relations between words to
determine whether a piece of text belongs to a certain class. This means that, for instance, a naïve
Bayes spam detector can be fooled into overlooking a spam email if the sender just adds some non-
spam words at the end of the message or replace spammy terms with other closely related words.
Naïve Bayes is not the only machine learning algorithm that can detect spam. Other popular
algorithms include recurrent neural networks (RNN) and transformers, which are efficient at
processing sequential data like email and text messages.
2) Image Recognition by Machine Learning
Image recognition refers to technologies that identify places, logos, people, objects, buildings, and
several other variables in images. Users are sharing vast amounts of data through apps, social
networks, and websites. Additionally, mobile phones equipped with cameras are leading to the
creation of limitless digital images and videos. The large volume of digital data is being used by
companies to deliver better and smarter services to the people accessing it.
Image recognition is a part of computer vision and a process to identify and detect an object or
attribute in a digital video or image. Computer vision is a broader term which includes methods of
gathering, processing and analyzing data from the real world. The data is high-dimensional and
produces numerical or symbolic information in the form of decisions.
The major steps in image recognition process are gathering and organizing data build a predictive
model and use it to recognize images. The human eye perceives an image as a set of signals which are
processed by the visual cortex in the brain. This result in a vivid experience of a scene, associated
with concepts and objects recorded in one‘s memory. Image recognition tries to mimic this process.
Computer perceives an image as either a raster or a vector image. Raster images are a sequence of
pixels with discrete numerical values for colors while vector images are a set of color-annotated
polygons. To analyze images the geometric encoding is transformed into constructs depicting physical
features and objects. These constructs can then be logically analyzed by the computer. Organizing
data involves classification and feature extraction. The first step in image classification is to simplify
the image by extracting important information and leaving out the rest. For example, in the below
image if you want to extract cat from the background you will notice a significant variation in RGB
pixel values.

Figure 6.1 Image Recognition by Machine Learning


However, by running an edge detector on the image we can simplify it. You can still easily discern the
circular shape of the face and eyes in these edge images and so we can conclude that edge detection
retains the essential information while throwing away non-essential information. Some well-known
feature descriptor techniques are Haar-like features introduced by Viola and Jones, Histogram of
Oriented Gradients (HOG), Scale-Invariant Feature Transform (SIFT), Speeded Up Robust Feature
(SURF) etc. we will learn how a classification algorithm takes this feature vector as input and outputs
a class label (e.g. cat or background/no-cat). Before a classification algorithm can do its magic, we
need to train it by showing thousands of cat and non-cat images. The general principle in machine
learning algorithms is to treat feature vectors as points in higher dimensional space. Then it tries to
find planes or surfaces (contours) that separate higher dimensional space in a way that all examples
from a particular class are on one side of the plane or surface.
There are numerous algorithms for image classification in recognizing images such as bag-of-words,
support vector machines (SVM), face landmark estimation (for face recognition), K-nearest neighbors
(KNN), logistic regression etc
While the above two steps take up most of the effort, this step to recognize image is pretty easy. The
image data, both training, and test are organized. Training data is different from test data, which also
means we remove duplicates (or near duplicates) between them. This data is fed into the model to
recognize images. We have to find the image of a cat in our database of known images which has the
closest measurements to our test image. All we need to do is train a classifier that can take the
measurements from a new test image and tells us about the closest match with a cat. Running this
classifier takes milliseconds. The result of the classifier is the ‗Cat‘ or ‗Non-cat‘.
3) Speech Recognition by Machine Learning
It‘s no secret that the science of speech recognition has come a long way since IBM introduced
its first speech recognition machine in 1962. As the technology has evolved, speech recognition
has become increasingly embedded in our everyday lives with voice -driven applications like
Amazon‘s Alexa, Apple‘s Siri, Microsoft‘s Cortana, or the many voice-responsive features of
Google. From our phones, computers, watches and even our refrigerators, each new voice-
interactive device that we bring into our lives deepens our dependence on artificial intelligence
(AI) and machine learning.
Machine learning, a subset of artificial intelligence, refers to systems that can learn by
themselves. It involves teaching a computer to recognize patterns, rather than programming it
with specific rules. The training process involves feeding large amounts of data to the al gorithm
and allowing it to learn from that data and identify patterns. In the early days, programmers
would have to write code for every object they wanted to recognize (e.g. human v. dog); now
one system can recognize both by showing it many examples of each. As a result, these systems
continue to get smarter over time without human intervention. Rev‘s automatic transcription is
powered by automated speech recognition (ASR) and natural language processing (NLP). ASR
is the conversion of spoken word to text while NLP is the processing of the text to derive its
meaning. Since humans often speak in colloquialisms, abbreviations, and acronyms, it takes
extensive computer analysis of natural language to produce accurate transcription.
Teaching a machine to learn to read a spoken language as humans do, is something that hasn‘t
yet been perfected. Listening to and understanding what a person says is so much more than
hearing the words the person speaks. As humans, we also read the person‘s eyes, their facial
expressions, body language, and the tones and inflections in their voice. Another nuance of
speech is the human tendency to shorten certain words (e.g. ―I don‘t know‖ becomes ―dunno‖);
we have said abbreviated words for so long, that we do not pronounce them as precisely as when
we learned them. This human disposition poses yet another challenge for machine learning in
speech recognition.
Machines are learning to ―listen‖ to accents, emotions and inflections, but there is still quite a
ways to go. As the technology becomes more sophisticated and more data is used by specific
algorithms, those challenges are quickly being overcome.

The technology to support voice-powered interfaces is incredibly powerful. With the


advancements in artificial intelligence and the copious amounts of speech data that can be easily
mined for machine learning purposes, it would not be surprising if it becomes the next dominant
user interface.
Various algorithms and computation techniques are used to recognize speech into text and improve
the accuracy of transcription. Below are some of the most commonly used methods:
1. Natural Language Processing (NLP):
2. Hidden Markov Models (HMM)
3. N-grams
4. Neural Networks
5. Speaker Diarization (SD)

4) Traffic Prediction by Machine Learning


We divided road networks into ―Supersegments‖ consisting of multiple adjacent segments of road that
share significant traffic volume. Currently, the Google Maps traffic prediction system consists of the
following components: (1) a route analyzer that processes terabytes of traffic information to construct
Supersegments and (2) a novel Graph Neural Network model, which is optimized with multiple
objectives and predicts the travel time for each Supersegment.
The biggest challenge to solve when creating a machine learning system to estimate travel times using
Supersegments is an architectural one. How do we represent dynamically sized examples of
connected segments with arbitrary accuracy in such a way that a single model can achieve success?
Our initial proof of concept began with a straight-forward approach that used the existing traffic
system as much as possible, specifically the existing segmentation of road-networks and the
associated real-time data pipeline. This meant that a Supersegment covered a set of road segments,
where each segment has a specific length and corresponding speed features. At first we trained a
single fully connected neural network model for every Supersegment. These initial results were
promising, and demonstrated the potential in using neural networks for predicting travel time.
However, given the dynamic sizes of the Supersegments, we required a separately trained neural
network model for each one. To deploy this at scale, we would have to train millions of these models,
which would have posed a considerable infrastructure challenge. This led us to look into models that
could handle variable length sequences, such as Recurrent Neural Networks (RNNs). However,
incorporating further structure from the road network proved difficult. Instead, we decided to use
Graph Neural Networks. In modeling traffic, we‘re interested in how cars flow through a network of
roads, and Graph Neural Networks can model network dynamics and information propagation.
Our model treats the local road network as a graph, where each route segment corresponds to a node
and edges exist between segments that are consecutive on the same road or connected through an
intersection. In a Graph Neural Network, a message passing algorithm is executed where the
messages and their effect on edge and node states are learned by neural networks. From this
viewpoint, our Supersegments are road subgraphs, which were sampled at random in proportion to
traffic density. A single model can therefore be trained using these sampled subgraphs, and can be
deployed at scale.
Graph Neural Networks extend the learning bias imposed by Convolutional Neural Networks and
Recurrent Neural Networks by generalizing the concept of ―proximity‖, allowing us to have
arbitrarily complex connections to handle not only traffic ahead or behind us, but also along adjacent
and intersecting roads. In a Graph Neural Network, adjacent nodes pass messages to each other. By
keeping this structure, we impose a locality bias where nodes will find it easier to rely on adjacent
nodes (this only requires one message passing step). These mechanisms allow Graph Neural
Networks to capitalize on the connectivity structure of the road network more effectively.
5) Self-driving Cars
The development of self-driving cars is one of the most trendy and popular directions in the world of
AI and machine learning. Automotive Artificial Intelligence is rapidly displacing human drivers by
enabling self-driving cars that use sensors to gather data about their surroundings. But how do self-
driving cars interpret that data? This is the biggest use case of machine learning in automotive.
Driverless cars can identify objects, interpret situations, and make decisions based on object detection
and object classification algorithms. They do this by detecting objects, classifying them, and
interpreting what they are. Mindy Support provides comprehensive data annotation services to help
train the machine learning algorithm to make the right decisions when navigating the roads. Machine
learning is accomplished through a fusion of many algorithms that overlap to minimize failure and
ensure safety. These algorithms interpret road signs, identify lanes, and recognize crossroads. The
three major sensors used by self-driving cars work together as the human eyes and brain. These
sensors are cameras, radar, and lidar. Together, they give the car a clear view of its environment. They
help the car to identify the location, speed, and 3D shapes of objects that are close to it. Additionally,
self-driving cars are now being built with inertial measurement units that monitor and control both
acceleration and location.
Self-driving cars have a number of cameras at every angle for a perfect view of their surroundings.
While some cameras have a broader field of view of about 120 degrees, others have a narrower view
for long-distance vision. Fish-eye cameras provide extensive visuals for parking purposes.
Radar detectors augment the efforts of camera sensors at night or whenever visibility is poor. They
send pulses of radio waves to locate an object and send back signals about the speed and location of
that object.
Lidar sensors calculate distance through pulsed lasers, by empowering driverless cars with 3D
visuals of their surroundings, adding richer information about shape and depth.
LiDAR is one of the most important technologies used in the development of self-driving vehicles.
Basically, it is a device that sends out pulses of light that bounce off an object and returns back to the
LiDAR sensor which determines its distance. The LiDAR produces 3D Point Cloud which is a digital
representation of the way the car sees the physical world.
We talked about some of the ways AI-powered vehicles see the physical world, but how are they able
to identify things like street signs, other cars, road markings and many other things encountered on the
road? This is where data annotation plays a crucial role. This is when all of the raw training data is
prepared through various annotation methods that allow the AI-system to understand what it needs to
learn. For the automotive sector, the most common data annotation methods include 3D Point Cloud
annotation, video labeling, full scene segmentation and many others.
The quality of the data annotation is very important since it will ultimately determine the accuracy
and the ability of the vehicle to navigate its surroundings and also let‘s not forget that people‘s lives
are at stake here. After all, one of the major goals of self-driving cars is increased safety since 94% of
serious crashes are the result of human error. The goal here is to reduce the human factor in driving
and make the car as accurate and safe as possible.
5) Virtual Personal Assistant
Virtual Personal Assistants (VPA) are software program meant to interact with an end user in a
natural way, to answer questions, follow a conversation and accomplish different tasks. Two kinds of
inputs are usually possible for a VPA: a voice interface (such as Apple Siri) or a text interface
(Google Assistant). The key point is that the end user is supposed to be able to talk to the VPA using
his natural language, that is, as he would do to another human being rather than having to use specific
sets of commands or a computer language. Since the launch of Siri by Apple in 2011, the offers of
virtual personal assistants have developed rapidly to provide a more generic user interface. This user
interface can be accessed from the user device (smartphone or specific device) to perform actions,
control objects, answer question and even make recommendations on its own.
The recent progress in Machine Learning that have made possible the rise of Virtual Personal
Assistant, come from a specific approach to Machine Learning: Neural Networks and more
specifically Deep Learning Neural Networks. Indeed, these new approach benefit the many different
tasks which are at the core of natural language processing. There are many tasks involved in
processing natural language for a VPA, which can be regrouped in four main categories, described as
bellow.
1. Speech-to-text
For voice-based input, like those from VPA such as Siri, the first need consists of converting speech
to actionable data. This ―speech-to-text‖ step, (also called speech recognition), is of paramount
importance: if the input is not correctly recognized, all following steps are useless. Indeed, even an
error on one word is very likely to result in an inaccurate answer. Speech-to-text technologies have
been in development for several decades, but the development of AI has enabled important
improvements in the last years.

Figure 6.2 Speech to Text


2. Syntax and semantic processing
Once a sequence of spoken words is successfully converted to a text form, numerous and very
complex tasks remain. Syntax analysis (or parsing) is used to analyze and identify the structure of the
sentence, based on knowledge of grammar. Semantic analysis is used to reach a partial representation
of the meaning of the sentence, based on the knowledge of the meaning of words. Pragmatic analysis
is used to reach a final representation of the meaning of the sentence, based on information about the
context.
3. Question Answering
For the vast majority of applications related to natural language, an answer (oral and/or written) is
given back after a query from the user. Using a correctly identified sentence and its
meaning/intention, systems thus have to succeed in finding the correct answer and formulating it.
Question answering deals with information retrieval (using information on the Internet, or in an
application) and generating a correct sentence, before the last step of speech synthesis.
4. Text to Speech
Text-to-speech (TTS) is the last step in a VPA interaction which uses audio: the text/answer has
already been determined, and the synthesis is the only remaining step. Speech synthesis is not the
most complex task, and is already mastered by all involved players. However, one major possibility of
improvement is to succeed in generating human-sounding speech. Indeed, current text-to-speech
systems are largely based on concatenative TTS and have a robotic-like sound as a result.
Figure 6.3 Text to Speech
Virtual Personal Assistance works on real time, as its give the required output instantaneously. As we
give the command to it via the mic, the speech or command that we have given is first processed and
then it is converted to text, then form the text the keys words are extracted and then check with the
modules which is stored in the local hard drive, if the keywords match with any of the modules then
that particular module will be executed, if the key word doesn‘t matches with any of the modules than
it will just tell the use to try again or it didn‘t understand what the user wants.
6) Machine Learning in Medical Diagnosis
Diagnosing diseases precisely is the cornerstone of healthcare. When physicians make bad judgment
calls, their negligence or lack of concentration can result in life-altering complications and prolong
patient recovery.
We will take a closer look at the ways to use machine learning for the purpose of medical diagnosis
1. Oncology
In oncology, the importance of detecting a malignant tumor on time is vital. This is why the accuracy
and precision of the diagnosis are crucial in this field.
Machine learning helps oncologists detect the disease at its earliest stages.
2. Pathology
The need to process large datasets also makes Pathology extremely lucrative for artificial intelligence
implementation.
Here are the most promising ways of using machine learning to indicate medical diagnosis:
 Improving the precision of blood and culture analysis using automated tissue and cell
quantification.
 Mapping disease cells and flagging areas of interest on a medical slide.
 Creating tumor staging paradigms.
 Improving healthcare professionals' productivity by increasing the speed of profile scanning.
3. Dermatology
In dermatology, artificial intelligence is used to improve clinical decision-making and ensure the
accuracy of skin disease diagnoses. Physicians hope that machine learning implementation in this
field will reduce the number of unnecessary biopsies dermatologists have to put patients through.
There are plenty of functional machine learning implementations in Dermatology, namely:
 An algorithm that separates melanomas from benign skin lesions with higher precision than
that of a human.
 Tools that track the development and changes in skin moles, helping detect pathological
conditions at the earliest stages.
 Algorithms that pinpoint biological markers for acne, nail fungus, and seborrheic dermatitis.
4. Genetics and Genomics
Recently, artificial intelligence has helped geneticists progress significantly in the transcription of
human genes. Machine learning and AI technologies are key players in preventive genetics. Scientists
increasingly rely on algorithms to determine how drugs, chemicals, and environmental factors
influence the human genome.
5. Mental Health
Artificial intelligence can have a groundbreaking impact on mental health research and the efficiency
of medical diagnosis through machine learning. The top applications of innovative technologies in the
field are:
 Personalized cognitive behavior therapy (CBT) fueled by chatbots and virtual therapists.
 Mental disease prevention by creating machine learning tools that help high-risk groups avoid
social isolation.
 Identifying groups with a high risk of suicide and providing them with support and assistance.
 Early detection of mental disorders using machine learning and data science: diagnosing
clinical depression, bipolar disorder, anxiety, and more.
6. Critical care
Artificial intelligence has the potential to reduce the length of an average ICU stay by predicting
early-onset sepsis and adjusting ventilator and other equipment settings according to a patient's
conditions.
Using artificial intelligence helps doctors avoid poor judgment calls — premature extubation or
prolonged intubation — that have a strong link to raising ICU mortality rates.
In addition, machine learning in ICU can help physicians identify high-risk patients to make sure no
early deterioration sign is left unnoticed.
7. Eye care
The diagnosis of Ophthalmology conditions has a lot of room for machine learning optimization.
Some of the latest innovations that these healthcare centers have adopted are:
 AI-driven vision screening programs that help provide a point-of-care medical diagnosis
based on machine learning for Ophthalmological conditions.
 Identifying Diabetic Retinopathy and providing physicians with treatment insights by
analyzing patient data (in 2018, the FDA approved the first among these machine learning
scanners for clinical use).
 Early-stage diagnosis of Macular Degeneration with the help of deep learning algorithms.
 High-precision glaucoma and cataract screening
8. Diabities
Over the last decade, the range of machine learning application examples for diagnosing and treating
diabetes has grown exponentially:
 Using vector machine modeling and building neural networks for pre-diabetes screening.
 Creating tools for managing personalized insulin delivery, as well as artificial pancreas
systems.
 Predicting treatable complications in diabetes patients to improve the quality of their lives.
 Identifying genetic and other biomarkers for diabetes.
9. Public Health
Artificial intelligence allows healthcare professionals to increase the scale of medical diagnoses using
machine learning and shift from analyzing individual cases to monitoring communities and predicting
disease outbreaks.
Machine learning Algorithms
In this section, here top 10 machine learning algorithms are explained in detail
1. KMean Clustering
K-means is a popularly used unsupervised machine learning algorithm for cluster analysis. K-Means
is a non-deterministic and iterative method. It is based on the intuition of Pattern Recognition on n-
dimensional feature-space geometry. In n-dimensional feature-space geometry, each and every data
instance is visualized as a data-point in n dimensions in which the n dimensions are the n features.
The algorithm operates on a given data set through a pre-defined number of clusters, k. The output of
the K Means algorithm is k clusters with input data partitioned among the clusters. For instance, let‘s
consider K-Means Clustering for Wikipedia Search results. The search term ―Jaguar‖ on Wikipedia
will return all pages containing the word Jaguar which can refer to Jaguar as a Car, Jaguar as Mac OS
version, and Jaguar as an Animal. K Means clustering algorithm can be applied to group the web
pages that talk about similar concepts. So, the algorithm will group all web pages that talk about
Jaguar as an Animal into one cluster, Jaguar as a Car into another cluster, and so on.
For any new incoming data point, the data point is classified according to its proximity to the nearby
classes. Data points inside a cluster will exhibit similar characteristics while the other clusters will
have different properties. The basic example of clustering would be the grouping of the same kind of
customers in a certain class for any kind of marketing campaign. It is also a useful algorithm for
document clustering.
The steps followed in the k means algorithm are as follows -
1. Specify the number of clusters as k
2. Randomly select k data points and assign them to the clusters
3. Cluster centroid will be calculated subsequently
4. Keep iterating from 1-3 steps until you find the optimal centroid after which values won‘t
change.
i) The sum of squared distance between the centroid and the data point is computed
ii) Assign each data point to the cluster that is closer to the other cluster
iii) Compute the centroid for the cluster by taking the average of all the data points in the cluster
We can find the optimal number of clusters k by plotting the value of sum squared distance which
decreases gradually to come to an optimal number k.
Execution of the Algorithm
1. Importing necessary libraries
2. Reading the dataset
3. Dropping unwanted columns
4. Label encoding of the target variable
5. Creating the feature set and target label variables
6. Feature scaling
7. Principle component Analysis (PCA) to reduce dimensionality of the data into 2 dimensions
8. Scatter plot visualization of the 2 Principle components
9. KMean clustering with 2 clusters signifying the 2 classes
10. Performance analysis of KMean clustering and performance visualization using decision
boundary on the scatter plot.
Advantages of using K-Means Clustering ML Algorithm
1. In case of globular clusters, K-Means produces tighter clusters than hierarchical clustering.
2. Given a smaller value of K, K-Means clustering computes faster than hierarchical clustering
for large number of variables.
Applications of K-Means Clustering
K Means Clustering algorithm is used by most of the search engines like Yahoo, Google to cluster
web pages by similarity and identify the ‗relevance rate‘ of search results. This helps search engines
reduce the computational time for the users.

2. Artificial Neural Network (ANN)


The learning network can be either supervised or unsupervised. The human brain has a highly
complex and non-linear parallel computer that can organize the structural constituents i.e. the neurons
interconnected in a complex manner between each other. Let us take a simple example of face
recognition-whenever we meet a person, a person who is known to us can be easily recognized with
his name or he works at XYZ place or based on his relationship with you. We may know thousands of
people, but the task requires the human brain to immediately recognize the person (face recognition).
Now, suppose instead of the human brain doing it, if a computer is asked to perform this task. It is
not going to be an easy computation for the machine as it does not know the person. You have to
teach the computer that there are images of different people. If you know 10,000 people then you have
to feed all the 10,000 photographs into the computer. Now, whenever you meet a person you capture
an image of the person and feed it to the computer. The computer matches this photograph with all the
10,000 photographs that you have already fed into the database. At the end of all the computations-it
gives the result with the photograph that best resembles the person. This could take several hours or
more depending on the number of images present in the database. The complexity of the task will
increase with the increase in the number of images in the database. However, a human brain can
recognize it instantly.
Can we recognize this instantly using a computer? Is it that the computation capability that exists in
humans different from the computers? If you consider the processing speed of a silicon IC it is of the
order of 10-9 (order of nanoseconds) whereas the processing speed of a human neuron is 6 times
slower than typical IC‘s i.e. 10-3 (order of milliseconds). In that case, there is a puzzling question then
how is that the processing time of the human brain faster than that of a computer. Typically, there are
10 billion neurons with approximately 60 trillion interconnections inside the human brain but still, it
processes faster than the computer. This is because the network of neurons in the human brain is
massively parallel.
Now the question is that is it possible to mimic the massively parallel nature of the human brain using
computer software. It is not that easy as we cannot really think of putting so many processing units
and realizing them in a massively parallel fashion. All that can be done within a limitation is
interconnecting a network of processors. Instead of considering the structure of a human brain in
totality, only a very small part of the human brain can be mimicked to do a very specific task. We can
make neurons but they will be different from the biological neuron of the human brain. This can be
achieved using Artificial Neural Networks. By artificial we inherently mean something that is
different from the biological neurons. ANN‘s are nothing but simulated brains that can be
programmed the way we want. By defining rules to mimic the behavior of the human brain, data
scientists can solve real-world problems that could have never been considered before.
How do Artificial Neural Network algorithms work?
It is a subfield of artificial intelligence which is modeled after the brain. It is a computational network
consisting of neurons interconnected to each other. This interconnected structure is used for making
various predictions for both regressions as well as classification problems. The ANN consists of
various layers - the input layer, the hidden layer, and the output layers. The hidden layers could be
more than 1 in number. The hidden layer is the place where all the mathematics of the neural network
takes place. The basic formulas of weights and biases are added here along with the application of the
activation functions. These activation functions are responsible for delivering the output in a
structured and trimmed manner. It is majorly used for solving non-linear problems - handwriting
recognition, traveling salesman problems, etc. ANNs involves complex mathematical calculations and
are highly compute-intensive in nature.
Imagine you are walking on a walkway and you see a pillar (assume that you have never seen a pillar
before). You walk into the pillar and hit it. Now, the next time whenever you see a pillar you stay a
few meters away from the pillar and continue walking on the side. This time your shoulder hits the
pillar and you are hurt again. Again when you see the pillar you ensure that you don‘t hit it but this
time on your path you hit a letter-box (assuming that you have never seen a letter-box before). You
walk into it and the complete process repeats again. This is how an artificial neural network works, it
is given several examples and it tries to get the same answer. Whenever it is wrong, an error is
calculated. So, next time for a similar example the value at the synapse (weighted values through
which neurons are connected in the network) and neuron is propagated backward i.e. back
propagation takes place. Thus, an ANN requires lots of examples and learning and they can be in
millions or billions for real-world applications.
Figure1. Single Layer Neural Network also called as Perceptron
Why use Artificial Neural Networks?
1. ANN‘s have interconnection of non-linear neurons thus these machine learning algorithms
can exploit non-linearity in a distributed manner.
2. They can adapt free parameters to the changes in the surrounding environment.
3. Learns from its mistakes and takes better decisions through backpropagation.
Execution of the Algorithm
1. Importing necessary libraries
2. Reading the dataset
3. Dropping unwanted columns
4. Preprocessing
5. To avoid over-fitting, we will divide our dataset into training and test splits. The training data will
be used to train the neural network and the test data will be used to evaluate the performance of the
neural network.
6. Feature scaling
7. Training and Predictions
8. Evaluating the algorithm
Advantages of Using Artificial Neural Networks
 Easy to understand for professionals who do not want to dig deep into math-related complex
machine learning algorithms. If you are trying to sell a model to an organization which would
you rather say Artificial Neural Networks (ANN) or Support Vector Machine (SVM). We
guess the answer obviously is going to be ANN because you can easily explain to them that
they just work like the neurons in your brain.
 They are easy to conceptualize.
 They have the ability to identify all probable interactions between predictor variables.
 They have the ability to subtly identify complex nonlinear relationships that exists between
independent and dependent variables.
 It is relatively easy to add prior knowledge to the model.
Disadvantages of Using Artificial Neural Networks
 It is very difficult to reverse engineer artificial neural network algorithms. If your ANN
machine learning algorithm learns that the image of a dog is actually a cat then it is very
difficult to determine ―why‖. All than can be done is continuously tweak or train the ANN
further.
 Artificial Neural Network algorithms are not probabilistic meaning if the output of the
algorithm is a continuous number it is difficult to translate it into a probability.
 They are not magic wands and cannot be applied to solve any kind of machine learning
algorithm.
 Artificial Neural Networks in native implementation are not highly effective at practical
problem-solving. However, this can be improved with the use of deep learning techniques.
 Multi-layered artificial neural network algorithms are hard to train and require tuning a lot of
parameters.
Applications of Artificial Neural Networks
 Financial Institutions use Artificial Neural Networks machine learning algorithms to enhance
the performance in evaluating loan applications, bond rating, target marketing, credit scoring.
They are also used to identify instances of fraud in credit card transactions.
 Buzzfeed uses artificial neural network algorithms for image recognition to organize and
search videos or photos.
 Many bomb detectors at US airports use artificial neural networks to analyze airborne trace
elements and identify the presence of explosive chemicals.
 Google uses Artificial Neural Networks for Speech Recognition, Image Recognition, and
other pattern recognition (handwriting recognition) applications. ANN‘s are used at Google to
sniff out spam and for many other applications.
 Artificial Neural Networks find great applications in robotic factories for adjusting
temperature settings, controlling machinery, diagnose malfunctions.

3. K-Nearest Neighbour (KNN)


K-Nearest Neighbour is one of the simplest Machine Learning algorithms based on supervised
learning technique.
K-NN algorithm assumes the similarity between the new case/data and available cases and put the
new case into the category that is most similar to the available categories.
K-NN algorithm stores all the available data and classifies a new data point based on the similarity.
This means when new data appears then it can be easily classified into a well suite category by using
K- NN algorithm.
K-NN algorithm can be used for Regression as well as for Classification but mostly it is used for the
Classification problems.
K-NN is a non-parametric algorithm, which means it does not make any assumption on underlying
data.
It is also called a lazy learner algorithm because it does not learn from the training set immediately
instead it stores the dataset and at the time of classification, it performs an action on the dataset.
KNN algorithm at the training phase just stores the dataset and when it gets new data, then it classifies
that data into a category that is much similar to the new data.
Example: Suppose, we have an image of a creature that looks similar to cat and dog, but we want to
know either it is a cat or dog. So for this identification, we can use the KNN algorithm, as it works on
a similarity measure. Our KNN model will find the similar features of the new data set to the cats and
dogs images and based on the most similar features it will put it in either cat or dog category.
Why do we need KNN?
Suppose there are two categories, i.e., Category A and Category B, and we have a new data point x1,
so this data point will lie in which of these categories. To solve this type of problem, we need a K-NN
algorithm. With the help of K-NN, we can easily identify the category or class of a particular dataset.
Consider the below diagram:

How does KNN work?


The K-NN working can be explained on the basis of the below algorithm:
o Step-1: Select the number K of the neighbors
o Step-2: Calculate the Euclidean distance of K number of neighbors
o Step-3: Take the K nearest neighbors as per the calculated Euclidean distance.
o Step-4: Among these k neighbors, count the number of the data points in each category.
o Step-5: Assign the new data points to that category for which the number of the neighbor is
maximum.
o Step-6: Our model is ready.
Suppose we have a new data point and we need to put it in the required category. Consider the below
image:
o Firstly, we will choose the number of neighbors, so we will choose the k=5.
o Next, we will calculate the Euclidean distance between the data points. The Euclidean
distance is the distance between two points, which we have already studied in geometry. It
can be calculated as:

o By calculating the Euclidean distance we got the nearest neighbors, as three nearest neighbors
in category A and two nearest neighbors in category B. Consider the below image:

o As we can see the 3 nearest neighbors are from category A, hence this new data point must
belong to category A.
How to select the value of K in KNN algorithm?
o There is no particular way to determine the best value for "K", so we need to try some values
to find the best out of them. The most preferred value for K is 5.
o A very low value for K such as K=1 or K=2, can be noisy and lead to the effects of outliers in
the model.
o Large values for K are good, but it may find some difficulties.
o Small value of K
1. Captures fine structure of the problem space better
2. May be necessary for small training set
o Large value of K
1. Less sensitive to noise (particularly class noise)
2. Better probability estimates for discrete class
3. Larger training set allows you to use large value of K
Execution of the Algorithm
1. Importing necessary libraries
2. Reading the dataset
3. Dropping unwanted columns
4. Preprocessing
5. Fitting the KNN algorithm to the training set
6. Predicting the test results
7. Visualizing the test set results
Advantages of KNN algorithm
o It is simple to implement.
o It is robust to the noisy training data
o It can be more effective if the training data is large.
Disadvantages of KNN algorithm
o Always needs to determine the value of K which may be complex some time.
o The computation cost is high because of calculating the distance between the data points for
all the training samples.

4. Decision Tree
Decision Tree is a supervised learning technique that can be used for both classification and
Regression problems, but mostly it is preferred for solving Classification problems. It is a tree-
structured classifier, where internal nodes represent the features of a dataset, branches represent the
decision rules and each leaf node represents the outcome.
In a Decision tree, there are two nodes, which are the Decision Node and Leaf Node. Decision nodes
are used to make any decision and have multiple branches, whereas Leaf nodes are the output of those
decisions and do not contain any further branches.
The decisions or the test are performed on the basis of features of the given dataset.
It is a graphical representation for getting all the possible solutions to a problem/decision based on
given conditions.
It is called a decision tree because, similar to a tree, it starts with the root node, which expands on
further branches and constructs a tree-like structure.
In order to build a tree, we use the CART algorithm, which stands for Classification and Regression
Tree algorithm.
A decision tree simply asks a question, and based on the answer (Yes/No), it further split the tree into
subtrees.
Below diagram explains the general structure of a decision tree:

1. Root Node: Root node is from where the decision tree starts. It represents the entire dataset, which
further gets divided into two or more homogeneous sets.
2. Leaf Node: Leaf nodes are the final output node, and the tree cannot be segregated further after
getting a leaf node.
3. Splitting: Splitting is the process of dividing the decision node/root node into sub-nodes according
to the given conditions.
4. Branch/Sub Tree: A tree formed by splitting the tree.
5. Pruning: Pruning is the process of removing the unwanted branches from the tree.
6. Parent/Child node: The root node of the tree is called the parent node, and other nodes are called
the child nodes.
How does decision tree algorithm works?
In a decision tree, for predicting the class of the given dataset, the algorithm starts from the root node
of the tree. This algorithm compares the values of root attribute with the record (real dataset) attribute
and, based on the comparison, follows the branch and jumps to the next node.
For the next node, the algorithm again compares the attribute value with the other sub-nodes and
move further. It continues the process until it reaches the leaf node of the tree. The complete process
can be better understood using the below algorithm:
o Step-1: Begin the tree with the root node, says S, which contains the complete dataset.
o Step-2: Find the best attribute in the dataset using Attribute Selection Measure (ASM).
o Step-3: Divide the S into subsets that contains possible values for the best attributes.
o Step-4: Generate the decision tree node, which contains the best attribute.
o Step-5: Recursively make new decision trees using the subsets of the dataset created in step -
3. Continue this process until a stage is reached where you cannot further classify the nodes
and called the final node as a leaf node.
Execution of the Algorithm
1. Importing necessary libraries
2. Reading the dataset
3. Dropping unwanted columns
4. Preprocessing
5. Fitting the Decision Tree algorithm to the training set
6. Predicting the test results
7. Visualizing the test set results

Advantages of the decision tree algorithm


o It is simple to understand as it follows the same process which a human follow while making
any decision in real-life.
o It can be very useful for solving decision-related problems.
o It helps to think about all the possible outcomes for a problem.
o There is less requirement of data cleaning compared to other algorithms.

Disadvantages of the decision tree algorithm


o The decision tree contains lots of layers, which makes it complex.
o It may have an overfitting issue, which can be resolved using the Random Forest algorithm.
o For more class labels, the computational complexity of the decision tree may increase.
5. Random Forest Algorithm
Random Forest is a popular machine learning algorithm that belongs to the supervised learning
technique. It can be used for both Classification and Regression problems in ML. It is based on the
concept of ensemble learning, which is a process of combining multiple classifiers to solve a complex
problem and to improve the performance of the model.
As the name suggests, "Random Forest is a classifier that contains a number of decision trees on
various subsets of the given dataset and takes the average to improve the predictive accuracy of that
dataset." Instead of relying on one decision tree, the random forest takes the prediction from each tree
and based on the majority votes of predictions, and it predicts the final output.
The greater number of trees in the forest leads to higher accuracy and prevents the problem of
overfitting.
The below diagram explains the working of the Random Forest algorithm:

To better understand the Random Forest Algorithm, you should have the knowledge of the Decision
Tree Algorithm
Since the random forest combines multiple trees to predict the class of the dataset, it is possible that
some decision trees may predict the correct output, while others may not. But together, all the trees
predict the correct output. Therefore, below are two assumptions for a better Random forest classifier:
o There should be some actual values in the feature variable of the dataset so that the classifier
can predict accurate results rather than a guessed result.
o The predictions from each tree must have very low correlations.
Why use Random Forest algorithm?
Below are some points that explain why we should use the Random Forest algorithm:
o It takes less training time as compared to other algorithms.
o It predicts output with high accuracy, even for the large dataset it runs efficiently.
o It can also maintain accuracy when a large proportion of data is missing.
How does Random Forest algorithm work?
Random Forest works in two-phase first is to create the random forest by combining N decision tree,
and second is to make predictions for each tree created in the first phase.
The Working process can be explained in the below steps and diagram:
Step-1: Select random K data points from the training set.
Step-2: Build the decision trees associated with the selected data points (Subsets).
Step-3: Choose the number N for decision trees that you want to build.
Step-4: Repeat Step 1 & 2.
Step-5: For new data points, find the predictions of each decision tree, and assign the new data points
to the category that wins the majority votes.
The working of the algorithm can be better understood by the below example:

Example: Suppose there is a dataset that contains multiple fruit images. So, this dataset is given to the
Random forest classifier. The dataset is divided into subsets and given to each decision tree. During
the training phase, each decision tree produces a prediction result, and when a new data point occurs,
then based on the majority of results, the Random Forest classifier predicts the final decision.

Execution of the Algorithm


1. Importing necessary libraries
2. Reading the dataset
3. Dropping unwanted columns
4. Preprocessing
5. Fitting the Random Forest algorithm to the training set
6. Predicting the test results
7. Visualizing the test set results
Advantages of Random Forest algorithm
o Random Forest is capable of performing both Classification and Regression tasks.
o It is capable of handling large datasets with high dimensionality.
o It enhances the accuracy of the model and prevents the overfitting issue.
Disadvantages of Random Forest algorithm
o Although random forest can be used for both classification and regression tasks, it is not more
suitable for Regression tasks.
Applications of Random Forest algorithm
There are mainly four sectors where Random forest mostly used:
1. Banking: Banking sector mostly uses this algorithm for the identification of loan risk.
2. Medicine: With the help of this algorithm, disease trends and risks of the disease can be
identified.
3. Land Use: We can identify the areas of similar land use by this algorithm.
4. Marketing: Marketing trends can be identified using this algorithm.

6. Support Vector Machine (SVM)


Support Vector Machine or SVM is one of the most popular supervised learning algorithms, which
is used for Classification as well as Regression problems. However, primarily, it is used for
Classification problems in Machine Learning.
The goal of the SVM algorithm is to create the best line or decision boundary that can segregate n-
dimensional space into classes so that we can easily put the new data point in the correct category in
the future. This best decision boundary is called a hyperplane.
SVM chooses the extreme points/vectors that help in creating the hyperplane. These extreme cases are
called as support vectors, and hence algorithm is termed as Support Vector Machine. Consider the
below diagram in which there are two different categories that are classified using a decision
boundary or hyperplane:
Example: SVM can be understood with the example that we have used in the KNN classifier.
Suppose we see a strange cat that also has some features of dogs, so if we want a model that can
accurately identify whether it is a cat or dog, so such a model can be created by using the SVM
algorithm. We will first train our model with lots of images of cats and dogs so that it can learn about
different features of cats and dogs, and then we test it with this strange creature. So as support vector
creates a decision boundary between these two data (cat and dog) and choose extreme cases (support
vectors), it will see the extreme case of cat and dog. On the basis of the support vectors, it will classify
it as a cat.

Types of SVM
SVM can be of two types:
o Linear SVM: Linear SVM is used for linearly separable data, which means if a dataset can
be classified into two classes by using a single straight line, then such data is termed as
linearly separable data, and classifier is used called as Linear SVM classifier.
o Non-linear SVM: Non-Linear SVM is used for non-linearly separated data, which means if a
dataset cannot be classified by using a straight line, then such data is termed as non-linear
data and classifier used is called as Non-linear SVM classifier.

Hyperplane and Support Vectors in SVM Algorithm


Hyperplane: There can be multiple lines/decision boundaries to segregate the classes in n-
dimensional space, but we need to find out the best decision boundary that helps to classify the data
points. This best boundary is known as the hyperplane of SVM.
The dimensions of the hyperplane depend on the features present in the dataset, which means if there
are 2 features (as shown in image), then hyperplane will be a straight line. And if there are 3 features,
then hyperplane will be a 2-dimension plane.
We always create a hyperplane that has a maximum margin, which means the maximum distance
between the data points.
Support Vectors:
The data points or vectors that are the closest to the hyperplane and which affect the position of the
hyperplane are termed as Support Vector. Since these vectors support the hyperplane, hence called a
Support vector.

How does SVM work?


Linear SVM:
The working of the SVM algorithm can be understood by using an example. Suppose we have a
dataset that has two tags (green and blue), and the dataset has two features x1 and x2. We want a
classifier that can classify the pair(x1, x2) of coordinates in either green or blue. Consider the below
image:

So as it is 2-d space so by just using a straight line, we can easily separate these two classes. But there
can be multiple lines that can separate these classes. Consider the below image:

Hence, the SVM algorithm helps to find the best line or decision boundary; this best boundary or
region is called as a hyperplane. SVM algorithm finds the closest point of the lines from both the
classes. These points are called support vectors. The distance between the vectors and the hyperplane
is called as margin. And the goal of SVM is to maximize this margin. The hyperplane with
maximum margin is called the optimal hyperplane.

Non Linear SVM:


If data is linearly arranged, then we can separate it by using a straight line, but for non-linear data, we
cannot draw a single straight line. Consider the below image:
So to separate these data points, we need to add one more dimension. For linear data, we have used
two dimensions x and y, so for non-linear data, we will add a third dimension z. It can be calculated
as:
z=x2 +y2
By adding the third dimension, the sample space will become as below image:

So now, SVM will divide the datasets into classes in the following way. Consider the below image:

Since we are in 3-d Space, hence it is looking like a plane parallel to the x-axis. If we convert it in 2d
space with z=1, then it will become as:
Hence we get a circumference of radius 1 in case of non-linear data.

Execution of the Algorithm


1. Importing necessary libraries
2. Reading the dataset
3. Dropping unwanted columns
4. Preprocessing
5. Fitting the SVM algorithm to the training set
6. Predicting the test results
7. Visualizing the test set results

7. Logistic Regression
Logistic regression is one of the most popular Machine Learning algorithms, which comes under the
supervised learning technique. It is used for predicting the categorical dependent variable using a
given set of independent variables.
Logistic regression predicts the output of a categorical dependent variable. Therefore the outcome
must be a categorical or discrete value. It can be either Yes or No, 0 or 1, true or False, etc. but instead
of giving the exact value as 0 and 1, it gives the probabilistic values which lie between 0 and 1.
Logistic Regression is much similar to the Linear Regression except that how they are used. Linear
Regression is used for solving Regression problems, whereas Logistic regression is used for solving
the classification problems.
In Logistic regression, instead of fitting a regression line, we fit an "S" shaped logistic function, which
predicts two maximum values (0 or 1).
The curve from the logistic function indicates the likelihood of something such as whether the cells
are cancerous or not, a mouse is obese or not based on its weight, etc.
Logistic Regression is a significant machine learning algorithm because it has the ability to provide
probabilities and classify new data using continuous and discrete datasets.
Logistic Regression can be used to classify the observations using different types of data and can
easily determine the most effective variables used for the classification.
Logistic Regression uses the concept of predictive modeling as regression therefore it is called logistic
regression, but it used to classify samples therefore it falls under the classification algorithm.
Logistic Function: Sigmoid
o The sigmoid function is a mathematical function used to map the predicted values to
probabilities.
o It maps any real value into another value within a range of 0 and 1.
o The value of the logistic regression must be between 0 and 1, which cannot go beyond this
limit, so it forms a curve like the "S" form. The S-form curve is called the Sigmoid function
or the logistic function.
o In logistic regression, we use the concept of the threshold value, which defines the probability
of either 0 or 1. Such as values above the threshold value tends to 1, and a value below the
threshold values tends to 0.
Assumption of Logistic Regression
o The dependent variable must be categorical in nature.
o The independent variable should not have multi-collinearity.
Logistic Regression Equation
The Logistic regression equation can be obtained from the Linear Regression equation. The
mathematical steps to get Logistic Regression equations are given below:
o We know the equation of the straight line can be written as:

o In Logistic Regression y can be between 0 and 1 only, so for this let's divide the above
equation by (1-y):

o But we need range between -[infinity] to +[infinity], then take logarithm of the equation it
will become:

The above equation is the final equation for Logistic Regression.


Execution of the Algorithm
1. Importing necessary libraries
2. Reading the dataset
3. Dropping unwanted columns
4. Preprocessing
5. Fitting the Logistic Regression to the training set
6. Predicting the test results
7. Visualizing the test set results

8. Naive Bayes
Naïve Bayes algorithm is a supervised learning algorithm, which is based on Bayes theorem and
used for solving classification problems.
It is mainly used in text classification that includes a high-dimensional training dataset.
Naïve Bayes Classifier is one of the simple and most effective Classification algorithms which help in
building the fast machine learning models that can make quick predictions.
It is a probabilistic classifier, which means it predicts on the basis of the probability of an object.
Some popular examples of Naïve Bayes Algorithm are spam filtration, Sentimental analysis, and
classifying articles.
Why it is called Naive Bayes?
The Naïve Bayes algorithm is comprised of two words Naïve and Bayes, Which can be described as:
o Naïve: It is called Naïve because it assumes that the occurrence of a certain feature is
independent of the occurrence of other features. Such as if the fruit is identified on the bases
of color, shape, and taste, then red, spherical, and sweet fruit is recognized as an apple. Hence
each feature individually contributes to identify that it is an apple without depending on each
other.
o Bayes: It is called Bayes because it depends on the principle of Bayes' Theorem.
Naive Bayes theorem
o Bayes' theorem is also known as Bayes' Rule or Bayes' law, which is used to determine the
probability of a hypothesis with prior knowledge. It depends on the conditional probability.
o The formula for Bayes' theorem is given as:
Where,
P(A|B) is Posterior probability: Probability of hypothesis A on the observed event B.
P(B|A) is Likelihood probability: Probability of the evidence given that the probability of a hypothesis
is true.
P(A) is Prior Probability: Probability of hypothesis before observing the evidence.
P(B) is Marginal Probability: Probability of Evidence.

Working of Naive Bayes classifier


Working of Naïve Bayes' Classifier can be understood with the help of the below example:
Suppose we have a dataset of weather conditions and corresponding target variable "Play". So using
this dataset we need to decide that whether we should play or not on a particular day according to the
weather conditions. So to solve this problem, we need to follow the below steps:
1. Convert the given dataset into frequency tables.
2. Generate Likelihood table by finding the probabilities of given features.
3. Now, use Bayes theorem to calculate the posterior probability.
Execution of the Algorithm
1. Importing necessary libraries
2. Reading the dataset
3. Dropping unwanted columns
4. Preprocessing
5. Fitting the Naive Bayes to the training set
6. Predicting the test results
7. Visualizing the test set results
Advantages of Naive Bayes Algorithm
o Naïve Bayes is one of the fast and easy ML algorithms to predict a class of datasets.
o It can be used for Binary as well as Multi-class Classifications.
o It performs well in Multi-class predictions as compared to the other Algorithms.
o It is the most popular choice for text classification problems.
Disadvantages of Naive Bayes Algorithm
o Naive Bayes assumes that all features are independent or unrelated, so it cannot learn the
relationship between features.
Applications of Naive Bayes Algorithm
o It is used for Credit Scoring.
o It is used in medical data classification.
o It can be used in real-time predictions because Naïve Bayes Classifier is an eager learner.
o It is used in Text classification such as Spam filtering and Sentiment analysis.

9. Principal Component Analysis (PCA)


Principal Component Analysis is an unsupervised learning algorithm that is used for the
dimensionality reduction in machine learning. It is a statistical process that converts the observations
of correlated features into a set of linearly uncorrelated features with the help of orthogonal
transformation. These new transformed features are called the Principal Components. It is one of the
popular tools that are used for exploratory data analysis and predictive modeling. It is a technique to
draw strong patterns from the given dataset by reducing the variances.
PCA generally tries to find the lower-dimensional surface to project the high-dimensional data.
PCA works by considering the variance of each attribute because the high attribute shows the good
split between the classes, and hence it reduces the dimensionality. Some real-world applications of
PCA are image processing, movie recommendation system, optimizing the power allocation in
various communication channels. It is a feature extraction technique, so it contains the important
variables and drops the least important variable.
The PCA algorithm is based on some mathematical concepts such as:
o Variance and Covariance
o Eigenvalues and Eigen factors
Some common terms used in PCA algorithm:
o Dimensionality: It is the number of features or variables present in the given dataset. More
easily, it is the number of columns present in the dataset.
o Correlation: It signifies that how strongly two variables are related to each other. Such as if
one changes, the other variable also gets changed. The correlation value ranges from -1 to +1.
Here, -1 occurs if variables are inversely proportional to each other, and +1 indicates that
variables are directly proportional to each other.
o Orthogonal: It defines that variables are not correlated to each other, and hence the
correlation between the pair of variables is zero.
o Eigenvectors: If there is a square matrix M, and a non-zero vector v is given. Then v will be
eigenvector if Av is the scalar multiple of v.
o Covariance Matrix: A matrix containing the covariance between the pair of variables is
called the Covariance Matrix.
Principal Components in PCA
As described above, the transformed new features or the output of PCA are the Principal Components.
The number of these PCs are either equal to or less than the original features present in the dataset.
Some properties of these principal components are given below:
o The principal component must be the linear combination of the original features.
o These components are orthogonal, i.e., the correlation between a pair of variables is zero.
o The importance of each component decreases when going to 1 to n, it means the 1 PC has the
most importance, and n PC will have the least importance.
Steps for PCA algorithm
1. Getting the dataset
Firstly, we need to take the input dataset and divide it into two subparts X and Y, where X is
the training set, and Y is the validation set.
2. Representing data into a structure
Now we will represent our dataset into a structure. Such as we will represent the two-
dimensional matrix of independent variable X. Here each row corresponds to the data items,
and the column corresponds to the Features. The number of columns is the dimensions of the
dataset.
3. Standardizing the data
In this step, we will standardize our dataset. Such as in a particular column, the features with
high variance are more important compared to the features with lower variance.
If the importance of features is independent of the variance of the feature, then we will divide
each data item in a column with the standard deviation of the column. Here we will name the
matrix as Z.
4. Calculating the Covariance of Z
To calculate the covariance of Z, we will take the matrix Z, and will transpose it. After
transpose, we will multiply it by Z. The output matrix will be the Covariance matrix of Z.
5. Calculating the Eigen Values and Eigen Vectors
Now we need to calculate the eigenvalues and eigenvectors for the resultant covariance
matrix Z. Eigenvectors or the covariance matrix are the directions of the axes with high
information. And the coefficients of these eigenvectors are defined as the eigenvalues.
6. Sorting the Eigen Vectors
In this step, we will take all the eigenvalues and will sort them in decreasing order, which
means from largest to smallest. And simultaneously sort the eigenvectors accordingly in
matrix P of eigenvalues. The resultant matrix will be named as P*.
7. Calculating the new features Or Principal Components
Here we will calculate the new features. To do this, we will multiply the P* matrix to the Z. In
the resultant matrix Z*, each observation is the linear combination of original features. Each
column of the Z* matrix is independent of each other.
8. Remove less or unimportant features from the new dataset.
The new feature set has occurred, so we will decide here what to keep and what to remove. It
means, we will only keep the relevant or important features in the new dataset, and
unimportant features will be removed out.
Applications of Principal Component Analysis
o PCA is mainly used as the dimensionality reduction technique in various AI applications
such as computer vision, image compression, etc.
o It can also be used for finding hidden patterns if data has high dimensions. Some fields where
PCA is used are Finance, data mining, Psychology, etc.

10. Linear Regression


Linear regression is one of the easiest and most popular Machine Learning algorithms. It is a
statistical method that is used for predictive analysis. Linear regression makes predictions for
continuous/real or numeric variables such as sales, salary, age, product price, etc.
Linear regression algorithm shows a linear relationship between a dependent (y) and one or more
independent (y) variables, hence called as linear regression. Since linear regression shows the linear
relationship, which means it finds how the value of the dependent variable is changing according to
the value of the independent variable.
The linear regression model provides a sloped straight line representing the relationship between the
variables. Consider the below image:

Mathematically, we can represent a linear regression as:


y= a0+a1x+ ε
Here,
Y= Dependent Variable (Target Variable)
X= Independent Variable (predictor Variable)
a0= intercept of the line (Gives an additional degree of freedom)
a1 = Linear regression coefficient (scale factor to each input value).
ε random error
The values for x and y variables are training datasets for Linear Regression model representation.
Types of Linear Regression
Linear regression can be further divided into two types of the algorithm:
o Simple Linear Regression:
If a single independent variable is used to predict the value of a numerical dependent variable,
then such a Linear Regression algorithm is called Simple Linear Regression.
o Multiple Linear regression:
If more than one independent variable is used to predict the value of a numerical dependent
variable, then such a Linear Regression algorithm is called Multiple Linear Regression.
Linear Regression Line
A linear line showing the relationship between the dependent and independent variables is called
a regression line. A regression line can show two types of relationship:
o Positive Linear Relationship:
If the dependent variable increases on the Y-axis and independent variable increases on X-
axis, then such a relationship is termed as a Positive linear relationship.

o Negative Linear Relationship:


If the dependent variable decreases on the Y-axis and independent variable increases on the
X-axis, then such a relationship is called a negative linear relationship.

Finding the best fit line:


When working with linear regression, our main goal is to find the best fit line that means the error
between predicted values and actual values should be minimized. The best fit line will have the least
error.
The different values for weights or the coefficient of lines (a0, a1) gives a different line of regression,
so we need to calculate the best values for a0 and a1 to find the best fit line, so to calculate this we use
cost function.
Cost function-
o The different values for weights or coefficient of lines (a0, a1) gives the different line of
regression, and the cost function is used to estimate the values of the coefficient for the best
fit line.
o Cost function optimizes the regression coefficients or weights. It measures how a linear
regression model is performing.
o We can use the cost function to find the accuracy of the mapping function, which maps the
input variable to the output variable. This mapping function is also known as Hypothesis
function.
For Linear Regression, we use the Mean Squared Error (MSE) cost function, which is the average
of squared error occurred between the predicted values and actual values. It can be written as:
For the above linear equation, MSE can be calculated as:

Where,
N=Total number of observation
Yi = Actual value
(a1xi+a0)= Predicted value.
Residuals: The distance between the actual value and predicted values is called residual. If the
observed points are far from the regression line, then the residual will be high, and so cost function
will high. If the scatter points are close to the regression line, then the residual will be small and hence
the cost function.
Gradient Descent:
o Gradient descent is used to minimize the MSE by calculating the gradient of the cost function.
o A regression model uses gradient descent to update the coefficients of the line by reducing the
cost function.
o It is done by a random selection of values of coefficient and then iteratively update the values
to reach the minimum cost function.
Model Performance:
The Goodness of fit determines how the line of regression fits the set of observations. The process of
finding the best model out of various models is called optimization. It can be achieved by below
method:
1. R-squared method:
o R-squared is a statistical method that determines the goodness of fit.
o It measures the strength of the relationship between the dependent and independent variables
on a scale of 0-100%.
o The high value of R-square determines the less difference between the predicted values and
actual values and hence represents a good model.
o It is also called a coefficient of determination, or coefficient of multiple determinations for
multiple regressions.
o It can be calculated from the below formula:
Assumptions of Linear Regression
Below are some important assumptions of Linear Regression. These are some formal checks while
building a Linear Regression model, which ensures to get the best possible result from the given
dataset.
o Linear relationship between the features and target:
Linear regression assumes the linear relationship between the dependent and independent
variables.
o Small or no multicollinearity between the features:
Multicollinearity means high-correlation between the independent variables. Due to
multicollinearity, it may difficult to find the true relationship between the predictors and
target variables. Or we can say, it is difficult to determine which predictor variable is
affecting the target variable and which is not. So, the model assumes either little or no
multicollinearity between the features or independent variables.
o Homoscedasticity Assumption:
Homoscedasticity is a situation when the error term is the same for all the values of
independent variables. With homoscedasticity, there should be no clear pattern distribution of
data in the scatter plot.
o Normal distribution of error terms:
Linear regression assumes that the error term should follow the normal distribution pattern. If
error terms are not normally distributed, then confidence intervals will become either too wide
or too narrow, which may cause difficulties in finding coefficients.
It can be checked using the q-q plot. If the plot shows a straight line without any deviation,
which means the error is normally distributed.
o No autocorrelations:
The linear regression model assumes no autocorrelation in error terms. If there will be any
correlation in the error term, then it will drastically reduce the accuracy of the model.
Autocorrelation usually occurs if there is a dependency between residual errors.

You might also like