You are on page 1of 559

An Introduction to Data Analytics

CAMI16 : Data Analytics

Dr. Jitendra Kumar


Department of Computer Applications
Course Objective
• To understand the data analytics approaches
• To familiarize with techniques for Data Analytics
• To apply Statistical modelling techniques for decision
making problems
• To use simple Machine Learning Techniques to
enhance data analytics

Dr. Jitendra Kumar National Institute of Technology Tiruchirappalli 2


Course Outcomes
After the completion of the course, students will be able
to
• Use Statistical principles to infer knowledge from the
data
• Apply various data analytics techniques for informed
decision making
• Adopt basic Machine Learning Technique to analyze the
data

Dr. Jitendra Kumar National Institute of Technology Tiruchirappalli 3


Syllabus
• Introduction: Data Analytics- Data collection- integration-
management- modelling- analysis-visualization-prediction
and informed decision making. General Linear Regression
Model, Estimation for β, Error Estimation, Residual Analysis.

• Tests of significance - ANOVA, ‘t’ test, Forward, Backward,


Sequential, Stepwise, and all possible subsets, Dummy
Regression, Logistic Regression, and Multi-collinearity.

• Discriminant Analysis-Two group problem, Variable


contribution, Violation of assumptions, Discrete and Logistic
Discrimination, The k-group problem, multiple groups,
Interpretation of Multiple group Discriminant Analysis
solutions.

Dr. Jitendra Kumar National Institute of Technology Tiruchirappalli 4


Syllabus (contd..)
• Principal Component Analysis-Extracting Principal
Components, Graphing of Principal Components, Some
sampling Distribution results, Component scores, Large
sample Inferences, Monitoring Quality with principal
Components.

• Factor Analysis-Orthogonal Factor Model,


Communalities, Factor Solutions and rotation. Machine
learning: supervised learning (rules, trees, forests,
nearest neighbour, regression)-optimization (gradient
descent and variants)- unsupervised learning.

Dr. Jitendra Kumar National Institute of Technology Tiruchirappalli 5


You might have learned many different methodologies but
choosing the right methodology is important.

Dr. Jitendra Kumar National Institute of Technology Tiruchirappalli 6


What is wrong
with this?
The real threat is lack
of fundamental
understanding of:
Why to use a
technique?
How to use it
correctly?

Dr. Jitendra Kumar National Institute of Technology Tiruchirappalli 7


Data
• Data are recorded measurements
• Measurement is a standard process which is used to
assign numbers to particular attributes or characteristic
of a variable
• Major forms of data:
• Numerical or Quantitative
• Categorical or Qualitative

Dr. Jitendra Kumar National Institute of Technology Tiruchirappalli 8


Why Data is important for organizations?

• Data can help the organizations in


• Making better decisions
• Evaluating the performance
• Understanding the consumers need
• Understanding the market behavior/trend

Dr. Jitendra Kumar National Institute of Technology Tiruchirappalli 9


Data Analytics
• A systematic computational approach of transforming
into insights for better decision making
• It is used for the discovery, interpretation, and
communication of meaningful patterns in data.
• Applications
• Marketing optimization
• Credit risk analysis
• Development of new medicines
• Fraud prevention
• Cyber physical systems
• …

Dr. Jitendra Kumar National Institute of Technology Tiruchirappalli 10


Data Analytic Process

Define Measure Analyse Improve Control

• Ask right question • Analyse data • Assess solutions


• Define the target • Develop solutions • Create framework

• Collect valid data • Implement solution


• Improve data quality • Optimize efficiency

Dr. Jitendra Kumar National Institute of Technology Tiruchirappalli 11


Types of analytics
What is
happening?

What is likely
to happen?
Value (Added to Company)

Why is it
happening?

What
should I do?

Complexity

Dr. Jitendra Kumar National Institute of Technology Tiruchirappalli 12


Descriptive analytics
• It is the conventional form of Business Intelligence and
data analysis
• Provides the summary view of facts and figures in an
understandable format
• Coverts and presents the raw data into an
understandable format
• Examples
• Reports
• Dashboards
• Data queries
• Data visualization

Dr. Jitendra Kumar National Institute of Technology Tiruchirappalli 13


Diagnostic Analytics
• Dissects the data to answer the question “Why did it
happen”.

• Provides the root-cause of happening something


• Anomaly detection
• Identify hidden relations in data

Dr. Jitendra Kumar National Institute of Technology Tiruchirappalli 14


Predictive analytics
• Forecasts the trends using historical data and current
events
• Predicts the probability of an event happening in future
• Predicts the accurate time of an event happening
• In general, various co-depended variables are studied
and analyzed to forecast a trend

Dr. Jitendra Kumar National Institute of Technology Tiruchirappalli 15


Predictive analytics

Dr. Jitendra Kumar National Institute of Technology Tiruchirappalli 16


Prescriptive analytics
• Set of techniques to indicate the best course of action
• It tells what decision to make to optimize the outcome
• The goal of prescriptive analytics is to enable
• Quality improvements
• Service enhancements
• Cost reductions
• Increasing productivity

Dr. Jitendra Kumar National Institute of Technology Tiruchirappalli 17


Why data analytics is important?

Dr. Jitendra Kumar National Institute of Technology Tiruchirappalli 18


Data analytics is everywhere

Dr. Jitendra Kumar National Institute of Technology Tiruchirappalli 19


Data analytics is everywhere

Dr. Jitendra Kumar National Institute of Technology Tiruchirappalli 20


Data analytics is everywhere

Dr. Jitendra Kumar National Institute of Technology Tiruchirappalli 21


Data analytics is everywhere

Dr. Jitendra Kumar National Institute of Technology Tiruchirappalli 22


Dr. Jitendra Kumar National Institute of Technology Tiruchirappalli 23
Data Analytics in Real World!

Dr. Jitendra Kumar National Institute of Technology Tiruchirappalli 24


Business

Dr. Jitendra Kumar National Institute of Technology Tiruchirappalli 25


Watson playing jeopardy!

Dr. Jitendra Kumar National Institute of Technology Tiruchirappalli 26


eHarmony

Dr. Jitendra Kumar National Institute of Technology Tiruchirappalli 27


Applications
• Netflix – Movie Recommendation
• Facebook – Analysis of Diversity of people and their
habits, Friends suggestion
• Walmart – Product recommendation
• Sports – To study about opponents play behavior
• Pharmaceutical companies – To study about the
combination of medicines for clinical trials.

Dr. Jitendra Kumar National Institute of Technology Tiruchirappalli 28


Application Areas
• Business analytics
• Business logistics, including supply chain optimization
• Finance
• Health, wellness, & biomedicine
• Bioinformatics
• Natural sciences
• Information economy / Social media and social network
analysis
• Smart cities
• Education and electronic teaching
• Energy, sustainability and climate

Dr. Jitendra Kumar National Institute of Technology Tiruchirappalli 29


Thank you!

Dr. Jitendra Kumar National Institute of Technology Tiruchirappalli 30


Introduction-II
CAMI16 : Data Analytics

Dr. Jitendra Kumar


Department of Computer Applications
Buzzwords

Dr. Jitendra Kumar National Institute of Technology Tiruchirappalli 2


Buzzwords (cont…)
• Data analysis is the detailed study or examination of data in order
to understand more about it
• Answers the question, “What happened?”
• Data analytics is systematic computational analysis
• Uses advanced machine learning and statistical tools to predict what
is most likely to happen.
• Data analyst is not directly involved in decision making
• Big data analytics is the process of examining large data sets
containing a variety of data types
• Discovers some knowledge from big data
• Identifies interesting patterns
• Data science is an umbrella term
• Incorporates all the underlying data operations, statistical models as
well as mathematical analysis
• Data scientist is directly involved in decision making

Dr. Jitendra Kumar National Institute of Technology Tiruchirappalli 3


Data Analyst Skills

Data Cleaning &


Statistics
Data Manipulation

Data Visualization Machine Learning

Dr. Jitendra Kumar National Institute of Technology Tiruchirappalli 4


Statistics

• Statistics is a branch of
mathematics dealing with
data collection and
organization, analysis,
and interpretation
• To find trends in change
• Analyst read the data
through statistical
measure to arrive at a
conclusion
https://www.lynda.com/Excel-tutorials/Excel-Statistics-Essential-Training-1/5026557-2.html

Dr. Jitendra Kumar National Institute of Technology Tiruchirappalli 5


Data Cleaning and Data Manipulation
• Data Cleaning is the process of detecting, correcting
corrupt or inaccurate records from the database
• Data Manipulation is the process of changing the data
to make it more organized and easy to read.

https://www.springboard.com/blog/data-cleaning/

Dr. Jitendra Kumar National Institute of Technology Tiruchirappalli 6


Data Visualization

• Representation of data in the form of Charts, diagrams etc.


• Drill-down refers to the process of viewing data at a level of
increased detail, while roll-up refers to the process of viewing data
with decreasing detail.

https://www.tehrantimes.com/news/438777/Iran-develops-first-integrated-health-data-visualization-system

Dr. Jitendra Kumar National Institute of Technology Tiruchirappalli 7


Machine Learning

Input (Data)
Traditional
Program Output (Data)
Programming

Input (Data)
Machine
Output (Data) Program
Learning

Dr. Jitendra Kumar National Institute of Technology Tiruchirappalli 8


Data
CAMI16 : Data Analytics

Dr. Jitendra Kumar


Department of Computer Applications
An Illustration
• Assume that a medical researcher sent you an email
related to some project you wanted to work on..
Hi,
I have attached the data file that I mentioned in my previous
email. Each line contains the information for a single patient and
consists of five fields. We want to predict the last field using the
other fields. I don’t have time to provide any more information
about he data since I’m going out of town for a couple of days,
but hopefully that won’t you own too much. An if you don’t
mind, could we meet when I get back to discuss your
preliminary results? I might invite few other members of my
team.
Thanks and see you in couple of days.

Dr. Jitendra Kumar National Institute of Technology Tiruchirappalli 10


012 232 33.5 0 10.7
020 121 16.9 2 210.1
027 165 24.0 0 427.6
. . . . .
. . . . .
. . . . .

Total 1000 records/ data points/ samples

Dr. Jitendra Kumar National Institute of Technology Tiruchirappalli 11


Conversation between Data Analyst and
Statistician
• So, you got the data for all the patients?
• Yes. I haven’t had much time for analysis, but I do have a few
interesting results.
• Amazing. There were so many data issues with this set of patients
that I couldn’t do much.
• Oh? I didn’t hear about any possible problems.
• Well, first there is field 5, the variable we want to predict. It’s
common knowledge among people who analyse this type of data
that results are better if you work with the log of the values. Was it
mention to you?
• Interesting Were there any other problems?
• Yes, fields 2 and 3 are basically the same, but I assume that you
probably noticed that.

Dr. Jitendra Kumar National Institute of Technology Tiruchirappalli 12


Conversation between Data Analyst and
Statistician
• Yes, but these fields were only weak predictor of field 5.
• Anyway given all those problems, I’m surprised you were able to
accomplish anything.
• True, but my results are really quite good. Field 1 is very strong
predictor of field 5. I’m surprised that this wasn’t noticed before.
• What? Filed 1 is just an identification number.
• Nonetheless, my results speak for themselves.
• Oh, no! I just remembered. We assigned ID numbers after we
sorted the records based on field 5. There is a strong connection,
but it’s meaningless. Sorry.

Moral: Know your data


*An extreme situation

Dr. Jitendra Kumar National Institute of Technology Tiruchirappalli 13


Data
• Data set is a collection of data objects
• record, data point, vector, pattern, event, case, sample,
observation, entity
• Data objects are described by a number of attributes
that capture the basic characteristics of an object
• variable, characteristic, field, feature, dimension
• In general, there are many types of data that can be
used to measure the properties of an entity.
• Numerical or Quantitative (Discrete/Continuous)
• Categorical or Qualitative (Discrete)

Dr. Jitendra Kumar National Institute of Technology Tiruchirappalli 14


General Characteristics of Datasets
• Dimensionality
• Number of attributes
• Curse of dimensionality
• Difficulties associated with analysing high dimensional data
• Dimensionality reduction
• Sparsity
• Very low number of non-zero attributes
• Low computational time and storage
• Resolution
• Too fine, pattern may not be visible
• Too coarse, pattern may disappear
• E.g. variations in atmospheric pressure on a scale of hour and
month (storms can be detected or disappeared)

Dr. Jitendra Kumar National Institute of Technology Tiruchirappalli 15


Attribute
• Property of a data object that varies from one object to
another
• Properties of numbers describe attributes

# Property Operation Type


1. Distinctiveness = and ≠ Categorical Nominal

2. Order <,≤,>,≥ (Qualitative) Ordinal

3. Addition + and - Interval


Numerical
4. Multiplication * and / (Quantitative) Ratio

Dr. Jitendra Kumar National Institute of Technology Tiruchirappalli 16


Nominal Scale
• A variable that takes a value among a set of mutually
exclusive codes that have no logical order is known as
a nominal variable.

• Gender { M, F} or { 1, 0 } Used letters or numbers

• Blood groups {A , B , AB , O } Used string

• Rhesus (Rh) factors {+ , - } Used symbols

Dr. Jitendra Kumar National Institute of Technology Tiruchirappalli 17


Nominal Scale
• The nominal scale is used to label data categorization
using a consistent naming convention
• The labels can be numbers, letters, strings, enumerated
constants or other keyboard symbols
• Nominal data thus makes “category” of a set of data
• The number of categories should be two (binary) or
more (ternary, etc.) but countably finite

Dr. Jitendra Kumar National Institute of Technology Tiruchirappalli 18


Nominal Scale
• A nominal data may be numerical in form, but the numerical
values have no mathematical interpretation.
• For example, 10 prisoners are 100, 101, … 110, but; 100 + 110 =
210 is meaningless. They are simply labels.

• Two labels may be identical ( = ) or dissimilar ( ≠ ).

• These labels do not have any ordering among themselves.


• For example, we cannot say blood group B is better or worse
than group A.

• Labels (from two different attributes) can be combined to


give another nominal variable.
• For example, blood group with Rh factor ( A+ , A- , AB+, etc.)

Dr. Jitendra Kumar National Institute of Technology Tiruchirappalli 19


Binary Scale
• A nominal variable with exactly two mutually exclusive
categories that have no logical order is known as binary
variable
Switch: {ON, OFF}
Attendance: {True, False}
Entry: {Yes, No}
etc.
• A Binary variable is a special case of a nominal variable
that takes only two possible values.

Dr. Jitendra Kumar National Institute of Technology Tiruchirappalli 20


Symmetric and Asymmetric Binary
Scale
• Different binary variables may have unequal importance

• If two choices of a binary variable have equal


importance, then it is called symmetric binary variable.
• Example: Gender = {male, female}

• If the two choices of a binary variable have unequal


importance, it is called asymmetric binary variable.
• Example: Student Course Opted= {Y, N}

Dr. Jitendra Kumar National Institute of Technology Tiruchirappalli 21


Operation on Nominal Variables
• Summary statistics applicable to nominal data are
mode, contingency correlation, etc.
• Arithmetic (+, -, *, /) and logical operations (<, >, ≠ etc.)
are not permitted
• The allowed operations are: accessing (read, check,
etc.) and re-coding (into another non-overlapping
symbol set, that is, one-to-one mapping) etc.
• Nominal data can be visualized using line charts, bar
charts or pie charts etc.
• Two or more nominal variables can be combined to
generate other nominal variable.
• Example: Gender (M,F) x Marital status (S, M, D, W)

Dr. Jitendra Kumar National Institute of Technology Tiruchirappalli 22


Ordinal Scale
• Ordered nominal data are known as ordinal data and
the variable that generates it is called ordinal variable.
• Example: Shirt size = { S, M, L, XL, XXL}

• The values assumed by an ordinal variable can be


ordered among themselves as each pair of values can
be compared literally or using relational operators ( < , ≤
, > , ≥ ).

Dr. Jitendra Kumar National Institute of Technology Tiruchirappalli 23


Operation on Ordinal Data
• Usually relational operators can be used on ordinal data.
• Summary measures mode and median can be used on
ordinal data.
• Ordinal data can be ranked (numerically, alphabetically, etc.)
Hence, we can find any of the percentiles measures of
ordinal data.
• Calculations based on order are permitted (such as count,
min, max, etc.).
• Spearman’s R can be used as a measure of the strength of
association between two sets of ordinal data.
• Numerical variable can be transformed into ordinal variable
and vice-versa, but with a loss of information.
• For example, Age [1, … 100] = [young, middle-aged, old]

Dr. Jitendra Kumar National Institute of Technology Tiruchirappalli 24


Interval Scale
• Interval-scale variables are continuous measurements
of a roughly linear scale.
• Example: weight, height, latitude, longitude, weather,
temperature, calendar dates, etc.
• Interval data are with well-defined interval.
• Interval data are measured on a numeric scale (with
+ve, 0 (zero), and –ve values).
• Interval data has a zero point on origin. However, the
origin does not imply a true absence of the measured
characteristics.
• For example, temperature in Celsius and Fahrenheit; 0⁰ does
not mean absence of temperature, that is, no heat!

Dr. Jitendra Kumar National Institute of Technology Tiruchirappalli 25


Operation on Interval Data
• We can add to or from interval data.
• For example: date1 + x-days = date2
• Subtraction can also be performed.
• For example: current date – date of birth = age
• Negation (changing the sign) and multiplication by a
constant are permitted.
• All operations on ordinal data defined are also valid
here.
• Linear (e.g. cx + d ) or Affine transformations are
permissible.
• Other one-to-one non-linear transformation (e.g., log,
exp, sin, etc.) can also be applied.

Dr. Jitendra Kumar National Institute of Technology Tiruchirappalli 26


Operation on Interval Data
• Interval data can be transformed to nominal or ordinal
scale, but with loss of information.

• Interval data can be graphed using histogram,


frequency polygon, etc.

Dr. Jitendra Kumar National Institute of Technology Tiruchirappalli 27


Ratio Scale
• Interval data with a clear definition of “zero” are called
ratio data.
• Example: Temperature in Kelvin scale, Intensity of earth-quake
on Richter scale, Sound intensity in Decibel, cost of an article,
population of a country, etc.
• All ratio data are interval data but the reverse is not
true.
• In ratio scale, both differences between data values and
ratios (of non-zero) data pairs are meaningful.
• Ratio data may be in linear or non-linear scale.
• Both interval and ratio data can be stored in same data
type (i.e., integer, float, double, etc.)

Dr. Jitendra Kumar National Institute of Technology Tiruchirappalli 28


Operation on Ratio Data
• All arithmetic operations on interval data are applicable
to ratio data.
• In addition, multiplication, division, etc. are allowed.
• Any linear transformation of the form ( ax + b )/c are
known.

Dr. Jitendra Kumar National Institute of Technology Tiruchirappalli 29


Type of Datasets
• Record based
• Transactional Data (shopping)
• Data Matrix (relational data)
• Sparse Data Matrix (course selection)
• Graph based
• Linked web pages
• Ordered
• Sequence Data (genetic encoding)
• Time Series Data (temperature)

Dr. Jitendra Kumar National Institute of Technology Tiruchirappalli 30


Thank You!

Dr. Jitendra Kumar National Institute of Technology Tiruchirappalli 31


Data Exploring
CAMI16 : Data Analytics

Dr. Jitendra Kumar


Department of Computer Applications
Data Exploration
• Preliminary investigation of the data in order to better
understand its specific characteristics
• Helps in selecting the appropriate pre-processing and
data analysis techniques
• Approaches
• Statistics
• Visualization

Dr. Jitendra Kumar National Institute of Technology Tiruchirappalli


Statistics
• “Statistics is concerned with scientific method for
collecting, organizing, summarising, presenting and
analysing data as well as drawing valid conclusions and
making reasonable decisions on the basis of such
analysis.”

• Helps in
• The planning of operations
• The setting up of standards

Dr. Jitendra Kumar National Institute of Technology Tiruchirappalli


Misuse of Statistics
• Data Source is not given
• Defective Data
• Unrepresentative Sample
• Inadequate Sample
• Unfair Comparisons

Dr. Jitendra Kumar National Institute of Technology Tiruchirappalli


Descriptive Statistics
• Quantities such as mean and standard deviation

• Captures different characteristics of a large set of


values
• E.g. Average household income, fraction of college dropout
students in last 10 years

• E.g. Study the height of students in a class involves


• Recording the heights of all the students
• Max., Min., Median, Mean, Mode

Dr. Jitendra Kumar National Institute of Technology Tiruchirappalli


Measures of Central Tendency
• Mean
σ𝑛𝑖=1 𝑥𝑖
mean(𝑥) = 𝑥ҧ =
𝑛

• Median (data needs to be sorted)

𝑥(𝑖+1) , if 𝑛 is odd, i.e., 𝑛 = 2𝑖 + 1


median(𝑥) = 𝑓 𝑥 = ൝1
2
𝑥𝑖 + 𝑥(𝑖+1) , if 𝑛 is even, i.e., 𝑛 = 2𝑖

• Mode
• Selects most common value

Dr. Jitendra Kumar National Institute of Technology Tiruchirappalli


Measures of Central Tendency
Data (x) : 3, 4, 3, 1, 2, 3, 9, 5, 6, 7, 4, 8
𝑛 = 12

3+4+⋯+8
Mean 𝑥ҧ = = 4.583
12

Median Sorted Data (x) : 1, 2, 3, 3, 3, 4, 4, 5, 6, 7, 8, 9


1
median 𝑥 = 4 + 4 = 4
2

Mode 1 1 4 2 7 1
mode 𝑥 = 3
2 1 5 1 8 1
3 3 6 1 9 1
Data items frequency

Dr. Jitendra Kumar National Institute of Technology Tiruchirappalli


Measures of Central Tendency
• Outliers are not important, use median
• Low impact of outliers on median
• Outliers are important, use mean

• E.g. Average income

Person P1 P2 P3 P4 P5 P6 P7
Income
1 1 1 2 2 3 11
(Million)
Mean = 3 Every person could make 3M
Median = 2 Poor half of the population makes 2M or less

Dr. Jitendra Kumar National Institute of Technology Tiruchirappalli


Measures of Central Tendency
• Example: Loose 1 Rs. Everyday on 99% of the days
but on 1% of the days, it gave Rs. 1M
• -1,-1,-1,…,-1,1000000,-1,-1,…,-1,-1,1000000,-1,-1

• Median = -1
• Mean = ((-1)+(-1)+…+(-1)+1000000)/100 = Some
positive number

Dr. Jitendra Kumar National Institute of Technology Tiruchirappalli


Measures of Central Tendency
• Garbage can placement on streets
• 40% people voted for garbage can at every 25th meter
• 45% people voted for garbage can at every 75th meter
• 15% people voted between 1 and 100 meter (except 25 and
75)

Mode = 75 (most popular preference)

Dr. Jitendra Kumar National Institute of Technology Tiruchirappalli


Measures of Dispersion/ Spread
• How does data deviate to central value or any other value?
• Range
• How spread apart the values in data set
• Compute using (𝑚𝑎𝑥 − min)
• Inter Quartile Range
• Measure of variability based on dividing the dataset into quartiles
• High Value – High Dispersion
• Low Value – Low Dispersion
• Sample Standard Deviation
• Deviation of each data point from mean
𝑛
1
𝑆𝐷 = ෍(𝑥𝑖 − 𝑥)ҧ 2
𝑛−1
𝑖=1

Dr. Jitendra Kumar National Institute of Technology Tiruchirappalli


Interquartile Range - Calculation
Step 1: Order the data from least to greatest
Step 2: Identify the extremes
Step 3: Find the median of the dataset
Step 4: Find Q3 i.e. median of the Upper half of the data
Step 5: Find Q1 i.e. median of the Lower half of the data
Step 6: Find IQR = Q3 – Q1
Ex. 1: 19,25,16,20,34,7,31,12,22,29,16
Ex. 2: 65,65,70,75,80,82,85,90,95,100

Dr. Jitendra Kumar National Institute of Technology Tiruchirappalli 12


Measures of Dispersion/ Spread
Data (x) : 3, 4, 3, 1, 2, 3, 9, 5, 6, 7, 4, 8
𝑛 = 12
Range
max-min = 9-1 = 8
High dispersion as min and max are highly deviated from
mean (4.583)
Inter Quartile Range
3rd quartile – 1st quartile
75th percentile – 25th percentile
6.5-3 = 3.5
Sample Standard Deviation
1
෍ 3 − 4.583 2 + (4 − 4.583)2 + ⋯ + 8 − 4.583 2 =
11

Dr. Jitendra Kumar National Institute of Technology Tiruchirappalli


Inferential statistics
• Generalizes a large dataset and applied probability to
draw a conclusion
• Used to infer data parameters based on the statistical
model using sample data
• Expand the model to get results for the entire
population.
• E.g. Hypothesis Testing

Dr. Jitendra Kumar National Institute of Technology Tiruchirappalli 14


Descriptive Statistics Vs. Inferential
Statistics

https://www.selecthub.com/business-intelligence/statistical-software/

Dr. Jitendra Kumar National Institute of Technology Tiruchirappalli 15


Thank You!

Dr. Jitendra Kumar National Institute of Technology Tiruchirappalli


Linear Regression
CAMI16 : Data Analytics

Dr. Jitendra Kumar


Department of Computer Applications
Regression
• Engineering and Science applications explore the relationship
among variables

• Regression analysis is a statistical model that is very useful for


such problems

• Regression: the process of going back to an earlier state

Dr. Jitendra Kumar National Institute of Technology Tiruchirappalli 2


Model
• Mathematical representation of a phenomenon i.e. the
representation of a relationship

𝐷𝑜𝑠𝑎𝑔𝑒 𝑜𝑓 𝑚𝑒𝑑𝑖𝑐𝑖𝑛𝑒 = 𝑓(𝑎𝑔𝑒, 𝑏𝑙𝑜𝑜𝑑 𝑝𝑟𝑒𝑠𝑠𝑢𝑟𝑒, 𝑜𝑥𝑦𝑔𝑒𝑛 𝑙𝑒𝑣𝑒𝑙)

𝑦 = 𝑓 𝑥1 , 𝑥2 , 𝑥3

𝑦 = 3𝑥1 + 7𝑥2 + 2𝑥3

Dr. Jitendra Kumar National Institute of Technology Tiruchirappalli 3


Model Components

Model

Variables Parameters

Input Output Non


Linear
Variable Variable Linear

Dr. Jitendra Kumar National Institute of Technology Tiruchirappalli 4


• Good model incorporates all salient features of phenomenon

• Bad model does not incorporate all salient features of


phenomenon

• How can you obtain good model?

Dr. Jitendra Kumar National Institute of Technology Tiruchirappalli 5


• Collect a sample of data

• Sample – Fraction of population (data points)

• Sample should be representative in nature i.e. all salient


features of population should be present in sampled data

Dr. Jitendra Kumar National Institute of Technology Tiruchirappalli 6


Model Parameters

𝑦 = 𝑚𝑥 + 𝑐

𝑦
Variables Parameters

S1. Knowledge of (x, y) completely


describes the model. 
S2. Knowledge of (m, c) completely c
describe the model.  slope 𝑚 = tan 𝜃 x

Dr. Jitendra Kumar National Institute of Technology Tiruchirappalli 7


Modeling is finding the parameters of a model which are UNKNOWN

Regression Analysis

Dr. Jitendra Kumar National Institute of Technology Tiruchirappalli 8


Regression Analysis
𝑦 = 𝑚𝑥 + 𝑐

In general, variables are represented using alphabets 𝑥, 𝑦, 𝑧 etc.


and parameters are represented using Greek letters 𝛼, 𝛽, 𝛾 etc.

With above mentioned convention 𝑦 = 𝑚𝑥 + 𝑐 becomes


𝑦 = 𝛽0 + 𝛽1 𝑥

In this case, model is known if 𝛽0 and 𝛽1 are known.

Dr. Jitendra Kumar National Institute of Technology Tiruchirappalli 9


Regression Analysis
General model for k input variables

𝑦 = 𝛽0 + 𝛽1 𝑥1 + 𝛽2 𝑥2 + ⋯ + 𝛽𝑘 𝑥𝑘

where 𝛽0 , 𝛽1 , ⋯ , 𝛽𝑘 are model parameters

More general form


𝑦 = 𝛽0 𝑥0 + 𝛽1 𝑥1 + 𝛽2 𝑥2 + ⋯ + 𝛽𝑘 𝑥𝑘
𝑥0 = 1

Dr. Jitendra Kumar National Institute of Technology Tiruchirappalli 10


Linear Regression
• A model is said to be linear when it is linear in parameters.

• Identify linear model(s)


𝜕𝑦 𝜕𝑦
 𝑦 = 𝛽0 + 𝛽1 𝑥 𝜕𝛽0
=1
𝜕𝛽1
=𝑥
𝜕𝑦 𝜕𝑦
 𝑦 = 𝛽0 + 𝛽1 𝑥 2 𝜕𝛽0
=1
𝜕𝛽1
= 𝑥2
𝜕𝑦 𝜕𝑦
 𝑦 = 𝛽0 + 𝛽12 𝑥 𝜕𝛽0
=1
𝜕𝛽1
= 2𝛽1 𝑥

𝜕𝑦
If is independent of parameters then model is linear
𝜕(𝑝𝑎𝑟𝑎𝑚𝑒𝑡𝑒𝑟𝑠)

Dr. Jitendra Kumar National Institute of Technology Tiruchirappalli 11


Non-linear model to linear model
𝑦 = 𝛽0 𝑥 𝛽1

log 𝑦 = log 𝛽0 + 𝛽1 log 𝑥

𝑦 ∗ = 𝛽0∗ + 𝛽1 𝑥 ∗

The updated model is linear in parameters 𝛽0∗ and 𝛽1 for input


variable 𝑥 ∗

Dr. Jitendra Kumar National Institute of Technology Tiruchirappalli 12


Simple Linear Regression
• Consider one variable
𝑦 = 𝛽0 + 𝛽1 𝑥

• y – output variable/ study variable/ response variable/ dependent variable


• x – input variable/ explanatory variable/ regressor/ independent variable

• Objective: Find the values of parameters

Dr. Jitendra Kumar National Institute of Technology Tiruchirappalli 13


Modeling

𝑦𝑖 = 𝛽0 + 𝛽1 𝑥𝑖 ; 𝑖 = 1,2, ⋯ , 𝑛
𝑦 = 𝛽0 + 𝛽1 𝑥
𝑦
This model is not representing 𝜀𝑛
the true phenomenon (𝑥 𝑛
,𝑦 𝑛
)

𝑦𝑖 = 𝛽0 + 𝛽1 𝑥𝑖 + 𝜀𝑖 ; 𝑖 = 1,2, ⋯ , 𝑛 (𝑥 1
,𝑦 1
)

𝜀1
𝜀2
𝜀𝑖 - random error (𝑥 2
,𝑦 2
)

Dr. Jitendra Kumar National Institute of Technology Tiruchirappalli 14


Least Square Estimation

𝑦𝑖 = 𝛽0 + 𝛽1 𝑥𝑖 + 𝜀𝑖

How to compute the total error?


 a) σ𝑛𝑖=1 𝜀𝑖

 b) σ𝑛𝑖=1 𝜀𝑖2 Least square estimation

 c) σ𝑛𝑖=1 𝜀𝑖 Least absolute error estimation

Dr. Jitendra Kumar National Institute of Technology Tiruchirappalli 15


Least Square Estimation (cont...)
𝑛
𝑛
𝑆𝑆𝐸 = ෍ 𝜀𝑖2 𝜕 𝑆𝑆𝐸
𝑖=1 = −2 ෍ 𝑦𝑖 − 𝛽0 − 𝛽1 𝑥𝑖
𝜕 𝛽0
𝑖=1
𝑦𝑖 = 𝛽0 + 𝛽1 𝑥𝑖 + 𝜀𝑖
𝑛
𝜀𝑖 = 𝑦𝑖 − 𝛽0 − 𝛽1 𝑥𝑖 𝜕 𝑆𝑆𝐸
= −2 ෍(𝑦𝑖 −𝛽0 − 𝛽1 𝑥𝑖 )𝑥𝑖
𝑛 𝜕 𝛽1
𝑖=1
𝑆𝑆𝐸 = ෍(𝑦𝑖 − 𝛽0 − 𝛽1 𝑥𝑖 )2
𝑖=1

Dr. Jitendra Kumar National Institute of Technology Tiruchirappalli 16


Least Square Estimation (cont...)

𝑆𝑆𝐸
𝑆𝑆𝐸

𝛽𝑜∗ 𝛽1∗

𝛽0 𝛽1

Dr. Jitendra Kumar National Institute of Technology Tiruchirappalli 17


Least Square Estimation (cont...)

𝑛 𝑛 𝑛
1 1
−2 ෍ 𝑦𝑖 − 𝛽0 − 𝛽1 𝑥𝑖 = 0 ෍ 𝑦𝑖 = 𝑦ത ෍ 𝑥𝑖 = 𝑥ҧ
𝑛 𝑛
𝑖=1 𝑖=1 𝑖=1
𝑛 𝑛

෍ 𝑦𝑖 − 𝑛𝛽0 − 𝛽1 ෍ 𝑥𝑖 = 0 𝑦ത − 𝛽0 − 𝛽1 𝑥ҧ = 0
𝑖=1 𝑖=1
𝑛 𝑛
1 𝛽1
෍ 𝑦𝑖 − 𝛽0 − ෍ 𝑥𝑖 = 0 𝛽0 = 𝑦ത − 𝛽1 𝑥ҧ
𝑛 𝑛
𝑖=1 𝑖=1

Dr. Jitendra Kumar National Institute of Technology Tiruchirappalli 18


Least Square Estimation (cont...)

−2 ෍(𝑦𝑖 −𝛽0 − 𝛽1 𝑥𝑖 )𝑥𝑖 = 0


𝑖=1
𝑛 𝑛 𝑛

෍ 𝑥𝑖 𝑦𝑖 − 𝛽0 ෍ 𝑥𝑖 − 𝛽1 ෍ 𝑥𝑖2 = 0
𝑖=1 𝑖=1 𝑖=1
𝑛 𝑛 𝑛

෍ 𝑥𝑖 𝑦𝑖 − (𝑦ത − 𝛽1 𝑥)ҧ ෍ 𝑥𝑖 − 𝛽1 ෍ 𝑥𝑖2 = 0


𝑖=1 𝑖=1 𝑖=1

Dr. Jitendra Kumar National Institute of Technology Tiruchirappalli 19


Least Square Estimation (cont...)
𝑛 𝑛 𝑛 𝑛 𝑛 𝑛
1 1
෍ 𝑥𝑖 𝑦𝑖 − ෍ 𝑦𝑖 ෍ 𝑥𝑖 + 𝛽1 ෍ 𝑥𝑖 ෍ 𝑥𝑖 − 𝛽1 ෍ 𝑥𝑖2 = 0
𝑛 𝑛
𝑖=1 𝑖=1 𝑖=1 𝑖=1 𝑖=1 𝑖=1
𝑛 𝑛 𝑛 𝑛 𝑛 𝑛
1 1
෍ 𝑥𝑖 𝑦𝑖 − ෍ 𝑦𝑖 ෍ 𝑥𝑖 = 𝛽1 − ෍ 𝑥𝑖 ෍ 𝑥𝑖 + ෍ 𝑥𝑖2
𝑛 𝑛
𝑖=1 𝑖=1 𝑖=1 𝑖=1 𝑖=1 𝑖=1

𝑛 𝑛 𝑛 𝑛
1 1
෍ 𝑥𝑖 𝑦𝑖 − 𝑦ത × 𝑛 × × ෍ 𝑥𝑖 = 𝛽1 −𝑥ҧ × 𝑛 × × ෍ 𝑥𝑖 + ෍ 𝑥𝑖2
𝑛 𝑛
𝑖=1 𝑖=1 𝑖=1 𝑖=1

Dr. Jitendra Kumar National Institute of Technology Tiruchirappalli 20


Least Square Estimation (cont...)
𝑛 𝑛 𝑛 𝑛
1 1
෍ 𝑥𝑖 𝑦𝑖 − 𝑦ത × 𝑛 × × ෍ 𝑥𝑖 = 𝛽1 −𝑥ҧ × 𝑛 × × ෍ 𝑥𝑖 + ෍ 𝑥𝑖2
𝑛 𝑛
𝑖=1 𝑖=1 𝑖=1 𝑖=1

σ𝑛𝑖=1 𝑥𝑖 𝑦𝑖 − 𝑛𝑥ҧ 𝑦ത
𝛽1 =
σ𝑛𝑖=1 𝑥𝑖2 − 𝑛𝑥ҧ 2

Dr. Jitendra Kumar National Institute of Technology Tiruchirappalli 21


Thank you!

Dr. Jitendra Kumar National Institute of Technology Tiruchirappalli 22


Linear Regression
(Gradient Descent)
CAMI16 : Data Analytics

Dr. Jitendra Kumar


Department of Computer Applications

The slides use the content from Machine Learning course on Coursera.
https://www.coursera.org/learn/machine-learning/home/
500
Housing Prices
(Trichy, TN) 400

300

Price (₹) 200


(in 100,000)
100

0
0 500 1000 1500 2000 2500 3000
Size (feet2)

Supervised Learning Regression Problem


Given the “right answer” for Predict real-valued output
each example in the data.

Dr. Jitendra Kumar National Institute of Technology Tiruchirappalli 2


Training set of Size in feet2 Price (₹) in
housing prices (x) 100,000's (y)
(Trichy, TN) 2104 460
1416 232
1534 315
852 178
Notation: … …
m = Number of training examples
x’s = “input” variable / features
y’s = “output” variable / “target” variable

Dr. Jitendra Kumar National Institute of Technology Tiruchirappalli 3


Training Set How do we represent h ?

Learning Algorithm

Size of Estimated
h
house price

Linear regression with one variable.


Univariate linear regression.

Dr. Jitendra Kumar National Institute of Technology Tiruchirappalli 4


Training Set Size in feet2 Price (₹) in
(x) 100,000's (y)
2104 460
1416 232
1534 315
852 178
… …
Hypothesis:

‘s: Parameters

How to choose ‘s ?
Dr. Jitendra Kumar National Institute of Technology Tiruchirappalli 5
3 3 3

2 2 2

1 1 1

0 0 0
0 1 2 3 0 1 2 3 0 1 2 3

Dr. Jitendra Kumar National Institute of Technology Tiruchirappalli 6


y

Idea: Choose so that


is close to for our
training examples

Dr. Jitendra Kumar National Institute of Technology Tiruchirappalli 7


Simplified
Hypothesis:

Parameters:

Cost Function:

Goal:

Dr. Jitendra Kumar National Institute of Technology Tiruchirappalli 8


(for fixed , this is a function of x) (function of the parameter )

3 3

2 2
y
1 1

0 0
0 1 2 3 -0.5 0 0.5 1 1.5 2 2.5
x

Dr. Jitendra Kumar National Institute of Technology Tiruchirappalli 9


(for fixed , this is a function of x) (function of the parameter )

3 3

2 2
y
1 1

0 0
0 1 2 3 -0.5 0 0.5 1 1.5 2 2.5
x

Dr. Jitendra Kumar National Institute of Technology Tiruchirappalli 10


(for fixed , this is a function of x) (function of the parameter )

3 3

2 2
y
1 1

0 0
0 1 2 3 -0.5 0 0.5 1 1.5 2 2.5
x

Dr. Jitendra Kumar National Institute of Technology Tiruchirappalli 11


Hypothesis:

Parameters:

Cost Function:

Goal:

Dr. Jitendra Kumar National Institute of Technology Tiruchirappalli 12


(for fixed , this is a function of x) (function of the parameters )

500

400

300
Price (₹)
in 100000’s
200

100

0
0 1000 2000 3000
Size in feet2 (x)

Dr. Jitendra Kumar National Institute of Technology Tiruchirappalli 13


Dr. Jitendra Kumar National Institute of Technology Tiruchirappalli 14
(for fixed , this is a function of x) (function of the parameters )

Dr. Jitendra Kumar National Institute of Technology Tiruchirappalli 15


(for fixed , this is a function of x) (function of the parameters )

Dr. Jitendra Kumar National Institute of Technology Tiruchirappalli 16


(for fixed , this is a function of x) (function of the parameters )

Dr. Jitendra Kumar National Institute of Technology Tiruchirappalli 17


(for fixed , this is a function of x) (function of the parameters )

Dr. Jitendra Kumar National Institute of Technology Tiruchirappalli 18


Gradient Descent

Dr. Jitendra Kumar National Institute of Technology Tiruchirappalli 19


Have some function

Want

Outline:
• Start with some
• Keep changing to reduce
until we hopefully end up at a minimum

Dr. Jitendra Kumar National Institute of Technology Tiruchirappalli 20


J(0,1)

1
0

Dr. Jitendra Kumar National Institute of Technology Tiruchirappalli 21


J(0,1)

1
0

Dr. Jitendra Kumar National Institute of Technology Tiruchirappalli 22


Gradient descent algorithm

Correct: Simultaneous update Incorrect:

Dr. Jitendra Kumar National Institute of Technology Tiruchirappalli 23


Dr. Jitendra Kumar National Institute of Technology Tiruchirappalli 24
If α is too small, gradient descent
can be slow.

If α is too large, gradient descent


can overshoot the minimum. It may
fail to converge, or even diverge.

Dr. Jitendra Kumar National Institute of Technology Tiruchirappalli 25


at local optima

Current value of

Dr. Jitendra Kumar National Institute of Technology Tiruchirappalli 26


Gradient descent can converge to a local
minimum, even with the learning rate α fixed.

As we approach a local
minimum, gradient
descent will automatically
take smaller steps. So, no
need to decrease α over
time.

Dr. Jitendra Kumar National Institute of Technology Tiruchirappalli 27


Gradient descent algorithm Linear Regression Model

Dr. Jitendra Kumar National Institute of Technology Tiruchirappalli 28


Dr. Jitendra Kumar National Institute of Technology Tiruchirappalli 29
Gradient descent algorithm

update
and
simultaneously

Dr. Jitendra Kumar National Institute of Technology Tiruchirappalli 30


J(0,1)

1
0

Dr. Jitendra Kumar National Institute of Technology Tiruchirappalli 31


Dr. Jitendra Kumar National Institute of Technology Tiruchirappalli 32
(for fixed , this is a function of x) (function of the parameters )

Dr. Jitendra Kumar National Institute of Technology Tiruchirappalli 33


(for fixed , this is a function of x) (function of the parameters )

Dr. Jitendra Kumar National Institute of Technology Tiruchirappalli 34


(for fixed , this is a function of x) (function of the parameters )

Dr. Jitendra Kumar National Institute of Technology Tiruchirappalli 35


(for fixed , this is a function of x) (function of the parameters )

Dr. Jitendra Kumar National Institute of Technology Tiruchirappalli 36


(for fixed , this is a function of x) (function of the parameters )

Dr. Jitendra Kumar National Institute of Technology Tiruchirappalli 37


(for fixed , this is a function of x) (function of the parameters )

Dr. Jitendra Kumar National Institute of Technology Tiruchirappalli 38


(for fixed , this is a function of x) (function of the parameters )

Dr. Jitendra Kumar National Institute of Technology Tiruchirappalli 39


(for fixed , this is a function of x) (function of the parameters )

Dr. Jitendra Kumar National Institute of Technology Tiruchirappalli 40


(for fixed , this is a function of x) (function of the parameters )

Dr. Jitendra Kumar National Institute of Technology Tiruchirappalli 41


“Batch” Gradient Descent

“Batch”: Each step of gradient descent


uses all the training examples.

Dr. Jitendra Kumar National Institute of Technology Tiruchirappalli 42


Thank You!

Dr. Jitendra Kumar National Institute of Technology Tiruchirappalli 43


Regression Analysis: Goodness of Fit
Thursday, 10 September, 2020 02:00 PM

Regression Analysis Page 1


Regression Analysis Page 2
Regression Analysis Page 3
Regression Assumptions
Wednesday, 16 September, 2020 03:00 PM

Regression Assumptions Page 1


Regression Assumptions Page 2
Linear Regression (Output Explanation)
Thursday, 17 September, 2020 02:18 PM

Linear Regression (Output Explanation) Page 1


Test of Significance
Tuesday, 15 September, 2020 02:35 PM

Test of Significance Page 1


ANOVA (Analysis Of Variance)
Tuesday, 22 September, 2020 01:57 PM

ANOVA Page 1
ANOVA Page 2
Test of Significance
Tuesday, 15 September, 2020 02:35 PM

Test of Significance Page 1


Test of Significance Page 2
Test of Significance Page 3
Multiple Linear Regression
Thursday, 24 September, 2020 02:01 PM

Multiple Linear Reegression Page 1


Multiple Linear Reegression Page 2
Multiple Linear Reegression Page 3
Aspects of Multiple Linear
Regression
CAMI16 : Data Analytics

Dr. Jitendra Kumar


Department of Computer Applications
Multiple Linear Regression Aspect
• Polynomial Regression Models
• Categorical Regressors and Indicator Variables
• Selection of Variables and Model Building
• Multicollinearity

Dr. Jitendra Kumar National Institute of Technology Tiruchirappalli


Polynomial Regression Models
• Form of linear regression where relationship between
independent variable x and dependent variable y is
modelled as an nth degree polynomial.
• Polynomial regression models are widely used when
the response in curvilinear
• General Model
𝒀 = 𝐗𝜷 + 𝝐

• Second degree polynomial in one variable


𝑦 = 𝛽0 + 𝛽1 𝑥 + 𝛽11 𝑥 2 + 𝜖

• Second degree polynomial in two variables


𝑦 = 𝛽0 + 𝛽1 𝑥 + 𝛽2 𝑥2 + 𝛽11 𝑥12 + 𝛽22 𝑥22 + 𝛽12 𝑥1 𝑥2 + 𝜖

Dr. Jitendra Kumar National Institute of Technology Tiruchirappalli


Polynomial Regression Model
x 20 25 30 35 40 50 60 65 70 75 80 90
y 1.81 1.70 1.65 1.55 1.48 1.40 1.30 1.26 1.24 1.21 1.20 1.18

1.9

1.81 1 20 400 1.8


𝑦 = 𝛽0 + 𝛽1 𝑥 + 𝛽11 𝑥 2 + 𝜖
1.70 1 25 625 1.7

1.65 1 30 900
1.6
1.5
1.55 1 35 1225

y
1.4
1.48 1 40 1600 1.3
1.40 1 50 2500 1.2
𝒚= 𝑿=
1.30 1 60 3600 1.1

1.26 1 65 4225 1
15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95
1.24 1 70 4900 𝛽0 x

1.21 1 75 5625 𝜷 = 𝛽1
1.20 1 80 6400 ෡ = 𝑿′ 𝒚
𝑿′ 𝑿𝜷
𝛽2
1.18 1 90 8100
𝑦ො = 2.19826629 − 0.02252236𝑥 + 0.00012507𝑥 2

Dr. Jitendra Kumar National Institute of Technology Tiruchirappalli


Polynomial Regression Model

Lowest-degree model is always better


Extra sum of squares due to 𝛽11 Can we drop the quadratic
term from the model?

Dr. Jitendra Kumar National Institute of Technology Tiruchirappalli


Categorical Regressors and Indicator Variables

• So far, regression models considered quantitative


variables (measured on a numerical scale)
• Sometimes, categorical or qualitative variables are
incorporated in a regression model
• The usual approach is to use indicator variables or
dummy variables
• For instance, suppose that one of the variables in a
regression model is thee operator who is associate with
each observation

0 if the observation is from operator 1


𝑥=ቊ
1 if the observation is from operator 2

Dr. Jitendra Kumar National Institute of Technology Tiruchirappalli


Categorical Regressors and Indicator Variables

𝑦 = 𝛽0 + 𝛽1 𝑥1 + 𝛽2 𝑥2 + 𝜖
If x2=0
𝑦 = 𝛽0 + 𝛽1 𝑥1 + 𝜖

If x2=1
𝑦 = 𝛽0 + 𝛽1 𝑥1 + 𝛽2 . 1 + 𝜖
𝑦 = (𝛽0 +𝛽2 ) + 𝛽1 𝑥1 + 𝜖

𝑦ො = 14.27620 + 0.14115𝑥1 − 13.28020𝑥2

Dr. Jitendra Kumar National Institute of Technology Tiruchirappalli


Categorical Regressors and Indicator Variables

Dr. Jitendra Kumar National Institute of Technology Tiruchirappalli


Selection of Variables and Model Building

• Selection of the set of regressor variables to be used in


the model is critical
• Previous experience or underlying theoretical
considerations can help the analyst specify the set of
regressor variables to use in a particular situation.
• Variable selection refers the screening the candidate
variables to obtain a regression model that contains the
“best” subset of regressor variables.
• We would also like the model to use as few regressor
variables as possible.
• The compromise between these conflicting objects is
often called finding the “best” regression equation.

Dr. Jitendra Kumar National Institute of Technology Tiruchirappalli


Selection of Variables and Model Building
Underfitting Appropriate Overfitting
(High Bias) (High Variance)

Dr. Jitendra Kumar National Institute of Technology Tiruchirappalli


Selection of Variables and Model Building
Underfitting Appropriate Overfitting
(High Bias) (High Variance)
Variance refers to the error due to complex
model trying to fit the data. High variance
means the model passes through most of
the data points and it results in over-fitting
the data

The bias is known as the difference


between the prediction of the values by the
model and the correct value. Being high in
biasing gives a large error in training as well
as testing data. Its recommended that an
algorithm should always be low biased to
avoid the problem of underfitting.

Dr. Jitendra Kumar National Institute of Technology Tiruchirappalli


Selection of Variables and Model Building

Dr. Jitendra Kumar National Institute of Technology Tiruchirappalli


All Possible Regressions

• Fit all the regression equations involving one candidate


variable, all regression equations involving two
candidate variables, and so on
• Then these equations are evaluated according to some
suitable criteria to select the “best” regression model
• If there are K candidate regressors, there are 2K total
equations to be examined.
• For example, if K = 4, there are 24 = 16 possible regression
equations; while if K = 10, there are 210 = 1024 possible
regression equations
• Hence, the number of equations to be examined
increases rapidly as the number of candidate variables
increases

Dr. Jitendra Kumar National Institute of Technology Tiruchirappalli


All Possible Regressions
• Several criteria may be used for evaluating and comparing
the different regression models obtained.
• A commonly used criterion is based on the value of 𝑅2 or the
2
value of the adjusted 𝑅2 , 𝑅𝑎𝑑𝑗 .
• Continue to increase the number of variables in the model
2
until the increase in 𝑅2 or the 𝑅𝑎𝑑𝑗 is small.
2
• Often, the 𝑅𝑎𝑑𝑗 will stabilize and actually begin to decrease as the
number of variables in the model increases.
2
• Usually, the model that maximizes 𝑅𝑎𝑑𝑗 is considered to be a
good candidate for the best regression equation.
• Another criteria is PRESS (Prediction
𝑛
Error Sum of Squares)
𝑃𝑅𝐸𝑆𝑆 = ෍(𝑦𝑖 − 𝑦ො𝑖 )2
𝑖=1
• Models that have small values of PRESS are preferred

Dr. Jitendra Kumar National Institute of Technology Tiruchirappalli


Stepwise Regression
• The most widely used variable selection technique
• The procedure iteratively constructs a sequence of
regression models by adding or removing variables at
each step.
• The criterion for adding or removing a variable at any
step is usually expressed in terms of a partial F-test.
• Let fin be the value of the F-random variable for adding
a variable to the model, and let fout be the value of the
F-random variable for removing a variable from the
model.

Dr. Jitendra Kumar National Institute of Technology Tiruchirappalli


Stepwise Regression
• Stepwise regression begins by forming a one-variable
model using the regressor variable that has the highest
correlation with the response variable Y.
• This will also be the regressor producing the largest F-
statistic.
• For example, suppose that at first step, x1 is selected.
• At the second step, the remaining K-1 candidate
variables are examined.
• The variable for which the partial F-statistic is a
maximum is added to the equation, provided that fj > fin
mean square for error
for the model containing
both x1 and xj

Dr. Jitendra Kumar National Institute of Technology Tiruchirappalli


Stepwise Regression
• Suppose that this procedure indicates that x2 should be
added to the model.
• Now the stepwise regression algorithm determines
whether the variable x1 added at the first step should be
removed
• If the calculated value f1 < fout, the variable x1 is
removed; otherwise it is retained

Dr. Jitendra Kumar National Institute of Technology Tiruchirappalli


Stepwise Regression
• In general, at each step
• The set of remaining candidate regressors is examined
• The regressor with the largest partial F-statistic is entered,
provided that the observed value of f exceeds fin.
• Then the partial F-statistic for each regressor in the model is
calculated, and the regressor with the smallest observed value
of F is deleted if the observed f < fout.
• The procedure continues until no other regressors can
be added to or removed from the model

Dr. Jitendra Kumar National Institute of Technology Tiruchirappalli


Forward Selection
• The forward selection procedure is a variation of stepwise
regression
• It is based on the principle that regressors should be added to the
model one at a time until there are no remaining candidate
regressors that produce a significant increase in the regression
sum of squares
• That is, variables are added one at a time as long as their partial F-
value exceeds fin
• Forward selection is a simplification of stepwise regression that
omits the partial F-test for deleting variables from the model that
have been added at previous steps
• This is a potential weakness of forward selection; that is, the
procedure does not explore the effect that adding a regressor at
the current step has on regressor variables added at earlier steps.

Dr. Jitendra Kumar National Institute of Technology Tiruchirappalli


Backward Selection
• The backward elimination algorithm begins with all K
candidate regressors in the model.
• Then the regressor with the smallest partial F-statistic is
deleted if this F-statistic is insignificant, that is, if f < fout.
• Next, the model with K-1 regressors is fit, and the next
regressor for potential elimination is found.
• The algorithm terminates when no further regressor can
be deleted.

Dr. Jitendra Kumar National Institute of Technology Tiruchirappalli


Multicollinearity
• In regression, multicollinearity refers to the extent to
which independent variables are correlated.
• Multicollinearity exists when:
• One independent variable is correlated with another
independent variable.
• One independent variable is correlated with a linear
combination of two or more independent variables.

𝑅𝑗2 is the coefficient of multiple determination resulting from regressing xj on the other k-1
regressor variables

Dr. Jitendra Kumar National Institute of Technology Tiruchirappalli


Thank You!

Dr. Jitendra Kumar National Institute of Technology Tiruchirappalli


Logistic Regression
CAMI16 : Data Analytics

Dr. Jitendra Kumar


Department of Computer Applications
Classification

Email: Spam / Not Spam?


Online Transactions: Fraudulent (Yes / No)?
Tumor: Malignant / Benign?

0: “Negative Class” (e.g., benign tumor)


1: “Positive Class” (e.g., malignant tumor)
Dr. Jitendra Kumar National Institute of Technology Tiruchirappalli
(Yes) 1

Malignant ?

(No) 0
Tumor Size

Linear Regression is not a good choice for classification

Threshold classifier output ℎ𝛽 (𝑥𝑖 ) at 0.5:

If ℎ𝛽 (𝑥𝑖 ) ≥ 0.5 , predict “y = 1”


Goal: 0 ≤ ℎ𝛽 𝑥𝑖 ≤1
If ℎ𝛽 𝑥𝑖 < 0.5 , predict “y = 0”

Dr. Jitendra Kumar National Institute of Technology Tiruchirappalli


Classification: y = 0 or 1

ℎ𝛽 (𝑥𝑖 ) can be > 1 or < 0

Logistic Regression 0 ≤ ℎ𝛽 (𝑥𝑖 ) ≤ 1


−∞ +∞

0 1

Dr. Jitendra Kumar National Institute of Technology Tiruchirappalli


Classification
Goal: 0 ≤ ℎ𝛽 𝑥𝑖 ≤1

ℎ𝛽 (𝑥𝑖 ) = 𝛽0 + 𝛽1 𝑥𝑖 Sigmoid function

ℎ𝛽 (𝑥𝑖 ) = 𝑔(𝛽𝑇 𝑥)
1
𝑔 𝑧 =
1 + 𝑒 −𝑧

1
ℎ𝛽 (𝑥) = 𝑇 𝑥)
1 + 𝑒 −(𝛽

Dr. Jitendra Kumar National Institute of Technology Tiruchirappalli


Interpretation of Hypothesis Output

ℎ𝛽 (𝑥𝑖 ) = estimated probability that y = 1 on input x

𝑥0 1
Example: If 𝑥 = 𝑥 =
1 tumorSize

ℎ𝛽 𝑥 = 0.7

Tell patient that 70% chance of tumor being malignant

ℎ𝛽 𝑥 = 𝑃(𝑦 = 1|𝑥; 𝛽) “probability that y = 1, given x,


parameterized by 𝛽 ”
𝑃 𝑦 = 0 𝑥; 𝛽 + 𝑃 𝑦 = 1 𝑥; 𝛽 = 1
𝑃 𝑦 = 0 𝑥; 𝛽 = 1 − 𝑃 𝑦 = 1 𝑥; 𝛽

Dr. Jitendra Kumar National Institute of Technology Tiruchirappalli


Training set:

m examples

1
ℎ𝛽 (𝑥) = 𝑇 𝑥)
1 + 𝑒 −(𝛽

How to choose parameters 𝛽 ?

Dr. Jitendra Kumar National Institute of Technology Tiruchirappalli


Cost function
𝑚
1 𝑖 2
Linear regression: 𝐽 𝛽 = ෍ ℎ𝛽 𝑥 − 𝑦 (𝑖)
2𝑚
𝑖=1

𝑖
1 2
𝐶𝑜𝑠𝑡 ℎ𝛽 𝑥 , 𝑦 (𝑖) = ℎ𝛽 𝑥 𝑖 − 𝑦 (𝑖)
2

𝐽 𝛽 𝐽 𝛽

𝛽 𝛽

Dr. Jitendra Kumar National Institute of Technology Tiruchirappalli


Logistic regression cost function
−log(ℎ𝛽 𝑥 ), 𝑖𝑓 𝑦 = 1
𝐶𝑜𝑠𝑡 ℎ𝛽 𝑥 , 𝑦 = ൝
−log(1 − ℎ𝛽 𝑥 ), 𝑖𝑓 𝑦 = 0

log z

z Cost = 0 if y=1, ℎ𝛽 𝑥 =1
0 1 But as ℎ𝛽 𝑥 → 0
-log z
Cost → ∞
Captures intuition that
If y = 1 if ℎ𝛽 𝑥 =0, (predict 𝑃(𝑦 = 1|𝑥; 𝛽)=0),
Cost

but y=1,
We will penalize learning algorithm
by a very large cost
0 1 ℎ𝛽 𝑥

Dr. Jitendra Kumar National Institute of Technology Tiruchirappalli


Logistic regression cost function

−log(ℎ𝛽 𝑥 ), 𝑖𝑓 𝑦 = 1
𝐶𝑜𝑠𝑡 ℎ𝛽 𝑥 , 𝑦 = ൝
−log(1 − ℎ𝛽 𝑥 ), 𝑖𝑓 𝑦 = 0

If y = 0
Cost

0 ℎ𝛽 𝑥 1

Dr. Jitendra Kumar National Institute of Technology Tiruchirappalli


Logistic regression cost function
𝑚
1
𝐽 𝛽 = ෍ 𝐶𝑜𝑠𝑡 ℎ𝛽 𝑥 , 𝑦
𝑚
𝑖=1

−log(ℎ𝛽 𝑥 ), 𝑖𝑓 𝑦 = 1
𝐶𝑜𝑠𝑡 ℎ𝛽 𝑥 , 𝑦 = ൝
−log(1 − ℎ𝛽 𝑥 ), 𝑖𝑓 𝑦 = 0

𝐶𝑜𝑠𝑡 ℎ𝛽 𝑥 , 𝑦 = −𝑦 log ℎ𝛽 𝑥 − (1 − 𝑦)log(1 − ℎ𝛽 𝑥 )

If y=1 𝐶𝑜𝑠𝑡 ℎ𝛽 𝑥 , 𝑦 = −𝑦 log ℎ𝛽 𝑥


If y=0 𝐶𝑜𝑠𝑡 ℎ𝛽 𝑥 , 𝑦 = −log(1 − ℎ𝛽 𝑥 )

Dr. Jitendra Kumar National Institute of Technology Tiruchirappalli


Logistic regression cost function
𝑚
1
𝐽 𝛽 = ෍ 𝐶𝑜𝑠𝑡 ℎ𝛽 𝑥 , 𝑦
𝑚
𝑖=1

𝑚
1
=− ෍ 𝑦 (𝑖) log ℎ𝛽 𝑥 𝑖
+ 1−𝑦 𝑖
log 1 − ℎ𝛽 𝑥 𝑖
𝑚
𝑖=1

To fit parameters 𝛽 : 𝑚𝑖𝑛


𝐽(𝛽)
𝛽

To make a prediction given new 𝑥 :


1
Output ℎ𝛽 (𝑥) = 𝑇 𝑥)
1 + 𝑒 −(𝛽

Dr. Jitendra Kumar National Institute of Technology Tiruchirappalli


Gradient Descent
𝑚
1
𝐽(𝛽) = − ෍ 𝑦 (𝑖) log ℎ𝛽 𝑥 𝑖
+ 1−𝑦 𝑖
log 1 − ℎ𝛽 𝑥 𝑖
𝑚
𝑖=1

Want 𝑚𝑖𝑛𝐽(𝛽) :
𝛽
Repeat

𝜕
𝛽𝑗 = 𝛽𝑗 − 𝛼 𝐽(𝛽)
𝜕𝛽𝑗

(simultaneously update all 𝛽𝑗)

Dr. Jitendra Kumar National Institute of Technology Tiruchirappalli


Gradient Descent
𝑚
1
𝐽(𝛽) = − ෍ 𝑦 (𝑖) log ℎ𝛽 𝑥 𝑖
+ 1−𝑦 𝑖
log 1 − ℎ𝛽 𝑥 𝑖
𝑚
𝑖=1

Want 𝑚𝑖𝑛𝐽(𝛽) :
𝛽
Repeat
(𝑖)
𝛽𝑗 = 𝛽𝑗 − 𝛼 σ𝑚
𝑖=1 ℎ𝛽 𝑥
𝑖 − 𝑦 (𝑖) 𝑥𝑗

(simultaneously update all 𝛽𝑗)

Algorithm looks identical to linear regression!

Dr. Jitendra Kumar National Institute of Technology Tiruchirappalli


Multiclass classification

Email foldering/tagging: Work, Friends, Family, Hobby

Medical diagrams: Not ill, Cold, Flu

Weather: Sunny, Cloudy, Rain, Snow

Dr. Jitendra Kumar National Institute of Technology Tiruchirappalli


Binary classification: Multi-class classification:

x2 x2

x1 x1

Dr. Jitendra Kumar National Institute of Technology Tiruchirappalli


x2
One-vs-all (one-vs-rest):

x1
x2 x2

x1
x1
x2
Class 1:
Class 2:
Class 3:
𝑖
ℎ𝛽 (𝑥) = 𝑃(𝑦 = 𝑖|𝑥; 𝛽)
x1
Dr. Jitendra Kumar National Institute of Technology Tiruchirappalli
One-vs-all

𝑖
Train a logistic regression classifier ℎ𝛽 (𝑥) for each
class 𝑖 to predict the probability that 𝑦 = 𝑖

On a new input 𝑥, to make a prediction, pick the


class 𝑖 that maximizes
𝑚𝑎𝑥 𝑖
ℎ (𝑥)
𝑖 𝛽

Dr. Jitendra Kumar National Institute of Technology Tiruchirappalli


Thank
You
jitendra@nitt.edu

https://imjitendra.wordpress.com/

https://www.linkedin.com/in/dr-jitendra/

Dr. Jitendra Kumar National Institute of Technology Tiruchirappalli


Discrimination Analysis
CAMI16: Data Analytics

Dr. Jitendra Kumar


Department of Computer Applications

Dr. Jitendra Kumar National Institute of Technology Tiruchirappalli


Introduction
• Suppose we are given a learning set of multivariate observations
(i.e., input values in 𝑅𝑛 ), and suppose each observation is known
to have come from one of K predefined classes having similar
characteristics.
• These classes may be identified, for example
• species of plants
• levels of credit worthiness of customers
• presence or absence of a specific medical condition
• different types of tumors
• whether an email message is spam or non-spam
• To distinguish the known classes from each other, we associate a
unique class label (or output value) with each class; the
observations are then described as labeled observations

Dr. Jitendra Kumar National Institute of Technology Tiruchirappalli


Problem
• A drug to cure disease
• The drug suits to some patients
• The drug reacts worse to some patients

• How to decide the suitability of drug for a patient?

Dr. Jitendra Kumar National Institute of Technology Tiruchirappalli


2 Genes

Dr. Jitendra Kumar National Institute of Technology Tiruchirappalli


3 Genes

Dr. Jitendra Kumar National Institute of Technology Tiruchirappalli


3 Genes

What if number of genes are more… say 1000 and beyond


Reducing the number of genes may help

Dr. Jitendra Kumar National Institute of Technology Tiruchirappalli


2-D to 1-D

Dr. Jitendra Kumar National Institute of Technology Tiruchirappalli


Dr. Jitendra Kumar National Institute of Technology Tiruchirappalli
Dr. Jitendra Kumar National Institute of Technology Tiruchirappalli
Dr. Jitendra Kumar National Institute of Technology Tiruchirappalli
Discriminant Analysis
• Discriminant function analysis is used to determine
which continuous variables discriminate between two or
more naturally occurring groups

Dr. Jitendra Kumar National Institute of Technology Tiruchirappalli


Dr. Jitendra Kumar National Institute of Technology Tiruchirappalli
Dr. Jitendra Kumar National Institute of Technology Tiruchirappalli
Dr. Jitendra Kumar National Institute of Technology Tiruchirappalli
Dr. Jitendra Kumar National Institute of Technology Tiruchirappalli
Assumptions
• Normal Distribution
• It is assumed that the data (for the variables) represent a
sample from a multivariate normal distribution
• Homogeneity of Variances
• Very sensitive to heterogeneity of variance-co variance
matrices
• Outliers
• Highly sensitive to the outliers
• Non-multicollinearity
• Low multicollinearity is favourable

Dr. Jitendra Kumar National Institute of Technology Tiruchirappalli


Thank You!

Dr. Jitendra Kumar National Institute of Technology Tiruchirappalli


Classification
Thursday, 15 October, 2020 01:56 PM

LDA Page 1
LDA Page 2
LDA Page 3
CAMI16 : Data Analytics
(Practice Questions)

1. A company manufacturing automobile tyres finds that tyre-life is normally distributed


with a mean of 40,000 km and standard deviation of 3,000 km. It is believed that a
change in the production process will result in a better product and the company has
developed a new tyre. A sample of 100 new tyres has been selected. The company has
found that the mean life of these new tyres is 40,900 km. Can it be concluded that the
new tyre is significantly better than the old one, using the significance level of 0.01?

2. A company is engaged in the packaging of superior quality tea in jars of 500 gm each.
The company is of the view that as long as jars contain 500 gm of tea, the process is in
control. The standard deviation is 50 gm. A sample of 225 jars is taken at random and
the sample average is found to be 510 gm. Has the process gone out of control?

3. A company manufacturing light bulbs is using two different processes A and B. The life of
light bulbs of process A has a normal distribution with mean µ1 and standard deviation
σ1 . Similarly, for process B, it is µ2 and σ2 . The data pertaining to the two process are
as follows:

Sample A Sample B
n1 = 16 n2 = 21
x̄1 = 1200hr x̄2 = 1300hr
σ1 = 60hr σ2 = 50hr

Verify that the variability of the two processes is the same. (Hint: Use F -statistic)

4. Examine the claim of a battery producer that the batteries will last for 100 days, given
that a sample study about their life, of the batteries on 200 batteries, showed mean life
of 90 days with a standard deviation of 15 days. Assume normal distribution, and test at
5% level of significance.

5. A company has appointed four salesmen, SA , SB , SC and SD , and observed their sales
in three seasons - summer, winter and monsoon. The figures (in Rs lakh) are given in the
following table:

Seasons SA SB SC SD Season Totals


Summer 36 36 21 35 128
Winter 28 29 31 32 120
Monsoon 26 28 29 29 112
Totals 90 93 81 96 360

Using 5% level of significance, perform an analysis of variance on the above data and
interpret the results.

6. Find out the regression equation using least squares estimation on below mentioned data:

X 2 3 4 5 6 7 8 9 10 12
Y 7 9 10 13 15 18 19 24 25 29

1
Principal Component Analysis
(PCA)
CAMI16: Data Analytics

Dr. Jitendra Kumar


Department of Computer Applications

Dr. Jitendra Kumar National Institute of Technology Tiruchirappalli


Data Compression

Reduce data from


(inches)

2D to 1D

(cm)

Dr. Jitendra Kumar National Institute of Technology Tiruchirappalli


Data Compression

Reduce data from


(inches)

2D to 1D

(cm)

Dr. Jitendra Kumar National Institute of Technology Tiruchirappalli


Data Compression

Reduce data from 3D to 2D

Dr. Jitendra Kumar National Institute of Technology Tiruchirappalli


Data Visualization

Mean
Per capita Poverty household
GDP GDP Human Index income
(trillions of (thousands Develop- Life (Gini as (thousands
Country US$) of intl. $) ment Index expectancy percentage) of US$) …
Canada 1.577 39.17 0.908 80.7 32.6 67.293 …
China 5.878 7.54 0.687 73 46.9 10.22 …
India 1.632 3.41 0.547 64.7 36.8 0.735 …
Russia 1.48 19.84 0.755 65.5 39.9 0.72 …
Singapore 0.223 56.69 0.866 80 42.5 67.1 …
USA 14.527 46.86 0.91 78.3 40.8 84.3 …
… … … … … … …
[resources from en.wikipedia.org]

Dr. Jitendra Kumar National Institute of Technology Tiruchirappalli


Data Visualization

Country
Canada 1.6 1.2
China 1.7 0.3
India 1.6 0.2
Russia 1.4 0.5
Singapore 0.5 1.7
USA 2 1.5
… … …

Dr. Jitendra Kumar National Institute of Technology Tiruchirappalli


Data Visualization

Dr. Jitendra Kumar National Institute of Technology Tiruchirappalli


Principal Component Analysis (PCA) problem formulation

Dr. Jitendra Kumar National Institute of Technology Tiruchirappalli


Principal Component Analysis (PCA) problem formulation

Reduce from 2-dimension to 1-dimension: Find a direction (a vector )


onto which to project the data so as to minimize the projection error.
Reduce from n-dimension to k-dimension: Find vectors
onto which to project the data, so as to minimize the projection error.

Dr. Jitendra Kumar National Institute of Technology Tiruchirappalli


PCA is not linear regression

Dr. Jitendra Kumar National Institute of Technology Tiruchirappalli


PCA is not linear regression

Dr. Jitendra Kumar National Institute of Technology Tiruchirappalli


Data preprocessing

Training set:
Preprocessing (feature scaling/mean normalization):

Replace each with


If different features on different scales (e.g., size of house,
number of bedrooms), scale features to have comparable
range of values.

Dr. Jitendra Kumar National Institute of Technology Tiruchirappalli


Principal Component Analysis (PCA) algorithm

Reduce data from 2D to 1D Reduce data from 3D to 2D

Dr. Jitendra Kumar National Institute of Technology Tiruchirappalli


Principal Component Analysis (PCA) algorithm
Reduce data from -dimensions to -dimensions
Compute “covariance matrix”:

Compute “eigenvectors” of matrix :

[U,S,V] = svd(Sigma);

Dr. Jitendra Kumar National Institute of Technology Tiruchirappalli


Principal Component Analysis (PCA) algorithm summary

After mean normalization (ensure every feature has


zero mean) and optionally feature scaling:

Sigma =

[U,S,V] = svd(Sigma);
Ureduce = U(:,1:k);
z = Ureduce’*x;

Dr. Jitendra Kumar National Institute of Technology Tiruchirappalli


Reconstruction from compressed representation

𝑥𝑎𝑝𝑝𝑟𝑜𝑥 = 𝑈𝑟𝑒𝑑𝑢𝑐𝑒 𝑧

Dr. Jitendra Kumar National Institute of Technology Tiruchirappalli


Choosing (number of principal components)
𝑚
1 2
Average squared projection error: 𝑚
(𝑖)
෍ 𝑥 (𝑖) − 𝑥𝑎𝑝𝑝𝑟𝑜𝑥
𝑖=1
𝑚
1 2
Total variation in the data: 𝑚 ෍ 𝑥 (𝑖)
𝑖=1

Typically, choose to be smallest value so that

(1%)

“99% of variance is retained”

Dr. Jitendra Kumar National Institute of Technology Tiruchirappalli


Choosing (number of principal components)
Algorithm: [U,S,V] = svd(Sigma)
Try PCA with
Compute

Check if

Pick smallest value of for which

(99% of variance retained)


Dr. Jitendra Kumar National Institute of Technology Tiruchirappalli
Supervised learning speedup

Extract inputs:
Unlabeled dataset:

New training set:

Note: Mapping should be defined by running PCA


only on the training set. This mapping can be applied as well to
the examples and in the cross validation and test sets.

Dr. Jitendra Kumar National Institute of Technology Tiruchirappalli


Application of PCA

- Compression
- Reduce memory/disk needed to store data
- Speed up learning algorithm

- Visualization

Dr. Jitendra Kumar National Institute of Technology Tiruchirappalli


Thank You!

Dr. Jitendra Kumar National Institute of Technology Tiruchirappalli


Principal Component Analysis
(PCA-II)
CAMI16: Data Analytics

Dr. Jitendra Kumar


Department of Computer Applications

Dr. Jitendra Kumar National Institute of Technology Tiruchirappalli


Principal Component Analysis
• Explains the variance-covariance structure of a set of
variables through a few linear combinations of these
variables
• p components are required to reproduce the total
system variability, often much of this variability can be
accounted for by accounted for by a small number k of
the principal components.
• If so, there is (almost) as much information in the k
components as there is in the original p variables. The k
principal components can then replace the initial p
variables, and the original data set, consisting of n
measurements on p variables, is reduced to a data set
consisting of n measurements on k principal
components.

Dr. Jitendra Kumar National Institute of Technology Tiruchirappalli


Principal Components
• Principal components are particular linear combinations
of the p random variables
• These linear combinations represent the selection of a
new coordinate system obtained by rotating the original
system
• The new axes represent the directions with maximum
variability

Dr. Jitendra Kumar National Institute of Technology Tiruchirappalli


Principal Components
Let 𝐗 ′ = [X1 , X2 , ⋯ , Xp ] have the covariance matrix 𝚺 with eigenvalues
λ1 ≥ λ2 ≥ ⋯ ≥ λp ≥ 0
Consider the linear combinations
Y1 = 𝐚′𝟏 𝐗 = a11 X1 + a12 X2 + ⋯ + a1p Xp
Y2 = 𝐚′𝟐 𝐗 = a21 X1 + a22 X2 + ⋯ + a2p Xp


Yp = 𝐚𝐩 𝐗 = ap1 X1 + ap2 X2 + ⋯ + app Xp

Cov Yi , Yk = 𝐚′𝐢 𝚺𝐚𝐢 𝑖, 𝑘 = 1,2, … , 𝑝


(if 𝑖 == 𝑘, Cov Yi , Y𝑖 == Var(𝑌𝑖 )
The principal components are those uncorrelated linear combinations
Y1 , Y2 , …, Yp whose variance are as large as possible
The first principal component is the linear combination with maximum
variance. That is, it maximizes Var Yi = 𝐚′𝐢 𝚺𝐚𝐢
Dr. Jitendra Kumar National Institute of Technology Tiruchirappalli
Principal Components

First principal component = Linear combination 𝐚′𝟏 𝐗 that maximizes


Var 𝐚′𝟏 𝐗 subject to 𝒂′𝟏 𝒂𝟏 = 𝟏

Second principal component = Linear combination 𝐚′𝟐 𝐗 that maximizes


Var 𝐚′𝟐 𝐗 subject to 𝒂′𝟐 𝒂𝟐 = 𝟏 and
Cov 𝐚′𝟏 𝐗, 𝐚′𝟐 𝐗 = 0

At the ith step


ith principal component = Linear combination 𝐚′𝒊 𝐗 that maximizes
Var 𝐚′𝒊 𝐗 subject to 𝒂′𝒊 𝒂𝒊 = 𝟏 and
Cov 𝐚′𝒊 𝐗, 𝐚′𝒌 𝐗 = 0 for 𝑘 < 𝑖

Dr. Jitendra Kumar National Institute of Technology Tiruchirappalli


Principal Components
Let 𝚺 be the covariance matrix associated with the
random vector 𝐗 ′ = X1 , X2 , ⋯ , Xp . Let 𝚺 have the
eigenvalue-eigenvector pairs 𝜆1 , 𝒆1 , 𝜆2 , 𝒆2 , … , 𝜆𝑝 , 𝒆𝑝
where λ1 ≥ λ2 ≥ ⋯ ≥ λp ≥ 0. Then the ith principal
component is given by
Yi = 𝒆′𝒊 𝐗 = 𝑒i1 X1 + 𝑒i2 X2 + ⋯ + 𝑒ip X p , 𝑖 = 1,2, … , 𝑝

With these choices


Var Yi = 𝒆′𝐢 𝚺𝒆𝐢 = λ𝑖 𝑖 = 1, 2, … , 𝑝
Cov Yi , Yk = 𝒆′𝐢 𝚺𝒆𝐢 = 0 𝑖≠𝑘

Dr. Jitendra Kumar National Institute of Technology Tiruchirappalli


Principal Components

Let 𝐗 ′ = X1 , X2 , ⋯ , Xp have covariance matrix 𝚺, with


eigenvalue-eigenvector pairs 𝜆1 , 𝒆1 , 𝜆2 , 𝒆2 , … , 𝜆𝑝 , 𝒆𝑝
where λ1 ≥ λ2 ≥ ⋯ ≥ λp ≥ 0. Let 𝑌1 = 𝒆′𝟏 𝐗, 𝑌2 = 𝒆′𝟐 𝐗, …,
𝑌𝑝 = 𝒆′𝒑 𝐗 be the principal components. Then

σ11 + σ22 + ⋯ + σpp = λ1 + λ2 + ⋯ + λp

Dr. Jitendra Kumar National Institute of Technology Tiruchirappalli


Dr. Jitendra Kumar National Institute of Technology Tiruchirappalli
Dr. Jitendra Kumar National Institute of Technology Tiruchirappalli
Dr. Jitendra Kumar National Institute of Technology Tiruchirappalli
Graphing the Principal Components
• Plots of the principal components can reveal suspect
observations, as well as provide check on the
assumption of normality.
• Since the principal components are linear combinations
of the original variables, it is not unreasonable to expect
them to be nearly normal.
• To help check the normal assumption, construct scatter
diagrams of pairs of first few principal components.
Also, make Q-Q plots from the sample values
generated by each principal component.
• Construct scatter diagrams and Q-Q plots for the last
few principal components. These help identify suspect
observations

Dr. Jitendra Kumar National Institute of Technology Tiruchirappalli


Large Sample Inferences
• Eigenvalues specify the variances and eigenvectors
determine the directions of maximum variability
• Most of the total variance can be explained in fewer
than p dimensions, when the first few eigenvalues are
much larger than the rest
• Decisions regarding the quality of the principal
component approximation must be made on the basis
of the eigenvalue-eigenvector pairs (𝜆መ 𝑖 , 𝒆ො 𝑖 )
• Because of sampling variation, these eigenvalues and
eigenvectors will differ from their underlying population
counterparts

Dr. Jitendra Kumar National Institute of Technology Tiruchirappalli


Large Sample Inferences
• The observations 𝑿1 , 𝑿2 ,…, 𝑿𝑛 are a random sample
from a normal population
• Assume that unknown eigenvalues of 𝚺 are distinct and
positive, so that λ1 > λ2 > ⋯ > λp > 0
• For n large, the 𝜆መ 𝑖 are independently
2
distributed and
2𝜆
have an approximate 𝑁(𝜆𝑖 , 𝑖 ൗ𝑛) distribution
• A large sample 100 1 − 𝛼 % confidence interval for 𝜆𝑖 is
provided by

Where 𝑧 𝛼 Τ2 is the upper 100 𝛼 Τ2 th percentile of a


standard normal distribution

Dr. Jitendra Kumar National Institute of Technology Tiruchirappalli


Thank You!

Dr. Jitendra Kumar National Institute of Technology Tiruchirappalli


Bayes’ Classifier
CAMI16: Data Analytics

Dr. Jitendra Kumar


Department of Computer Applications
Bayes’ Rule
• Bayes’ Theorem is a way of finding a probability when
we know certain other probabilities.

𝑃 𝐴 𝑃(𝐵|𝐴)
𝑃 𝐴𝐵 =
𝑃(𝐵)

Which tells us:


how often A happens given that B happens, written P(A|B),
When we know:
how often B happens given that A happens, written P(B|A)
and how likely A is on its own, written P(A)
and how likely B is on its own, written P(B)

Dr. Jitendra Kumar National Institute of Technology Tiruchirappalli


Addition Rule

𝑚
𝑃 𝐴 =
𝑛

𝑃 𝐴 ∪ 𝐵 = 𝑃 𝐴 + 𝑃(𝐵)

𝑃 𝐴 ∪ 𝐵 = 𝑃 𝐴 + 𝑃 𝐵 − 𝑃(𝐴 ∩ 𝐵)

Dr. Jitendra Kumar National Institute of Technology Tiruchirappalli


Example

This Photo by Unknown Author is licensed under CC BY-SA

Dr. Jitendra Kumar National Institute of Technology Tiruchirappalli


Example

Dr. Jitendra Kumar National Institute of Technology Tiruchirappalli


Example
Company PRODUCTION CHEMICAL MECHANICAL Total

TCS 22 28 18 68

L&T 34 25 30 89

IBM 19 32 21 72

Total 75 85 69 229

Dr. Jitendra Kumar National Institute of Technology Tiruchirappalli


Conditional Probability
• Probability of occurrence of event B given that event A
has already occurred

𝑃(𝐴 ∩ 𝐵)
𝑃 𝐵𝐴 =
𝑃 𝐴 A B

Dr. Jitendra Kumar National Institute of Technology Tiruchirappalli


Multiplication Rule

𝑃 𝐵𝐴 = 𝑃 𝐴𝐵 =

𝑃 𝐵 𝐴 𝑃(𝐴)
𝑃 𝐴𝐵 =
𝑃(𝐵)

Dr. Jitendra Kumar National Institute of Technology Tiruchirappalli


A1 A2

B
A3

Dr. Jitendra Kumar National Institute of Technology Tiruchirappalli


Example
• P(Fire) means how often there is fire
• P(Smoke) means how often we see smoke
• P(Fire|Smoke) means how often there is fire when we
can see smoke
• P(Smoke|Fire) means how often we can see smoke
when there is fire

P(Fire) = 1%, P(Smoke) = 10%, P(Smoke|Fire)=90%

𝑃 𝐹𝑖𝑟𝑒 𝑆𝑚𝑜𝑘𝑒 =
𝑃 𝐹𝑖𝑟𝑒 𝑃(𝑆𝑚𝑜𝑘𝑒|𝐹𝑖𝑟𝑒)
𝑃(𝑆𝑚𝑜𝑘𝑒)
=9% ?
Dr. Jitendra Kumar National Institute of Technology Tiruchirappalli
Example
• You are planning a picnic today
• but the morning is cloudy
• Oh no! 50% of all rainy days start off cloudy!
• But cloudy mornings are common (about 40% of days
start cloudy)
• And this is usually a dry month (only 3 of 30 days tend
to be rainy, or 10%)

What is the chance of rain during the day?

Dr. Jitendra Kumar National Institute of Technology Tiruchirappalli


Example
𝑃 𝑅𝑎𝑖𝑛|𝐶𝑙𝑜𝑢𝑑 =?

𝑃 𝑅𝑎𝑖𝑛 𝑃(𝐶𝑙𝑜𝑢𝑑|𝑅𝑎𝑖𝑛)
𝑃 𝑅𝑎𝑖𝑛|𝐶𝑙𝑜𝑢𝑑 =
𝑃(𝐶𝑙𝑜𝑢𝑑)

𝑃 𝑅𝑎𝑖𝑛 = 10% 𝑃 𝐶𝑙𝑜𝑢𝑑 = 40% 𝑃 𝐶𝑙𝑜𝑢𝑑|𝑅𝑎𝑖𝑛 = 50%

0.1 × 0.5
𝑃 𝑅𝑎𝑖𝑛|𝐶𝑙𝑜𝑢𝑑 = = 0.125
0.4

12.5% chances of rain. Not too bad, you may have a picnic.

Dr. Jitendra Kumar National Institute of Technology Tiruchirappalli


Example

Blue notBlue
40
Man 5 35 40 𝑃 𝑀𝑎𝑛 = = 0.4
100
25
Woman 20 40 60 𝑃 𝐵𝑙𝑢𝑒 = = 0.25
100
5
25 75 100 𝑃 𝐵𝑙𝑢𝑒|𝑀𝑎𝑛 = = 0.125
40

𝑃 𝑀𝑎𝑛 𝐵𝑙𝑢𝑒 =?
𝑃 𝑀𝑎𝑛 𝑃(𝐵𝑙𝑢𝑒|𝑀𝑎𝑛)
𝑃 𝑀𝑎𝑛 𝐵𝑙𝑢𝑒 =
𝑃(𝐵𝑙𝑢𝑒)
0.4 × 0.125
= = 0.2
0.25

Dr. Jitendra Kumar National Institute of Technology Tiruchirappalli


Question 1
• In a School, 60% of the boys play football and 36% of
the boys play ice hockey. Given that 40% of those that
play football also play ice hockey, what percent of those
that play ice hockey also play football?

Dr. Jitendra Kumar National Institute of Technology Tiruchirappalli


Question 2
• 40% of the girls like music and 24% of the girls like
dance. Given that 30% of those that like music also like
dance, what percent of those that like dance also like
music?

Dr. Jitendra Kumar National Institute of Technology Tiruchirappalli


Question 3
• In a factory, machine X produces 60% of the daily
output and machine Y produces 40% of the daily output.
2% of machine X's output is defective, and 1.5% of
machine Y's output is defective.
One day, an item was inspected at random and found to
be defective. What is the probability that it was
produced by machine X?

Dr. Jitendra Kumar National Institute of Technology Tiruchirappalli


Naïve Bayes Algorithm
• The Naïve Bayes algorithm is a machine learning
algorithm for classification problems. It is primarily used
for text classification, which involves high-dimensional
training data sets.
• It makes an assumption that the occurrence of a certain
feature/attribute is independent to the occurrence of
other attributes.
• Spam filtrations
• Sentimental analysis
• News article classification

Dr. Jitendra Kumar National Institute of Technology Tiruchirappalli


• In a classification problem, there are multiple attributes and
classes
• The main aim in the Naïve Bayes algorithm is to calculate
the conditional probability of an object with attributes 𝐴 =
(𝑎1 , 𝑎2 , … , 𝑎𝑛 ) belongs to a particular class 𝑣
𝑃 𝐴 𝑣 𝑃(𝑣)
𝑃 𝑣𝐴 =
𝑃(𝐴)
𝐴 = (𝑎1 , 𝑎2 , … , 𝑎𝑛 )
𝑃 𝑎1 𝑣 𝑃 𝑎2 𝑣 … 𝑃 𝑎𝑛 𝑣 𝑃(𝑣)
𝑃 𝑣 𝑎1 , 𝑎2 , … , 𝑎𝑛 =
𝑃 𝑎1 𝑃 𝑎2 … 𝑃(𝑎𝑛 )

𝑃 𝑣 𝑎1 , 𝑎2 , … , 𝑎𝑛 = 𝑃(𝑣) ෑ 𝑃(𝑎𝑖 |𝑣)

𝑣 = 𝑎𝑟𝑔𝑚𝑎𝑥𝑣 𝑃(𝑣) ෑ 𝑃(𝑎𝑖 |𝑣)

Dr. Jitendra Kumar National Institute of Technology Tiruchirappalli


PLAY
DAY OUTLOOK TEMPERATURE HUMIDITY WINDY
GOLF
0 Rainy Hot High False No
1 Rainy Hot High True No
2 Overcast Hot High False Yes
3 Sunny Mild High False Yes
4 Sunny Cool Normal False Yes
5 Sunny Cool Normal True No
6 Overcast Cool Normal True Yes
7 Rainy Mild High False No
8 Rainy Cool Normal False Yes
9 Sunny Mild Normal False Yes
10 Rainy Mild Normal True Yes
11 Overcast Mild High True Yes
12 Overcast Hot Normal False Yes
13 Sunny Mild High True No

Dr. Jitendra Kumar National Institute of Technology Tiruchirappalli


PLAY
DAY OUTLOOK TEMPERATURE HUMIDITY WINDY
GOLF
0 Rainy Hot High False No
1 Rainy Hot High True No
2 Overcast Hot High False Yes
3 Sunny Mild High False Yes
4 Sunny Cool Normal False Yes
5 Sunny Cool Normal True No
6 Overcast Cool Normal True Yes
7 Rainy Mild High False No
8 Rainy Cool Normal False Yes
9 Sunny Mild Normal False Yes
10 Rainy Mild Normal True Yes
11 Overcast Mild High True Yes
12 Overcast Hot Normal False Yes
13 Sunny Mild High True No

Outlook Yes No P(Yes) P(No)

Sunny

Overcast

Rainy

Total 9 5 100% 100%

Dr. Jitendra Kumar National Institute of Technology Tiruchirappalli


PLAY
DAY OUTLOOK TEMPERATURE HUMIDITY WINDY
GOLF
0 Rainy Hot High False No
1 Rainy Hot High True No
2 Overcast Hot High False Yes
3 Sunny Mild High False Yes
4 Sunny Cool Normal False Yes
5 Sunny Cool Normal True No
6 Overcast Cool Normal True Yes
7 Rainy Mild High False No
8 Rainy Cool Normal False Yes
9 Sunny Mild Normal False Yes
10 Rainy Mild Normal True Yes
11 Overcast Mild High True Yes
12 Overcast Hot Normal False Yes
13 Sunny Mild High True No

Outlook Yes No P(Yes) P(No) Temp Yes No P(Yes) P(No)

Sunny 3 2 3/9 2/5 Hot

Overcast 4 0 4/9 0/9 Mild

Rainy 2 3 2/9 3/5 Cold

Total 9 5 100% 100% Total 9 5 100% 100%

Dr. Jitendra Kumar National Institute of Technology Tiruchirappalli


PLAY
DAY OUTLOOK TEMPERATURE HUMIDITY WINDY
GOLF
0 Rainy Hot High False No
1 Rainy Hot High True No
2 Overcast Hot High False Yes
3 Sunny Mild High False Yes Humidity Yes No P(Yes) P(No)
4 Sunny Cool Normal False Yes
High
5 Sunny Cool Normal True No
6 Overcast Cool Normal True Yes Normal
7 Rainy Mild High False No
Total 9 5 100% 100%
8 Rainy Cool Normal False Yes
9 Sunny Mild Normal False Yes
10 Rainy Mild Normal True Yes
11 Overcast Mild High True Yes
12 Overcast Hot Normal False Yes
13 Sunny Mild High True No

Outlook Yes No P(Yes) P(No) Temp Yes No P(Yes) P(No)

Sunny 3 2 3/9 2/5 Hot 2 2 2/9 2/5

Overcast 4 0 4/9 0/9 Mild 4 2 4/9 2/5

Rainy 2 3 2/9 3/5 Cold 3 1 3/9 1/5

Total 9 5 100% 100% Total 9 5 100% 100%

Dr. Jitendra Kumar National Institute of Technology Tiruchirappalli


PLAY
DAY OUTLOOK TEMPERATURE HUMIDITY WINDY
GOLF
0 Rainy Hot High False No
1 Rainy Hot High True No
2 Overcast Hot High False Yes
3 Sunny Mild High False Yes Humidity Yes No P(Yes) P(No)
4 Sunny Cool Normal False Yes
High 3 4 3/9 4/5
5 Sunny Cool Normal True No
6 Overcast Cool Normal True Yes Normal 6 1 6/9 1/5
7 Rainy Mild High False No
Total 9 5 100% 100%
8 Rainy Cool Normal False Yes
9 Sunny Mild Normal False Yes Windy Yes No P(Yes) P(No)
10 Rainy Mild Normal True Yes
True
11 Overcast Mild High True Yes
12 Overcast Hot Normal False Yes False
13 Sunny Mild High True No
Total 9 5 100% 100%

Outlook Yes No P(Yes) P(No) Temp Yes No P(Yes) P(No)

Sunny 3 2 3/9 2/5 Hot 2 2 2/9 2/5

Overcast 4 0 4/9 0/9 Mild 4 2 4/9 2/5

Rainy 2 3 2/9 3/5 Cold 3 1 3/9 1/5

Total 9 5 100% 100% Total 9 5 100% 100%

Dr. Jitendra Kumar National Institute of Technology Tiruchirappalli


PLAY
DAY OUTLOOK TEMPERATURE HUMIDITY WINDY
GOLF
0 Rainy Hot High False No
1 Rainy Hot High True No
2 Overcast Hot High False Yes
3 Sunny Mild High False Yes Humidity Yes No P(Yes) P(No)
4 Sunny Cool Normal False Yes
High 3 4 3/9 4/5
5 Sunny Cool Normal True No
6 Overcast Cool Normal True Yes Normal 6 1 6/9 1/5
7 Rainy Mild High False No Total 9 5 100% 100%
8 Rainy Cool Normal False Yes
9 Sunny Mild Normal False Yes Windy Yes No P(Yes) P(No)
10 Rainy Mild Normal True Yes
True 3 3 3/9 3/5
11 Overcast Mild High True Yes
12 Overcast Hot Normal False Yes False 6 2 6/9 2/5
13 Sunny Mild High True No
Total 9 5 100% 100%

Outlook Yes No P(Yes) P(No) Temp Yes No P(Yes) P(No) Play P(Yes)
or
Sunny 3 2 3/9 2/5 Hot 2 2 2/9 2/5 P(No)
Overcast 4 0 4/9 0/9 Mild 4 2 4/9 2/5 Yes
Rainy 2 3 2/9 3/5 Cold 3 1 3/9 1/5 No
Total 9 5 100% 100% Total 9 5 100% 100% Total 14 100%

Dr. Jitendra Kumar National Institute of Technology Tiruchirappalli


PLAY
DAY OUTLOOK TEMPERATURE HUMIDITY WINDY
GOLF
0 Rainy Hot High False No
1 Rainy Hot High True No
2 Overcast Hot High False Yes
3 Sunny Mild High False Yes Humidity Yes No P(Yes) P(No)
4 Sunny Cool Normal False Yes
High 3 4 3/9 4/5
5 Sunny Cool Normal True No
6 Overcast Cool Normal True Yes Normal 6 1 6/9 1/5
7 Rainy Mild High False No
Total 9 5 100% 100%
8 Rainy Cool Normal False Yes
9 Sunny Mild Normal False Yes Windy Yes No P(Yes) P(No)
10 Rainy Mild Normal True Yes
True 3 3 3/9 3/5
11 Overcast Mild High True Yes
12 Overcast Hot Normal False Yes False 6 2 6/9 2/5
13 Sunny Mild High True No
Total 9 5 100% 100%

Outlook Yes No P(Yes) P(No) Temp Yes No P(Yes) P(No) Play P(Yes)
or
Sunny 3 2 3/9 2/5 Hot 2 2 2/9 2/5 P(No)
Overcast 4 0 4/9 0/9 Mild 4 2 4/9 2/5 Yes 9 9/14
Rainy 2 3 2/9 3/5 Cold 3 1 3/9 1/5 No 5 5/14
Total 9 5 100% 100% Total 9 5 100% 100% Total 14 100%

Today = (Sunny, Hot, Normal, False)

Dr. Jitendra Kumar National Institute of Technology Tiruchirappalli


Outlook Yes No P(Yes) P(No) Temp Yes No P(Yes) P(No)

Sunny 3 2 3/9 2/5 Hot 2 2 2/9 2/5

Overcast 4 0 4/9 0/9 Mild 4 2 4/9 2/5

Rainy 2 3 2/9 3/5 Cold 3 1 3/9 1/5

Total 9 5 100% 100% Total 9 5 100% 100%


Play P(Yes)
or
Humidity Yes No P(Yes) P(No) Windy Yes No P(Yes) P(No) P(No)
High 3 4 3/9 4/5 True 3 3 3/9 3/5 Yes 9 9/14

Normal 6 1 6/9 1/5 False 6 2 6/9 2/5 No 5 5/14

Total 9 5 100% 100% Total 9 5 100% 100% Total 14 100%

Today = (Sunny, Hot, Normal, False)


𝑃 𝑆𝑢𝑛𝑛𝑦 𝑌𝑒𝑠 𝑃 𝐻𝑜𝑡 𝑌𝑒𝑠 𝑃 𝑁𝑜𝑟𝑚𝑎𝑙 𝑌𝑒𝑠 𝑃 𝐹𝑎𝑙𝑠𝑒 𝑌𝑒𝑠 𝑃(𝑌𝑒𝑠)
𝑃 𝑌𝑒𝑠 𝑇𝑜𝑑𝑎𝑦 =
𝑃(𝑇𝑜𝑑𝑎𝑦)

𝑃 𝑆𝑢𝑛𝑛𝑦 𝑁𝑜 𝑃 𝐻𝑜𝑡 𝑁𝑜 𝑃 𝑁𝑜𝑟𝑚𝑎𝑙 𝑁𝑜 𝑃 𝐹𝑎𝑙𝑠𝑒 𝑁𝑜 𝑃(𝑁𝑜)


𝑃 𝑁𝑜 𝑇𝑜𝑑𝑎𝑦 =
𝑃(𝑇𝑜𝑑𝑎𝑦)

𝑃 𝑌𝑒𝑠 𝑇𝑜𝑑𝑎𝑦 ∝

𝑃 𝑁𝑜 𝑇𝑜𝑑𝑎𝑦 ∝

Dr. Jitendra Kumar National Institute of Technology Tiruchirappalli


Thank You!

Dr. Jitendra Kumar National Institute of Technology Tiruchirappalli


Eigenvalues and Eigenvectors
Wednesday, 28 October, 2020 03:01 PM

EigenValues and EigenVectors Page 1


EigenValues and EigenVectors Page 2
EigenValues and EigenVectors Page 3
Machine Learning
CAMI16: Data Analytics

Dr. Jitendra Kumar


Department of Computer Applications

Dr. Jitendra Kumar National Institute of Technology Tiruchirappalli


How do
you know?

I think, Its going to rain today!


Dr. Jitendra Kumar National Institute of Technology Tiruchirappalli
https://cdn.shopify.com/s/files/1/1406/4308/articles/Looking-at-the-clouds-can-help-you-predict-bad-weather-_697_6052888_0_14103285_1000_large.jpg?v=1500990343
HUMAN

MACHINE
Learns from data
Learns from mistakes

Dr. Jitendra Kumar National Institute of Technology Tiruchirappalli


https://cdn.dribbble.com/users/538946/screenshots/4169377/artificial-2.png
Dr. Jitendra Kumar National Institute of Technology Tiruchirappalli
Dr. Jitendra Kumar National Institute of Technology Tiruchirappalli
Why Machine Learning is getting popular?

Computing Power Availability


Excessive Data Availability

Dr. Jitendra Kumar National Institute of Technology Tiruchirappalli


What do we mean by learning?
• Given
• a data set D,
• a task T, and
• a performance measure M,
a computer system is said to learn from D to perform
the task T if after learning the system’s performance on
T improves as measured by M.
• In other words, the learned model helps the system to
perform T better as compared to no learning.
Herbert Simon: “Learning is any process by
which a system improves performance from
experience.”

Dr. Jitendra Kumar National Institute of Technology Tiruchirappalli


What is Machine Learning?

Definition:
“changes in [a] system that ... enable [it] to do the same
task or tasks drawn from the same population more
efficiently and more effectively the next time.'' (Simon
1983)

Dr. Jitendra Kumar National Institute of Technology Tiruchirappalli


https://expertsystem.com/wp-content/uploads/2017/03/machine-learning-definition.jpeg
Why Machine Learning?
• For some kinds of problems we are just not able write
down the rules
• e.g. image & speech recognition, language translation, sales
forecasting

Problem Code
RULES
?

Dr. Jitendra Kumar National Institute of Technology Tiruchirappalli


Traditional Computing vs Machine
Learning

Data Traditional
Computing

Output

Program

Data Machine
Learning

Program

Output
Dr. Jitendra Kumar National Institute of Technology Tiruchirappalli
Types of Machine Learning

Supervised Unsupervised Reinforcement


Learning Learning Learning

• Labelled data • Unlabelled data • Reward based


learning
• Direct feedback • Association
• Machine learns how
• Classification • Clustering to act in an
environment
• Regression
• Robotics

Dr. Jitendra Kumar National Institute of Technology Tiruchirappalli


Supervised Learning

Regression Classification

Dr. Jitendra Kumar National Institute of Technology Tiruchirappalli


Regression
Notation:
Price (₹) in m = Number of training examples
Size in feet2 (x)
100,000's (y) x’s = “input” variable / features
2104 460 y’s = “output” variable / “target” variable

1416 232
1534 315 500

852 178 400

… …
(in 100,000)
300
Price (₹)
200
Housing Prices
100
(Trichy, TN)
0
0 1000 2000 3000
Size (feet2)

Dr. Jitendra Kumar National Institute of Technology Tiruchirappalli


Training Set
(size of house)

Learning Algorithm

Size of Estimated
Model (h)
house price

Dr. Jitendra Kumar National Institute of Technology Tiruchirappalli


500
Modelling 400

(in 100,000)
300

Price (₹)
ℎ𝛽 𝑥 = 𝛽0 + 𝛽1 𝑥 200

100
Identify 𝛽0 and 𝛽1 so that
ℎ𝛽 𝑥 is close to 𝑦 0
0 1000 2000 3000
Size (feet2)
3 3 3
𝛽0 = 1.5, 𝛽1 = 0 𝛽0 = 0, 𝛽1 = 0.5 𝛽0 = 1, 𝛽1 = 0.5

2 2 2

1 1 1

0 0 0
0 1 2 3 0 1 2 3 0 1 2 3
Dr. Jitendra Kumar National Institute of Technology Tiruchirappalli
How to define closeness?

ℎ𝛽 (𝑥𝑖 ) = 𝛽0 + 𝛽1 𝑥𝑖 ; 𝑖 = 1,2, ⋯ , 𝑚
𝑦
𝜀𝑖 = ℎ𝛽 𝑥𝑖 − 𝑦𝑖 𝑖 = 1,2, ⋯ , 𝑚
𝑦 = 𝛽0 + 𝛽1 𝑥
𝜀𝑚
How to compute the total error? (𝑥 𝑚
,𝑦 𝑚
)

 a) σ𝑛𝑖=1 𝜀𝑖 (𝑥 1 ,𝑦 1 )

𝜀1
𝜀2
 b) σ𝑛𝑖=1 𝜀𝑖2
(𝑥 2 ,𝑦 2 )

x
𝑚
1
Cost function: 𝐽 𝛽0 , 𝛽1 = ෍(ℎ𝛽 𝑥𝑖 − 𝑦𝑖 )2
2𝑚
𝑖=1

Goal: 𝛽min
0 ,𝛽1
𝐽 𝛽0 , 𝛽1
Dr. Jitendra Kumar National Institute of Technology Tiruchirappalli
ℎ𝛽 𝑥 𝐽 𝛽1
(for fixed 𝛽1, this is a function of x) (function of the parameter 𝛽1 )
𝛽1 =1
3 3

2 𝛽1 =0.5 2
y 𝐽 𝛽1
1 1

0 𝛽1 =0 0
0 1
x 2 3 -0.5 0 0.5 1 1.5 2 2.5
𝛽1

Dr. Jitendra Kumar National Institute of Technology Tiruchirappalli


Parameter Learning

Have some function 𝐽 𝛽0 , 𝛽1 repeat until convergence{


𝑚
1
𝛽0 ≔ 𝛽0 − α ෍ ℎ𝛽 𝑥𝑖 − 𝑦𝑖
Want 𝛽min
0 ,𝛽1
𝐽 𝛽0 , 𝛽1 𝑚
𝑖=1

𝑚
1
𝛽1 ≔ 𝛽1 − α ෍ ℎ𝛽 𝑥𝑖 − 𝑦𝑖 𝑥𝑖
Outline: 𝑚
𝑖=1
}
• Start with some 𝛽0 , 𝛽1
• Keep changing 𝛽0 , 𝛽1 to reduce
until we hopefully end up at a minimum 𝐽 𝛽0 , 𝛽1

Dr. Jitendra Kumar National Institute of Technology Tiruchirappalli


𝐽 𝛽0 , 𝛽1

𝛽1
𝛽0

Dr. Jitendra Kumar National Institute of Technology Tiruchirappalli


𝐽 𝛽0 , 𝛽1

𝛽1
𝛽0

Dr. Jitendra Kumar National Institute of Technology Tiruchirappalli


Dr. Jitendra Kumar National Institute of Technology Tiruchirappalli
ℎ𝛽 𝑥 𝐽 𝛽0 𝛽1
(for fixed 𝛽0 , 𝛽1 , this is a function of x) (function of the parameter 𝛽0 , 𝛽1 )
Price (₹) in 100,000's

𝛽1

𝛽0

Dr. Jitendra Kumar National Institute of Technology Tiruchirappalli


ℎ𝛽 𝑥 𝐽 𝛽0 𝛽1
(for fixed 𝛽0 , 𝛽1 , this is a function of x) (function of the parameter 𝛽0 , 𝛽1 )
Price (₹) in 100,000's

𝛽1

𝛽0

Dr. Jitendra Kumar National Institute of Technology Tiruchirappalli


ℎ𝛽 𝑥 𝐽 𝛽0 𝛽1
(for fixed 𝛽0 , 𝛽1 , this is a function of x) (function of the parameter 𝛽0 , 𝛽1 )
Price (₹) in 100,000's

𝛽1

𝛽0

Dr. Jitendra Kumar National Institute of Technology Tiruchirappalli


ℎ𝛽 𝑥 𝐽 𝛽0 𝛽1
(for fixed 𝛽0 , 𝛽1 , this is a function of x) (function of the parameter 𝛽0 , 𝛽1 )
Price (₹) in 100,000's

𝛽1

𝛽0

Dr. Jitendra Kumar National Institute of Technology Tiruchirappalli


ℎ𝛽 𝑥 𝐽 𝛽0 𝛽1
(for fixed 𝛽0 , 𝛽1 , this is a function of x) (function of the parameter 𝛽0 , 𝛽1 )
Price (₹) in 100,000's

𝛽1

𝛽0

Dr. Jitendra Kumar National Institute of Technology Tiruchirappalli


ℎ𝛽 𝑥 𝐽 𝛽0 𝛽1
(for fixed 𝛽0 , 𝛽1 , this is a function of x) (function of the parameter 𝛽0 , 𝛽1 )
Price (₹) in 100,000's

𝛽1

𝛽0

Dr. Jitendra Kumar National Institute of Technology Tiruchirappalli


ℎ𝛽 𝑥 𝐽 𝛽0 𝛽1
(for fixed 𝛽0 , 𝛽1 , this is a function of x) (function of the parameter 𝛽0 , 𝛽1 )
Price (₹) in 100,000's

𝛽1

𝛽0

Dr. Jitendra Kumar National Institute of Technology Tiruchirappalli


ℎ𝛽 𝑥 𝐽 𝛽0 𝛽1
(for fixed 𝛽0 , 𝛽1 , this is a function of x) (function of the parameter 𝛽0 , 𝛽1 )
Price (₹) in 100,000's

𝛽1

𝛽0

Dr. Jitendra Kumar National Institute of Technology Tiruchirappalli


ℎ𝛽 𝑥 𝐽 𝛽0 𝛽1
(for fixed 𝛽0 , 𝛽1 , this is a function of x) (function of the parameter 𝛽0 , 𝛽1 )
Price (₹) in 100,000's

𝛽1

𝛽0

Dr. Jitendra Kumar National Institute of Technology Tiruchirappalli


Classification

Email: Spam / Not Spam?


Online Transactions: Fraudulent (Yes / No)?
Tumor: Malignant / Benign?

0: “Negative Class” (e.g., benign tumor)


1: “Positive Class” (e.g., malignant tumor)
Dr. Jitendra Kumar National Institute of Technology Tiruchirappalli
(Yes) 1

Malignant ?

(No) 0
Tumor Size Tumor Size

Linear Regression is not a good choice for classification

Threshold classifier output ℎ𝛽 (𝑥𝑖 ) at 0.5:

If ℎ𝛽 (𝑥𝑖 ) ≥ 0.5 , predict “y = 1”


Goal: 0 ≤ ℎ𝛽 𝑥𝑖 ≤1
If ℎ𝛽 𝑥𝑖 < 0.5 , predict “y = 0”

Dr. Jitendra Kumar National Institute of Technology Tiruchirappalli


Classification

Goal: 0 ≤ ℎ𝛽 𝑥𝑖 ≤1
ℎ𝛽 (𝑥𝑖 ) = 0.7

ℎ𝛽 (𝑥𝑖 ) = 𝛽0 + 𝛽1 𝑥𝑖 Tell patient that 70% chance of


tumor being malignant
ℎ𝛽 (𝑥𝑖 ) = 𝑔(𝜃 𝑇 𝑥)
ℎ𝛽 𝑥 = 𝑃(𝑦 = 1|𝑥; 𝛽)
Sigmoid function
“probability that y = 1, given x,
parameterized by 𝛽”
1
𝑔 𝑧 =
1 + 𝑒 −𝑧 𝑃 𝑦 = 0 𝑥; 𝛽 + 𝑃 𝑦 = 1 𝑥; 𝛽 = 1
𝑃 𝑦 = 0 𝑥; 𝛽 = 1 − 𝑃 𝑦 = 1 𝑥; 𝛽

Dr. Jitendra Kumar National Institute of Technology Tiruchirappalli


Binary classification: Multi-class classification:

x2 x2

x1 x1

Dr. Jitendra Kumar National Institute of Technology Tiruchirappalli


x2
One-vs-all (one-vs-rest):

x1
x2 x2

x1
x1
x2
Class 1:
Class 2:
Class 3:
𝑖
ℎ𝛽 (𝑥) = 𝑃(𝑦 = 𝑖|𝑥; 𝛽) 𝑖 = (1,2,3) x1
Dr. Jitendra Kumar National Institute of Technology Tiruchirappalli
Unsupervised Learning

Dr. Jitendra Kumar National Institute of Technology Tiruchirappalli


Reinforcement Learning
Defines how software agents should

Action
take actions in an environment

State, Reward

Dr. Jitendra Kumar National Institute of Technology Tiruchirappalli


Dr. Jitendra Kumar National Institute of Technology Tiruchirappalli
Dr. Jitendra Kumar National Institute of Technology Tiruchirappalli
Dr. Jitendra Kumar National Institute of Technology Tiruchirappalli
Reinforcement Learning Process
Two main components
1. Agent
2. Environment

Graphics by Unknown Author is licensed under CC BY-SA

Dr. Jitendra Kumar National Institute of Technology Tiruchirappalli


Reward Maximization

Agent

Reward

Opponent

Dr. Jitendra Kumar National Institute of Technology Tiruchirappalli


Graphics by Unknown Author is licensed under CC BY-ND
Markov Decision Process
• The following parameters are used to attain a solution
• Actions (A)
• States (S)
• Reward (R)
• Policy (𝜋) Action
• Value (V)

State, Reward

Dr. Jitendra Kumar National Institute of Technology Tiruchirappalli


Q-Learning Algorithm

Dr. Jitendra Kumar National Institute of Technology Tiruchirappalli


Q-Learning R4
0

R6
0

0 100
0 0 0
R0 R1 0

0 100
R2 R0 R3 R1 R7
R4 R3 0 0

0
0
R5
R6 0
R7 R2 R5
0

0 0 0 0 0 0 0 0 𝑅0 −1 −1 −1 −1 0 −1 −1 −1
0 0 0 0 0 0 0 0 𝑅1 −1 −1 −1 0 −1 −1 −1 100
0 0 0 0 0 0 0 0 𝑅2 −1 −1 −1 0 −1 0 −1 −1
0 0 0
𝑄= 0 0 0 0 0
𝑅 = 𝑅3
−1 0 0 −1 0 −1 −1 −1
0 0 0 0 0 0 0 0 𝑅4 0 −1 −1 0 −1 −1 0 −1
0 0 0 0 0 0 0 0 𝑅5 −1 −1 0 −1 −1 −1 −1 −1
0 0 0 0 0 0 0 0 𝑅6 −1 −1 −1 −1 0 −1 −1 100
0 0 0 0 0 0 0 0 𝑅7 −1 0 −1 −1 −1 −1 0 −1

Dr. Jitendra Kumar National Institute of Technology Tiruchirappalli


Machine Learning Model Development
Process

Feature Extraction

Model Evaluation
Data Preparation
Data Collection

Model Training

Model Testing
Step 1 Step 2 Step 3 Step 4 Step 5 Step 6

Dr. Jitendra Kumar National Institute of Technology Tiruchirappalli


https://static.javatpoint.com/tutorial/machine-learning/images/applications-of-machine-learning.png

Dr. Jitendra Kumar National Institute of Technology Tiruchirappalli


Dr. Jitendra Kumar National Institute of Technology Tiruchirappalli
Dr. Jitendra Kumar National Institute of Technology Tiruchirappalli
Thank
You

jitendra@nitt.edu

https://imjitendra.wordpress.com/

Dr. Jitendra Kumar National Institute of Technology Tiruchirappalli


Decision Trees
What makes a loan risky?

I want a to buy
a new house! Credit
★★★★

Income
★★★

Term
★★★★★
Loan
Application
Personal Info
★★★
Credit history explained

Did I pay previous


loans on time? Credit History
★★★★
Example: excellent, Income
good, or fair ★★★

Term
★★★★★

Personal Info
★★★
Income

Credit History
What’s my income? ★★★★

Example: Income
★★★
$80K per year
Term
★★★★★

Personal Info
★★★
Loan terms

Credit History
How soon do I need to ★★★★
pay the loan?
Income
Example: 3 years, ★★★
5 years,… Term
★★★★★

Personal Info
★★★
Personal information

Credit History
★★★★

Income
Age, reason for the ★★★
loan, marital status,…
Term
Example: Home loan ★★★★★
for a married couple Personal Info
★★★
Intelligent application

Loan
Applications

Safe

Intelligent loan application Risky


review system ✘

Risky

Classifier review

ŷi = +1

Loan Classifier Safe


Application MODEL Risky

Output: ŷ
Input: xi Predicted
class ŷi = -1
Decision Tree: Intuitions
What does a decision tree represent?

Start

excellent poor
Credit?

fair
Income?
Safe Term?
high Low
3 years 5 years

Risky Safe Term? Risky

3 years 5 years

3 year loans with fair Risky Safe


credit history are risky
What does a decision tree represent?

Start

excellent poor
Credit?

fair
Income?
Safe Term?
high Low
3 years 5 years

Risky Safe Term? Risky

3 years 5 years
3 year loans with high
income & poor credit Safe
Risky
history are risky
Scoring a loan application
xi = (Credit = poor, Income = high, Term = 5 years)

Start

excellent poor
Credit?

fair
Income?
Safe Term?
high Low
3 years 5 years

Risky Safe Term? Risky

3 years 5 years

Risky Safe ŷi = Safe


Decision tree model

T(xi) = Traverse decision tree


start

excellent poor
Credit?

fair
Loan
Income? ŷi
Application Safe Term?
high Low
3 years 5 years

Input: xi Risky Safe Term? Risky

3 years 5 years

Risky Safe
Decision tree learning task
Training
x Feature h(x) ML ŷ
extraction model
Data

y T(x)

ML algorithm

Quality
metric
Learn decision tree from data?

Credit Term Income y


excellent 3 yrs high safe Start
fair 5 yrs low risky
fair 3 yrs high safe excellent poor
Credit?
poor 5 yrs high risky
fair
excellent 3 yrs low risky
Income?
fair 5 yrs low safe Safe Term?
high Low
poor 3 yrs high risky 3 years 5 years
poor 5 yrs low safe
Risky Safe Term? Risky
fair 3 yrs high safe
3 years 5 years

Risky Safe
Decision tree learning problem

Training data: N observations (xi,yi)


Credit Term Income y
excellent 3 yrs high safe
fair 5 yrs low risky
Optimize
fair
poor
3 yrs
5 yrs
high
high
safe
risky
quality metric
on training data
T(X)
excellent 3 yrs low risky
fair 5 yrs low safe
poor 3 yrs high risky
poor 5 yrs low safe
fair 3 yrs high safe
Quality metric: Classification error

• Error measures fraction of mistakes

Error = # incorrect predictions


# examples

- Best possible value : 0.0


- Worst possible value: 1.0
Find the tree with lowest classification error

Credit Term Income y


excellent 3 yrs high safe
fair 5 yrs low risky Minimize
fair
poor
3 yrs
5 yrs
high
high
safe
risky
classification error
on training data
T(X)
excellent 3 yrs low risky
fair 5 yrs low safe
poor 3 yrs high risky
poor 5 yrs low safe
fair 3 yrs high safe
How do we find the best tree?

Exponentially large number of possible


trees makes decision tree learning hard!
(NP-hard problem)

T1(X) T2(X) T3(X)

T4 (X) T5(X) T6 (X)


Simple (greedy) algorithm finds “good”
tree

Credit Term Income y


excellent 3 yrs high safe
fair 5 yrs low risky Approximately
minimize
fair
poor
3 yrs
5 yrs
high
high
safe
risky
classification error
on training data
T(X)
excellent 3 yrs low risky
fair 5 yrs low safe
poor 3 yrs high risky
poor 5 yrs low safe
fair 3 yrs high safe
Greedy decision tree learning:
Algorithm outline
Step 1: Start with an empty tree

(all data) Safe


Risky

Histogram All points in the


of y values dataset
Step 2: Split on a feature

(all data) Split/partition


data on Credit

excellent poor
Credit ?

fair

Subset of data with Subset of data with Subset of data with


Credit = excellent Credit = fair Credit = poor
Feature split explained

(all data) Safe


Risky
Data points where
Credit = excellent
Split/partition
data on Credit
Credit?

excellent fair poor


Step 3: Making predictions
(all data)
Safe
Risky

Credit?

excellent fair poor

Here, all examples


Predict Safe
are Safe loans
Step 4: Recursion

(all data)
Safe
Nothing more Risky
to do here

Credit?

excellent fair poor

Build tree from Build tree from


Predict Safe these data points these data points
Greedy decision tree learning

• Step 1: Start with an empty tree


• Step 2: Select a feature to split data
• For each split of the tree:
Problem 1: Feature
• Step 3: If nothing more to, split selection
make predictions
Problem 2:
• Step 4: Otherwise, go to Step 2 & Stopping condition
continue (recurse) on this split
Recursion
Feature split learning
=
Decision stump learning
Start with the data

Assume N = 40, 3 features

Credit Term Income y


excellent 3 yrs high safe
fair 5 yrs low risky
fair 3 yrs high safe
poor 5 yrs high risky
excellent 3 yrs low risky
fair 5 yrs low safe
poor 3 yrs high risky
poor 5 yrs low safe
fair 3 yrs high safe
Start with all the data

Loan status: Safe Risky


(all data)
Number of Risky
22
18 loans

Number of Safe
loans N = 40 examples
Compact visual notation: Root node

Loan status: Safe Risky


Root
22 18 Number of risky
loans

Number of safe
loans N = 40 examples
Decision stump: Single level tree
Loan status: (all data )
Safe Risky
Split on Credit

Credit?
excellent fair poor

excellent fair poor


9 0 9 4 4 14

Subset of data with Subset of data with Subset of data with


Credit = excellent Credit = fair Credit = poor
Visual Notation: Intermediate nodes

Loan status: Root


Safe Risky 22 18

Credit?

excellent fair poor


9 0 9 4 4 14

Intermediate nodes
Making predictions with a decision stump

Loan status: root


Safe Risky 22 18

credit?

excellent fair poor


9 0 9 4 4 14

Safe Safe Risky

For each intermediate node,


set ŷ = majority value
How do we learn a decision stump?

Loan status: Root Find the “best”


Safe Risky 22 18 feature to split on!

Credit?

excellent fair poor


9 0 9 4 4 14
How do we select the best feature?

Choice 1: Split on Credit Choice 2: Split on Term


Loan status: Loan status:
Root Root
Safe Risky Safe Risky
22 18 22 18

Credit?
OR Term?

excellent fair poor 3 years 5 years


9 0 9 4 4 14 16 4 6 14
How do we measure
eff ectiveness of a split?
Loan status:
Root
Safe Risky
22 18 Idea: Calculate
classification error of
this decision stump
Credit?

excellent fair poor


9 0 9 4 4 14

Error = # mistakes
# data points
Calculating classification error
• Step 1: ŷ = class of majority of data in node
• Step 2: Calculate classification error of
predicting ŷ for this data

Loan status:
Safe Risky Root Error = .
22 18

=
22 correct 18 mistakes
Safe
Tree Classification error
(root) 0.45
ŷ = majority class
Choice 1: Split on credit history?

Choice 1: Split on Credit


Loan status: Root Does a split on Credit
Safe Risky 22 18
reduce classification
error below 0.45?
Credit?

excellent fair poor


9 0 9 4 4 14
How good is the split on Credit?

Choice 1: Split on Credit


Loan status: Root
Safe Risky 22 18

Credit?
Step 1: For each
excellent fair poor intermediate node,
9 0 9 4 4 14 set ŷ = majority value

Safe Safe Risky


Split on Credit: Classification error
Choice 1: Split on Credit
Loan status: Root
Safe Risky 22 18

Credit? Error = .

excellent fair poor =


9 0 9 4 4 14

Tree Classification error


Safe Safe Risky (root) 0.45
Split on credit 0.2
0 mistakes 4 mistakes 4 mistakes
Choice 2: Split on Term?

Choice 2: Split on Term

Loan status: Root


Safe Risky 22 18

Term?

3 years 5 years
16 4 6 14

Safe Risky
Evaluating the split on Term

Choice 2: Split on Term

Loan status: Root


Safe Risky 22 18
Error = .

Term?

3 years 5 years
=
16 4 6 14
Tree Classification error
Safe Risky (root) 0.45
Split on credit 0.2
4 mistakes 6 mistakes Split on term 0.25
Choice 1 vs Choice 2
Tree Classification error
(root) 0.45
split on credit 0.2
split on loan term 0.25
Choice 1: Split on Credit Choice 2: Split on Term
Loan status: Root Loan status: Root
Safe Risky 22 18 Safe Risky 22 18

Credit? OR Term?

excellent fair poor 3 years 5 years


9 0 8 4 4 14 16 4 6 14
WINNER
Feature split selection algorithm

• Given a subset of data M (a node in a tree)


• For each feature h i(x):
1. Split data of M according to feature h i(x)
2. Compute classification error split
• Chose feature h * (x) with lowest
classification error
Greedy decision tree learning

• Step 1: Start with an empty tree


• Step 2: Select a feature to split data
• For each split of the tree: Pick feature split
• Step 3: If nothing more to, leading to lowest
make predictions classification error
• Step 4: Otherwise, go to Step 2 &
continue (recurse) on this split
Decision Tree Learning:
Recursion & Stopping conditions
Learn decision tree from data?

Credit Term Income y


excellent 3 yrs high safe
fair 5 yrs low risky
Start

fair 3 yrs high safe


excellent poor

poor 5 yrs high risky


Credit?

fair

excellent 3 yrs low risky Safe Term?


Income?

high Low

fair 5 yrs low safe


3 years 5 years

Risky Safe Term? Risky

poor 3 yrs high risky


3 years 5 years

poor 5 yrs low safe


Risky Safe

fair 3 yrs high safe


We’ve learned a decision stump, what next?

Loan status: Root


Safe Risky 22 18

Credit?

excellent fair poor


9 0 9 4 4 14

All data points are Safe


Safe
nothing else to do with
this subset of data
Leaf node
Tree learning = Recursive stump learning

Loan status: Root


Safe Risky 22 18

Credit?

excellent fair poor


9 0 9 4 4 14

Safe
Build decision stump Build decision stump
with subset of data with subset of data
where Credit = fair where Credit = poor
Second level

Loan status: Root


Safe Risky 22 18

Credit?

excellent fair poor


9 0 9 4 4 14

Safe Term? Income?

3 years 5 years high Low


0 4 9 0 4 5 0 9

Risky Safe Risky

Build another stump


these data points
Final decision tree
Root poor
Loan status: 22 18 4 14
Safe Risky
Credit? Income?

excellent
Fair
9 0 high low
9 4
4 5 0
9
Safe Term?
Term? Risky
3 years 5 years
0 4 9 0
3 years 5 years
0 2 4 3
Risky Safe

Risky Safe
Simple greedy decision tree learning

Pick best feature to split on

Learn decision stump with


this split

For each leaf of decision


stump, recurse

When do we stop???
Stopping condition 1: All data agrees on y
All data in these
nodes have same
Root poor
y value
Loan ->
status: 22 18 4 14
Safe Risky
Nothing to do
Credit? Income?

excellent
Fair
9 0 high low
9 4
4 5 0 9

Safe Term?
Term? Risky
3 years 5 years
0 4 9 0
3 years 5 years
0 2 4 3
Risky Safe

Risky Safe
Stopping condition 2: Already split on all features
Already split on all
possible features ->
Root poor
Loan status:
Nothing to do 22 18 4 14
Safe Risky
Credit? Income?

excellent
Fair
9 0 high low
9 4
4 5 0 9

Safe Term?
Term? Risky
3 years 5 years
0 4 9 0
3 years 5 years
0 2 4 3
Risky Safe

Risky Safe
Greedy decision tree learning

• Step 1: Start with an empty tree


• Step 2: Select a feature to split data
Pick feature split
• For each split of the tree: leading to lowest
• Step 3: If nothing more to, classification error
make predictions
Stopping
• Step 4: Otherwise, go to Step 2 & conditions 1 & 2
continue (recurse) on this split
Recursion
Predictions with decision trees
Training
x Feature h(x) ML ŷ
extraction model
Data

y T(x)

ML algorithm

Quality
metric
Decision tree model

T(xi) = Traverse decision tree


start

excellent poor
Credit?

fair
Loan
Income? ŷi
Application Safe Term?
high Low
3 years 5 years

Input: xi Risky Safe Term? Risky

3 years 5 years

Risky Safe
Traversing a decision tree
xi = (Credit = poor, Income = high, Term = 5 years)

Start

excellent poor
Credit?

fair
Income?
Safe Term?
high Low
3 years 5 years

Risky Safe Term? Risky

3 years 5 years

Risky Safe ŷi = Safe


Decision tree prediction algorithm

predict(tree_node, input)
• If current tree_node is a leaf:
o return majority class of
data points in leaf
• else:
o next_note = child node of
tree_node whose feature value
agrees with input
o return predict(next_note, input)
Multiclass classification
Multiclass prediction

Safe

Loan Classifier
Application MODEL
Risky
Output: ŷ i
Input: xi Predicted class

Danger
Multiclass decision stump

N = 40,
1 feature,
3 classes Loan status: Root
Credit y Safe Risky Danger 18 12 10
excellent safe
fair risky
Credit?
fair safe
poor danger
excellent risky excellent fair poor
fair safe 9 2 1 6 9 2 3 1 7
poor danger
poor safe
fair safe
… … Safe Risky Danger
Decision tree learning:
Real valued features
How do we use real values inputs?

Income Credit Term y


$105 K excellent 3 yrs Safe
$112 K good 5 yrs Risky
$73 K fair 3 yrs Safe
$69 K excellent 5 yrs Safe
$217 K excellent 3 yrs Risky
$120 K good 5 yrs Safe
$64 K fair 3 yrs Risky
$340 K excellent 5 yrs Safe
$60 K good 3 yrs Risky
Split on each numeric value?
Danger: May only Root Loan status:
contain one data point 22 18 Safe Risky
per node

Income?

$30K $31.4K $39.5K $61.1K $91.3K


0 1 1 0 0 1 0 1 0 1

Can’t trust prediction


(overfitting)
Alternative: Threshold split

Loan status: Root


Safe Risky 22 18 Split on the
feature Income

Income?

< $60K >= $60K


8 13 14 5

Subset of data with


Income >= $60K Many data points è
lower chance of overfitting
Threshold splits in 1-D

Threshold split is the line


Income = $60K

Income < $60K Income >= $60K


Safe
Risky
Income

$10K $120K
Visualizing the threshold split

Threshold split is
the line Age = 38
Income

$80K

$40K

$0K
0 10 20 30 40 …
Age
Split on Age >= 38

Income age < 38 age >= 38


Predict Risky

$80K

$40K
Predict Safe

$0K
0 10 20 30 40 …
Age
Depth 2: Split on Income >= $60K

Threshold split is the


line Income = 60K
Income

$80K

$40K

$0K
0 10 20 30 40 …
Age
Each split partitions the 2-D space

Age >= 38
Income Age < 38 Income >= 60K

$80K

$40K
Age >= 38
Income < 60K
$0K
0 10 20 30 40 …
Age
Summary of decision trees
What you can do now

• Define a decision tree classifier


• Interpret the output of a decision trees
• Learn a decision tree classifier using
greedy algorithm
• Traverse a decision tree to make
predictions
- Majority class predictions
- Probability predictions
- Multiclass classification
Clustering
CAMI16: Data Analytics

Dr. Jitendra Kumar


Department of Computer Applications
Supervised Learning

Training set:
Unsupervised Learning

Training set:
K-means algorithm
Input:
- (number of clusters)
- Training set

(drop convention)
K-means algorithm
Randomly initialize cluster centroids
Repeat {
for = 1 to
:= index (from 1 to ) of cluster centroid
closest to
for = 1 to
:= average (mean) of points assigned to cluster
}
K-means for non-separated clusters
T-shirt sizing

Weight

Height
K-means optimization objective
= index of cluster (1,2,…, ) to which example is currently
assigned
= cluster centroid ( )
= cluster centroid of cluster to which example has been
assigned

Optimization objective:
K-means algorithm
Randomly initialize cluster centroids

Repeat {
for = 1 to
:= index (from 1 to ) of cluster centroid
closest to
for = 1 to
:= average (mean) of points assigned to cluster
}
Random initialization
Should have

Randomly pick training


examples.

Set equal to these


examples.
Local optima
Random initialization
For i = 1 to 100 {
Randomly initialize K-means.
Run K-means. Get .
Compute cost function (distortion)

Pick clustering that gave lowest cost


Right value of K?
Choosing the value of K
Elbow method:
Cost function

Cost function
1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8

(no. of clusters) (no. of clusters)


Choosing the value of K
Sometimes, you’re running K-means to get clusters to use for some
later/downstream purpose. Evaluate K-means based on a metric for
how well it performs for that later purpose.

E.g.
T-shirt sizing T-shirt sizing

Weight
Weight

Height Height
Thank You!
Random Forest
CAMI16: Data Analytics

Dr. Jitendra Kumar


Department of Computer Applications

Dr. Jitendra Kumar National Institute of Technology Tiruchirappalli


Loan Application

I want a to buy
a new house! Credit
★★★★

Income
★★★

Term
★★★★★
Loan
Application
Personal Info
★★★

Dr. Jitendra Kumar National Institute of Technology Tiruchirappalli


Decision Tree

T(xi) = Traverse decision tree


start

excellent poor
Credit?

fair
Loan
Income? ŷi
Application Safe Term?
high Low
3 years 5 years

Input: xi Risky Safe Term? Risky

3 years 5 years

Risky Safe

Dr. Jitendra Kumar National Institute of Technology Tiruchirappalli


Decision Tree
• Non-linear classifier
• Easy to use
• Easy to interpret
• Susceptible to overfitting

Dr. Jitendra Kumar National Institute of Technology Tiruchirappalli


Ensemble Learning
Original
D Training data

Step 1:
Create Multiple D1 D2 .... Dt-1 Dt
Data Sets

Step 2:
Build Multiple C1 C2 Ct -1 Ct
Classifiers

Step 3:
Combine C*
Classifiers

Dr. Jitendra Kumar National Institute of Technology Tiruchirappalli


STT450-550: Statistical Data Mining

Bootstrapping

Resampling of the
observed dataset
(and of equal size
to the observed
dataset), each of
which is obtained
by random
sampling with
replacement from
the original
dataset.

Dr. Jitendra Kumar National Institute of Technology Tiruchirappalli 6


Random Forests
• Random forests (RF) are a combination of tree predictors
• Each tree depends on the values of a random vector sampled
independently
• The generalization error depends on the strength of the
individual trees and the correlation between them

Dr. Jitendra Kumar National Institute of Technology Tiruchirappalli


Random Forest Classifier

Training Data

M features
N examples

Dr. Jitendra Kumar National Institute of Technology Tiruchirappalli


Random Forest Classifier
Create samples
from the training data

M features
N examples

....…

Dr. Jitendra Kumar National Institute of Technology Tiruchirappalli


Random Forest Classifier
Construct a decision tree

M features
N examples

....…

Dr. Jitendra Kumar National Institute of Technology Tiruchirappalli


Random Forest Classifier
Create decision tree
from each bootstrap sample

M features
N examples

....…
....…

Dr. Jitendra Kumar National Institute of Technology Tiruchirappalli


Random Forest Classifier

M features
N examples

Take he
majority
vote

....…
....…

Dr. Jitendra Kumar National Institute of Technology Tiruchirappalli


The Random Forests Algorithm
Given a training set D
For i = 1 to k do:
Build subset Di by sampling with replacement from D
Learn tree Ti from Di
At each node:
Choose best split from random subset of m features
Each tree grows to the largest extend, and no pruning
Make predictions according to majority vote of the set of k trees.

For prediction:
Regression: average all k predictions from all k trees
Classification: majority vote among all k trees

Dr. Jitendra Kumar National Institute of Technology Tiruchirappalli


STT450-550: Statistical Data Mining 14

Why are we considering a random sample of m


predictors instead of all M predictors for splitting?

• Suppose that we have a very strong predictor in the data set


along with a number of other moderately strong predictor,
then in the collection of bagged trees, most or all of them will
use the very strong predictor for the first split!

• All bagged trees will look similar. Hence all the predictions
from the bagged trees will be highly correlated

• Averaging many highly correlated quantities does not lead to


a large variance reduction, and thus random forests
decorrelates the bagged trees leading to more reduction in
variance

Dr. Jitendra Kumar National Institute of Technology Tiruchirappalli


Features of Random Forests
• Random Forests requires less training time.
• They both can be used in regression.
• One-vs-all works well in most cases in multi-class
classification.
• It is unexcelled in accuracy among current algorithms.
• It runs efficiently on large data bases.
• It has methods for balancing error in class population
unbalanced data sets.

Dr. Jitendra Kumar National Institute of Technology Tiruchirappalli


Thank You

Dr. Jitendra Kumar National Institute of Technology Tiruchirappalli


k Nearest Neighbours (kNN)
CAMI16: Data Analytics

Dr. Jitendra Kumar


Department of Computer Applications

Dr. Jitendra Kumar National Institute of Technology Tiruchirappalli


Instance-based Classification

• Similar instances have similar


classification
• No clear separation between
the three phases (training,
testing, and usage) of
classification
• It is a lazy classifier, as
opposed to eager classifier

Dr. Jitendra Kumar National Institute of Technology Tiruchirappalli


Eager vs Lazy Classification

Eager Lazy
• Model is computed before • Model is computed
classification during classification
• Model is independent of the • Model is dependent on
test instance the test instance
• Test instance is not included
• Test instance is included
in the training data
in the training data
• Avoids too much work at
classification time • High accuracy for models
at each instance level.
• Model is not accurate for
each instance

Dr. Jitendra Kumar National Institute of Technology Tiruchirappalli


k-Nearest Neighbour

Learning by analogy
Tell me who your friends are and I’ll tell you who you are

• An instance is assigned to the most common class


among the instance similar to it
• How to measure similarity between instances?
• How to choose the most common class?

Dr. Jitendra Kumar National Institute of Technology Tiruchirappalli


How does it work?

Initialization, define k

Compute distance

Sort the distances

Take k nearest neighbours

Apply majority Label Class

Dr. Jitendra Kumar National Institute of Technology Tiruchirappalli


Dr. Jitendra Kumar National Institute of Technology Tiruchirappalli
Dr. Jitendra Kumar National Institute of Technology Tiruchirappalli
Comparing Objects
• Problem: measure similarity between instances
• different types of data
• Numbers
• Text
• Images
• Geolocation
• Booleans
• Solution: Convert all features of the instances into
numerical values
• Represent instances as vectors of features in an n
dimensional space

Dr. Jitendra Kumar National Institute of Technology Tiruchirappalli


How to measure distance?
• Euclidean distance
𝑛

𝐷 𝑋, 𝑌 = ෍(𝑥𝑖 − 𝑦𝑖 )2
𝑖=1

• Manhattan distance 𝑛

𝐷 𝑋, 𝑌 = ෍ |𝑥𝑖 − 𝑦𝑖 |
𝑖=1

Dr. Jitendra Kumar National Institute of Technology Tiruchirappalli


How to choose k?
• Classification is sensitive to the correct selection of k

• Small k?
• Captures fine structures
• Influenced by noise
• Larger k?
• Less precise, higher bias

2
•𝑘= 𝑛

Dr. Jitendra Kumar National Institute of Technology Tiruchirappalli


Example

C1 = [(1,7), (1,12), (2,7), (2,9), (2,11), (3,6), (3,10), (3.5,8)];


C2 = [(2.5,9), (3.5,3), (5,3), (6,1), (3,2), (4,2), (5.5,4), (7,2)];
Dr. Jitendra Kumar National Institute of Technology Tiruchirappalli
Example

Dr. Jitendra Kumar National Institute of Technology Tiruchirappalli


Example

Dr. Jitendra Kumar National Institute of Technology Tiruchirappalli


Example
x1 x2 y x1 x2 Euclidean Distance to query instance (3,7)
7 7 Bad 7 7 (7−3)2 + (7−7)2 = 4
7 4 Bad 7 4 (7−3)2 + (4−7)2 = 5
3 4 Good
3 4 (3−3)2 + (4−7)2 = 3
5 7 Good
5 7 (5−3)2 + (7−7)2 = 2

x1 x2 Euclidean Distance to Rank minimum Included in 3- Y


query instance (3,7) distance Nearest Neighbours
7 7 (7−3)2 + (7−7)2 = 4 3 Yes Bad
7 4 (7−3)2 + (4−7)2 = 5 4 No
3 4 (3−3)2 + (4−7)2 = 3 2 Yes Good
5 7 (5−3)2 + (7−7)2 = 2 1 Yes Good

Majority indicates
GOOD

Dr. Jitendra Kumar National Institute of Technology Tiruchirappalli


Pros and Cons
• Pros
• Simple to implement and use
• Robust to noisy data by averaging k-nearest neighbours
• kNN classification is based solely on local information
• Cons
• O(n) for each instance to be classified
• More expensive to classify a new instance than with a model
• High memory storage required as compared to other
supervised learning algorithms.

Dr. Jitendra Kumar National Institute of Technology Tiruchirappalli


Applications
• Banking System
• kNN can be used in banking system to predict weather an
individual is fit for loan approval? Does that individual have the
characteristics similar to the defaulters one?
• Calculating Credit Ratings
• kNN algorithms can be used to find an individual’s credit rating
by comparing with the persons having similar traits.
• Politics
• With the help of KNN algorithms, we can classify a potential
voter into various classes like “Will Vote”, “Will not Vote”, “Will
Vote to Party ‘Congress’, “Will Vote to Party ‘BJP’.

Dr. Jitendra Kumar National Institute of Technology Tiruchirappalli


Thank You

Dr. Jitendra Kumar National Institute of Technology Tiruchirappalli


Artificial Neural Network
CAMI16: Data Analytics

Dr. Jitendra Kumar


Department of Computer Applications

Dr. Jitendra Kumar National Institute of Technology Tiruchirappalli


What is this?
You see this:

But the camera sees this:

Dr. Jitendra Kumar National Institute of Technology Tiruchirappalli


Computer Vision: Car detection

Cars Not a car

Testing:

What is this?
Dr. Jitendra Kumar National Institute of Technology Tiruchirappalli
pixel 1

Learning
Algorithm

pixel 2

Raw image

pixel 2

Cars
“Non”-Cars

pixel 1

Dr. Jitendra Kumar National Institute of Technology Tiruchirappalli


pixel 1

Learning
Algorithm

pixel 2
Raw image

pixel 2

Cars pixel 1
“Non”-Cars
Dr. Jitendra Kumar National Institute of Technology Tiruchirappalli
pixel 1

Learning
Algorithm

pixel 2
Raw image
50 x 50 pixel images→ 2500 pixels
(7500 if RGB)
pixel 2

pixel 1 intensity

pixel 2 intensity

pixel 2500 intensity

Cars
pixel 1
“Non”-Cars

Dr. Jitendra Kumar National Institute of Technology Tiruchirappalli


Neurons in the brain

• The brain consists of a


densely interconnected
set of nerve cells, or basic
information-processing
units, called neurons.
• The human brain
incorporates nearly 10
billion neurons and 60
trillion connections,
synapses, between them.

[Credit: US National Institutes of Health, National Institute on Aging]

Dr. Jitendra Kumar National Institute of Technology Tiruchirappalli


Biological Neuron vs Artificial Neuron

Biological Neural Network Artificial Neural Network


Soma Neuron
Dendrite Input
Axon Output
Synapses Weight

Dr. Jitendra Kumar National Institute of Technology Tiruchirappalli


Artificial Neural Network
▪ Our brain can be considered as a highly complex,non-linear and parallel
information-processing system.
▪ Information is stored and processed in a neural network simultaneously
throughout the whole network, rather than at specific locations. In other
words, in neural networks, both data and its processing are global rather
than local.
▪ An artificial neural network consists of a number of very simple processors,
also called neurons, which are analogous to the biological neurons in the
brain.
▪ The neurons are connected by weighted links passing signals from one
neuron to another.
▪ The output signal is transmitted through the neuron’s outgoing connection.
The outgoing connection splits into a number of branches that transmit the
same signal.
▪ The outgoing branches terminate at the incoming connections of other
neurons in the network.

Dr. Jitendra Kumar National Institute of Technology Tiruchirappalli


The neuron as a simple computing
element (Diagram of a neuron)

Dr. Jitendra Kumar National Institute of Technology Tiruchirappalli


The neuron as a simple computing
element (Diagram of a neuron)

Biological Neural Network Artificial Neural Network


Soma Neuron
Dendrite Input
Axon Output
Synapses Weight

Dr. Jitendra Kumar National Institute of Technology Tiruchirappalli


Can a neuron learn from task?
• In 1958, Frank Rosenblatt introduced a training
algorithm that provided the first procedure for training a
simple ANN: a perceptron.
• The perceptron is the simplest form of a neural network.
It consists of a single neuron with adjustable synaptic
weights and a hard limiter or bias.

Dr. Jitendra Kumar National Institute of Technology Tiruchirappalli


The Perceptron

• The operation of Rosenblatt’s • The aim of the perceptron is


perceptron is based on the to classify inputs, 𝑥1 , 𝑥2 , … 𝑥𝑛
McCulloch and Pitts neuron into one of two classes, say
model. The model consists of a A1 and A2.
linear combiner followed by a • In the case of an elementary
hard limiter. perceptron, the n dimensional
space is divided by a hyper-
plane into two decision
• The weighted sum of the inputs
regions. The hyper-plane is
is applied to the hard limiter,
defined by the linearly
which produces an output
separable function:
equal to +1 if its input is
positive and -1 if it is negative.

Dr. Jitendra Kumar National Institute of Technology Tiruchirappalli


Linear separability in the perceptron

Dr. Jitendra Kumar National Institute of Technology Tiruchirappalli


Negation
𝑥1 ∈ {0,1}

0 1
10

𝑌
-20
𝑥1 𝑦

0 1

1 0

Dr. Jitendra Kumar National Institute of Technology Tiruchirappalli


Logical OR
𝑥1 , 𝑥2 ∈ {0,1} 1

𝑦 = 𝑥1 OR 𝑥2

-10

20 0 1
𝑌
20 𝑦
0 0 0
0 1 1
1 0 1
1 1 1
Dr. Jitendra Kumar National Institute of Technology Tiruchirappalli
Logical AND
𝑥1 , 𝑥2 ∈ {0,1} 1

𝑦 = 𝑥1 AND 𝑥2

-30

20 0 1
𝑌
20 𝑦
0 0 0
0 1 0
1 0 0
1 1 1
Dr. Jitendra Kumar National Institute of Technology Tiruchirappalli
How does the perceptron learn its
classification tasks?
• This is done by making small adjustments in the
weights to reduce the difference between the predicted
and desired outputs of the perceptron.
• The initial weights are randomly assigned, usually in the
range [-0.5, 0.5], and then updated to obtain the output
consistent with the training examples.
• If at iteration p, the predicted output is Y(p) and the
desired output is Yd (p), then the error is given by:

Dr. Jitendra Kumar National Institute of Technology Tiruchirappalli


The perceptron learning rule

• where p = 1, 2, 3, . . .

• α is the learning rate, a positive constant less than


unity.

• The perceptron learning rule was first proposed by


Rosenblatt in 1960. Using this rule we can derive the
perceptron training algorithm for classification tasks.

Dr. Jitendra Kumar National Institute of Technology Tiruchirappalli


Perceptron’s training algorithm
• Step 1: Initialisation
• Set initial weights 𝑤1 , 𝑤2 , … 𝑤𝑛 and threshold OR bias θ
to random numbers in the range [-0.5, 0.5].
• If the error, e(p), is positive, we need to increase
perceptron output Y(p), but if it is negative, we need to
decrease Y(p).

Dr. Jitendra Kumar National Institute of Technology Tiruchirappalli


Perceptron’s training algorithm
• Step 2: Activation
• Activate the perceptron by applying inputs 𝑥1 , 𝑥2 , …
𝑥𝑛 and desired output 𝑌𝑑 𝑝
• Calculate the actual output at iteration p=1
𝑛

𝑌 𝑝 = 𝑠𝑡𝑒𝑝 ෍ 𝑥𝑖 × 𝑤𝑖 𝑝 − 𝜃
𝑖=1
Where n is the number of the perceptron inputs, and step
is a step activation function

Dr. Jitendra Kumar National Institute of Technology Tiruchirappalli


Perceptron’s training algorithm
• Step 3: Weight training
• Update the weights of the perceptron
𝑤𝑖 𝑝 + 1 = 𝑤𝑖 𝑝 + Δ 𝑤𝑖 𝑝
Where Δ 𝑤𝑖 𝑝 is the weight correction at iteration p.
The weight correction is computed by the delta rule:
Δ 𝑤𝑖 𝑝 = α. 𝑥𝑖 𝑝 . 𝑒(𝑝)

• Step4: Iteration
• Increase iteration p by one, go back to step 2 and
repeat the process until convergence

Dr. Jitendra Kumar National Institute of Technology Tiruchirappalli


Logical AND

Dr. Jitendra Kumar National Institute of Technology Tiruchirappalli


Two-dimensional plots of basic logical
operations

A perceptron can learn the basic operations like AND, OR, and NOT
but it can not learn other complex functions such as X-OR

Dr. Jitendra Kumar National Institute of Technology Tiruchirappalli


Multilayer Perceptron
• A multilayer perceptron is a feedforward neural network
with one or more hidden layers

• The network consists of an input layer of source


neurons, at least on middle or hidden layer of
computational neurons, and an output layer
computational neurons

• The input signals are propagated in a forward direction


on a layer-by-layer basis

Dr. Jitendra Kumar National Institute of Technology Tiruchirappalli


Multilayer Perceptron

Dr. Jitendra Kumar National Institute of Technology Tiruchirappalli


Multilayer Perceptron
(𝑗)
𝑎𝑖 = "activation" of unit 𝑖 in layer 𝑗
𝑤 (𝑗) = matrix of weights controlling function
mapping from layer 𝑗 to layer 𝑗 + 1

(2) (1) (1) (1) (1)


𝑎1 = 𝑔(𝑤10 𝑥0 + 𝑤11 𝑥1 + 𝑤12 𝑥2 + 𝑤13 𝑥3 )
(2) (1) (1) (1) (1)
𝑎2 = 𝑔(𝑤20 𝑥0 + 𝑤21 𝑥1 + 𝑤22 𝑥2 + 𝑤23 𝑥3 )
(2) (1) (1) (1) (1)
𝑎3 = 𝑔(𝑤30 𝑥0 + 𝑤31 𝑥1 + 𝑤32 𝑥2 + 𝑤33 𝑥3 )
(3) (2) (2) (2) (2) (2) (2) (2) (2)
𝑌 = 𝑎1 = 𝑔(𝑤10 𝑎0 + 𝑤11 𝑎1 + 𝑤12 𝑎2 + 𝑤13 𝑎3 )

If network has 𝑆𝑗 units in layer 𝑗, 𝑆𝑗+1 units in layer 𝑗 + 1,


then 𝑤 (𝑗) will be of dimension 𝑆𝑗+1 × (𝑆𝑗 + 1)

Dr. Jitendra Kumar National Institute of Technology Tiruchirappalli


Forward propagation: Vectorized implementation
𝑥0 (2)
𝑧1
𝑥1 (2) (2)
𝑌 𝑥 = 𝑥2 𝑧 = 𝑧2
𝑥3 (2)
𝑧3

𝑧 (2) = 𝑤 (1) 𝑥

𝑎(2) = 𝑔(𝑧 (2) )


(2) (1) (1) (1) (1)
𝑎1 = 𝑔(𝑤10 𝑥0 + 𝑤11 𝑥1 + 𝑤12 𝑥2 + 𝑤13 𝑥3 )
(2)
(2) (1) (1) (1) (1) Add 𝑎0 = 1
𝑎2 = 𝑔(𝑤20 𝑥0 + 𝑤21 𝑥1 + 𝑤22 𝑥2 + 𝑤23 𝑥3 )
(2) (1) (1) (1) (1)
𝑎3 = 𝑔(𝑤30 𝑥0 + 𝑤31 𝑥1 + 𝑤32 𝑥2 + 𝑤33 𝑥3 ) 𝑧 (3) = 𝑤 (2) 𝑎(2)
(2) (2) (2) (2) (2) (2) (2) (2)
𝑌 = 𝑔(𝑤10 𝑎0 + 𝑤11 𝑎1 + 𝑤12 𝑎2 + 𝑤13 𝑎3 ) 𝑌 = 𝑎(3) = 𝑔(𝑧 3
)

Dr. Jitendra Kumar National Institute of Technology Tiruchirappalli


Non-linear classification example:
XOR/XNOR
, are binary (0 or 1).

x2
x2

x1

x1

Dr. Jitendra Kumar National Institute of Technology Tiruchirappalli


Putting it together (𝒙𝟏 XNOR 𝒙𝟐 )
-30 10 -10

20 -20 20
𝑦 𝑦 𝑦
20 -20 20

𝑦
-30

20
-10
0 0
20
20
𝑦
0 1
10 20
-20 1 0
-20
1 1

Dr. Jitendra Kumar National Institute of Technology Tiruchirappalli


Learning
• The desired output is unavailable at hidden layer
• Hidden layer neurons can not be observed through the
input-output behaviour
• Typically, commercial neural network applications are
built using three-four layers (one or two hidden layers).
Each layer may contain (10, 1000) neurons.
• Experimental neural network applications can have five
or six layers (three or four hidden layers). Each layer
may have millions of neurons.

Dr. Jitendra Kumar National Institute of Technology Tiruchirappalli


Backpropagation of error

Dr. Jitendra Kumar National Institute of Technology Tiruchirappalli


Backpropagation Algorithm
1. Select a pattern from training set and present it to the network
2. Compute activations and signals of input, hidden and output
neurons
3. Compute the error over the output neurons by comparing the
generated outputs with the desired outputs
4. Use thee error calculated in Step 3 to compute the change in the
hidden to output layer weights, and the change in input to hidden
layer weights, such that a global error measure gets reduced
5. Update all weights of the network in accordance with the changes
computed in step 4
Hidden to Output layer weights
𝑝+1 𝑝 𝑝
𝑤ℎ𝑗 = 𝑤ℎ𝑗 + ∆𝑤ℎ𝑗
Input to hidden layer weights
𝑝+1 𝑝 𝑝
𝑤𝑖ℎ = 𝑤𝑖ℎ + ∆𝑤𝑖ℎ
6. Repeat Steps 1 through 5 until the global error falls below a
predefined threshold

Dr. Jitendra Kumar National Institute of Technology Tiruchirappalli


Thank You!

Dr. Jitendra Kumar National Institute of Technology Tiruchirappalli

You might also like