CAMI 16 Data Analytics End Sem PDFs

An Introduction to Data Analytics
CAMI16 : Data Analytics
Dr. Jitendra Kumar

Department of Computer Applications
Course Objective
• To understand the data analytics approaches
• To familiarize with techniques for Data Analytics
• To apply Statistical modelling techniques for decision
making problems
• To use simple Machine Learning Techniques to
enhance data analytics
Dr. Jitendra Kumar National Institute of Technology Tiruchirappalli 2

Course Outcomes
After the completion of the course, students will be able
to
• Use Statistical principles to infer knowledge from the
data
• Apply various data analytics techniques for informed
decision making
• Adopt basic Machine Learning Technique to analyze the
data

Syllabus
• Introduction: Data Analytics- Data collection- integration-
management- modelling- analysis-visualization-prediction
and informed decision making. General Linear Regression
Model, Estimation for β, Error Estimation, Residual Analysis.
• Tests of significance - ANOVA, ‘t’ test, Forward, Backward,

Sequential, Stepwise, and all possible subsets, Dummy
Regression, Logistic Regression, and Multi-collinearity.
• Discriminant Analysis-Two group problem, Variable

contribution, Violation of assumptions, Discrete and Logistic
Discrimination, The k-group problem, multiple groups,
Interpretation of Multiple group Discriminant Analysis
solutions.

Syllabus (contd..)
• Principal Component Analysis-Extracting Principal
Components, Graphing of Principal Components, Some
sampling Distribution results, Component scores, Large
sample Inferences, Monitoring Quality with principal
Components.
• Factor Analysis-Orthogonal Factor Model,

Communalities, Factor Solutions and rotation. Machine
learning: supervised learning (rules, trees, forests,
nearest neighbour, regression)-optimization (gradient
descent and variants)- unsupervised learning.

You might have learned many different methodologies but
choosing the right methodology is important.

What is wrong
with this?
The real threat is lack
of fundamental
understanding of:
Why to use a
technique?
How to use it
correctly?

Data
• Data are recorded measurements
• Measurement is a standard process which is used to
assign numbers to particular attributes or characteristic
of a variable
• Major forms of data:
• Numerical or Quantitative
• Categorical or Qualitative

Why Data is important for organizations?
• Data can help the organizations in

• Making better decisions
• Evaluating the performance
• Understanding the consumers need
• Understanding the market behavior/trend

Data Analytics
• A systematic computational approach of transforming
into insights for better decision making
• It is used for the discovery, interpretation, and
communication of meaningful patterns in data.
• Applications
• Marketing optimization
• Credit risk analysis
• Development of new medicines
• Fraud prevention
• Cyber physical systems
• …

Data Analytic Process
Define Measure Analyse Improve Control
• Ask right question • Analyse data • Assess solutions

• Define the target • Develop solutions • Create framework
• Collect valid data • Implement solution

• Improve data quality • Optimize efficiency

Types of analytics
What is
happening?
What is likely
to happen?
Value (Added to Company)
Why is it
happening?
What
should I do?
Complexity

Descriptive analytics
• It is the conventional form of Business Intelligence and
data analysis
• Provides the summary view of facts and figures in an
understandable format
• Coverts and presents the raw data into an
understandable format
• Examples
• Reports
• Dashboards
• Data queries
• Data visualization

Diagnostic Analytics
• Dissects the data to answer the question “Why did it
happen”.
• Provides the root-cause of happening something

• Anomaly detection
• Identify hidden relations in data

Predictive analytics
• Forecasts the trends using historical data and current
events
• Predicts the probability of an event happening in future
• Predicts the accurate time of an event happening
• In general, various co-depended variables are studied
and analyzed to forecast a trend

Predictive analytics

Prescriptive analytics
• Set of techniques to indicate the best course of action
• It tells what decision to make to optimize the outcome
• The goal of prescriptive analytics is to enable
• Quality improvements
• Service enhancements
• Cost reductions
• Increasing productivity

Why data analytics is important?

Data analytics is everywhere




Data Analytics in Real World!

Business

Watson playing jeopardy!

eHarmony

Applications
• Netflix – Movie Recommendation
• Facebook – Analysis of Diversity of people and their
habits, Friends suggestion
• Walmart – Product recommendation
• Sports – To study about opponents play behavior
• Pharmaceutical companies – To study about the
combination of medicines for clinical trials.

Application Areas
• Business analytics
• Business logistics, including supply chain optimization
• Finance
• Health, wellness, & biomedicine
• Bioinformatics
• Natural sciences
• Information economy / Social media and social network
analysis
• Smart cities
• Education and electronic teaching
• Energy, sustainability and climate

Thank you!

Introduction-II
Dr. Jitendra Kumar

Buzzwords

Buzzwords (cont…)
• Data analysis is the detailed study or examination of data in order
to understand more about it
• Answers the question, “What happened?”
• Data analytics is systematic computational analysis
• Uses advanced machine learning and statistical tools to predict what
is most likely to happen.
• Data analyst is not directly involved in decision making
• Big data analytics is the process of examining large data sets
containing a variety of data types
• Discovers some knowledge from big data
• Identifies interesting patterns
• Data science is an umbrella term
• Incorporates all the underlying data operations, statistical models as
well as mathematical analysis
• Data scientist is directly involved in decision making

Data Analyst Skills
Data Cleaning &

Statistics
Data Manipulation
Data Visualization Machine Learning

Statistics
• Statistics is a branch of
mathematics dealing with
data collection and
organization, analysis,
and interpretation
• To find trends in change
• Analyst read the data
through statistical
measure to arrive at a
conclusion
https://www.lynda.com/Excel-tutorials/Excel-Statistics-Essential-Training-1/5026557-2.html

Data Cleaning and Data Manipulation
• Data Cleaning is the process of detecting, correcting
corrupt or inaccurate records from the database
• Data Manipulation is the process of changing the data
to make it more organized and easy to read.
https://www.springboard.com/blog/data-cleaning/

Data Visualization
• Representation of data in the form of Charts, diagrams etc.

• Drill-down refers to the process of viewing data at a level of
increased detail, while roll-up refers to the process of viewing data
with decreasing detail.
https://www.tehrantimes.com/news/438777/Iran-develops-first-integrated-health-data-visualization-system

Machine Learning
Input (Data)
Traditional
Program Output (Data)
Programming
Input (Data)
Machine
Output (Data) Program
Learning

Data
Dr. Jitendra Kumar

An Illustration
• Assume that a medical researcher sent you an email
related to some project you wanted to work on..
Hi,
I have attached the data file that I mentioned in my previous
email. Each line contains the information for a single patient and
consists of five fields. We want to predict the last field using the
other fields. I don’t have time to provide any more information
about he data since I’m going out of town for a couple of days,
but hopefully that won’t you own too much. An if you don’t
mind, could we meet when I get back to discuss your
preliminary results? I might invite few other members of my
team.
Thanks and see you in couple of days.

012 232 33.5 0 10.7
020 121 16.9 2 210.1
027 165 24.0 0 427.6
. . . . .
. . . . .
. . . . .
Total 1000 records/ data points/ samples

Conversation between Data Analyst and
Statistician
• So, you got the data for all the patients?
• Yes. I haven’t had much time for analysis, but I do have a few
interesting results.
• Amazing. There were so many data issues with this set of patients
that I couldn’t do much.
• Oh? I didn’t hear about any possible problems.
• Well, first there is field 5, the variable we want to predict. It’s
common knowledge among people who analyse this type of data
that results are better if you work with the log of the values. Was it
mention to you?
• Interesting Were there any other problems?
• Yes, fields 2 and 3 are basically the same, but I assume that you
probably noticed that.

Conversation between Data Analyst and
Statistician
• Yes, but these fields were only weak predictor of field 5.
• Anyway given all those problems, I’m surprised you were able to
accomplish anything.
• True, but my results are really quite good. Field 1 is very strong
predictor of field 5. I’m surprised that this wasn’t noticed before.
• What? Filed 1 is just an identification number.
• Nonetheless, my results speak for themselves.
• Oh, no! I just remembered. We assigned ID numbers after we
sorted the records based on field 5. There is a strong connection,
but it’s meaningless. Sorry.
Moral: Know your data

*An extreme situation

Data
• Data set is a collection of data objects
• record, data point, vector, pattern, event, case, sample,
observation, entity
• Data objects are described by a number of attributes
that capture the basic characteristics of an object
• variable, characteristic, field, feature, dimension
• In general, there are many types of data that can be
used to measure the properties of an entity.
• Numerical or Quantitative (Discrete/Continuous)
• Categorical or Qualitative (Discrete)

General Characteristics of Datasets
• Dimensionality
• Number of attributes
• Curse of dimensionality
• Difficulties associated with analysing high dimensional data
• Dimensionality reduction
• Sparsity
• Very low number of non-zero attributes
• Low computational time and storage
• Resolution
• Too fine, pattern may not be visible
• Too coarse, pattern may disappear
• E.g. variations in atmospheric pressure on a scale of hour and
month (storms can be detected or disappeared)

Attribute
• Property of a data object that varies from one object to
another
• Properties of numbers describe attributes
# Property Operation Type

1. Distinctiveness = and ≠ Categorical Nominal
2. Order <,≤,>,≥ (Qualitative) Ordinal
3. Addition + and - Interval

Numerical
4. Multiplication * and / (Quantitative) Ratio

Nominal Scale
• A variable that takes a value among a set of mutually
exclusive codes that have no logical order is known as
a nominal variable.
• Gender { M, F} or { 1, 0 } Used letters or numbers
• Blood groups {A , B , AB , O } Used string
• Rhesus (Rh) factors {+ , - } Used symbols

Nominal Scale
• The nominal scale is used to label data categorization
using a consistent naming convention
• The labels can be numbers, letters, strings, enumerated
constants or other keyboard symbols
• Nominal data thus makes “category” of a set of data
• The number of categories should be two (binary) or
more (ternary, etc.) but countably finite

Nominal Scale
• A nominal data may be numerical in form, but the numerical
values have no mathematical interpretation.
• For example, 10 prisoners are 100, 101, … 110, but; 100 + 110 =
210 is meaningless. They are simply labels.
• Two labels may be identical ( = ) or dissimilar ( ≠ ).
• These labels do not have any ordering among themselves.

• For example, we cannot say blood group B is better or worse
than group A.
• Labels (from two different attributes) can be combined to

give another nominal variable.
• For example, blood group with Rh factor ( A+ , A- , AB+, etc.)

Binary Scale
• A nominal variable with exactly two mutually exclusive
categories that have no logical order is known as binary
variable
Switch: {ON, OFF}
Attendance: {True, False}
Entry: {Yes, No}
etc.
• A Binary variable is a special case of a nominal variable
that takes only two possible values.

Symmetric and Asymmetric Binary
Scale
• Different binary variables may have unequal importance
• If two choices of a binary variable have equal

importance, then it is called symmetric binary variable.
• Example: Gender = {male, female}
• If the two choices of a binary variable have unequal

importance, it is called asymmetric binary variable.
• Example: Student Course Opted= {Y, N}

Operation on Nominal Variables
• Summary statistics applicable to nominal data are
mode, contingency correlation, etc.
• Arithmetic (+, -, *, /) and logical operations (<, >, ≠ etc.)
are not permitted
• The allowed operations are: accessing (read, check,
etc.) and re-coding (into another non-overlapping
symbol set, that is, one-to-one mapping) etc.
• Nominal data can be visualized using line charts, bar
charts or pie charts etc.
• Two or more nominal variables can be combined to
generate other nominal variable.
• Example: Gender (M,F) x Marital status (S, M, D, W)

Ordinal Scale
• Ordered nominal data are known as ordinal data and
the variable that generates it is called ordinal variable.
• Example: Shirt size = { S, M, L, XL, XXL}
• The values assumed by an ordinal variable can be

ordered among themselves as each pair of values can
be compared literally or using relational operators ( < , ≤
, > , ≥ ).

Operation on Ordinal Data
• Usually relational operators can be used on ordinal data.
• Summary measures mode and median can be used on
ordinal data.
• Ordinal data can be ranked (numerically, alphabetically, etc.)
Hence, we can find any of the percentiles measures of
ordinal data.
• Calculations based on order are permitted (such as count,
min, max, etc.).
• Spearman’s R can be used as a measure of the strength of
association between two sets of ordinal data.
• Numerical variable can be transformed into ordinal variable
and vice-versa, but with a loss of information.
• For example, Age [1, … 100] = [young, middle-aged, old]

Interval Scale
• Interval-scale variables are continuous measurements
of a roughly linear scale.
• Example: weight, height, latitude, longitude, weather,
temperature, calendar dates, etc.
• Interval data are with well-defined interval.
• Interval data are measured on a numeric scale (with
+ve, 0 (zero), and –ve values).
• Interval data has a zero point on origin. However, the
origin does not imply a true absence of the measured
characteristics.
• For example, temperature in Celsius and Fahrenheit; 0⁰ does
not mean absence of temperature, that is, no heat!

Operation on Interval Data
• We can add to or from interval data.
• For example: date1 + x-days = date2
• Subtraction can also be performed.
• For example: current date – date of birth = age
• Negation (changing the sign) and multiplication by a
constant are permitted.
• All operations on ordinal data defined are also valid
here.
• Linear (e.g. cx + d ) or Affine transformations are
permissible.
• Other one-to-one non-linear transformation (e.g., log,
exp, sin, etc.) can also be applied.

Operation on Interval Data
• Interval data can be transformed to nominal or ordinal
scale, but with loss of information.
• Interval data can be graphed using histogram,

frequency polygon, etc.

Ratio Scale
• Interval data with a clear definition of “zero” are called
ratio data.
• Example: Temperature in Kelvin scale, Intensity of earth-quake
on Richter scale, Sound intensity in Decibel, cost of an article,
population of a country, etc.
• All ratio data are interval data but the reverse is not
true.
• In ratio scale, both differences between data values and
ratios (of non-zero) data pairs are meaningful.
• Ratio data may be in linear or non-linear scale.
• Both interval and ratio data can be stored in same data
type (i.e., integer, float, double, etc.)

Operation on Ratio Data
• All arithmetic operations on interval data are applicable
to ratio data.
• In addition, multiplication, division, etc. are allowed.
• Any linear transformation of the form ( ax + b )/c are
known.

Type of Datasets
• Record based
• Transactional Data (shopping)
• Data Matrix (relational data)
• Sparse Data Matrix (course selection)
• Graph based
• Linked web pages
• Ordered
• Sequence Data (genetic encoding)
• Time Series Data (temperature)

Thank You!

Data Exploring
Dr. Jitendra Kumar

Data Exploration
• Preliminary investigation of the data in order to better
understand its specific characteristics
• Helps in selecting the appropriate pre-processing and
data analysis techniques
• Approaches
• Statistics
• Visualization
Dr. Jitendra Kumar National Institute of Technology Tiruchirappalli

Statistics
• “Statistics is concerned with scientific method for
collecting, organizing, summarising, presenting and
analysing data as well as drawing valid conclusions and
making reasonable decisions on the basis of such
analysis.”
• Helps in
• The planning of operations
• The setting up of standards

Misuse of Statistics
• Data Source is not given
• Defective Data
• Unrepresentative Sample
• Inadequate Sample
• Unfair Comparisons

Descriptive Statistics
• Quantities such as mean and standard deviation
• Captures different characteristics of a large set of

values
• E.g. Average household income, fraction of college dropout
students in last 10 years
• E.g. Study the height of students in a class involves

• Recording the heights of all the students
• Max., Min., Median, Mean, Mode

Measures of Central Tendency
• Mean
σ𝑛𝑖=1 𝑥𝑖
mean(𝑥) = 𝑥ҧ =
𝑛
• Median (data needs to be sorted)
𝑥(𝑖+1) , if 𝑛 is odd, i.e., 𝑛 = 2𝑖 + 1

median(𝑥) = 𝑓 𝑥 = ൝1
2
𝑥𝑖 + 𝑥(𝑖+1) , if 𝑛 is even, i.e., 𝑛 = 2𝑖
• Mode
• Selects most common value

Data (x) : 3, 4, 3, 1, 2, 3, 9, 5, 6, 7, 4, 8
𝑛 = 12
3+4+⋯+8
Mean 𝑥ҧ = = 4.583
12
Median Sorted Data (x) : 1, 2, 3, 3, 3, 4, 4, 5, 6, 7, 8, 9

1
median 𝑥 = 4 + 4 = 4
2
Mode 1 1 4 2 7 1
mode 𝑥 = 3
2 1 5 1 8 1
3 3 6 1 9 1
Data items frequency

• Outliers are not important, use median
• Low impact of outliers on median
• Outliers are important, use mean
• E.g. Average income
Person P1 P2 P3 P4 P5 P6 P7
Income
1 1 1 2 2 3 11
(Million)
Mean = 3 Every person could make 3M
Median = 2 Poor half of the population makes 2M or less

• Example: Loose 1 Rs. Everyday on 99% of the days
but on 1% of the days, it gave Rs. 1M
• -1,-1,-1,…,-1,1000000,-1,-1,…,-1,-1,1000000,-1,-1
• Median = -1
• Mean = ((-1)+(-1)+…+(-1)+1000000)/100 = Some
positive number

• Garbage can placement on streets
• 40% people voted for garbage can at every 25th meter
• 45% people voted for garbage can at every 75th meter
• 15% people voted between 1 and 100 meter (except 25 and
75)
Mode = 75 (most popular preference)

Measures of Dispersion/ Spread
• How does data deviate to central value or any other value?
• Range
• How spread apart the values in data set
• Compute using (𝑚𝑎𝑥 − min)
• Inter Quartile Range
• Measure of variability based on dividing the dataset into quartiles
• High Value – High Dispersion
• Low Value – Low Dispersion
• Sample Standard Deviation
• Deviation of each data point from mean
𝑛
1
𝑆𝐷 = ෍(𝑥𝑖 − 𝑥)ҧ 2
𝑛−1
𝑖=1

Interquartile Range - Calculation
Step 1: Order the data from least to greatest
Step 2: Identify the extremes
Step 3: Find the median of the dataset
Step 4: Find Q3 i.e. median of the Upper half of the data
Step 5: Find Q1 i.e. median of the Lower half of the data
Step 6: Find IQR = Q3 – Q1
Ex. 1: 19,25,16,20,34,7,31,12,22,29,16
Ex. 2: 65,65,70,75,80,82,85,90,95,100

Measures of Dispersion/ Spread
Data (x) : 3, 4, 3, 1, 2, 3, 9, 5, 6, 7, 4, 8
𝑛 = 12
Range
max-min = 9-1 = 8
High dispersion as min and max are highly deviated from
mean (4.583)
Inter Quartile Range
3rd quartile – 1st quartile
75th percentile – 25th percentile
6.5-3 = 3.5
Sample Standard Deviation
1
෍ 3 − 4.583 2 + (4 − 4.583)2 + ⋯ + 8 − 4.583 2 =
11

Inferential statistics
• Generalizes a large dataset and applied probability to
draw a conclusion
• Used to infer data parameters based on the statistical
model using sample data
• Expand the model to get results for the entire
population.
• E.g. Hypothesis Testing

Descriptive Statistics Vs. Inferential
Statistics
https://www.selecthub.com/business-intelligence/statistical-software/

Thank You!

Linear Regression
Dr. Jitendra Kumar

Regression
• Engineering and Science applications explore the relationship
among variables
• Regression analysis is a statistical model that is very useful for

such problems
• Regression: the process of going back to an earlier state

Model
• Mathematical representation of a phenomenon i.e. the
representation of a relationship
𝐷𝑜𝑠𝑎𝑔𝑒 𝑜𝑓 𝑚𝑒𝑑𝑖𝑐𝑖𝑛𝑒 = 𝑓(𝑎𝑔𝑒, 𝑏𝑙𝑜𝑜𝑑 𝑝𝑟𝑒𝑠𝑠𝑢𝑟𝑒, 𝑜𝑥𝑦𝑔𝑒𝑛 𝑙𝑒𝑣𝑒𝑙)
𝑦 = 𝑓 𝑥1 , 𝑥2 , 𝑥3
𝑦 = 3𝑥1 + 7𝑥2 + 2𝑥3

Model Components
Model
Variables Parameters
Input Output Non

Linear
Variable Variable Linear

• Good model incorporates all salient features of phenomenon
• Bad model does not incorporate all salient features of

phenomenon
• How can you obtain good model?

• Collect a sample of data
• Sample – Fraction of population (data points)
• Sample should be representative in nature i.e. all salient

features of population should be present in sampled data

Model Parameters
𝑦 = 𝑚𝑥 + 𝑐
𝑦
Variables Parameters
S1. Knowledge of (x, y) completely

describes the model. 
S2. Knowledge of (m, c) completely c
describe the model.  slope 𝑚 = tan 𝜃 x

Modeling is finding the parameters of a model which are UNKNOWN
Regression Analysis

Regression Analysis
𝑦 = 𝑚𝑥 + 𝑐
In general, variables are represented using alphabets 𝑥, 𝑦, 𝑧 etc.

and parameters are represented using Greek letters 𝛼, 𝛽, 𝛾 etc.
With above mentioned convention 𝑦 = 𝑚𝑥 + 𝑐 becomes

𝑦 = 𝛽0 + 𝛽1 𝑥
In this case, model is known if 𝛽0 and 𝛽1 are known.

Regression Analysis
General model for k input variables
𝑦 = 𝛽0 + 𝛽1 𝑥1 + 𝛽2 𝑥2 + ⋯ + 𝛽𝑘 𝑥𝑘
where 𝛽0 , 𝛽1 , ⋯ , 𝛽𝑘 are model parameters
More general form

𝑦 = 𝛽0 𝑥0 + 𝛽1 𝑥1 + 𝛽2 𝑥2 + ⋯ + 𝛽𝑘 𝑥𝑘
𝑥0 = 1

Linear Regression
• A model is said to be linear when it is linear in parameters.
• Identify linear model(s)

𝜕𝑦 𝜕𝑦
 𝑦 = 𝛽0 + 𝛽1 𝑥 𝜕𝛽0
=1
𝜕𝛽1
=𝑥
𝜕𝑦 𝜕𝑦
 𝑦 = 𝛽0 + 𝛽1 𝑥 2 𝜕𝛽0
=1
𝜕𝛽1
= 𝑥2
𝜕𝑦 𝜕𝑦
 𝑦 = 𝛽0 + 𝛽12 𝑥 𝜕𝛽0
=1
𝜕𝛽1
= 2𝛽1 𝑥
𝜕𝑦
If is independent of parameters then model is linear
𝜕(𝑝𝑎𝑟𝑎𝑚𝑒𝑡𝑒𝑟𝑠)

Non-linear model to linear model
𝑦 = 𝛽0 𝑥 𝛽1
log 𝑦 = log 𝛽0 + 𝛽1 log 𝑥
𝑦 ∗ = 𝛽0∗ + 𝛽1 𝑥 ∗
The updated model is linear in parameters 𝛽0∗ and 𝛽1 for input

variable 𝑥 ∗

Simple Linear Regression
• Consider one variable
𝑦 = 𝛽0 + 𝛽1 𝑥
• y – output variable/ study variable/ response variable/ dependent variable

• x – input variable/ explanatory variable/ regressor/ independent variable
• Objective: Find the values of parameters

Modeling
𝑦𝑖 = 𝛽0 + 𝛽1 𝑥𝑖 ; 𝑖 = 1,2, ⋯ , 𝑛
𝑦 = 𝛽0 + 𝛽1 𝑥
𝑦
This model is not representing 𝜀𝑛
the true phenomenon (𝑥 𝑛
,𝑦 𝑛
)
𝑦𝑖 = 𝛽0 + 𝛽1 𝑥𝑖 + 𝜀𝑖 ; 𝑖 = 1,2, ⋯ , 𝑛 (𝑥 1
,𝑦 1
)
𝜀1
𝜀2
𝜀𝑖 - random error (𝑥 2
,𝑦 2
)

Least Square Estimation
𝑦𝑖 = 𝛽0 + 𝛽1 𝑥𝑖 + 𝜀𝑖
How to compute the total error?

 a) σ𝑛𝑖=1 𝜀𝑖
 b) σ𝑛𝑖=1 𝜀𝑖2 Least square estimation
 c) σ𝑛𝑖=1 𝜀𝑖 Least absolute error estimation

Least Square Estimation (cont...)
𝑛
𝑛
𝑆𝑆𝐸 = ෍ 𝜀𝑖2 𝜕 𝑆𝑆𝐸
𝑖=1 = −2 ෍ 𝑦𝑖 − 𝛽0 − 𝛽1 𝑥𝑖
𝜕 𝛽0
𝑖=1
𝑦𝑖 = 𝛽0 + 𝛽1 𝑥𝑖 + 𝜀𝑖
𝑛
𝜀𝑖 = 𝑦𝑖 − 𝛽0 − 𝛽1 𝑥𝑖 𝜕 𝑆𝑆𝐸
= −2 ෍(𝑦𝑖 −𝛽0 − 𝛽1 𝑥𝑖 )𝑥𝑖
𝑛 𝜕 𝛽1
𝑖=1
𝑆𝑆𝐸 = ෍(𝑦𝑖 − 𝛽0 − 𝛽1 𝑥𝑖 )2
𝑖=1

𝑆𝑆𝐸
𝑆𝑆𝐸
𝛽𝑜∗ 𝛽1∗
𝛽0 𝛽1

𝑛 𝑛 𝑛
1 1
−2 ෍ 𝑦𝑖 − 𝛽0 − 𝛽1 𝑥𝑖 = 0 ෍ 𝑦𝑖 = 𝑦ത ෍ 𝑥𝑖 = 𝑥ҧ
𝑛 𝑛
𝑖=1 𝑖=1 𝑖=1
𝑛 𝑛
෍ 𝑦𝑖 − 𝑛𝛽0 − 𝛽1 ෍ 𝑥𝑖 = 0 𝑦ത − 𝛽0 − 𝛽1 𝑥ҧ = 0
𝑖=1 𝑖=1
𝑛 𝑛
1 𝛽1
෍ 𝑦𝑖 − 𝛽0 − ෍ 𝑥𝑖 = 0 𝛽0 = 𝑦ത − 𝛽1 𝑥ҧ
𝑛 𝑛
𝑖=1 𝑖=1

−2 ෍(𝑦𝑖 −𝛽0 − 𝛽1 𝑥𝑖 )𝑥𝑖 = 0

𝑖=1
𝑛 𝑛 𝑛
෍ 𝑥𝑖 𝑦𝑖 − 𝛽0 ෍ 𝑥𝑖 − 𝛽1 ෍ 𝑥𝑖2 = 0
𝑖=1 𝑖=1 𝑖=1
𝑛 𝑛 𝑛
෍ 𝑥𝑖 𝑦𝑖 − (𝑦ത − 𝛽1 𝑥)ҧ ෍ 𝑥𝑖 − 𝛽1 ෍ 𝑥𝑖2 = 0

𝑖=1 𝑖=1 𝑖=1

𝑛 𝑛 𝑛 𝑛 𝑛 𝑛
1 1
෍ 𝑥𝑖 𝑦𝑖 − ෍ 𝑦𝑖 ෍ 𝑥𝑖 + 𝛽1 ෍ 𝑥𝑖 ෍ 𝑥𝑖 − 𝛽1 ෍ 𝑥𝑖2 = 0
𝑛 𝑛
𝑖=1 𝑖=1 𝑖=1 𝑖=1 𝑖=1 𝑖=1
𝑛 𝑛 𝑛 𝑛 𝑛 𝑛
1 1
෍ 𝑥𝑖 𝑦𝑖 − ෍ 𝑦𝑖 ෍ 𝑥𝑖 = 𝛽1 − ෍ 𝑥𝑖 ෍ 𝑥𝑖 + ෍ 𝑥𝑖2
𝑛 𝑛
𝑖=1 𝑖=1 𝑖=1 𝑖=1 𝑖=1 𝑖=1
𝑛 𝑛 𝑛 𝑛
1 1
෍ 𝑥𝑖 𝑦𝑖 − 𝑦ത × 𝑛 × × ෍ 𝑥𝑖 = 𝛽1 −𝑥ҧ × 𝑛 × × ෍ 𝑥𝑖 + ෍ 𝑥𝑖2
𝑛 𝑛
𝑖=1 𝑖=1 𝑖=1 𝑖=1

𝑛 𝑛 𝑛 𝑛
1 1
෍ 𝑥𝑖 𝑦𝑖 − 𝑦ത × 𝑛 × × ෍ 𝑥𝑖 = 𝛽1 −𝑥ҧ × 𝑛 × × ෍ 𝑥𝑖 + ෍ 𝑥𝑖2
𝑛 𝑛
𝑖=1 𝑖=1 𝑖=1 𝑖=1
σ𝑛𝑖=1 𝑥𝑖 𝑦𝑖 − 𝑛𝑥ҧ 𝑦ത
𝛽1 =
σ𝑛𝑖=1 𝑥𝑖2 − 𝑛𝑥ҧ 2

Thank you!

Linear Regression
(Gradient Descent)
Dr. Jitendra Kumar

The slides use the content from Machine Learning course on Coursera.
https://www.coursera.org/learn/machine-learning/home/
500
Housing Prices
(Trichy, TN) 400
300
Price (₹) 200

(in 100,000)
100
0
0 500 1000 1500 2000 2500 3000
Size (feet2)
Supervised Learning Regression Problem

Given the “right answer” for Predict real-valued output
each example in the data.

Training set of Size in feet2 Price (₹) in
housing prices (x) 100,000's (y)
(Trichy, TN) 2104 460
1416 232
1534 315
852 178
Notation: … …
m = Number of training examples
x’s = “input” variable / features
y’s = “output” variable / “target” variable

Training Set How do we represent h ?
Learning Algorithm
Size of Estimated
h
house price
Linear regression with one variable.

Univariate linear regression.

Training Set Size in feet2 Price (₹) in
(x) 100,000's (y)
2104 460
1416 232
1534 315
852 178
… …
Hypothesis:
‘s: Parameters
How to choose ‘s ?
3 3 3
2 2 2
1 1 1
0 0 0
0 1 2 3 0 1 2 3 0 1 2 3

y
Idea: Choose so that

is close to for our
training examples

Simplified
Hypothesis:
Parameters:
Cost Function:
Goal:

(for fixed , this is a function of x) (function of the parameter )
3 3
2 2
y
1 1
0 0
0 1 2 3 -0.5 0 0.5 1 1.5 2 2.5
x

3 3
2 2
y
1 1
0 0
0 1 2 3 -0.5 0 0.5 1 1.5 2 2.5
x

3 3
2 2
y
1 1
0 0
0 1 2 3 -0.5 0 0.5 1 1.5 2 2.5
x

Hypothesis:
Parameters:
Cost Function:
Goal:

(for fixed , this is a function of x) (function of the parameters )
500
400
300
Price (₹)
in 100000’s
200
100
0
0 1000 2000 3000
Size in feet2 (x)





Gradient Descent

Have some function
Want
Outline:
• Start with some
• Keep changing to reduce
until we hopefully end up at a minimum

J(0,1)
1
0

J(0,1)
1
0

Gradient descent algorithm
Correct: Simultaneous update Incorrect:

If α is too small, gradient descent
can be slow.
If α is too large, gradient descent

can overshoot the minimum. It may
fail to converge, or even diverge.

at local optima
Current value of

Gradient descent can converge to a local
minimum, even with the learning rate α fixed.
As we approach a local
minimum, gradient
descent will automatically
take smaller steps. So, no
need to decrease α over
time.

Gradient descent algorithm Linear Regression Model

Gradient descent algorithm
update
and
simultaneously

J(0,1)
1
0










“Batch” Gradient Descent
“Batch”: Each step of gradient descent

uses all the training examples.

Thank You!

Regression Analysis: Goodness of Fit
Thursday, 10 September, 2020 02:00 PM
Regression Analysis Page 1

Regression Assumptions
Wednesday, 16 September, 2020 03:00 PM
Regression Assumptions Page 1

Regression Assumptions Page 2
Linear Regression (Output Explanation)
Linear Regression (Output Explanation) Page 1

Test of Significance
Tuesday, 15 September, 2020 02:35 PM
Test of Significance Page 1

ANOVA (Analysis Of Variance)
ANOVA Page 1
ANOVA Page 2
Test of Significance

Multiple Linear Regression
Multiple Linear Reegression Page 1

Aspects of Multiple Linear
Regression
Dr. Jitendra Kumar

Multiple Linear Regression Aspect
• Polynomial Regression Models
• Categorical Regressors and Indicator Variables
• Selection of Variables and Model Building
• Multicollinearity

Polynomial Regression Models
• Form of linear regression where relationship between
independent variable x and dependent variable y is
modelled as an nth degree polynomial.
• Polynomial regression models are widely used when
the response in curvilinear
• General Model
𝒀 = 𝐗𝜷 + 𝝐
• Second degree polynomial in one variable

𝑦 = 𝛽0 + 𝛽1 𝑥 + 𝛽11 𝑥 2 + 𝜖
• Second degree polynomial in two variables

𝑦 = 𝛽0 + 𝛽1 𝑥 + 𝛽2 𝑥2 + 𝛽11 𝑥12 + 𝛽22 𝑥22 + 𝛽12 𝑥1 𝑥2 + 𝜖

Polynomial Regression Model
x 20 25 30 35 40 50 60 65 70 75 80 90
y 1.81 1.70 1.65 1.55 1.48 1.40 1.30 1.26 1.24 1.21 1.20 1.18
1.9
1.81 1 20 400 1.8

𝑦 = 𝛽0 + 𝛽1 𝑥 + 𝛽11 𝑥 2 + 𝜖
1.70 1 25 625 1.7
1.65 1 30 900
1.6
1.5
1.55 1 35 1225
y
1.4
1.48 1 40 1600 1.3
1.40 1 50 2500 1.2
𝒚= 𝑿=
1.30 1 60 3600 1.1
1.26 1 65 4225 1
15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95
1.24 1 70 4900 𝛽0 x
1.21 1 75 5625 𝜷 = 𝛽1
1.20 1 80 6400 ෡ = 𝑿′ 𝒚
𝑿′ 𝑿𝜷
𝛽2
1.18 1 90 8100
𝑦ො = 2.19826629 − 0.02252236𝑥 + 0.00012507𝑥 2

Polynomial Regression Model
Lowest-degree model is always better

Extra sum of squares due to 𝛽11 Can we drop the quadratic
term from the model?

Categorical Regressors and Indicator Variables
• So far, regression models considered quantitative

variables (measured on a numerical scale)
• Sometimes, categorical or qualitative variables are
incorporated in a regression model
• The usual approach is to use indicator variables or
dummy variables
• For instance, suppose that one of the variables in a
regression model is thee operator who is associate with
each observation
0 if the observation is from operator 1

𝑥=ቊ
1 if the observation is from operator 2

𝑦 = 𝛽0 + 𝛽1 𝑥1 + 𝛽2 𝑥2 + 𝜖
If x2=0
𝑦 = 𝛽0 + 𝛽1 𝑥1 + 𝜖
If x2=1
𝑦 = 𝛽0 + 𝛽1 𝑥1 + 𝛽2 . 1 + 𝜖
𝑦 = (𝛽0 +𝛽2 ) + 𝛽1 𝑥1 + 𝜖
𝑦ො = 14.27620 + 0.14115𝑥1 − 13.28020𝑥2


Selection of Variables and Model Building
• Selection of the set of regressor variables to be used in

the model is critical
• Previous experience or underlying theoretical
considerations can help the analyst specify the set of
regressor variables to use in a particular situation.
• Variable selection refers the screening the candidate
variables to obtain a regression model that contains the
“best” subset of regressor variables.
• We would also like the model to use as few regressor
variables as possible.
• The compromise between these conflicting objects is
often called finding the “best” regression equation.

Underfitting Appropriate Overfitting
(High Bias) (High Variance)

Underfitting Appropriate Overfitting
(High Bias) (High Variance)
Variance refers to the error due to complex
model trying to fit the data. High variance
means the model passes through most of
the data points and it results in over-fitting
the data
The bias is known as the difference

between the prediction of the values by the
model and the correct value. Being high in
biasing gives a large error in training as well
as testing data. Its recommended that an
algorithm should always be low biased to
avoid the problem of underfitting.


All Possible Regressions
• Fit all the regression equations involving one candidate

variable, all regression equations involving two
candidate variables, and so on
• Then these equations are evaluated according to some
suitable criteria to select the “best” regression model
• If there are K candidate regressors, there are 2K total
equations to be examined.
• For example, if K = 4, there are 24 = 16 possible regression
equations; while if K = 10, there are 210 = 1024 possible
regression equations
• Hence, the number of equations to be examined
increases rapidly as the number of candidate variables
increases

All Possible Regressions
• Several criteria may be used for evaluating and comparing
the different regression models obtained.
• A commonly used criterion is based on the value of 𝑅2 or the
2
value of the adjusted 𝑅2 , 𝑅𝑎𝑑𝑗 .
• Continue to increase the number of variables in the model
2
until the increase in 𝑅2 or the 𝑅𝑎𝑑𝑗 is small.
2
• Often, the 𝑅𝑎𝑑𝑗 will stabilize and actually begin to decrease as the
number of variables in the model increases.
2
• Usually, the model that maximizes 𝑅𝑎𝑑𝑗 is considered to be a
good candidate for the best regression equation.
• Another criteria is PRESS (Prediction
𝑛
Error Sum of Squares)
𝑃𝑅𝐸𝑆𝑆 = ෍(𝑦𝑖 − 𝑦ො𝑖 )2
𝑖=1
• Models that have small values of PRESS are preferred

Stepwise Regression
• The most widely used variable selection technique
• The procedure iteratively constructs a sequence of
regression models by adding or removing variables at
each step.
• The criterion for adding or removing a variable at any
step is usually expressed in terms of a partial F-test.
• Let fin be the value of the F-random variable for adding
a variable to the model, and let fout be the value of the
F-random variable for removing a variable from the
model.

Stepwise Regression
• Stepwise regression begins by forming a one-variable
model using the regressor variable that has the highest
correlation with the response variable Y.
• This will also be the regressor producing the largest F-
statistic.
• For example, suppose that at first step, x1 is selected.
• At the second step, the remaining K-1 candidate
variables are examined.
• The variable for which the partial F-statistic is a
maximum is added to the equation, provided that fj > fin
mean square for error
for the model containing
both x1 and xj

Stepwise Regression
• Suppose that this procedure indicates that x2 should be
added to the model.
• Now the stepwise regression algorithm determines
whether the variable x1 added at the first step should be
removed
• If the calculated value f1 < fout, the variable x1 is
removed; otherwise it is retained

Stepwise Regression
• In general, at each step
• The set of remaining candidate regressors is examined
• The regressor with the largest partial F-statistic is entered,
provided that the observed value of f exceeds fin.
• Then the partial F-statistic for each regressor in the model is
calculated, and the regressor with the smallest observed value
of F is deleted if the observed f < fout.
• The procedure continues until no other regressors can
be added to or removed from the model

Forward Selection
• The forward selection procedure is a variation of stepwise
regression
• It is based on the principle that regressors should be added to the
model one at a time until there are no remaining candidate
regressors that produce a significant increase in the regression
sum of squares
• That is, variables are added one at a time as long as their partial F-
value exceeds fin
• Forward selection is a simplification of stepwise regression that
omits the partial F-test for deleting variables from the model that
have been added at previous steps
• This is a potential weakness of forward selection; that is, the
procedure does not explore the effect that adding a regressor at
the current step has on regressor variables added at earlier steps.

Backward Selection
• The backward elimination algorithm begins with all K
candidate regressors in the model.
• Then the regressor with the smallest partial F-statistic is
deleted if this F-statistic is insignificant, that is, if f < fout.
• Next, the model with K-1 regressors is fit, and the next
regressor for potential elimination is found.
• The algorithm terminates when no further regressor can
be deleted.

Multicollinearity
• In regression, multicollinearity refers to the extent to
which independent variables are correlated.
• Multicollinearity exists when:
• One independent variable is correlated with another
independent variable.
• One independent variable is correlated with a linear
combination of two or more independent variables.
𝑅𝑗2 is the coefficient of multiple determination resulting from regressing xj on the other k-1
regressor variables

Thank You!

Logistic Regression
Dr. Jitendra Kumar

Classification
Email: Spam / Not Spam?

Online Transactions: Fraudulent (Yes / No)?
Tumor: Malignant / Benign?
0: “Negative Class” (e.g., benign tumor)

1: “Positive Class” (e.g., malignant tumor)
(Yes) 1
Malignant ?
(No) 0
Tumor Size
Linear Regression is not a good choice for classification
Threshold classifier output ℎ𝛽 (𝑥𝑖 ) at 0.5:
If ℎ𝛽 (𝑥𝑖 ) ≥ 0.5 , predict “y = 1”

Goal: 0 ≤ ℎ𝛽 𝑥𝑖 ≤1
If ℎ𝛽 𝑥𝑖 < 0.5 , predict “y = 0”

Classification: y = 0 or 1
ℎ𝛽 (𝑥𝑖 ) can be > 1 or < 0
Logistic Regression 0 ≤ ℎ𝛽 (𝑥𝑖 ) ≤ 1

−∞ +∞
0 1

Classification
ℎ𝛽 (𝑥𝑖 ) = 𝛽0 + 𝛽1 𝑥𝑖 Sigmoid function
ℎ𝛽 (𝑥𝑖 ) = 𝑔(𝛽𝑇 𝑥)
1
𝑔 𝑧 =
1 + 𝑒 −𝑧
1
ℎ𝛽 (𝑥) = 𝑇 𝑥)
1 + 𝑒 −(𝛽

Interpretation of Hypothesis Output
ℎ𝛽 (𝑥𝑖 ) = estimated probability that y = 1 on input x
𝑥0 1
Example: If 𝑥 = 𝑥 =
1 tumorSize
ℎ𝛽 𝑥 = 0.7
Tell patient that 70% chance of tumor being malignant
ℎ𝛽 𝑥 = 𝑃(𝑦 = 1|𝑥; 𝛽) “probability that y = 1, given x,

parameterized by 𝛽 ”
𝑃 𝑦 = 0 𝑥; 𝛽 + 𝑃 𝑦 = 1 𝑥; 𝛽 = 1
𝑃 𝑦 = 0 𝑥; 𝛽 = 1 − 𝑃 𝑦 = 1 𝑥; 𝛽

Training set:
m examples
1
ℎ𝛽 (𝑥) = 𝑇 𝑥)
1 + 𝑒 −(𝛽
How to choose parameters 𝛽 ?

Cost function
𝑚
1 𝑖 2
Linear regression: 𝐽 𝛽 = ෍ ℎ𝛽 𝑥 − 𝑦 (𝑖)
2𝑚
𝑖=1
𝑖
1 2
𝐶𝑜𝑠𝑡 ℎ𝛽 𝑥 , 𝑦 (𝑖) = ℎ𝛽 𝑥 𝑖 − 𝑦 (𝑖)
2
𝐽 𝛽 𝐽 𝛽
𝛽 𝛽

Logistic regression cost function
−log(ℎ𝛽 𝑥 ), 𝑖𝑓 𝑦 = 1
𝐶𝑜𝑠𝑡 ℎ𝛽 𝑥 , 𝑦 = ൝
−log(1 − ℎ𝛽 𝑥 ), 𝑖𝑓 𝑦 = 0
log z
z Cost = 0 if y=1, ℎ𝛽 𝑥 =1
0 1 But as ℎ𝛽 𝑥 → 0
-log z
Cost → ∞
Captures intuition that
If y = 1 if ℎ𝛽 𝑥 =0, (predict 𝑃(𝑦 = 1|𝑥; 𝛽)=0),
Cost
but y=1,
We will penalize learning algorithm
by a very large cost
0 1 ℎ𝛽 𝑥

−log(1 − ℎ𝛽 𝑥 ), 𝑖𝑓 𝑦 = 0
If y = 0
Cost
0 ℎ𝛽 𝑥 1

𝑚
1
𝐽 𝛽 = ෍ 𝐶𝑜𝑠𝑡 ℎ𝛽 𝑥 , 𝑦
𝑚
𝑖=1
−log(1 − ℎ𝛽 𝑥 ), 𝑖𝑓 𝑦 = 0
𝐶𝑜𝑠𝑡 ℎ𝛽 𝑥 , 𝑦 = −𝑦 log ℎ𝛽 𝑥 − (1 − 𝑦)log(1 − ℎ𝛽 𝑥 )
If y=1 𝐶𝑜𝑠𝑡 ℎ𝛽 𝑥 , 𝑦 = −𝑦 log ℎ𝛽 𝑥

If y=0 𝐶𝑜𝑠𝑡 ℎ𝛽 𝑥 , 𝑦 = −log(1 − ℎ𝛽 𝑥 )

𝑚
1
𝐽 𝛽 = ෍ 𝐶𝑜𝑠𝑡 ℎ𝛽 𝑥 , 𝑦
𝑚
𝑖=1
𝑚
1
=− ෍ 𝑦 (𝑖) log ℎ𝛽 𝑥 𝑖
+ 1−𝑦 𝑖
log 1 − ℎ𝛽 𝑥 𝑖
𝑚
𝑖=1
To fit parameters 𝛽 : 𝑚𝑖𝑛

𝐽(𝛽)
𝛽
To make a prediction given new 𝑥 :

1
Output ℎ𝛽 (𝑥) = 𝑇 𝑥)
1 + 𝑒 −(𝛽

Gradient Descent
𝑚
1
𝐽(𝛽) = − ෍ 𝑦 (𝑖) log ℎ𝛽 𝑥 𝑖
+ 1−𝑦 𝑖
𝑚
𝑖=1
Want 𝑚𝑖𝑛𝐽(𝛽) :
𝛽
Repeat
𝜕
𝛽𝑗 = 𝛽𝑗 − 𝛼 𝐽(𝛽)
𝜕𝛽𝑗
(simultaneously update all 𝛽𝑗)

Gradient Descent
𝑚
1
𝐽(𝛽) = − ෍ 𝑦 (𝑖) log ℎ𝛽 𝑥 𝑖
+ 1−𝑦 𝑖
𝑚
𝑖=1
Want 𝑚𝑖𝑛𝐽(𝛽) :
𝛽
Repeat
(𝑖)
𝛽𝑗 = 𝛽𝑗 − 𝛼 σ𝑚
𝑖=1 ℎ𝛽 𝑥
𝑖 − 𝑦 (𝑖) 𝑥𝑗
(simultaneously update all 𝛽𝑗)
Algorithm looks identical to linear regression!

Multiclass classification
Email foldering/tagging: Work, Friends, Family, Hobby
Medical diagrams: Not ill, Cold, Flu
Weather: Sunny, Cloudy, Rain, Snow

Binary classification: Multi-class classification:
x2 x2
x1 x1

x2
One-vs-all (one-vs-rest):
x1
x2 x2
x1
x1
x2
Class 1:
Class 2:
Class 3:
𝑖
ℎ𝛽 (𝑥) = 𝑃(𝑦 = 𝑖|𝑥; 𝛽)
x1
One-vs-all
𝑖
Train a logistic regression classifier ℎ𝛽 (𝑥) for each
class 𝑖 to predict the probability that 𝑦 = 𝑖
On a new input 𝑥, to make a prediction, pick the

class 𝑖 that maximizes
𝑚𝑎𝑥 𝑖
ℎ (𝑥)
𝑖 𝛽

Thank
You
jitendra@nitt.edu
https://imjitendra.wordpress.com/
https://www.linkedin.com/in/dr-jitendra/

Discrimination Analysis
CAMI16: Data Analytics
Dr. Jitendra Kumar


Introduction
• Suppose we are given a learning set of multivariate observations
(i.e., input values in 𝑅𝑛 ), and suppose each observation is known
to have come from one of K predefined classes having similar
characteristics.
• These classes may be identified, for example
• species of plants
• levels of credit worthiness of customers
• presence or absence of a specific medical condition
• different types of tumors
• whether an email message is spam or non-spam
• To distinguish the known classes from each other, we associate a
unique class label (or output value) with each class; the
observations are then described as labeled observations

Problem
• A drug to cure disease
• The drug suits to some patients
• The drug reacts worse to some patients
• How to decide the suitability of drug for a patient?

2 Genes

3 Genes

3 Genes
What if number of genes are more… say 1000 and beyond

Reducing the number of genes may help

2-D to 1-D

Discriminant Analysis
• Discriminant function analysis is used to determine
which continuous variables discriminate between two or
more naturally occurring groups

Assumptions
• Normal Distribution
• It is assumed that the data (for the variables) represent a
sample from a multivariate normal distribution
• Homogeneity of Variances
• Very sensitive to heterogeneity of variance-co variance
matrices
• Outliers
• Highly sensitive to the outliers
• Non-multicollinearity
• Low multicollinearity is favourable

Thank You!

Classification
Thursday, 15 October, 2020 01:56 PM
LDA Page 1
LDA Page 2
LDA Page 3
(Practice Questions)
1. A company manufacturing automobile tyres finds that tyre-life is normally distributed

with a mean of 40,000 km and standard deviation of 3,000 km. It is believed that a
change in the production process will result in a better product and the company has
developed a new tyre. A sample of 100 new tyres has been selected. The company has
found that the mean life of these new tyres is 40,900 km. Can it be concluded that the
new tyre is significantly better than the old one, using the significance level of 0.01?
2. A company is engaged in the packaging of superior quality tea in jars of 500 gm each.
The company is of the view that as long as jars contain 500 gm of tea, the process is in
control. The standard deviation is 50 gm. A sample of 225 jars is taken at random and
the sample average is found to be 510 gm. Has the process gone out of control?
3. A company manufacturing light bulbs is using two different processes A and B. The life of
light bulbs of process A has a normal distribution with mean µ1 and standard deviation
σ1 . Similarly, for process B, it is µ2 and σ2 . The data pertaining to the two process are
as follows:
Sample A Sample B
n1 = 16 n2 = 21
x̄1 = 1200hr x̄2 = 1300hr
σ1 = 60hr σ2 = 50hr
Verify that the variability of the two processes is the same. (Hint: Use F -statistic)
4. Examine the claim of a battery producer that the batteries will last for 100 days, given
that a sample study about their life, of the batteries on 200 batteries, showed mean life
of 90 days with a standard deviation of 15 days. Assume normal distribution, and test at
5% level of significance.
5. A company has appointed four salesmen, SA , SB , SC and SD , and observed their sales
in three seasons - summer, winter and monsoon. The figures (in Rs lakh) are given in the
following table:
Seasons SA SB SC SD Season Totals

Summer 36 36 21 35 128
Winter 28 29 31 32 120
Monsoon 26 28 29 29 112
Totals 90 93 81 96 360
Using 5% level of significance, perform an analysis of variance on the above data and
interpret the results.
6. Find out the regression equation using least squares estimation on below mentioned data:
X 2 3 4 5 6 7 8 9 10 12
Y 7 9 10 13 15 18 19 24 25 29
1
Principal Component Analysis
(PCA)
Dr. Jitendra Kumar


Data Compression
Reduce data from

(inches)
2D to 1D
(cm)

Data Compression
Reduce data from

(inches)
2D to 1D
(cm)

Data Compression
Reduce data from 3D to 2D

Data Visualization
Mean
Per capita Poverty household
GDP GDP Human Index income
(trillions of (thousands Develop- Life (Gini as (thousands
Country US$) of intl. $) ment Index expectancy percentage) of US$) …
Canada 1.577 39.17 0.908 80.7 32.6 67.293 …
China 5.878 7.54 0.687 73 46.9 10.22 …
India 1.632 3.41 0.547 64.7 36.8 0.735 …
Russia 1.48 19.84 0.755 65.5 39.9 0.72 …
Singapore 0.223 56.69 0.866 80 42.5 67.1 …
USA 14.527 46.86 0.91 78.3 40.8 84.3 …
… … … … … … …
[resources from en.wikipedia.org]

Data Visualization
Country
Canada 1.6 1.2
China 1.7 0.3
India 1.6 0.2
Russia 1.4 0.5
Singapore 0.5 1.7
USA 2 1.5
… … …

Data Visualization

Principal Component Analysis (PCA) problem formulation

Principal Component Analysis (PCA) problem formulation
Reduce from 2-dimension to 1-dimension: Find a direction (a vector )

onto which to project the data so as to minimize the projection error.
Reduce from n-dimension to k-dimension: Find vectors
onto which to project the data, so as to minimize the projection error.

PCA is not linear regression

PCA is not linear regression

Data preprocessing
Training set:
Preprocessing (feature scaling/mean normalization):
Replace each with

If different features on different scales (e.g., size of house,
number of bedrooms), scale features to have comparable
range of values.

Principal Component Analysis (PCA) algorithm
Reduce data from 2D to 1D Reduce data from 3D to 2D

Principal Component Analysis (PCA) algorithm
Reduce data from -dimensions to -dimensions
Compute “covariance matrix”:
Compute “eigenvectors” of matrix :
[U,S,V] = svd(Sigma);

Principal Component Analysis (PCA) algorithm summary
After mean normalization (ensure every feature has

zero mean) and optionally feature scaling:
Sigma =
[U,S,V] = svd(Sigma);
Ureduce = U(:,1:k);
z = Ureduce’*x;

Reconstruction from compressed representation
𝑥𝑎𝑝𝑝𝑟𝑜𝑥 = 𝑈𝑟𝑒𝑑𝑢𝑐𝑒 𝑧

Choosing (number of principal components)
𝑚
1 2
Average squared projection error: 𝑚
(𝑖)
෍ 𝑥 (𝑖) − 𝑥𝑎𝑝𝑝𝑟𝑜𝑥
𝑖=1
𝑚
1 2
Total variation in the data: 𝑚 ෍ 𝑥 (𝑖)
𝑖=1
Typically, choose to be smallest value so that
(1%)
“99% of variance is retained”

Choosing (number of principal components)
Algorithm: [U,S,V] = svd(Sigma)
Try PCA with
Compute
Check if
Pick smallest value of for which
(99% of variance retained)

Supervised learning speedup
Extract inputs:
Unlabeled dataset:
New training set:
Note: Mapping should be defined by running PCA

only on the training set. This mapping can be applied as well to
the examples and in the cross validation and test sets.

Application of PCA
- Compression
- Reduce memory/disk needed to store data
- Speed up learning algorithm
- Visualization

Thank You!

(PCA-II)
Dr. Jitendra Kumar


• Explains the variance-covariance structure of a set of
variables through a few linear combinations of these
variables
• p components are required to reproduce the total
system variability, often much of this variability can be
accounted for by accounted for by a small number k of
the principal components.
• If so, there is (almost) as much information in the k
components as there is in the original p variables. The k
principal components can then replace the initial p
variables, and the original data set, consisting of n
measurements on p variables, is reduced to a data set
consisting of n measurements on k principal
components.

Principal Components
• Principal components are particular linear combinations
of the p random variables
• These linear combinations represent the selection of a
new coordinate system obtained by rotating the original
system
• The new axes represent the directions with maximum
variability

Let 𝐗 ′ = [X1 , X2 , ⋯ , Xp ] have the covariance matrix 𝚺 with eigenvalues
λ1 ≥ λ2 ≥ ⋯ ≥ λp ≥ 0
Consider the linear combinations
Y1 = 𝐚′𝟏 𝐗 = a11 X1 + a12 X2 + ⋯ + a1p Xp
Y2 = 𝐚′𝟐 𝐗 = a21 X1 + a22 X2 + ⋯ + a2p Xp
⋮
′
Yp = 𝐚𝐩 𝐗 = ap1 X1 + ap2 X2 + ⋯ + app Xp
Cov Yi , Yk = 𝐚′𝐢 𝚺𝐚𝐢 𝑖, 𝑘 = 1,2, … , 𝑝

(if 𝑖 == 𝑘, Cov Yi , Y𝑖 == Var(𝑌𝑖 )
The principal components are those uncorrelated linear combinations
Y1 , Y2 , …, Yp whose variance are as large as possible
The first principal component is the linear combination with maximum
variance. That is, it maximizes Var Yi = 𝐚′𝐢 𝚺𝐚𝐢
First principal component = Linear combination 𝐚′𝟏 𝐗 that maximizes

Var 𝐚′𝟏 𝐗 subject to 𝒂′𝟏 𝒂𝟏 = 𝟏
Second principal component = Linear combination 𝐚′𝟐 𝐗 that maximizes

Var 𝐚′𝟐 𝐗 subject to 𝒂′𝟐 𝒂𝟐 = 𝟏 and
Cov 𝐚′𝟏 𝐗, 𝐚′𝟐 𝐗 = 0
At the ith step

ith principal component = Linear combination 𝐚′𝒊 𝐗 that maximizes
Var 𝐚′𝒊 𝐗 subject to 𝒂′𝒊 𝒂𝒊 = 𝟏 and
Cov 𝐚′𝒊 𝐗, 𝐚′𝒌 𝐗 = 0 for 𝑘 < 𝑖

Let 𝚺 be the covariance matrix associated with the
random vector 𝐗 ′ = X1 , X2 , ⋯ , Xp . Let 𝚺 have the
eigenvalue-eigenvector pairs 𝜆1 , 𝒆1 , 𝜆2 , 𝒆2 , … , 𝜆𝑝 , 𝒆𝑝
where λ1 ≥ λ2 ≥ ⋯ ≥ λp ≥ 0. Then the ith principal
component is given by
Yi = 𝒆′𝒊 𝐗 = 𝑒i1 X1 + 𝑒i2 X2 + ⋯ + 𝑒ip X p , 𝑖 = 1,2, … , 𝑝
With these choices

Var Yi = 𝒆′𝐢 𝚺𝒆𝐢 = λ𝑖 𝑖 = 1, 2, … , 𝑝
Cov Yi , Yk = 𝒆′𝐢 𝚺𝒆𝐢 = 0 𝑖≠𝑘

Let 𝐗 ′ = X1 , X2 , ⋯ , Xp have covariance matrix 𝚺, with

eigenvalue-eigenvector pairs 𝜆1 , 𝒆1 , 𝜆2 , 𝒆2 , … , 𝜆𝑝 , 𝒆𝑝
where λ1 ≥ λ2 ≥ ⋯ ≥ λp ≥ 0. Let 𝑌1 = 𝒆′𝟏 𝐗, 𝑌2 = 𝒆′𝟐 𝐗, …,
𝑌𝑝 = 𝒆′𝒑 𝐗 be the principal components. Then
σ11 + σ22 + ⋯ + σpp = λ1 + λ2 + ⋯ + λp

Graphing the Principal Components
• Plots of the principal components can reveal suspect
observations, as well as provide check on the
assumption of normality.
• Since the principal components are linear combinations
of the original variables, it is not unreasonable to expect
them to be nearly normal.
• To help check the normal assumption, construct scatter
diagrams of pairs of first few principal components.
Also, make Q-Q plots from the sample values
generated by each principal component.
• Construct scatter diagrams and Q-Q plots for the last
few principal components. These help identify suspect
observations

Large Sample Inferences
• Eigenvalues specify the variances and eigenvectors
determine the directions of maximum variability
• Most of the total variance can be explained in fewer
than p dimensions, when the first few eigenvalues are
much larger than the rest
• Decisions regarding the quality of the principal
component approximation must be made on the basis
of the eigenvalue-eigenvector pairs (𝜆መ 𝑖 , 𝒆ො 𝑖 )
• Because of sampling variation, these eigenvalues and
eigenvectors will differ from their underlying population
counterparts

Large Sample Inferences
• The observations 𝑿1 , 𝑿2 ,…, 𝑿𝑛 are a random sample
from a normal population
• Assume that unknown eigenvalues of 𝚺 are distinct and
positive, so that λ1 > λ2 > ⋯ > λp > 0
• For n large, the 𝜆መ 𝑖 are independently
2
distributed and
2𝜆
have an approximate 𝑁(𝜆𝑖 , 𝑖 ൗ𝑛) distribution
• A large sample 100 1 − 𝛼 % confidence interval for 𝜆𝑖 is
provided by
Where 𝑧 𝛼 Τ2 is the upper 100 𝛼 Τ2 th percentile of a

standard normal distribution

Thank You!

Bayes’ Classifier
Dr. Jitendra Kumar

Bayes’ Rule
• Bayes’ Theorem is a way of finding a probability when
we know certain other probabilities.
𝑃 𝐴 𝑃(𝐵|𝐴)
𝑃 𝐴𝐵 =
𝑃(𝐵)
Which tells us:

how often A happens given that B happens, written P(A|B),
When we know:
how often B happens given that A happens, written P(B|A)
and how likely A is on its own, written P(A)
and how likely B is on its own, written P(B)

Addition Rule
𝑚
𝑃 𝐴 =
𝑛
𝑃 𝐴 ∪ 𝐵 = 𝑃 𝐴 + 𝑃(𝐵)
𝑃 𝐴 ∪ 𝐵 = 𝑃 𝐴 + 𝑃 𝐵 − 𝑃(𝐴 ∩ 𝐵)

Example
This Photo by Unknown Author is licensed under CC BY-SA

Example

Example
Company PRODUCTION CHEMICAL MECHANICAL Total
TCS 22 28 18 68
L&T 34 25 30 89
IBM 19 32 21 72
Total 75 85 69 229

Conditional Probability
• Probability of occurrence of event B given that event A
has already occurred
𝑃(𝐴 ∩ 𝐵)
𝑃 𝐵𝐴 =
𝑃 𝐴 A B

Multiplication Rule
𝑃 𝐵𝐴 = 𝑃 𝐴𝐵 =
𝑃 𝐵 𝐴 𝑃(𝐴)
𝑃 𝐴𝐵 =
𝑃(𝐵)

A1 A2
B
A3

Example
• P(Fire) means how often there is fire
• P(Smoke) means how often we see smoke
• P(Fire|Smoke) means how often there is fire when we
can see smoke
• P(Smoke|Fire) means how often we can see smoke
when there is fire
P(Fire) = 1%, P(Smoke) = 10%, P(Smoke|Fire)=90%
𝑃 𝐹𝑖𝑟𝑒 𝑆𝑚𝑜𝑘𝑒 =
𝑃 𝐹𝑖𝑟𝑒 𝑃(𝑆𝑚𝑜𝑘𝑒|𝐹𝑖𝑟𝑒)
𝑃(𝑆𝑚𝑜𝑘𝑒)
=9% ?
Example
• You are planning a picnic today
• but the morning is cloudy
• Oh no! 50% of all rainy days start off cloudy!
• But cloudy mornings are common (about 40% of days
start cloudy)
• And this is usually a dry month (only 3 of 30 days tend
to be rainy, or 10%)
What is the chance of rain during the day?

Example
𝑃 𝑅𝑎𝑖𝑛|𝐶𝑙𝑜𝑢𝑑 =?
𝑃 𝑅𝑎𝑖𝑛 𝑃(𝐶𝑙𝑜𝑢𝑑|𝑅𝑎𝑖𝑛)
𝑃 𝑅𝑎𝑖𝑛|𝐶𝑙𝑜𝑢𝑑 =
𝑃(𝐶𝑙𝑜𝑢𝑑)
𝑃 𝑅𝑎𝑖𝑛 = 10% 𝑃 𝐶𝑙𝑜𝑢𝑑 = 40% 𝑃 𝐶𝑙𝑜𝑢𝑑|𝑅𝑎𝑖𝑛 = 50%
0.1 × 0.5
𝑃 𝑅𝑎𝑖𝑛|𝐶𝑙𝑜𝑢𝑑 = = 0.125
0.4
12.5% chances of rain. Not too bad, you may have a picnic.

Example
Blue notBlue
40
Man 5 35 40 𝑃 𝑀𝑎𝑛 = = 0.4
100
25
Woman 20 40 60 𝑃 𝐵𝑙𝑢𝑒 = = 0.25
100
5
25 75 100 𝑃 𝐵𝑙𝑢𝑒|𝑀𝑎𝑛 = = 0.125
40
𝑃 𝑀𝑎𝑛 𝐵𝑙𝑢𝑒 =?
𝑃 𝑀𝑎𝑛 𝑃(𝐵𝑙𝑢𝑒|𝑀𝑎𝑛)
𝑃 𝑀𝑎𝑛 𝐵𝑙𝑢𝑒 =
𝑃(𝐵𝑙𝑢𝑒)
0.4 × 0.125
= = 0.2
0.25

Question 1
• In a School, 60% of the boys play football and 36% of
the boys play ice hockey. Given that 40% of those that
play football also play ice hockey, what percent of those
that play ice hockey also play football?

Question 2
• 40% of the girls like music and 24% of the girls like
dance. Given that 30% of those that like music also like
dance, what percent of those that like dance also like
music?

Question 3
• In a factory, machine X produces 60% of the daily
output and machine Y produces 40% of the daily output.
2% of machine X's output is defective, and 1.5% of
machine Y's output is defective.
One day, an item was inspected at random and found to
be defective. What is the probability that it was
produced by machine X?

Naïve Bayes Algorithm
• The Naïve Bayes algorithm is a machine learning
algorithm for classification problems. It is primarily used
for text classification, which involves high-dimensional
training data sets.
• It makes an assumption that the occurrence of a certain
feature/attribute is independent to the occurrence of
other attributes.
• Spam filtrations
• Sentimental analysis
• News article classification

• In a classification problem, there are multiple attributes and
classes
• The main aim in the Naïve Bayes algorithm is to calculate
the conditional probability of an object with attributes 𝐴 =
(𝑎1 , 𝑎2 , … , 𝑎𝑛 ) belongs to a particular class 𝑣
𝑃 𝐴 𝑣 𝑃(𝑣)
𝑃 𝑣𝐴 =
𝑃(𝐴)
𝐴 = (𝑎1 , 𝑎2 , … , 𝑎𝑛 )
𝑃 𝑎1 𝑣 𝑃 𝑎2 𝑣 … 𝑃 𝑎𝑛 𝑣 𝑃(𝑣)
𝑃 𝑣 𝑎1 , 𝑎2 , … , 𝑎𝑛 =
𝑃 𝑎1 𝑃 𝑎2 … 𝑃(𝑎𝑛 )
𝑃 𝑣 𝑎1 , 𝑎2 , … , 𝑎𝑛 = 𝑃(𝑣) ෑ 𝑃(𝑎𝑖 |𝑣)
𝑣 = 𝑎𝑟𝑔𝑚𝑎𝑥𝑣 𝑃(𝑣) ෑ 𝑃(𝑎𝑖 |𝑣)

PLAY
DAY OUTLOOK TEMPERATURE HUMIDITY WINDY
GOLF
0 Rainy Hot High False No
1 Rainy Hot High True No
2 Overcast Hot High False Yes
3 Sunny Mild High False Yes
4 Sunny Cool Normal False Yes
5 Sunny Cool Normal True No
6 Overcast Cool Normal True Yes
7 Rainy Mild High False No
8 Rainy Cool Normal False Yes
9 Sunny Mild Normal False Yes
10 Rainy Mild Normal True Yes
11 Overcast Mild High True Yes
12 Overcast Hot Normal False Yes
13 Sunny Mild High True No

PLAY
GOLF
Outlook Yes No P(Yes) P(No)
Sunny
Overcast
Rainy
Total 9 5 100% 100%

PLAY
GOLF
Outlook Yes No P(Yes) P(No) Temp Yes No P(Yes) P(No)
Sunny 3 2 3/9 2/5 Hot
Overcast 4 0 4/9 0/9 Mild
Rainy 2 3 2/9 3/5 Cold
Total 9 5 100% 100% Total 9 5 100% 100%

PLAY
GOLF
3 Sunny Mild High False Yes Humidity Yes No P(Yes) P(No)
High
6 Overcast Cool Normal True Yes Normal
Total 9 5 100% 100%
Sunny 3 2 3/9 2/5 Hot 2 2 2/9 2/5
Overcast 4 0 4/9 0/9 Mild 4 2 4/9 2/5
Rainy 2 3 2/9 3/5 Cold 3 1 3/9 1/5
Total 9 5 100% 100% Total 9 5 100% 100%

PLAY
GOLF
High 3 4 3/9 4/5
6 Overcast Cool Normal True Yes Normal 6 1 6/9 1/5
Total 9 5 100% 100%
9 Sunny Mild Normal False Yes Windy Yes No P(Yes) P(No)
True
12 Overcast Hot Normal False Yes False
Total 9 5 100% 100%
Sunny 3 2 3/9 2/5 Hot 2 2 2/9 2/5
Overcast 4 0 4/9 0/9 Mild 4 2 4/9 2/5
Rainy 2 3 2/9 3/5 Cold 3 1 3/9 1/5
Total 9 5 100% 100% Total 9 5 100% 100%

PLAY
GOLF
High 3 4 3/9 4/5
7 Rainy Mild High False No Total 9 5 100% 100%
True 3 3 3/9 3/5
12 Overcast Hot Normal False Yes False 6 2 6/9 2/5
Total 9 5 100% 100%
Outlook Yes No P(Yes) P(No) Temp Yes No P(Yes) P(No) Play P(Yes)
or
Sunny 3 2 3/9 2/5 Hot 2 2 2/9 2/5 P(No)
Overcast 4 0 4/9 0/9 Mild 4 2 4/9 2/5 Yes
Rainy 2 3 2/9 3/5 Cold 3 1 3/9 1/5 No
Total 9 5 100% 100% Total 9 5 100% 100% Total 14 100%

PLAY
GOLF
High 3 4 3/9 4/5
Total 9 5 100% 100%
True 3 3 3/9 3/5
12 Overcast Hot Normal False Yes False 6 2 6/9 2/5
Total 9 5 100% 100%
Outlook Yes No P(Yes) P(No) Temp Yes No P(Yes) P(No) Play P(Yes)
or
Sunny 3 2 3/9 2/5 Hot 2 2 2/9 2/5 P(No)
Overcast 4 0 4/9 0/9 Mild 4 2 4/9 2/5 Yes 9 9/14
Rainy 2 3 2/9 3/5 Cold 3 1 3/9 1/5 No 5 5/14
Today = (Sunny, Hot, Normal, False)

Sunny 3 2 3/9 2/5 Hot 2 2 2/9 2/5
Overcast 4 0 4/9 0/9 Mild 4 2 4/9 2/5
Rainy 2 3 2/9 3/5 Cold 3 1 3/9 1/5
Total 9 5 100% 100% Total 9 5 100% 100%

Play P(Yes)
or
Humidity Yes No P(Yes) P(No) Windy Yes No P(Yes) P(No) P(No)
High 3 4 3/9 4/5 True 3 3 3/9 3/5 Yes 9 9/14
Normal 6 1 6/9 1/5 False 6 2 6/9 2/5 No 5 5/14
Today = (Sunny, Hot, Normal, False)

𝑃 𝑆𝑢𝑛𝑛𝑦 𝑌𝑒𝑠 𝑃 𝐻𝑜𝑡 𝑌𝑒𝑠 𝑃 𝑁𝑜𝑟𝑚𝑎𝑙 𝑌𝑒𝑠 𝑃 𝐹𝑎𝑙𝑠𝑒 𝑌𝑒𝑠 𝑃(𝑌𝑒𝑠)
𝑃 𝑌𝑒𝑠 𝑇𝑜𝑑𝑎𝑦 =
𝑃(𝑇𝑜𝑑𝑎𝑦)
𝑃 𝑆𝑢𝑛𝑛𝑦 𝑁𝑜 𝑃 𝐻𝑜𝑡 𝑁𝑜 𝑃 𝑁𝑜𝑟𝑚𝑎𝑙 𝑁𝑜 𝑃 𝐹𝑎𝑙𝑠𝑒 𝑁𝑜 𝑃(𝑁𝑜)

𝑃 𝑁𝑜 𝑇𝑜𝑑𝑎𝑦 =
𝑃(𝑇𝑜𝑑𝑎𝑦)
𝑃 𝑌𝑒𝑠 𝑇𝑜𝑑𝑎𝑦 ∝
𝑃 𝑁𝑜 𝑇𝑜𝑑𝑎𝑦 ∝

Thank You!

Eigenvalues and Eigenvectors
Wednesday, 28 October, 2020 03:01 PM
EigenValues and EigenVectors Page 1

Machine Learning
Dr. Jitendra Kumar


How do
you know?
I think, Its going to rain today!

https://cdn.shopify.com/s/files/1/1406/4308/articles/Looking-at-the-clouds-can-help-you-predict-bad-weather-_697_6052888_0_14103285_1000_large.jpg?v=1500990343
HUMAN
MACHINE
Learns from data
Learns from mistakes

https://cdn.dribbble.com/users/538946/screenshots/4169377/artificial-2.png
Why Machine Learning is getting popular?
Computing Power Availability

Excessive Data Availability

What do we mean by learning?
• Given
• a data set D,
• a task T, and
• a performance measure M,
a computer system is said to learn from D to perform
the task T if after learning the system’s performance on
T improves as measured by M.
• In other words, the learned model helps the system to
perform T better as compared to no learning.
Herbert Simon: “Learning is any process by
which a system improves performance from
experience.”

What is Machine Learning?
Definition:
“changes in [a] system that ... enable [it] to do the same
task or tasks drawn from the same population more
efficiently and more effectively the next time.'' (Simon
1983)

https://expertsystem.com/wp-content/uploads/2017/03/machine-learning-definition.jpeg
Why Machine Learning?
• For some kinds of problems we are just not able write
down the rules
• e.g. image & speech recognition, language translation, sales
forecasting
Problem Code
RULES
?

Traditional Computing vs Machine
Learning
Data Traditional
Computing
Output
Program
Data Machine
Learning
Program
Output
Types of Machine Learning
Supervised Unsupervised Reinforcement

Learning Learning Learning
• Labelled data • Unlabelled data • Reward based

learning
• Direct feedback • Association
• Machine learns how
• Classification • Clustering to act in an
environment
• Regression
• Robotics

Supervised Learning
Regression Classification

Regression
Notation:
Price (₹) in m = Number of training examples
Size in feet2 (x)
100,000's (y) x’s = “input” variable / features
2104 460 y’s = “output” variable / “target” variable
1416 232
1534 315 500
852 178 400
… …
(in 100,000)
300
Price (₹)
200
Housing Prices
100
(Trichy, TN)
0
0 1000 2000 3000
Size (feet2)

Training Set
(size of house)
Learning Algorithm
Size of Estimated
Model (h)
house price

500
Modelling 400
(in 100,000)
300
Price (₹)
ℎ𝛽 𝑥 = 𝛽0 + 𝛽1 𝑥 200
100
Identify 𝛽0 and 𝛽1 so that
ℎ𝛽 𝑥 is close to 𝑦 0
0 1000 2000 3000
Size (feet2)
3 3 3
𝛽0 = 1.5, 𝛽1 = 0 𝛽0 = 0, 𝛽1 = 0.5 𝛽0 = 1, 𝛽1 = 0.5
2 2 2
1 1 1
0 0 0
0 1 2 3 0 1 2 3 0 1 2 3
How to define closeness?
ℎ𝛽 (𝑥𝑖 ) = 𝛽0 + 𝛽1 𝑥𝑖 ; 𝑖 = 1,2, ⋯ , 𝑚
𝑦
𝜀𝑖 = ℎ𝛽 𝑥𝑖 − 𝑦𝑖 𝑖 = 1,2, ⋯ , 𝑚
𝑦 = 𝛽0 + 𝛽1 𝑥
𝜀𝑚
How to compute the total error? (𝑥 𝑚
,𝑦 𝑚
)
 a) σ𝑛𝑖=1 𝜀𝑖 (𝑥 1 ,𝑦 1 )
𝜀1
𝜀2
 b) σ𝑛𝑖=1 𝜀𝑖2
(𝑥 2 ,𝑦 2 )
x
𝑚
1
Cost function: 𝐽 𝛽0 , 𝛽1 = ෍(ℎ𝛽 𝑥𝑖 − 𝑦𝑖 )2
2𝑚
𝑖=1
Goal: 𝛽min
0 ,𝛽1
𝐽 𝛽0 , 𝛽1
ℎ𝛽 𝑥 𝐽 𝛽1
(for fixed 𝛽1, this is a function of x) (function of the parameter 𝛽1 )
𝛽1 =1
3 3
2 𝛽1 =0.5 2
y 𝐽 𝛽1
1 1
0 𝛽1 =0 0
0 1
x 2 3 -0.5 0 0.5 1 1.5 2 2.5
𝛽1

Parameter Learning
Have some function 𝐽 𝛽0 , 𝛽1 repeat until convergence{

𝑚
1
𝛽0 ≔ 𝛽0 − α ෍ ℎ𝛽 𝑥𝑖 − 𝑦𝑖
Want 𝛽min
0 ,𝛽1
𝐽 𝛽0 , 𝛽1 𝑚
𝑖=1
𝑚
1
𝛽1 ≔ 𝛽1 − α ෍ ℎ𝛽 𝑥𝑖 − 𝑦𝑖 𝑥𝑖
Outline: 𝑚
𝑖=1
}
• Start with some 𝛽0 , 𝛽1
• Keep changing 𝛽0 , 𝛽1 to reduce
until we hopefully end up at a minimum 𝐽 𝛽0 , 𝛽1

𝐽 𝛽0 , 𝛽1
𝛽1
𝛽0

𝐽 𝛽0 , 𝛽1
𝛽1
𝛽0

ℎ𝛽 𝑥 𝐽 𝛽0 𝛽1
(for fixed 𝛽0 , 𝛽1 , this is a function of x) (function of the parameter 𝛽0 , 𝛽1 )
Price (₹) in 100,000's
𝛽1
𝛽0

Price (₹) in 100,000's
𝛽1
𝛽0

Price (₹) in 100,000's
𝛽1
𝛽0

Price (₹) in 100,000's
𝛽1
𝛽0

Price (₹) in 100,000's
𝛽1
𝛽0

Price (₹) in 100,000's
𝛽1
𝛽0

Price (₹) in 100,000's
𝛽1
𝛽0

Price (₹) in 100,000's
𝛽1
𝛽0

Price (₹) in 100,000's
𝛽1
𝛽0

Classification
Email: Spam / Not Spam?

Online Transactions: Fraudulent (Yes / No)?
Tumor: Malignant / Benign?
0: “Negative Class” (e.g., benign tumor)

1: “Positive Class” (e.g., malignant tumor)
(Yes) 1
Malignant ?
(No) 0
Tumor Size Tumor Size
Linear Regression is not a good choice for classification
Threshold classifier output ℎ𝛽 (𝑥𝑖 ) at 0.5:
If ℎ𝛽 (𝑥𝑖 ) ≥ 0.5 , predict “y = 1”

If ℎ𝛽 𝑥𝑖 < 0.5 , predict “y = 0”

Classification
ℎ𝛽 (𝑥𝑖 ) = 0.7
ℎ𝛽 (𝑥𝑖 ) = 𝛽0 + 𝛽1 𝑥𝑖 Tell patient that 70% chance of

tumor being malignant
ℎ𝛽 (𝑥𝑖 ) = 𝑔(𝜃 𝑇 𝑥)
ℎ𝛽 𝑥 = 𝑃(𝑦 = 1|𝑥; 𝛽)
Sigmoid function
“probability that y = 1, given x,
parameterized by 𝛽”
1
𝑔 𝑧 =
1 + 𝑒 −𝑧 𝑃 𝑦 = 0 𝑥; 𝛽 + 𝑃 𝑦 = 1 𝑥; 𝛽 = 1
𝑃 𝑦 = 0 𝑥; 𝛽 = 1 − 𝑃 𝑦 = 1 𝑥; 𝛽

Binary classification: Multi-class classification:
x2 x2
x1 x1

x2
One-vs-all (one-vs-rest):
x1
x2 x2
x1
x1
x2
Class 1:
Class 2:
Class 3:
𝑖
ℎ𝛽 (𝑥) = 𝑃(𝑦 = 𝑖|𝑥; 𝛽) 𝑖 = (1,2,3) x1
Unsupervised Learning

Reinforcement Learning
Defines how software agents should
Action
take actions in an environment
State, Reward

Reinforcement Learning Process
Two main components
1. Agent
2. Environment
Graphics by Unknown Author is licensed under CC BY-SA

Reward Maximization
Agent
Reward
Opponent

Graphics by Unknown Author is licensed under CC BY-ND
Markov Decision Process
• The following parameters are used to attain a solution
• Actions (A)
• States (S)
• Reward (R)
• Policy (𝜋) Action
• Value (V)
State, Reward

Q-Learning Algorithm

Q-Learning R4
0
R6
0
0 100
0 0 0
R0 R1 0
0 100
R2 R0 R3 R1 R7
R4 R3 0 0
0
0
R5
R6 0
R7 R2 R5
0
0 0 0 0 0 0 0 0 𝑅0 −1 −1 −1 −1 0 −1 −1 −1
0 0 0 0 0 0 0 0 𝑅1 −1 −1 −1 0 −1 −1 −1 100
0 0 0 0 0 0 0 0 𝑅2 −1 −1 −1 0 −1 0 −1 −1
0 0 0
𝑄= 0 0 0 0 0
𝑅 = 𝑅3
−1 0 0 −1 0 −1 −1 −1
0 0 0 0 0 0 0 0 𝑅4 0 −1 −1 0 −1 −1 0 −1
0 0 0 0 0 0 0 0 𝑅5 −1 −1 0 −1 −1 −1 −1 −1
0 0 0 0 0 0 0 0 𝑅6 −1 −1 −1 −1 0 −1 −1 100
0 0 0 0 0 0 0 0 𝑅7 −1 0 −1 −1 −1 −1 0 −1

Machine Learning Model Development
Process
Feature Extraction
Model Evaluation
Data Preparation
Data Collection
Model Training
Model Testing
Step 1 Step 2 Step 3 Step 4 Step 5 Step 6

https://static.javatpoint.com/tutorial/machine-learning/images/applications-of-machine-learning.png

Thank
You
jitendra@nitt.edu
https://imjitendra.wordpress.com/

Decision Trees
What makes a loan risky?
I want a to buy
a new house! Credit
★★★★
Income
★★★
Term
★★★★★
Loan
Application
Personal Info
★★★
Credit history explained
Did I pay previous

loans on time? Credit History
★★★★
Example: excellent, Income
good, or fair ★★★
Term
★★★★★
Personal Info
★★★
Income
Credit History
What’s my income? ★★★★
Example: Income
★★★
$80K per year
Term
★★★★★
Personal Info
★★★
Loan terms
Credit History
How soon do I need to ★★★★
pay the loan?
Income
Example: 3 years, ★★★
5 years,… Term
★★★★★
Personal Info
★★★
Personal information
Credit History
★★★★
Income
Age, reason for the ★★★
loan, marital status,…
Term
Example: Home loan ★★★★★
for a married couple Personal Info
★★★
Intelligent application
Loan
Applications
Safe
✓
Intelligent loan application Risky

review system ✘
Risky
✘
Classifier review
ŷi = +1
Loan Classifier Safe

Application MODEL Risky
Output: ŷ
Input: xi Predicted
class ŷi = -1
Decision Tree: Intuitions
What does a decision tree represent?
Start
excellent poor
Credit?
fair
Income?
Safe Term?
high Low
3 years 5 years
Risky Safe Term? Risky
3 years 5 years
3 year loans with fair Risky Safe

credit history are risky
What does a decision tree represent?
Start
excellent poor
Credit?
fair
Income?
Safe Term?
high Low
3 years 5 years
3 years 5 years
3 year loans with high
income & poor credit Safe
Risky
history are risky
Scoring a loan application
xi = (Credit = poor, Income = high, Term = 5 years)
Start
excellent poor
Credit?
fair
Income?
Safe Term?
high Low
3 years 5 years
3 years 5 years
Risky Safe ŷi = Safe

Decision tree model
T(xi) = Traverse decision tree

start
excellent poor
Credit?
fair
Loan
Income? ŷi
Application Safe Term?
high Low
3 years 5 years
Input: xi Risky Safe Term? Risky
3 years 5 years
Risky Safe
Decision tree learning task
Training
x Feature h(x) ML ŷ
extraction model
Data
y T(x)
ML algorithm
Quality
metric
Learn decision tree from data?
Credit Term Income y

excellent 3 yrs high safe Start
fair 5 yrs low risky
fair 3 yrs high safe excellent poor
Credit?
poor 5 yrs high risky
fair
excellent 3 yrs low risky
Income?
fair 5 yrs low safe Safe Term?
high Low
poor 3 yrs high risky 3 years 5 years
poor 5 yrs low safe
fair 3 yrs high safe
3 years 5 years
Risky Safe
Decision tree learning problem
Training data: N observations (xi,yi)

excellent 3 yrs high safe
Optimize
fair
poor
3 yrs
5 yrs
high
high
safe
risky
quality metric
on training data
T(X)
fair 5 yrs low safe
poor 5 yrs low safe
Quality metric: Classification error
• Error measures fraction of mistakes
Error = # incorrect predictions

# examples
- Best possible value : 0.0

- Worst possible value: 1.0
Find the tree with lowest classification error

fair 5 yrs low risky Minimize
fair
poor
3 yrs
5 yrs
high
high
safe
risky
classification error
on training data
T(X)
fair 5 yrs low safe
poor 5 yrs low safe
How do we find the best tree?
Exponentially large number of possible

trees makes decision tree learning hard!
(NP-hard problem)
T1(X) T2(X) T3(X)
T4 (X) T5(X) T6 (X)

Simple (greedy) algorithm finds “good”
tree

fair 5 yrs low risky Approximately
minimize
fair
poor
3 yrs
5 yrs
high
high
safe
risky
on training data
T(X)
fair 5 yrs low safe
poor 5 yrs low safe
Greedy decision tree learning:
Algorithm outline
Step 1: Start with an empty tree
(all data) Safe

Risky
Histogram All points in the

of y values dataset
Step 2: Split on a feature
(all data) Split/partition

data on Credit
excellent poor
Credit ?
fair
Subset of data with Subset of data with Subset of data with

Credit = excellent Credit = fair Credit = poor
Feature split explained
(all data) Safe

Risky
Data points where
Credit = excellent
Split/partition
data on Credit
Credit?
excellent fair poor

Step 3: Making predictions
(all data)
Safe
Risky
Credit?
excellent fair poor
Here, all examples

Predict Safe
are Safe loans
Step 4: Recursion
(all data)
Safe
Nothing more Risky
to do here
Credit?
excellent fair poor
Build tree from Build tree from

Predict Safe these data points these data points
Greedy decision tree learning
• Step 1: Start with an empty tree

• Step 2: Select a feature to split data
• For each split of the tree:
Problem 1: Feature
• Step 3: If nothing more to, split selection
make predictions
Problem 2:
• Step 4: Otherwise, go to Step 2 & Stopping condition
continue (recurse) on this split
Recursion
Feature split learning
=
Decision stump learning
Start with the data
Assume N = 40, 3 features

fair 5 yrs low safe
poor 5 yrs low safe
Start with all the data
Loan status: Safe Risky

(all data)
Number of Risky
22
18 loans
Number of Safe
loans N = 40 examples
Compact visual notation: Root node
Loan status: Safe Risky

Root
22 18 Number of risky
loans
Number of safe
loans N = 40 examples
Decision stump: Single level tree
Loan status: (all data )
Safe Risky
Split on Credit
Credit?
excellent fair poor
excellent fair poor

9 0 9 4 4 14
Subset of data with Subset of data with Subset of data with

Credit = excellent Credit = fair Credit = poor
Visual Notation: Intermediate nodes
Loan status: Root

Safe Risky 22 18
Credit?
excellent fair poor

9 0 9 4 4 14
Intermediate nodes
Making predictions with a decision stump
Loan status: root

Safe Risky 22 18
credit?
excellent fair poor

9 0 9 4 4 14
Safe Safe Risky
For each intermediate node,

set ŷ = majority value
How do we learn a decision stump?
Loan status: Root Find the “best”

Safe Risky 22 18 feature to split on!
Credit?
excellent fair poor

9 0 9 4 4 14
How do we select the best feature?
Choice 1: Split on Credit Choice 2: Split on Term

Loan status: Loan status:
Root Root
Safe Risky Safe Risky
22 18 22 18
Credit?
OR Term?
excellent fair poor 3 years 5 years

9 0 9 4 4 14 16 4 6 14
How do we measure
eﬀ ectiveness of a split?
Loan status:
Root
Safe Risky
22 18 Idea: Calculate
classification error of
this decision stump
Credit?
excellent fair poor

9 0 9 4 4 14
Error = # mistakes
# data points
Calculating classification error
• Step 1: ŷ = class of majority of data in node
• Step 2: Calculate classification error of
predicting ŷ for this data
Loan status:
Safe Risky Root Error = .
22 18
=
22 correct 18 mistakes
Safe
Tree Classification error
(root) 0.45
ŷ = majority class
Choice 1: Split on credit history?
Choice 1: Split on Credit

Loan status: Root Does a split on Credit
Safe Risky 22 18
reduce classification
error below 0.45?
Credit?
excellent fair poor

9 0 9 4 4 14
How good is the split on Credit?

Loan status: Root
Safe Risky 22 18
Credit?
Step 1: For each
excellent fair poor intermediate node,
9 0 9 4 4 14 set ŷ = majority value
Safe Safe Risky

Split on Credit: Classification error
Loan status: Root
Safe Risky 22 18
Credit? Error = .
excellent fair poor =

9 0 9 4 4 14

Safe Safe Risky (root) 0.45
Split on credit 0.2
0 mistakes 4 mistakes 4 mistakes
Choice 2: Split on Term?
Choice 2: Split on Term
Loan status: Root

Safe Risky 22 18
Term?
3 years 5 years
16 4 6 14
Safe Risky
Evaluating the split on Term
Choice 2: Split on Term
Loan status: Root

Safe Risky 22 18
Error = .
Term?
3 years 5 years
=
16 4 6 14
Safe Risky (root) 0.45
Split on credit 0.2
4 mistakes 6 mistakes Split on term 0.25
Choice 1 vs Choice 2
(root) 0.45
split on credit 0.2
split on loan term 0.25
Choice 1: Split on Credit Choice 2: Split on Term
Loan status: Root Loan status: Root
Safe Risky 22 18 Safe Risky 22 18
Credit? OR Term?
excellent fair poor 3 years 5 years

9 0 8 4 4 14 16 4 6 14
WINNER
Feature split selection algorithm
• Given a subset of data M (a node in a tree)

• For each feature h i(x):
1. Split data of M according to feature h i(x)
2. Compute classification error split
• Chose feature h * (x) with lowest

• For each split of the tree: Pick feature split
• Step 3: If nothing more to, leading to lowest
make predictions classification error
• Step 4: Otherwise, go to Step 2 &
Decision Tree Learning:
Recursion & Stopping conditions
Learn decision tree from data?

Start

excellent poor

Credit?
fair
excellent 3 yrs low risky Safe Term?

Income?
high Low
fair 5 yrs low safe

3 years 5 years

3 years 5 years
poor 5 yrs low safe

Risky Safe

We’ve learned a decision stump, what next?
Loan status: Root

Safe Risky 22 18
Credit?
excellent fair poor

9 0 9 4 4 14
All data points are Safe

Safe
nothing else to do with
this subset of data
Leaf node
Tree learning = Recursive stump learning
Loan status: Root

Safe Risky 22 18
Credit?
excellent fair poor

9 0 9 4 4 14
Safe
Build decision stump Build decision stump
with subset of data with subset of data
where Credit = fair where Credit = poor
Second level
Loan status: Root

Safe Risky 22 18
Credit?
excellent fair poor

9 0 9 4 4 14
Safe Term? Income?
3 years 5 years high Low

0 4 9 0 4 5 0 9
Risky Safe Risky
Build another stump

these data points
Final decision tree
Root poor
Loan status: 22 18 4 14
Safe Risky
Credit? Income?
excellent
Fair
9 0 high low
9 4
4 5 0
9
Safe Term?
Term? Risky
3 years 5 years
0 4 9 0
3 years 5 years
0 2 4 3
Risky Safe
Risky Safe
Simple greedy decision tree learning
Pick best feature to split on
Learn decision stump with

this split
For each leaf of decision

stump, recurse
When do we stop???
Stopping condition 1: All data agrees on y
All data in these
nodes have same
Root poor
y value
Loan ->
status: 22 18 4 14
Safe Risky
Nothing to do
Credit? Income?
excellent
Fair
9 0 high low
9 4
4 5 0 9
Safe Term?
Term? Risky
3 years 5 years
0 4 9 0
3 years 5 years
0 2 4 3
Risky Safe
Risky Safe
Stopping condition 2: Already split on all features
Already split on all
possible features ->
Root poor
Loan status:
Nothing to do 22 18 4 14
Safe Risky
Credit? Income?
excellent
Fair
9 0 high low
9 4
4 5 0 9
Safe Term?
Term? Risky
3 years 5 years
0 4 9 0
3 years 5 years
0 2 4 3
Risky Safe
Risky Safe

Pick feature split
• For each split of the tree: leading to lowest
• Step 3: If nothing more to, classification error
make predictions
Stopping
• Step 4: Otherwise, go to Step 2 & conditions 1 & 2
Recursion
Predictions with decision trees
Training
x Feature h(x) ML ŷ
extraction model
Data
y T(x)
ML algorithm
Quality
metric
Decision tree model

start
excellent poor
Credit?
fair
Loan
Income? ŷi
high Low
3 years 5 years
3 years 5 years
Risky Safe
Traversing a decision tree
xi = (Credit = poor, Income = high, Term = 5 years)
Start
excellent poor
Credit?
fair
Income?
Safe Term?
high Low
3 years 5 years
3 years 5 years
Risky Safe ŷi = Safe

Decision tree prediction algorithm
predict(tree_node, input)
• If current tree_node is a leaf:
o return majority class of
data points in leaf
• else:
o next_note = child node of
tree_node whose feature value
agrees with input
o return predict(next_note, input)
Multiclass classification
Multiclass prediction
Safe
Loan Classifier
Application MODEL
Risky
Output: ŷ i
Input: xi Predicted class
Danger
Multiclass decision stump
N = 40,
1 feature,
3 classes Loan status: Root
Credit y Safe Risky Danger 18 12 10
excellent safe
fair risky
Credit?
fair safe
poor danger
excellent risky excellent fair poor
fair safe 9 2 1 6 9 2 3 1 7
poor danger
poor safe
fair safe
… … Safe Risky Danger
Decision tree learning:
Real valued features
How do we use real values inputs?
Income Credit Term y

$105 K excellent 3 yrs Safe
$112 K good 5 yrs Risky
$73 K fair 3 yrs Safe
$217 K excellent 3 yrs Risky
$120 K good 5 yrs Safe
$64 K fair 3 yrs Risky
$60 K good 3 yrs Risky
Split on each numeric value?
Danger: May only Root Loan status:
contain one data point 22 18 Safe Risky
per node
Income?
$30K $31.4K $39.5K $61.1K $91.3K

0 1 1 0 0 1 0 1 0 1
Can’t trust prediction

(overfitting)
Alternative: Threshold split
Loan status: Root

Safe Risky 22 18 Split on the
feature Income
Income?
< $60K >= $60K

8 13 14 5
Subset of data with

Income >= $60K Many data points è
lower chance of overfitting
Threshold splits in 1-D
Threshold split is the line

Income = $60K
Income < $60K Income >= $60K

Safe
Risky
Income
$10K $120K
Visualizing the threshold split
Threshold split is
the line Age = 38
Income
$80K
$40K
$0K
0 10 20 30 40 …
Age
Split on Age >= 38
Income age < 38 age >= 38

Predict Risky
…
$80K
$40K
Predict Safe
$0K
0 10 20 30 40 …
Age
Depth 2: Split on Income >= $60K
Threshold split is the

line Income = 60K
Income
$80K
$40K
$0K
0 10 20 30 40 …
Age
Each split partitions the 2-D space
Age >= 38
Income Age < 38 Income >= 60K
$80K
$40K
Age >= 38
Income < 60K
$0K
0 10 20 30 40 …
Age
Summary of decision trees
What you can do now
• Define a decision tree classifier

• Interpret the output of a decision trees
• Learn a decision tree classifier using
greedy algorithm
• Traverse a decision tree to make
predictions
- Majority class predictions
- Probability predictions
- Multiclass classification
Clustering
Dr. Jitendra Kumar

Supervised Learning
Training set:
Unsupervised Learning
Training set:
K-means algorithm
Input:
- (number of clusters)
- Training set
(drop convention)
K-means algorithm
Randomly initialize cluster centroids
Repeat {
for = 1 to
:= index (from 1 to ) of cluster centroid
closest to
for = 1 to
:= average (mean) of points assigned to cluster
}
K-means for non-separated clusters
T-shirt sizing
Weight
Height
K-means optimization objective
= index of cluster (1,2,…, ) to which example is currently
assigned
= cluster centroid ( )
= cluster centroid of cluster to which example has been
assigned
Optimization objective:
K-means algorithm
Randomly initialize cluster centroids
Repeat {
for = 1 to
:= index (from 1 to ) of cluster centroid
closest to
for = 1 to
:= average (mean) of points assigned to cluster
}
Random initialization
Should have
Randomly pick training

examples.
Set equal to these

examples.
Local optima
Random initialization
For i = 1 to 100 {
Randomly initialize K-means.
Run K-means. Get .
Compute cost function (distortion)
Pick clustering that gave lowest cost

Right value of K?
Choosing the value of K
Elbow method:
Cost function
Cost function
1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8
(no. of clusters) (no. of clusters)

Choosing the value of K
Sometimes, you’re running K-means to get clusters to use for some
later/downstream purpose. Evaluate K-means based on a metric for
how well it performs for that later purpose.
E.g.
T-shirt sizing T-shirt sizing
Weight
Weight
Height Height
Thank You!
Random Forest
Dr. Jitendra Kumar


Loan Application
I want a to buy
a new house! Credit
★★★★
Income
★★★
Term
★★★★★
Loan
Application
Personal Info
★★★

Decision Tree

start
excellent poor
Credit?
fair
Loan
Income? ŷi
high Low
3 years 5 years
3 years 5 years
Risky Safe

Decision Tree
• Non-linear classifier
• Easy to use
• Easy to interpret
• Susceptible to overfitting

Ensemble Learning
Original
D Training data
Step 1:
Create Multiple D1 D2 .... Dt-1 Dt
Data Sets
Step 2:
Build Multiple C1 C2 Ct -1 Ct
Classifiers
Step 3:
Combine C*
Classifiers

STT450-550: Statistical Data Mining
Bootstrapping
Resampling of the
observed dataset
(and of equal size
to the observed
dataset), each of
which is obtained
by random
sampling with
replacement from
the original
dataset.

Random Forests
• Random forests (RF) are a combination of tree predictors
• Each tree depends on the values of a random vector sampled
independently
• The generalization error depends on the strength of the
individual trees and the correlation between them

Random Forest Classifier
Training Data
M features
N examples

Create samples
from the training data
M features
N examples
....…

Construct a decision tree
M features
N examples
....…

Create decision tree
from each bootstrap sample
M features
N examples
....…
....…

M features
N examples
Take he
majority
vote
....…
....…

The Random Forests Algorithm
Given a training set D
For i = 1 to k do:
Build subset Di by sampling with replacement from D
Learn tree Ti from Di
At each node:
Choose best split from random subset of m features
Each tree grows to the largest extend, and no pruning
Make predictions according to majority vote of the set of k trees.
For prediction:
Regression: average all k predictions from all k trees
Classification: majority vote among all k trees

STT450-550: Statistical Data Mining 14
Why are we considering a random sample of m

predictors instead of all M predictors for splitting?
• Suppose that we have a very strong predictor in the data set

along with a number of other moderately strong predictor,
then in the collection of bagged trees, most or all of them will
use the very strong predictor for the first split!
• All bagged trees will look similar. Hence all the predictions
from the bagged trees will be highly correlated
• Averaging many highly correlated quantities does not lead to

a large variance reduction, and thus random forests
decorrelates the bagged trees leading to more reduction in
variance

Features of Random Forests
• Random Forests requires less training time.
• They both can be used in regression.
• One-vs-all works well in most cases in multi-class
classification.
• It is unexcelled in accuracy among current algorithms.
• It runs efficiently on large data bases.
• It has methods for balancing error in class population
unbalanced data sets.

Thank You

k Nearest Neighbours (kNN)
Dr. Jitendra Kumar


Instance-based Classification
• Similar instances have similar

classification
• No clear separation between
the three phases (training,
testing, and usage) of
classification
• It is a lazy classifier, as
opposed to eager classifier

Eager vs Lazy Classification
Eager Lazy
• Model is computed before • Model is computed
classification during classification
• Model is independent of the • Model is dependent on
test instance the test instance
• Test instance is not included
• Test instance is included
in the training data
in the training data
• Avoids too much work at
classification time • High accuracy for models
at each instance level.
• Model is not accurate for
each instance

k-Nearest Neighbour
Learning by analogy
Tell me who your friends are and I’ll tell you who you are
• An instance is assigned to the most common class

among the instance similar to it
• How to measure similarity between instances?
• How to choose the most common class?

How does it work?
Initialization, define k
Compute distance
Sort the distances
Take k nearest neighbours
Apply majority Label Class

Comparing Objects
• Problem: measure similarity between instances
• different types of data
• Numbers
• Text
• Images
• Geolocation
• Booleans
• Solution: Convert all features of the instances into
numerical values
• Represent instances as vectors of features in an n
dimensional space

How to measure distance?
• Euclidean distance
𝑛
𝐷 𝑋, 𝑌 = ෍(𝑥𝑖 − 𝑦𝑖 )2
𝑖=1
• Manhattan distance 𝑛
𝐷 𝑋, 𝑌 = ෍ |𝑥𝑖 − 𝑦𝑖 |
𝑖=1

How to choose k?
• Classification is sensitive to the correct selection of k
• Small k?
• Captures fine structures
• Influenced by noise
• Larger k?
• Less precise, higher bias
2
•𝑘= 𝑛

Example
C1 = [(1,7), (1,12), (2,7), (2,9), (2,11), (3,6), (3,10), (3.5,8)];

C2 = [(2.5,9), (3.5,3), (5,3), (6,1), (3,2), (4,2), (5.5,4), (7,2)];
Example

Example

Example
x1 x2 y x1 x2 Euclidean Distance to query instance (3,7)
7 7 Bad 7 7 (7−3)2 + (7−7)2 = 4
7 4 Bad 7 4 (7−3)2 + (4−7)2 = 5
3 4 Good
3 4 (3−3)2 + (4−7)2 = 3
5 7 Good
5 7 (5−3)2 + (7−7)2 = 2
x1 x2 Euclidean Distance to Rank minimum Included in 3- Y

query instance (3,7) distance Nearest Neighbours
7 7 (7−3)2 + (7−7)2 = 4 3 Yes Bad
7 4 (7−3)2 + (4−7)2 = 5 4 No
3 4 (3−3)2 + (4−7)2 = 3 2 Yes Good
5 7 (5−3)2 + (7−7)2 = 2 1 Yes Good
Majority indicates
GOOD

Pros and Cons
• Pros
• Simple to implement and use
• Robust to noisy data by averaging k-nearest neighbours
• kNN classification is based solely on local information
• Cons
• O(n) for each instance to be classified
• More expensive to classify a new instance than with a model
• High memory storage required as compared to other
supervised learning algorithms.

Applications
• Banking System
• kNN can be used in banking system to predict weather an
individual is fit for loan approval? Does that individual have the
characteristics similar to the defaulters one?
• Calculating Credit Ratings
• kNN algorithms can be used to find an individual’s credit rating
by comparing with the persons having similar traits.
• Politics
• With the help of KNN algorithms, we can classify a potential
voter into various classes like “Will Vote”, “Will not Vote”, “Will
Vote to Party ‘Congress’, “Will Vote to Party ‘BJP’.

Thank You

Artificial Neural Network
Dr. Jitendra Kumar


What is this?
You see this:
But the camera sees this:

Computer Vision: Car detection
Cars Not a car
Testing:
What is this?
pixel 1
Learning
Algorithm
pixel 2
Raw image
pixel 2
Cars
“Non”-Cars
pixel 1

pixel 1
Learning
Algorithm
pixel 2
Raw image
pixel 2
Cars pixel 1
“Non”-Cars
pixel 1
Learning
Algorithm
pixel 2
Raw image
50 x 50 pixel images→ 2500 pixels
(7500 if RGB)
pixel 2
pixel 1 intensity
pixel 2 intensity
pixel 2500 intensity
Cars
pixel 1
“Non”-Cars

Neurons in the brain
• The brain consists of a

densely interconnected
set of nerve cells, or basic
information-processing
units, called neurons.
• The human brain
incorporates nearly 10
billion neurons and 60
trillion connections,
synapses, between them.
[Credit: US National Institutes of Health, National Institute on Aging]

Biological Neuron vs Artificial Neuron
Biological Neural Network Artificial Neural Network

Soma Neuron
Dendrite Input
Axon Output
Synapses Weight

Artificial Neural Network
▪ Our brain can be considered as a highly complex,non-linear and parallel
information-processing system.
▪ Information is stored and processed in a neural network simultaneously
throughout the whole network, rather than at specific locations. In other
words, in neural networks, both data and its processing are global rather
than local.
▪ An artificial neural network consists of a number of very simple processors,
also called neurons, which are analogous to the biological neurons in the
brain.
▪ The neurons are connected by weighted links passing signals from one
neuron to another.
▪ The output signal is transmitted through the neuron’s outgoing connection.
The outgoing connection splits into a number of branches that transmit the
same signal.
▪ The outgoing branches terminate at the incoming connections of other
neurons in the network.

The neuron as a simple computing
element (Diagram of a neuron)

The neuron as a simple computing
element (Diagram of a neuron)
Biological Neural Network Artificial Neural Network

Soma Neuron
Dendrite Input
Axon Output
Synapses Weight

Can a neuron learn from task?
• In 1958, Frank Rosenblatt introduced a training
algorithm that provided the first procedure for training a
simple ANN: a perceptron.
• The perceptron is the simplest form of a neural network.
It consists of a single neuron with adjustable synaptic
weights and a hard limiter or bias.

The Perceptron
• The operation of Rosenblatt’s • The aim of the perceptron is

perceptron is based on the to classify inputs, 𝑥1 , 𝑥2 , … 𝑥𝑛
McCulloch and Pitts neuron into one of two classes, say
model. The model consists of a A1 and A2.
linear combiner followed by a • In the case of an elementary
hard limiter. perceptron, the n dimensional
space is divided by a hyper-
plane into two decision
• The weighted sum of the inputs
regions. The hyper-plane is
is applied to the hard limiter,
defined by the linearly
which produces an output
separable function:
equal to +1 if its input is
positive and -1 if it is negative.

Linear separability in the perceptron

Negation
𝑥1 ∈ {0,1}
0 1
10
𝑌
-20
𝑥1 𝑦
0 1
1 0

Logical OR
𝑥1 , 𝑥2 ∈ {0,1} 1
𝑦 = 𝑥1 OR 𝑥2
-10
20 0 1
𝑌
20 𝑦
0 0 0
0 1 1
1 0 1
1 1 1
Logical AND
𝑥1 , 𝑥2 ∈ {0,1} 1
𝑦 = 𝑥1 AND 𝑥2
-30
20 0 1
𝑌
20 𝑦
0 0 0
0 1 0
1 0 0
1 1 1
How does the perceptron learn its
classification tasks?
• This is done by making small adjustments in the
weights to reduce the difference between the predicted
and desired outputs of the perceptron.
• The initial weights are randomly assigned, usually in the
range [-0.5, 0.5], and then updated to obtain the output
consistent with the training examples.
• If at iteration p, the predicted output is Y(p) and the
desired output is Yd (p), then the error is given by:

The perceptron learning rule
• where p = 1, 2, 3, . . .
• α is the learning rate, a positive constant less than

unity.
• The perceptron learning rule was first proposed by

Rosenblatt in 1960. Using this rule we can derive the
perceptron training algorithm for classification tasks.

Perceptron’s training algorithm
• Step 1: Initialisation
• Set initial weights 𝑤1 , 𝑤2 , … 𝑤𝑛 and threshold OR bias θ
to random numbers in the range [-0.5, 0.5].
• If the error, e(p), is positive, we need to increase
perceptron output Y(p), but if it is negative, we need to
decrease Y(p).

• Step 2: Activation
• Activate the perceptron by applying inputs 𝑥1 , 𝑥2 , …
𝑥𝑛 and desired output 𝑌𝑑 𝑝
• Calculate the actual output at iteration p=1
𝑛
𝑌 𝑝 = 𝑠𝑡𝑒𝑝 ෍ 𝑥𝑖 × 𝑤𝑖 𝑝 − 𝜃
𝑖=1
Where n is the number of the perceptron inputs, and step
is a step activation function

• Step 3: Weight training
• Update the weights of the perceptron
𝑤𝑖 𝑝 + 1 = 𝑤𝑖 𝑝 + Δ 𝑤𝑖 𝑝
Where Δ 𝑤𝑖 𝑝 is the weight correction at iteration p.
The weight correction is computed by the delta rule:
Δ 𝑤𝑖 𝑝 = α. 𝑥𝑖 𝑝 . 𝑒(𝑝)
• Step4: Iteration
• Increase iteration p by one, go back to step 2 and
repeat the process until convergence

Logical AND

Two-dimensional plots of basic logical
operations
A perceptron can learn the basic operations like AND, OR, and NOT
but it can not learn other complex functions such as X-OR

Multilayer Perceptron
• A multilayer perceptron is a feedforward neural network
with one or more hidden layers
• The network consists of an input layer of source

neurons, at least on middle or hidden layer of
computational neurons, and an output layer
computational neurons
• The input signals are propagated in a forward direction

on a layer-by-layer basis


(𝑗)
𝑎𝑖 = "activation" of unit 𝑖 in layer 𝑗
𝑤 (𝑗) = matrix of weights controlling function
mapping from layer 𝑗 to layer 𝑗 + 1
(2) (1) (1) (1) (1)

𝑎1 = 𝑔(𝑤10 𝑥0 + 𝑤11 𝑥1 + 𝑤12 𝑥2 + 𝑤13 𝑥3 )
(2) (1) (1) (1) (1)
𝑎2 = 𝑔(𝑤20 𝑥0 + 𝑤21 𝑥1 + 𝑤22 𝑥2 + 𝑤23 𝑥3 )
(2) (1) (1) (1) (1)
𝑎3 = 𝑔(𝑤30 𝑥0 + 𝑤31 𝑥1 + 𝑤32 𝑥2 + 𝑤33 𝑥3 )
(3) (2) (2) (2) (2) (2) (2) (2) (2)
𝑌 = 𝑎1 = 𝑔(𝑤10 𝑎0 + 𝑤11 𝑎1 + 𝑤12 𝑎2 + 𝑤13 𝑎3 )
If network has 𝑆𝑗 units in layer 𝑗, 𝑆𝑗+1 units in layer 𝑗 + 1,

then 𝑤 (𝑗) will be of dimension 𝑆𝑗+1 × (𝑆𝑗 + 1)

Forward propagation: Vectorized implementation
𝑥0 (2)
𝑧1
𝑥1 (2) (2)
𝑌 𝑥 = 𝑥2 𝑧 = 𝑧2
𝑥3 (2)
𝑧3
𝑧 (2) = 𝑤 (1) 𝑥
𝑎(2) = 𝑔(𝑧 (2) )

(2) (1) (1) (1) (1)
𝑎1 = 𝑔(𝑤10 𝑥0 + 𝑤11 𝑥1 + 𝑤12 𝑥2 + 𝑤13 𝑥3 )
(2)
(2) (1) (1) (1) (1) Add 𝑎0 = 1
𝑎2 = 𝑔(𝑤20 𝑥0 + 𝑤21 𝑥1 + 𝑤22 𝑥2 + 𝑤23 𝑥3 )
(2) (1) (1) (1) (1)
𝑎3 = 𝑔(𝑤30 𝑥0 + 𝑤31 𝑥1 + 𝑤32 𝑥2 + 𝑤33 𝑥3 ) 𝑧 (3) = 𝑤 (2) 𝑎(2)
(2) (2) (2) (2) (2) (2) (2) (2)
𝑌 = 𝑔(𝑤10 𝑎0 + 𝑤11 𝑎1 + 𝑤12 𝑎2 + 𝑤13 𝑎3 ) 𝑌 = 𝑎(3) = 𝑔(𝑧 3
)

Non-linear classification example:
XOR/XNOR
, are binary (0 or 1).
x2
x2
x1
x1

Putting it together (𝒙𝟏 XNOR 𝒙𝟐 )
-30 10 -10
20 -20 20
𝑦 𝑦 𝑦
20 -20 20
𝑦
-30
20
-10
0 0
20
20
𝑦
0 1
10 20
-20 1 0
-20
1 1

Learning
• The desired output is unavailable at hidden layer
• Hidden layer neurons can not be observed through the
input-output behaviour
• Typically, commercial neural network applications are
built using three-four layers (one or two hidden layers).
Each layer may contain (10, 1000) neurons.
• Experimental neural network applications can have five
or six layers (three or four hidden layers). Each layer
may have millions of neurons.

Backpropagation of error

Backpropagation Algorithm
1. Select a pattern from training set and present it to the network
2. Compute activations and signals of input, hidden and output
neurons
3. Compute the error over the output neurons by comparing the
generated outputs with the desired outputs
4. Use thee error calculated in Step 3 to compute the change in the
hidden to output layer weights, and the change in input to hidden
layer weights, such that a global error measure gets reduced
5. Update all weights of the network in accordance with the changes
computed in step 4
Hidden to Output layer weights
𝑝+1 𝑝 𝑝
𝑤ℎ𝑗 = 𝑤ℎ𝑗 + ∆𝑤ℎ𝑗
Input to hidden layer weights
𝑝+1 𝑝 𝑝
𝑤𝑖ℎ = 𝑤𝑖ℎ + ∆𝑤𝑖ℎ
6. Repeat Steps 1 through 5 until the global error falls below a
predefined threshold

Thank You!

CAMI 16 Data Analytics End Sem PDFs

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

CAMI 16 Data Analytics End Sem PDFs

Uploaded by

Copyright:

Available Formats

An Introduction to Data Analytics

CAMI16 : Data Analytics

Dr. Jitendra Kumar

Dr. Jitendra Kumar National Institute of Technology Tiruchirappalli 2

Dr. Jitendra Kumar National Institute of Technology Tiruchirappalli 3

• Tests of significance - ANOVA, ‘t’ test, Forward, Backward,

• Discriminant Analysis-Two group problem, Variable

Dr. Jitendra Kumar National Institute of Technology Tiruchirappalli 4

• Factor Analysis-Orthogonal Factor Model,

Dr. Jitendra Kumar National Institute of Technology Tiruchirappalli 5

Dr. Jitendra Kumar National Institute of Technology Tiruchirappalli 6

Dr. Jitendra Kumar National Institute of Technology Tiruchirappalli 7

Dr. Jitendra Kumar National Institute of Technology Tiruchirappalli 8

• Data can help the organizations in

Dr. Jitendra Kumar National Institute of Technology Tiruchirappalli 9

Dr. Jitendra Kumar National Institute of Technology Tiruchirappalli 10

Define Measure Analyse Improve Control

• Ask right question • Analyse data • Assess solutions

• Collect valid data • Implement solution

Dr. Jitendra Kumar National Institute of Technology Tiruchirappalli 11

Dr. Jitendra Kumar National Institute of Technology Tiruchirappalli 12

Dr. Jitendra Kumar National Institute of Technology Tiruchirappalli 13

• Provides the root-cause of happening something

Dr. Jitendra Kumar National Institute of Technology Tiruchirappalli 14

Dr. Jitendra Kumar National Institute of Technology Tiruchirappalli 15

Dr. Jitendra Kumar National Institute of Technology Tiruchirappalli 16

Dr. Jitendra Kumar National Institute of Technology Tiruchirappalli 17

Dr. Jitendra Kumar National Institute of Technology Tiruchirappalli 18

Dr. Jitendra Kumar National Institute of Technology Tiruchirappalli 19

Dr. Jitendra Kumar National Institute of Technology Tiruchirappalli 20

Dr. Jitendra Kumar National Institute of Technology Tiruchirappalli 21

Dr. Jitendra Kumar National Institute of Technology Tiruchirappalli 22

Dr. Jitendra Kumar National Institute of Technology Tiruchirappalli 24

Dr. Jitendra Kumar National Institute of Technology Tiruchirappalli 25

Dr. Jitendra Kumar National Institute of Technology Tiruchirappalli 26

Dr. Jitendra Kumar National Institute of Technology Tiruchirappalli 27

Dr. Jitendra Kumar National Institute of Technology Tiruchirappalli 28

Dr. Jitendra Kumar National Institute of Technology Tiruchirappalli 29

Dr. Jitendra Kumar National Institute of Technology Tiruchirappalli 30

Dr. Jitendra Kumar

Dr. Jitendra Kumar National Institute of Technology Tiruchirappalli 2

Dr. Jitendra Kumar National Institute of Technology Tiruchirappalli 3

Data Cleaning &

Data Visualization Machine Learning

Dr. Jitendra Kumar National Institute of Technology Tiruchirappalli 4

Dr. Jitendra Kumar National Institute of Technology Tiruchirappalli 5

Dr. Jitendra Kumar National Institute of Technology Tiruchirappalli 6

• Representation of data in the form of Charts, diagrams etc.

Dr. Jitendra Kumar National Institute of Technology Tiruchirappalli 7

Dr. Jitendra Kumar National Institute of Technology Tiruchirappalli 8

Dr. Jitendra Kumar

Dr. Jitendra Kumar National Institute of Technology Tiruchirappalli 10

Total 1000 records/ data points/ samples

Dr. Jitendra Kumar National Institute of Technology Tiruchirappalli 11

Dr. Jitendra Kumar National Institute of Technology Tiruchirappalli 12

Moral: Know your data

Dr. Jitendra Kumar National Institute of Technology Tiruchirappalli 13

Dr. Jitendra Kumar National Institute of Technology Tiruchirappalli 14

Dr. Jitendra Kumar National Institute of Technology Tiruchirappalli 15

# Property Operation Type

2. Order <,≤,>,≥ (Qualitative) Ordinal

3. Addition + and - Interval

Dr. Jitendra Kumar National Institute of Technology Tiruchirappalli 16

• Gender { M, F} or { 1, 0 } Used letters or numbers