Unit 2

Department of Computer Science and Engineering (CSE)
UNIVERSITY INSTITUTE OF
ENGINEERING
COMPUTER SCIENCE& ENGINEERING
Bachelor of Engineering
Statistical Method Using R-(20 SMT-460)
Prepared By: Mr. Munish Kumar(E16513)
Topic: Central Tendency

DISCOVER . LEARN . EMPOWER
University Institute of Engineering (UIE)

Statistical Method Using R-20 SMT-460
Unit-2 Discrete and continuous distribution, Regression

Regression:-> Regression analysis- its properties, various method to
perform regression analysis, numerical based on it
Discrete Distribution:->Bernoulli distribution, Binomial
distribution, Poisson distribution,
Continuous distributions:->uniform distribution, exponential
distribution, normal distribution, properties of normal distribution,
area under the normal curve.

What is Correlation?
• Correlation/ connection refers to a
process for establishing the
relationships between two
variables.
Example: 1.As sunlight increases,

temperature goes up.
2. If price is more then demand will be
less.

Correlation

1.Positive correlation:-> If increase in one variable will increase the other variable
value.For example, there is a positive correlation between smoking and alcohol
use. As alcohol use increases, so does smoking.
2.Negative correlation:-> This means that as one variable increases, the other
decreases, and vice versa. If price is more then demand will be less
3.Linear correlation:-> If the ratio of change between two variable remains same.
Marks w.r.t topper of class
4.Curvilinear correlation:-> If the ratio of change between two variable changes.
Student strength in class
5.Simple correlation:-> Relation between two variables only sunlight & Temp
6.Partial correlation:-> Relation between three variables only temp, rainfall & yield
7.Multiple correlation:-> Relation between three/four variables

What is Regression?
Regression analysis measures the nature and extent of two or more variables
which enables us to make predictions. Regression analysis is a mathematical
measure of the average relationship between two or more variables in terms of
the original units of the data.
How much relation between variables is correlation.
How is the relationship between variables is the regression.
If X= a+B and Y= X+c then

clearly if value of X changes then Y ‘s value
will change accordingly

What is Regression?
The term "regression" literally means "stepping back towards the average".
It was first used by a British biometrician Sir Francis Galton (1822-1911),
in connection with the inheritance of stature
•In Regression Analysis there are two types of variables.
Dependent variable/regressed or explained variable:-> The variable
whose value is influenced or is to be predicted. The dependent variable
is shown by “y”
Independent variable/ regressor or predictor or explanatory
variable:-> variable which influences the values or is used for prediction.
independent variables are shown by “x”.

Properties of Regression?
 Correlation coefficient is the geometric mean between the regression
coefficients.
 If one of the regression coefficients is greater than unity, the other must
be less than unity.

 Arithmetic mean of the regression coefficients is greater than the
correlation coefficient r, provided r > 0.

 Regression coefficients are independent of the change of origin but not
of scale.

Types of Linear Regression
1. Simple linear regression

Simple linear regression reveals the correlation between a dependent variable
(input) and an independent variable (output). Primarily, this regression type
describes the following:
Relationship strength between the given variables.
Example: The relationship between pollution levels and rising
temperatures.
The value of the dependent variable is based on the value of the independent
variable.
Example: The value of pollution level at a specific temperature.


2. Multiple linear regression
Multiple linear regression establishes the relationship between independent
variables (two or more) and the corresponding dependent variable. Here, the
independent variables can be either continuous or categorical. This regression
type helps foresee trends, determine future values, and predict the impacts of
changes.
Example: Consider the task of calculating blood pressure. In this case, height,
weight, and amount of exercise can be considered independent variables. Here,
we can use multiple linear regression to analyze the relationship between the
three independent variables and one dependent variable, as all the variables
considered are quantitative.

3. Logistic Regression
Logistic regression—also referred to as the logit model—is applicable in cases
where there is one dependent variable and more independent variables. The
fundamental difference between multiple and logistic regression is that the
target variable in the logistic approach is discrete (binary or an ordinal value).
Implying, the dependent variable is finite or categorical–either P or Q (binary
regression) or a range of limited options P, Q, R, or S.
Example: One can determine the likelihood of choosing an offer on your website
(dependent variable). For analysis purposes, you can look at various visitor
characteristics such as the sites they came from, count of visits to your site, and
activity on your site (independent variables). This can help determine the
probability of certain visitors who are more likely to accept the offer. As a result,
it allows you to make better decisions on whether to promote the offer on your
site or not.


4. Ordinal Regression
Ordinal regression involves one dependent dichotomous variable and one
independent variable, which can either be ordinal or nominal. It facilitates the
interaction between dependent variables with multiple ordered levels with one
or more independent variables.
Example: Consider a survey where the respondents are supposed to answer as

‘agree’ or ‘disagree.’ In some cases, such responses are of no help as one cannot
derive a definitive conclusion, complicating the generalized results. However, you
can observe a natural order in the categories by adding levels to responses, such
as agree, strongly agree, disagree, and strongly disagree. Ordinal regression thus
helps in predicting the dependent variable having ‘ordered’ multiple categories
using independent variables.


3. Multinomial logistic regression
Multinomial logistic regression (MLR) is performed when the dependent variable
is nominal with more than two levels. It specifies the relationship between one
dependent nominal variable and one or more continuous-level (interval, ratio, or
dichotomous) independent variables. Here, the nominal variable refers to a
variable with no intrinsic ordering.
Example: Multinomial logit can be used to model the program choices made by
school students. The program choices, in this case, refer to a vocational program,
sports program, and academic program. The choice of type of program can be
predicted by considering a variety of attributes, such as how well the students
can read and write on the subjects given, gender, and awards received by them.

Lines of Regression
If the variables in a bivariate distribution are related, we will find that the points
in the scatter diagram will cluster round some curve called the "curve of
Regression“

Lines of Regression
If the curve is a straight line, it is called the line of regression and there is said to
be linear regression' between the variables, otherwise regression is said to be
Curvilinear.

Lines of Regression
The line of regression is the line which gives the best estimate to the value of one
variable for any specific value of the other variable. Thus the line of regression is
the line of "best fit" and is obtained by the principles of least squares..

Regression Equation using Normal Equation Examples
Calculate the regression equation of X on Y of following data by least

Square method.
Solution:
X on Y => X=a+bY
Σ X= Na+ b Σ Y ………………..1
Σ XY= aΣX+ b Σy2 ………………..2


Regression Equation using Normal Equation Examples
Question 2:Calculate the regression equation of Y on x of following data by

least Square method where X=10
Solution
Y on X=> Y=a+bX
Σ Y= Na+ b ΣX ………………..1
Σ XY= aΣX+ b ΣX2 ………………..2


Regression Equation using Normal coefficients

Calculate the regression equation of X on Y and Y on X of following data
Solution: X on Y calculated as (X- X̅ ) =bxy (Y-Y̅ )

Step 1: Calculate the means of X and Y:

Mean of X (X̄ ) = (1+2+3+4+5) / 5 = 15 / 5 = 3
Mean of Y (Ȳ) = (2+5+3+8+7) / 5 = 25 / 5 = 5
Step 2: bxy= N ΣXY - ΣX.ΣY

NΣy2 - (ΣY)2
Σ X=15 Σ Y=25 Σ XY=88 Σy2 =151 (ΣY)2=(25)2= 625
bxy= 5(88)-15*25
5(151)-625
bxy= 440-1375 = 0.5

755-625
Step 3:(X- X̅ ) =bxy (Y-Y̅ ) =>

(X- 3) =0.5 (Y-5) =>0.5Y-2.5
=> X=0.5Y-2.5+3
X=0.5Y+0.5

Y on X calculated as (Y-Y̅ ) =bxy (X- X̅ )
byx= N ΣXY - ΣX.ΣY

NΣX2 - (ΣX)2
Σ X=15 Σ Y=25 Σ XY=88 ΣX2 =55 (ΣX)2=(15)2= 225
N= 5 X̅ =3 Y̅ =5
bxy= 5(88) - 15*25

5(55) - (15)2
bxy= 65/50=1.3
(Y-Y̅ ) =bxy (X- X̅ )
(Y-5) =1.3 (X- 3)

(Y-5) =1.3X- 3.9=> Y=1.3X-3.9+5
Y= 1.3X+1.1

Method 2: Using Deviations from the Actual Means

Question: Calculate the regression equation of X on Y and Y on X of following
data
X on Y calculated as (X- X̅ ) =bxy (Y-Y̅ )
bxy= Σxy x= x-x̅ y=y-y̅

Σy2
X̅ =ΣX =>42/6 =7
N
Y̅ =ΣY =>30/6 =5
N



Probability Distribution
Probability distribution is a function that gives the relative likelihood
of occurrence of all possible outcomes of an experiment. There are
two important functions that are used to describe a probability
distribution. These are the
Probability density function or probability mass function
Cumulative distribution function.

A die is tossed once. If the random variable x is the

number of the even number, find the probability
distribution of X.
Solution :-> X= Number of even numbers
Sample space = {1,2,3,4,5,6}
P(E)=3/6=>⅓
P(O)=3/6=>⅓

Suppose two coins Tossed
Sample Space= {HH,TT,HT,TH}
Random Variables : for Number of heads
0 head= ¼
1 head= ½
2 head= ¼
The table or methods of describing random variables is called Probability

distribution.
Random variables are used to quantify outcomes of a random occurrence, and

therefore, can take on many values. Based on this, a probability distribution can
be classified into a discrete probability distribution and a continuous probability
distribution. Random variables are required to be measurable and are typically
real numbers. A discrete probability distribution has whole number and a
continuous probability distribution has fractions also.

Find the probability distribution of random variable

number of heads when two coins are tossed
• X= Number of heads
•
Sample space = {HH,TH,HT,TT}
•
Random Variables : for Number of heads
•
0 head= ¼
• 1 head= ½
• 2 head= ¼

Types of Random Variable
As discussed in the introduction, there are two random variables,

such as:
Discrete Random Variable:->A discrete random variable is one which

may take on only a countable number of distinct values such as
0,1,2,3,4,........ Discrete random variables are usually (but not necessarily)
counts. If a random variable can take only a finite number of distinct
values, then it must be discrete.
Continuous Random Variable:->A continuous random variable is one
which takes an infinite number of possible values. Continuous random
variables are usually measurements. Examples include height, weight, the
amount of sugar in an orange, the time required to run a mile.


DISCRETE DISTRIBUTIONS:
Discrete distributions have a finite number of different possible outcomes.
Characteristics of Discrete Distribution
•We can add up individual values to find out the probability of an interval
•Discrete distributions can be expressed with a graph, piece-wise function
or table
•In discrete distributions, graph consists of bars lined up one after the
other
•Expected values might not be achievable

•P(Y≤y) = P(Y < y + 1)
In graph, the discrete distributions
looks like as,

Binominal Distribution
The binomial distribution represents the probability for 'x' successes of an
experiment in 'n' trials, given a success probability 'p' for each trial at the
experiment.
Step 1:-> Success= P and Failure = q
p=1-q q=1-p
Step 2:-> P(x:n,p) = nCr pr (q)n-r

where,
n = the number of experiments
r = 0, 1, 2, 3, 4, …
p = Probability of success in a single experiment
q = Probability of failure in a single experiment (= 1 – p)

Step 3:-> Binominal Distribution =(q+p) n
Step 4:->Mean, μ = np
Variance, σ2 = npq
Standard Deviation σ= √(npq)

Determine Binomial Distribution whose mean is 9 and standard
deviation is 3/2
Solution:
•Mean, μ = np = 9 …………………………………….. 1
•Standard Deviation σ= √(npq) = 3/2 …………………………………….. 2
•Divide equation by 2 by 1
=> ( npq = 9/4) / (np=9) => q= ¼
•p=1-q => 1-¼ => ¾

•np=9 => n*¾ =9 => 9*4/3 => 12
•Binomial Distribution =(q+p)n => (¼+¾)12

The probability of a man hitting a target is ¼. He fires 7 times . What is the probability of hitting
at least target twice?
• Solution:-> p= ¼ q= 1-¼ =¾ n= 7
• P(X>=2) =>P(x=2)+P(x=3)+P(x=4)+P(x=5)+P(x=6)+P(x=7)
• q= 1-[P(x=0)+P(x=1)]
• P(r:n,p) = nCr pr (q)n-r
• P(x=0) = 7c0 p
• P(r:n,p) = n r r
C p (q)
n-r
• P(x=0) = [7c0 p0 q7 + 7c1 p1 q6]
• 7c0=0 or 6c0=0 or 5c0=0
• 7c1=1 or 6c1=1 or 5c1=1
• P= 1-[q7+7pq6] => 1-q6[q+7p] =>1-(3/4)6(3/4+7/4)
• 1-7290/16384 =>4547/8192

Example : If a coin is tossed 5 times, using binomial distribution find the

probability of:
(a) Exactly 2 heads
(b) At least 4 heads.


A fair coin is tossed 10 times, what are the probability of

getting exactly 6 heads and at least six heads.



Bernoulli Distribution
A Bernoulli variable has only two values: success and failure. If we know
the probability of success, p, then the probability of failure is 1-p.A pass or
fail exam can be modeled by a Bernoulli Distribution.
If we have a Binomial Distribution where n = 1 then it becomes a Bernoulli
Distribution
The 3 conditions for a Bernoulli trial are:
1. Each trial has only two possible outcomes: True/False, Yes/No,
Success/Failure, etc.
2. The trials are independent. They do not influence each other.
3. The probabilities of success and failure do not change. They remain the
same for all trials.
The expected value of a Bernoulli distribution is the probability of success,
p: EX = p.
The variance of a Bernoulli distribution is p(1-p).

Probability Mass Function for Bernoulli Distribution

Cumulative Distribution Function for Bernoulli Distribution
Mean of Bernoulli Distribution:-> E[X] = p
The variance can be defined as the difference of the mean of

X2 and the square of the mean of X
Var[X] = E[X2] - (E[X])2






Poisson Distribution Probability
Poisson Distribution (as a limiting case of Binomial

Distribution). Poisson distribution was discovered by the
French mathematician and physicist Simeon Denis Poisson
(1781-1840) who published it in 1837.
Poisson distribution is a limiting case of the binomial
distribution under the following conditions:

n= when n has maximum values and
p= when less probability of any event
λ=np
f(x) = P(X=x) = (e-λ λx )/x!
Where
x = 0, 1, 2, 3...
e is the Euler's number(e = 2.718)
λ is an average rate of the expected value and λ = variance, also λ>0



Five coins tossed 3200 times. what is the probability to getting 5 heads two times?
Solution:
•n= 3200 and
•p= ½ five times=> 5 times (½) => 1/32
• λ=np =>3200*1/32 ⇒ 100
• f(x) = P(X=x) = (e-λ λx )/x!
• P(X=2) = (e-100 100 2 )/2! ⇒ 500*e-100





Normal Distribution
• Normal Distribution, also called the Gaussian Distribution, is the most significant
continuous probability distribution. Sometimes it is also called a bell curve.
• The Normal Distribution is defined by the probability density function for a continuous
random variable in a system. Let us say, f(x) is the probability density function and X is the
random variable. Hence, it defines a function which is integrated between the range or
interval (x to x + dx), giving the probability of random variable X, by considering the values
between x and x+dx.
• f(x) ≥ 0 ∀ x ϵ (−∞,+∞)
• And -∞∫+∞ f(x) = 1

Normal Distribution Formula

The probability density function of normal or gaussian distribution is given by;
•Where,
•x is the variable
•μ is the mean
•σ is the standard deviation

Normal Distribution Properties

Some of the important properties of the normal distribution are listed below:
•In a normal distribution, the mean, median and mode are equal.(i.e., Mean =
Median= Mode).
•The total area under the curve should be equal to 1.

•The normally distributed curve should be symmetric at the center.
•There should be exactly half of the values are to the right of the center and exactly
half of the values are to the left of the center.
•The normal distribution should be defined by the mean and standard deviation.
•The normal distribution curve must have only one peak. (i.e., Unimodal)
•The curve approaches the x-axis, but it never touches, and it extends farther away
from the mean.


Normal Distribution Standard Deviation
Normal distribution has any positive standard deviation. We know that the mean helps to
determine the line of symmetry of a graph, whereas the standard deviation helps to know how
far the data are spread out.
If the standard deviation is smaller, the data are somewhat close to each other and the graph
becomes narrower.
If the standard deviation is larger, the data are dispersed more, and the graph becomes wider.
The standard deviations are used to subdivide the area under the normal curve. Each
subdivided section defines the percentage of data, which falls into the specific region of a graph.



The random variable X is normally distributed with 9 and at Standard Deviation is 3
find the probability X>=15, X<15, 0<X<9
Answer:








Rectangular or Uniform Distribution
A uniform distribution is a continuous probability distribution and

relates to the events which are likely to occur equally. A uniform
distribution is defined by two parameters, a and b, where a is the
minimum value and b is the maximum value. It is generally denoted
as u(a, b).
When the probability density function or probability distribution of a
uniform distribution with a continuous random variable X is f(x)=1/b-
a, then It can be denoted by U(a,b), where a and b are constants
such that a<x<b. It is written as:
f(x) = 1/ (b-a) for a≤ x ≤b.
where,
•a is the minimum value
•b is the maximum value

















Exponential Distribution
The exponential distribution formula is used to find the exponential distribution of
a function. Exponential distribution refers to the process in which the event
happens at a constant average rate independently and continuously. The
exponential distribution is most often known as the memoryless distribution
because it means that past information has no effect on future probabilities.
The exponential distribution is commonly used to model time: the time between
arrivals, the time until a component fails, the time until a patient dies. We have
already encountered several examples of exponential random variables—the time
of the first arrival in a Poisson process follows an exponential distribution.



















THANK YOU

Unit 2

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Unit 2

Uploaded by

Copyright:

Available Formats

Department of Computer Science and Engineering (CSE)

Prepared By: Mr. Munish Kumar(E16513)

Topic: Central Tendency

University Institute of Engineering (UIE)

Statistical Method Using R-20 SMT-460

Unit-2 Discrete and continuous distribution, Regression

University Institute of Engineering (UIE)

Example: 1.As sunlight increases,

University Institute of Engineering (UIE)

University Institute of Engineering (UIE)

University Institute of Engineering (UIE)

If X= a+B and Y= X+c then

University Institute of Engineering (UIE)

University Institute of Engineering (UIE)

 Correlation coefficient is the geometric mean between the regression

be less than unity.

correlation coefficient r, provided r > 0.

University Institute of Engineering (UIE)

Types of Linear Regression

1. Simple linear regression

University Institute of Engineering (UIE)

Types of Linear Regression

Types of Linear Regression

University Institute of Engineering (UIE)

Types of Linear Regression

Example: Consider a survey where the respondents are supposed to answer as

University Institute of Engineering (UIE)

Types of Linear Regression

University Institute of Engineering (UIE)

University Institute of Engineering (UIE)

University Institute of Engineering (UIE)

University Institute of Engineering (UIE)

Regression Equation using Normal Equation Examples

Calculate the regression equation of X on Y of following data by least

Σ XY= aΣX+ b Σy2 ………………..2

University Institute of Engineering (UIE)

University Institute of Engineering (UIE)

Regression Equation using Normal Equation Examples

Question 2:Calculate the regression equation of Y on x of following data by

Σ XY= aΣX+ b ΣX2 ………………..2

University Institute of Engineering (UIE)

University Institute of Engineering (UIE)

Regression Equation using Normal coefficients

University Institute of Engineering (UIE)

Calculate the regression equation of X on Y and Y on X of following data

Solution: X on Y calculated as (X- X̅ ) =bxy (Y-Y̅ )

University Institute of Engineering (UIE)

Step 1: Calculate the means of X and Y:

Step 2: bxy= N ΣXY - ΣX.ΣY

Σ X=15 Σ Y=25 Σ XY=88 Σy2 =151 (ΣY)2=(25)2= 625

bxy= 440-1375 = 0.5

Step 3:(X- X̅ ) =bxy (Y-Y̅ ) =>

University Institute of Engineering (UIE)

Y on X calculated as (Y-Y̅ ) =bxy (X- X̅ )

byx= N ΣXY - ΣX.ΣY

bxy= 5(88) - 15*25

(Y-5) =1.3 (X- 3)

University Institute of Engineering (UIE)

Method 2: Using Deviations from the Actual Means

X on Y calculated as (X- X̅ ) =bxy (Y-Y̅ )

bxy= Σxy x= x-x̅ y=y-y̅

University Institute of Engineering (UIE)

University Institute of Engineering (UIE)

University Institute of Engineering (UIE)