Professional Documents
Culture Documents
Statistics
1
What is Statistics
Continuous/Variable
The characteristics which may assume any value within its range of variation
e.g. Height, Weight, Diameter etc.
Discrete/Attribute 7
The characteristics which assume only isolated values in its range of
variation. e.g. No. of complaints per month, Percentage absenteeism, No. of
injuries etc.
Types of Data - Qualitative
No number attached from birth
Qualitative Data: It takes on Categorical values
Binary : Two class Categorical data of type present and absent having
states 1 and 0 respectively. It is also referred to as Boolean when two
states correspond to TRUE and FALSE.
Interval and Ratio Measurement
• In Nominal and Ordinal, the distance between attribute does not have any
meaning.
• However in interval measurement, the distance between attributes have a
meaning. E.g. if we are measuring temperature in Fahrenhiet scale then
distance between 30 to 40 is same as distance between 70 and 80.
• In such situation, it makes sense to compute average on interval scale.
However it does not make sense to compute ratios on this scale as 80 degrees
is not twice as hot as 40 degrees.
• In Ratio measurement, there is always an absolute zero that is meaningful. This
means that one can construct a meaningful fraction (or ratio) with a ratio
variable.
• Weight and Height are ratio variable and so are most count variable. E.g. No of
defects this week are twice that of last week.
6
Descriptive Statistics
Average or Mean - Sum of all Data divided by number of data points
Median - Middle Data Value when data is ranked from min. to max.
Sum of all data val divided Middle data value when data Most common value, data
by number of data points is ranked from min to max which has max frequency
A histogram balances when Median divides area of Mode is the value of highest
supported at mean histogram in half point on the histogram
Not used when: Does not get affected by Does not get affected by
Data contains few extreme extreme values extreme values
values widely different
from majority Preferred when order of Preferred when most
values are considered commonly occurring value
Terminal classes are open important appropriately represent the
group
Why Standard Deviation is important
10
Random Experiment
11
Sample Space
12
Sample Space
13
Sample Space
• Continuous
e.g. S = R+={x|x>0}
S={x|10<x<11}
14
Event
15
Event
16
Event
17
Event
18
Simple (or Elementary) Event
19
Simple (or Elementary) Event
20
Simple (or Elementary) Event
21
Compound Event
22
Compound Event
Example:
23
Classical Probability
24
Probability Range
0<=P(A)<=1
25
Statistical Probability
26
Frequency and Probability
27
Basic rules of probability theory
28
Basic rules of probability theory
29
Basic rules of probability theory
P(B/A) = P(B)
• In such cases:
30
Basic rules of probability theory
31
Basic rules of probability theory
32
Probability Distribution Function
• Consider a discrete Random Variable X, which can take
values x1, x2, …., xn
• Not all these values are equally likely. Some are more
probable and some are less probable
• We call distribution function of the random variable any
function that describes the distribution of the probabilities
among the values of the variable
• The distribution of a discrete random variable X can be
represented as follows:
xi x1 x2 …… xn
P(x)
pi p1 p2 …… pn
X
33
Probability Density Function (PDF)
• For Continuous random variable we have probability density
• Density of a substance is mass per unit volume. For non-homogeneous
substance we talk of local density
• In probability theory, we also have local density (probability at point x per
unit length)
• Probability Density function is a function associated with continuous
Random Variable X, which gives the Probability Density at f(x) at point x
• f(x) >=0 everywhere and total area under the curve = 1
PDF
f(x)
X
34
Cumulative Density Function (CDF)
• Area under PDF corresponds to probabilities for the Random
Variable X
• The probability that X lies within the interval (a,b) equals area
bounded by x-axis, pdf curve, X=a and X=b
• Probability at any point a is equal to zero (area is zero at a point)
1.0
P(x < X)
CDF
f(x)
0.5 F(x)
0.0
-4 -3 -2 -1 0 1 2 3 4
x
• The cumulative density function of X returns the probability that
the random variable is less than or equal to the value x
F(x) = P(X <=x)
• This function is applicable to both Discrete and Continuous X
35
Probability Distributions
We need to quantify/verify our conclusions from the descriptive
statistics investigation and remove the subjectivity from the use of
descriptive statistics investigations
Observations vary from each other but they form a pattern that, if
stable, can be described as a distribution.
The distribution
( x - m )2
(x ; m )= 1 -
p ,s 2
e 2s 2
Notes 2 s 2
Variation /
Standard deviation
s
Mean or Average
m
Center / Location
Characterising the Normal Distribution
39
Characterising the Normal Distribution
1s 1s
68.27%
2s 2s
95.45%
3s 3s
99.73%
Whether you like it or not, 99.73% of the observations
will lie within Average +/- 3 s.d.
Z- standard normal variate
A “Standard” Normal Distribution
mean = 0
st. dev. = 1
-3 -2 -1 0 1 2 3
-3 -2 -1 0 1 2 3
Z-value anywhere
on this scale
• Z-value
– How many standard deviations
the value-of-interest is away
from the mean
(value - of - interest) - X
Z=
S
41
Use of Standard Normal distribution
• If a list of numbers follow the normal curve, the percentage of
entries falling in a given interval can be estimated as follows:
– First convert the intervals to standard unit
– Find the corresponding area under the normal curve
• The procedure is called the normal approximation
42
Percentiles
• The average and SD can be used to summarize data following the normal curve.
• The are less satisfactory for other kind of data e.g. skewed data. To summarize
such kind of data we use percentiles
• A percentile (or a centile) is a measure used in statistics indicating the value
below which a given percentage of observations in a group of observations fall.
• For example, the 20th percentile is the value (or score) below which 20% of the
observations may be found
• Percentile is each of the 100 equal groups into which a population can be
divided according to the distribution of values of a particular variable
– 50th percentile is the median
– Interquartile range equals: 75th percentile – 25th percentile
• A percentile is only used as a comparison score
• All histograms, whether or not they follow the normal curve, can be
summarized using percentiles
Quantile of a Distribution
• The ath quantile of a cumulative distribution function F is the
point xa so that F(xa) = a
44
Thank You
Abhinav Srivastava
45