You are on page 1of 90

Applied Statistical Methods (ASM)

“The true logic of this


world is in the calculus
of probabilities”.

James Clerk Maxwell


Data and Statistics
• Data consists of information coming from
observations, counts, measurements, or responses.

Statistics is the science of collecting, organizing, analyzing,


and interpreting data in order to make decisions.

A population is the collection of all outcomes, responses,


measurement, or counts that are of interest.

A sample is a subset of a population.

2
Populations & Samples
In a recent survey, 10000 DU students at Delhi
were asked if they studied regularly 6 hours.
350 of the students said yes. Identify the
population and the sample.

Responses of all students at


Union College (population)

Responses of students
in survey (sample)

3
Parameters & Statistics
A parameter is a numerical description of a population
characteristic.

A statistic is a numerical description of a sample


characteristic.

Parameter Population

Statistic Sample

4
Parameters & Statistics
• Decide whether the numerical value describes a
population parameter or a sample statistic.

a.) A recent survey of a sample of 450 college research


students reported that the average monthly income
for students is Rs 30000.
Because the average of Rs 30000 is based on a
sample, this is a sample statistic.
b.) The average monthly income for all students is Rs
30000.
Because the average of Rs 30000 is based on a
population, this is a population parameter.
Branches of Statistics
The study of statistics has two major branches: descriptive
statistics and inferential statistics.
Statistics

Descriptive Inferential
statistics statistics
Involves the Involves using a
organization, sample to draw
summarization, conclusions about a
and display of data. population.
7
Descriptive and Inferential
Statistics
• In a study, volunteers who had less than 6 hours of sleep were
four times more likely to answer incorrectly on a science test than
were participants who had at least 8 hours of sleep. Decide
which part is the descriptive statistic and what conclusion might
be drawn using inferential statistics.

The statement “four times more likely to answer


incorrectly” is a descriptive statistic. An
inference drawn from the sample is that all
individuals sleeping less than 6 hours are more
likely to answer science question incorrectly than
individuals who sleep at least 8 hours.
Types of Data
Data sets can consist of two types of data: qualitative data
and quantitative data.
Data

Qualitative Quantitative
Data Data
Consists of Consists of
attributes, labels, numerical
or nonnumerical measurements or
entries. counts.
Designing a Statistical Study

1. Identify the variable(s) of interest (the focus)


and the population of the study.
2. Develop a detailed plan for collecting data. If
you use a sample, make sure the sample is
representative of the population.
3. Collect the data.
4. Describe the data.
5. Interpret the data and make decisions about
the population using inferential statistics.
6. Identify any possible errors.
Methods of Data Collection
In an observational study, a researcher observes and
measures characteristics of interest of part of a population.
In an experiment, a treatment is applied to part of a
population, and responses are observed.
A simulation is the use of a mathematical or physical model
to reproduce the conditions of a situation or process.
A survey is an investigation of one or more characteristics
of a population.
A census is a measurement of an entire population.

A sampling is a measurement of part of a population.


Central Tendency
• In general terms, central tendency is a
statistical measure that determines a
single value that accurately describes the
center of the distribution and represents
the entire distribution of scores.
• The goal of central tendency is to identify
the single value that is the best
representative for the entire set of data.
The Mean, the Median,
and the Mode
• It is essential that central tendency be
determined by an objective and well-defined
procedure so that others will understand exactly
how the "average" value was obtained and can
duplicate the process.
• No single procedure always produces a good,
representative value. Therefore, researchers
have developed three commonly used
techniques for measuring central tendency: the
mean, the median, and the mode.
Find the mean
• My 5 test scores for Calculus I are 95, 83,
92, 81, 75. What is the mean?
• ANSWER: sum up all the tests and divide
by the total number of tests.
• Test mean = (95+83+92+81+75)/5 = 85.2
Population Mean


 X X  X  X ...  X
 1 2 3 N
N N
24  13  19  26  11

5
93

5
 18. 6
Example with a range of data
• When you are given a Age of Number of
range of data, you need males students
to find midpoints.
• To find a midpoint, sum
14≤x<18 94,000
the two endpoints on the 18≤x<20 1,551,000
range and divide by 2. 20≤x<22 1,420,000
• Example 14≤x<18. The
midpoint (14+18)/2=16.
22≤x<25 1,091,000
• The total number of 25≤x<30 865,000
students is 5,542,000. 30≤x<35 521,000
Total 5,542,000
Continuing the previous example
• What we need to do is find the midpoints of the
ranges and then multiply then by the frequency.
So that we can compute the mean.
• The midpoints are 16, 19, 21, 23.5, 27.5, 32.5.
• The mean is
[16(94,000)+19(1,551,000)+21(1,420,000)+
23.5(1,091,000)+27.5(865,000)+32.5(521,000)]
/5,542,000.=22.94
The median.
• Here are a bunch of 10 point quizzes from
MATH F432:
• 9, 6, 7, 10, 9, 4, 9, 2, 9, 10, 7, 7, 5, 6, 7
• As you can see there are 15 data points.
• Now arrange the data points in order from
smallest to largest.
• 2, 4, 5, 6, 6, 7, 7, 7, 7, 9, 9, 9, 9, 10, 10
• Calculate the location of the median:
(15+1)/2=8. The eighth piece of data is the
median. Thus the median is 7.
BITS Pilani, Pilani Campus
The mode
• The mode is the most frequent number in a
collection of data.
• Example A: 3, 10, 8, 8, 7, 8, 10, 3, 3, 3
• The mode of the above example is 3, because 3
has a frequency of 4.
• Example B: 2, 5, 1, 5, 1, 2
• This example has no mode because 1, 2, and 5
have a frequency of 2.
• Example C: 5, 7, 9, 1, 7, 5, 0, 4
• This example has two modes 5 and 7. This is
said to be bimodal.
Mode -- Example
• The mode is 44.
35 41 44 45
• There are more 44s
37 41 44 46
than any other value.
37 43 44 46

39 43 44 46

40 43 44 46

40 43 45 48
BITS Pilani, Pilani Campus
BITS Pilani, Pilani Campus
• Find the mean, median, and Score Number of
mode of the following data: students
• Mean =
[3(10)+10(9)+9(8)+8(7)+10(6)+
2(5)]/42 = 7.57 10 3
• Median: find the location 9 10
(42+1)/2=21.5 Use the 21st and
22nd values in the data set. 8 9
• The 21st and 22nd values are 8
and 8. Thus the median is 7 8
(8+8)/2=8.
• The modes are 6 and 9 since 6 10
they have frequency 10.
5 2
Measures of Dispersion
• Measures of dispersion are descriptive
statistics that describe how similar a set of
scores are to each other
– The more similar the scores are to each other, the
lower the measure of dispersion will be
– The less similar the scores are to each other, the
higher the measure of dispersion will be
– In general, the more spread out a distribution is,
the larger the measure of dispersion will be
Measures of Dispersion
• Which of the 125
100
distributions of scores 75
has the larger 50
25
dispersion? 0
1 2 3 4 5 6 7 8 9 10
• The upper
distribution has more 125
100
dispersion because 75
50
the scores are more 25
0
spread out 1 2 3 4 5 6 7 8 9 10
• That is, they are less
similar to each other
Measures of Dispersion
• There are three main measures of dispersion:
– The Range
– The Quartile
– Variance / Standard Deviation
The Range
• The range is defined as the difference
between the largest score in the set of data
and the smallest score in the set of data, XL -
XS
• What is the range of the following data:
4 8 1 6 6 2 9 3 6 9
• The largest score (XL) is 9; the smallest score
(XS) is 1; the range is XL - XS = 9 - 1 = 8
Range
• The difference between the largest and the
smallest values in a set of data
• Simple to compute 35 41 44 45
• Ignores all data points
37 41 44 46
except the
two extremes 37 43 44 46

• Example: 39 43 44 46
Range
Largest - Smallest 40 43 44 46=

48 - 35 = 13 40 43 45 48
Quartiles
Measures of central tendency that divide a
group of data into four subgroups

• Q1: 25% of the data set is below the first


quartile
• Q2: 50% of the data set is below the second
quartile
• Q3: 75% of the data set is below the third
quartile
Quartiles, continued
• Q1 is equal to the 25th percentile

• Q2 is located at 50th percentile and equals the


median

• Q3 is equal to the 75th percentile

Quartile values are not necessarily members of


the data set
Quartiles

Q1 Q2 Q3

25% 25% 25% 25%


Quartiles: Example
• Ordered array: 106, 109, 114, 116, 121, 122,
125, 129

• Q1: 25 109114
i (8)  2 Q1   1115
.
100 2
50 116121
• Q2: i (8)  4 Q2   1185
.
100 2
75 122125
• Q3: i (8)  6 Q3   1235
.
100 2
Deviation from the Mean
• Data set: 5, 9, 16, 17, 18
• Mean:

 X 65
  13
N 5
Deviations from the mean: -8, -4, 3, 4, 5
+5
+3 +4
-8 -4
0 5 10 15 20


Population Variance
• Average of the squared deviations from the
arithmetic mean

X   X
 X 
X 
2

 2


2
5 -8 64 
9 -4 16 N
16 +3 9 130

17 +4 16 5
18 +5 25  2 6 .0
0 130
Population Standard Deviation
• Square root of the
variance

 X 
2

X   X  
2
X 
2


N
5 -8 64 130
9 -4 16 
5
16 +3 9
 2 6 .0
17 +4 16
18 +5 25

2
 
0 130
 2 6 .0
 5 .1
Coefficient of Variation
• Ratio of the standard deviation to the mean,
expressed as a percentage
• Measurement of relative dispersion


C.V . 100

Coefficient of Variation
  29
1
  84
2

 1
 4.6  2
 10
 100  100
CV
. .
1
1
CV
. .
2
2

1 2

4.6 10
 100  100
29 84
 1586
.  1190
.
Measures of Shape
• Skewness
– Absence of symmetry
– Extreme values in one side of a distribution
• Kurtosis
– Peakedness of a distribution
• Box and Whisker Plots
– Graphic display of a distribution
– Reveals skewness
Skewness

Negatively Symmetric Positively


Skewed (Not Skewed) Skewed
Skewness

Mean Mode Mean Mean


Mode
Median
Median Mode Median

Negatively Symmetric Positively


Skewed (Not Skewed) Skewed
Coefficient of Skewness
• Summary measure for skewness

3   Md 
S

• If S < 0, the distribution is negatively skewed
(skewed to the left).
• If S = 0, the distribution is symmetric (not skewed).
• If S > 0, the distribution is positively skewed
(skewed to the right).
Coefficient of Skewness
 1
 23  2
 26  3
 29

M
d1  26 M
d2  26 M
d3  26
 1
 12.3  2
 12.3  3
 12.3


3 1  M 
d1 
3 2  M d2  
3 3  M 
d3
S 1

 S 2

 S 3


1 2 3

3 23  26 3 26  26 3 29  26


  
12.3 12.3 12.3
 0.73 0  0.73
Kurtosis
• Peakedness of a distribution
– Leptokurtic: high and thin
– Mesokurtic: normal in shape
– Platykurtic: flat and spread out

Leptokurtic

Mesokurtic
Platykurtic
Box and Whisker Plot

• Five specific values are used:


– Median, Q2
– First quartile, Q1
– Third quartile, Q3
– Minimum value in the data set
– Maximum value in the data set
Box and Whisker Plot, continued
• Inner Fences
– IQR = Q3 - Q1
– Lower inner fence = Q1 - 1.5 IQR
– Upper inner fence = Q3 + 1.5 IQR

• Outer Fences
– Lower outer fence = Q1 - 3.0 IQR
– Upper outer fence = Q3 + 3.0 IQR
Box and Whisker Plot

Minimum Q1 Q2 Q3 Maximum
Skewness: Box and Whisker Plots, and
Coefficient of Skewness
S<0 S=0 S>0

Negatively Symmetric Positively


Skewed (Not Skewed) Skewed
The clean water act require that all waters meet specific pollution reduction goals
to ensure that water is fishable and swimmable. The estimated pollutant loads
(total nitrogen kg N/day) in watershed as follows (increasing order):

9.69, 13.16, 17.09, 18.12, 23.7, 24.07, 24.29, 26.43, 30.75, 31.54,
35.07, 36.99, 40.32, 42.51, 45.64, 48.22, 49.98, 50.06, 55.02, 57.00,
58.41, 61.31, 64.25, 65.24, 66.14, 67.68, 81.40, 90.80, 92.17, 92.42,
100.82, 101.94, 103.61, 106.28, 106.8, 108.69, 114.61, 120.86, 124.54,
143.27, 143.75, 149.64, 167.79, 182.5, 192.55, 193.53, 271.57, 292.61,
312.45, 352.09, 371.47, 444.68, 460.86, 563.92, 690.11,826.54, 1529.35

BITS Pilani, Pilani Campus


Data Analysis

The data set consists of observations on shower-flow rate (L/min) for n=129
houses in Delhi.

4.6, 12.3, 7.1, 7.0, 4.0, 9.2, 6.7, 6.9, 11.5, 5.1, 11.2, 10.5, 14.3, 8.0, 8.8, 6.4,
5.1, 5.6, 9.6, 7.5, 7.5, 6.2, 5.8, 2.3, 3.4, 10.4, 9.8, 6.6, 3.7, 6.4, 8.3, 6.5, 7.6,
9.3, 9.2, 7.3, 5.0, 6.3, 13.8, 6.2, 5.4, 4.8, 7.5, 6.0, 6.9, 10.8, 7.5, 6.6, 5.0, 3.3,
7.6, 3.9, 11.9, 2.2, 15.0, 7.2, 6.1, 15.3, 18.9, 7.2, 5.4, 5.5, 4.3, 9.0, 12.7, 11.3,
7.4, 5.0, 3.5, 8.2, 8.4, 7.3, 10.3, 11.9, 6.0, 5.6, 9.5, 9.3, 10.4, 9.7, 5.1, 6.7,
10.2, 6.2, 8.4, 7.0, 4.8, 5.6, 10.5, 14.6, 10.8, 15.5, 7.5, 6.4, 3.4, 5.5, 6.6, 5.9,
15.0, 9.6, 7.8, 7.0, 6.9, 4.1, 3.6, 11.9, 3.7, 5.7, 6.8, 11.3, 9.3, 9.6, 10.4, 9.3,
6.9, 9.8, 9.1, 10.6, 4.5, 6.2, 8.3, 3.2, 4.9, 5.0, 6.0, 8.2, 6.3, 3.8, 6.0.

BITS Pilani, Pilani Campus


sum 994.3
mean 7.707751938
variance 9.393583318
standard deviation 3.064895319
skewness 0.885261602
kurtosis 3.918782138
Excess Kurtosis 0.918782138
range 16.7
Quartile 1 5.6
Quartile 2 7
Quartile 3 9.6
Quartile 4 18.9
Median 7
IQR 4
Mode 7.5

BITS Pilani, Pilani Campus


Bin Frequency
2.2 1
3.718181818 9
5.236363636 17
6.754545455 30
8.272727273 25
9.790909091 18
11.30909091 15
12.82727273 6
14.34545455 2
15.86363636 5
17.38181818 0
More 1

BITS Pilani, Pilani Campus


BITS Pilani, Pilani Campus
Probability

BITS Pilani, Pilani Campus


Exp: Tossing an unbiased coin
• Simulation using a random number
generator: rand()=x Random Event

• Rule 1. if x<0.5, Assume Tail (T)


• Rule 2. if x>=0.5, Assume Head(H)

Either Head or Tail will be outcome

mutually exclusive events H T  


Disjoint n( H  T )  0
H n=10
H
H
T
n( SampleSpace  S)  10
H n( H )  6, n(T )  4
T
T n( H ) 6
  0.6
T n( S ) 10
H
H
n(T ) 4
  0.4
n( S ) 10
n=100
H T H
H T T H T T T
T H T
T H T H H H T
T H H
T H T H H H T
T H H
H T T T T H H
H H H
H T T T H H T
T T H
H H T T T T T
H T H
T T H H H T H
H H H
T H H T H H T
H H T
T H T H H H H
T H H
H T T H T H T
n=100
n( SampleSpace  S)  100
n( H )  54, n(T )  46
n( H ) 54
  0.54
n( S ) 100

n(T ) 46
  0.46
n( S ) 100
n=100000
n( SampleSpace  S)  100000
n( H )  49868 , n(T )  50132
n( H ) 49868
  0.49868
n( S ) 100000

n(T ) 50132
  0.50132
n( S ) 100000
n=10000000000
n( SampleSpace  S)  1000000000 0
n( H )  0.5, n(T )  0.5

n( H )
 0.5
n( S )

n (T )
 0 .5
n( S )
Uniform Distribution
n
n( SampleSpace  S)  
n( H )  , n(T )  
n( H )  n(T ) 
 
n( S )  n( S ) 
n( H )  n( H )
  Lim
n( S )  n  n( S )
n(T )  n(T )
  Lim
n( S )  n  n( S )
Theoretical Probability

n( H )
P ( H )  Lim  0.5
n  n( S )

n(T )
P (T )  Lim  0.5
n  n( S )
Mutually Exclusive Events
H T  
n( H  T )  0

n( H  T )
P( H  T )  0
n( S )

P( H  T )  P( H )  P(T )  1
Axioms of Probability
• 1. 0  P( E )  1
• 2. P( S )  1
 

• 3. P( E )   P( Ei ); for mutually exclusive events


i i 1
i 1
Equally Likely Outcome

P( H )  0.5
P(T )  0.5
P( H )  P(T )
Equally Likely
1 Uniform Distribution
P( Ei ) 
N
E  E 
C
P( E )  1  P( E )
C

EE S
C
S

C
E
E
E  F  P( E )  P( F )
S

F
E
Probability as a measure of belief
• Belief in proposition (f) can be measured in
terms of number between 0 (impossible) and
1 (certain).
• f has a probability between 0 and 1 , does not
mean it is true to some degree, but means
that we are ignorant of its truth value.
Probability as a measurement of
uncertainty
• Uncertainty : The lack of certainty, a state of limited
knowledge where it is impossible to exactly describe
the existing state, a future outcome, or more than
one possible outcome. ...
• Quantification of uncertainty in terms of probability.
• Uncertainty arises in partially observable and/or
stochastic environments, as well as due to ignorance,
indolence, or both.
Exp: Rolling two dice
Sample Space
Equally Likely
Non Uniform Distribution
Observations

• (i) The outcomes (1, 1), (2, 2), (3, 3), (4, 4), (5,
5) and (6, 6) are called doublets.

• (ii) The pair (1, 2) and (2, 1) are different


outcomes.
Question 1: Two dice are rolled. A is the event that
the sum of the numbers shown on the two dice is 5,
and B is the event that at least one of the dice
shows up a 3.
• Are the two events (i) mutually exclusive, (ii)
exhaustive?
Now, A = {(1, 4), (2, 3), (4, 1), (3, 2)}, and
B = {(3, 1), (3, 2), (3, 3), (3, 4), (3, 5), (3, 6), (1,3), (2,
3), (4, 3), (5, 3), (6, 3)}
(i) A ∩ B = {(2, 3), (3, 2)} ≠ ∅.
Hence, A and B are not mutually exclusive.
(ii) Also, A ∪ B ≠ S.
Therefore, A and B are not exhaustive events.
Question 2
• What is the probability that the total of two
dice will be greater than 8 ?
• What is the probability that the first die is a 6?
• What is the probability that the total of two
dice will be greater than 8 given that the first
die is a 6?
• What is the probability that the first die is a 6
given that the total of two dice will be greater
than 8 ?
Question 2
• What is the probability that the total of two dice
will be greater than 8 and the first die is a 6?
• What is the probability that the first die is a 6?
• What is the probability that the total of two dice
will be greater than 8 given that the first die is a
6?
• What is the probability that the first die is a 6
given that the total of two dice will be greater
than 8 ?
Solution
• A= Event that the total of two dice will be
greater than 8
• B= Event that the first die is a 6
• C= Event that the total of two dice will be
greater than 8 given that the first die is a 6
• D = Event that the first die is a 6 given that
the total of two dice will be greater than 8
A  B = Event that the total of two dice will be
greater than 8 and first die 6
(1,1), (1,2), (1,3), (1,4), (1,5), (1,6), (2,1), (2,2), (2,3), (2,4), (2,5), (2,6)

AC A
(3,6), (4,5), (4,6), (5,4), (5,5)
(5,6), (6,3), (6,4), (6,5), (6,6)
10 5
P( A)  
36 18
26 13
P( A )  
C

36 18
(3,1), (3,2), (3,3), (3,4), (3,5), (4,1), (4,2), (4,3), (4,4), (5,1), (5,2), (5,3), (6,1), (6,2)
(1,1), (1,2), (1,3), (1,4), (1,5), (1,6), (2,1), (2,2), (2,3), (2,4), (2,5), (2,6)

BC B
(6,1), (6,2), (6,3), (6,4), (6,5), (6,6)
6 1
P( B)  
36 6
30 5
P( B )  
C

36 6
(3,1), (3,2), (3,3), (3,4), (3,5), (3,6), (4,1), (4,2), (4,3),
(4,4), (4,5), (4,6), (5,1), (5,2), (5,3), (5,4), (5,5), (5,6)
(1,1), (1,2), (1,3), (1,4), (1,5), (1,6), (2,1), (2,2), (2,3), (2,4), (2,5), (2,6)

P( D)  P( B | A) 
4 2

10 5
A
4 1 (6,3), (6,4), (3,6), (4,5), (4,6),
P( A  B)  
36 9 (5,4), (5,5), (5,6)
(6,5), (6,6)
B P(C )  P( A | B) 
4 2

6 3

(6,1), (6,2)
2 1
P(C C )  P( AC | B)  
6 3 6 3
P( D C )  P( B C | A)  
10 5

(3,1), (3,2), (3,3), (3,4), (3,5), (4,1), (4,2), (4,3), (4,4), (5,1), (5,2), (5,3)
2
5 1 P( A | B) 
P( A)  P( B)  3
18 6
13
2 1 P( A ) 
C

P( B | A)  P( A  B)  18
5 9

5 1 3
P( B ) C P( A | B) 
C P( B C | A) 
6 3 5

1 1
P( A) P( B | A)  P( B) P( A | B) 
9 9

P( A  B)  P( A) P( B | A)  P( B) P( A | B)
P( A  B) P( A  B)
P( A | B)  P( B | A)  Conditional Probability
P( B) P( A)
Tree Diagram
3
P( B | A) 
C

5
5
P( A)  2
18 P( B | A) 
5

2
1 P( A | B) 
P( B)  3
6
1
P( A | B) 
C

3
Tree Diagram
Multiplication Rule
n n1
P( Ei )  P( E1 ) P( E2 | E1 ) P( E3 | E1  E2 )...P( En |  Ei )
i 1 i 1
Thanks

BITS Pilani, Pilani Campus

You might also like