You are on page 1of 69

STAT – 835

PROBABILITY AND STATISTICS


Scope of the Course

To serve as a comprehensive introduction to


‘probability concepts’ and ‘statistical
methods & applications’ most likely to be
encountered and used by students in pursuit
of their careers in engineering

2
STAT – 835 Probability and Statistics
Planned Curriculum

 Descriptive Statistics (6 hours)


◦ Populations, Samples, Processes
◦ Pictorial and Tabular Methods in Descriptive Statistics (Stem-
and-leaf, box plot, dot plots, histogram)
◦ Mean, Median, Quartiles, Percentiles, Trimmed mean
◦ Measures of Variability (variance, standard deviation)
 Probability (18 hours)
◦ Sample Spaces, Events
◦ Axioms, Interpretations and Properties of Probability
◦ Conditional Probability , Independence and Bayes’ Theorem
◦ Discrete and Continuous Random Variables
◦ Discrete and Continuous Probability Distributions

3
STAT – 835 Probability and Statistics
Planned Curriculum

 Hypotheses Tests ( 6 hours)


◦ Hypotheses and Test Procedures
◦ Test about a population Mean
◦ Inferences based on two samples (two-sample t test)

 Regression and Statistical Modeling (18 hours)


◦ Simple linear Regression Model
◦ Estimating Model Parameters and their inferences
◦ Correlation
◦ Diagnostics and Remedial Measures
◦ Nonlinear and Multiple Regression
◦ Software Learning (Excel, SPSS (PASW 18.0))

4
Miscellaneous Course Information
STAT– 835: Probability and Statistics
Time and Location: Thus, 5:00 - 6:30 PM, Mon 6:45 – 8:15;
Instructor: Dr. Kamran Ahmed
Contact : (051)9085-4153; Mob: 0301-5630831
Email: buet99@hotmail.com
Office Hours for Students: By appointment
Textbooks: Probability and Statistics For Engineering and Sciences by Jay L. Devore (6th
Edition)
Exams: There will be Two Class Tests (One hour Each) and One final examination (3
hours). The final examination will be held during the final exam week, and covers the
entire course,
Home Work: Homework will be given after completion of a major topic (a total of 4/5
Homework Assignments)
Quiz and Attendance:
There will be 3/6 quiz tests including a couple of pop-up quizzes in class. Students are
expected to attend all classes. Poor attendance will affect the final grade of
students.
Final Grade: Final grade will depend on the following components with the
proportions mentioned against each (subject to variation):
Homework (5%), Quiz (10%), Class Tests (30%), Term project (15%) and Final exam
(40%).
5
STAT – 835 Probability and Statistics

DESCRIPTIVE STATISTICS

6
Population and Sample and Processes
 Engineers and Scientists are constantly exposed to the
collection of facts, or data

 Statistics provide methods for organizing and


summarizing data and for drawing conclusions
based on data

 An investigation will typically focus on a well-defined


collection of objects constituting a population (e.g. all
graduating students of a University)

 If desired information is available for all objects in the


population, we have what is called census
7
Population and Sample and Processes
(cont...)
 Usually census is impractical and infeasible: Why?
Constraints on time, money and other scarce resources

8
Population and Sample and Processes
(cont...)
 Usually census is impractical and infeasible: Why?
Constraints on time, money and other scarce resources
 Instead, a subset of population – a sample is selected in
some prescribed manner (e.g. a randomly selected 50
students out of 500 graduates)
 In order to draw inferences/ conclusions about a
population, certain characteristics of the objects of
population are investigated: (e.g. age, gender, GPA – a
categorical or numerical variable)
 Variable is any characteristic whose value may change
from one object to another
 Uni-variate , bi-variate and Multivariate data set

9
Univariate, Bivariate, and Multivariate
Data
 Depending on how many variables we are
measuring on the individuals or objects in our
sample, we will have one of the three following
types of data sets
◦ Univariate: Measurements made on only one variable
per observation.
◦ Bivariate: Measurements made on two variables per
observation.
◦ Multivariate: Measurements made on more than two
variables per observation.

10
Population and Sample
 Population: The entire collection of individuals or
measurement objects about which information is desired
e.g.Average height of 5-year old children in Pakistan

 Sample: A subset of the population selected for study.


Primary objective is to create a subset of population
whose center, spread and shape (data characteristics)
are as close as that of population; i.e., the sample should be
a true representative of entire population. There are many
methods of sampling. Random (simple or systematic)
sampling, stratified or cluster sampling etc.

 Random Sample: A simple random sample of size n


from a population is a subset of n elements from that
population where the subset is chosen in such a way that
every possible unit of population has the same chance of
being selected.
11
Population and Sample (Cont…)
 Why do we need randomness in sampling?
It reduces the possibility of subjective biaseness (e.g.
selectivity bias).
Mean and variance of a random sample is an unbiased
estimate of the population mean and variance
respectively.

12
Census and Inference
 Census: Complete enumeration of population units.

 Inference: We sample the population (in a manner to ensure


that the sample correctly represents the population) and then
take measurements on our sample and infer (or generalize)
back to the population.

Example: We may want to know the average height of all


adults (over 18 years old) in Pakistan. Our population is then
all adults over 18 years of age. If we were to census, we would
measure every adult and then compute the average. By using
statistics, we can take a random sample of adults over 18
years of age, measure their average height, and then infer that
the average height of the total population is ``close to'' the
average height of our sample.

13
Parameter and Statistic
 Parameter: Any statistical characteristic of a
population. Population mean, population median,
population standard deviation are examples of
parameters.

 Statistics: Any statistical characteristic of a


sample. Sample mean, sample median, sample
standard deviation are some examples of
statistics.

 Statistical Methods: Describing population


through census or making inference from sample
by estimating the value of the parameter using
statistic.

14
Some Differences between Population and Sample

POPULATION SAMPLE
Size Large Small
Size Notation N n
Easy to collect data? No Yes
Term used to describe A “parameter” A “statistic”
its nature
e.g., μ, σ e.g., x, s

15
Some Differences between Population and Sample
(Cont’d)

POPULATION SAMPLE
Mean (notation) μ x
Std Deviation σ s
(notation)
Mean (formula)
  x
x
 x
N n
Variance (formula)
 (x  ) 2
s2 
 (x  x) 2

 2
 n 1
N

16
Statistics!
What is it? What does it involve?
 The art or science of making confident conclusions about the
attributes of a system or collection of systems

 Involves:
- taking a small sample from a larger set (Sampling)
- analyzing data from the small sample (Data analysis)
- testing the hypotheses to ascertain if true (Hypothesis
Testing)
- making conclusions about the larger set (Statistical
Inference)
- presenting your findings to an audience (Information
Delivery)

17
Prelude to Statistics

As engineers we may be required to answer some


questions about a population, which is the
collection of all elements in a system.

However, we’ll find it impractical to use the entire


population for the investigation.

We will have to settle for a small part of the


population, called sample.

18
Some of such questions we may be required to answer as
civil engineer :

- What is the quality of aggregates at a certain quarry?


(Materials/Civil Engineering)

- What is the ratio of auto use to transit use


(Transportation Planning)

19
- What is the strength of concrete being used in
constructing a certain structure?
(Construction/Materials Engineering)

- What is the quality of water produced by a water


treatment plant? (Environmental Engineering)

- What has been the long-term settlement of


high-rise buildings in a City? (Geotechnical)

20
-
- How many of the steel I-sections provided by a certain
supplier have a lower-than-specified strength?
(Structural Engineering)

- What is the quality of water in a water reservoir?


(Environmental Engineering)

21
Therefore:

… there’re countless instances in engineering

where

we’ll have to take only a small sample from a large


population of systems or system components

in order to

investigate an issue and provide needed answers.

22
Because we draw the sample from the population, the
sample is called a subset of the population (Recall
Set Theory)

The population is also referred to as the “Universe”, or


the “Sample Space”.

Sample

Population

23
Ideally, we seek a sample that is a miniature copy of
the population.

But there is no guarantee that we can achieve such a


sample.

This dilemma leads to 2 very important questions …

24
Important Questions …

1. Is our sample a good copy of the


population?
In other words, what quantitative means
can we use to determine whether our
sample is “close” enough to the
population?

2. What steps can we take to ensure that our


sample is a good miniature copy of the
population?

25
Every engineer involved in statistic analysis of his/her system hopes
that:
his/her sample is a good representative of the population.

i.e., the engineer “prays” that the statistics of his/her


sample closely match the true parameters of the population.

Otherwise any conclusion he/she makes about the


sample does not reflect the entire population.

POPULATION SAMPLE
Parameters: μ, σ Statistics: x, s
26
Back to “Important Questions, #1”

Is our sample a good copy (close enough) of the


population?
We may compare the population parameters and the
sample statistics. However, the parameters of the
population are unknown, so can we measure such
closeness of our sample to the population?

We use the concepts of Bias and Efficiency (to be


discussed under “Inferential Statistics”).

“Statistical Inference”, helps to determine the


biased-ness or efficiency of estimates, in order to
see how good our samples are.
27
Back to “Important Questions #2”

What steps can we take to ensure that our sample is a


good miniature copy of the population?

Answer: Sampling must be random (and representative).


i.e., all elements of the population should have an
equal chance of being picked in the sample

28
Methods of Random Sampling
There are 4 major ways by which a sample can be
carried out to ensure that it is random and yet
represents a true miniature copy of the population:

- Simple Random Sampling


- Systematic Random Sampling
- Stratified (or Clustered) Random Sampling
- Combos of the above

The choice of any specific sampling technique above


depends on
- the composition of the population
- the availability of sampling resources
29
Simple Random Sampling
This is just a simple selection of elements of the
population without regard to the nature of the
population.

Advantages: - Less effort in preparations for the survey


- Less effort for conduct of the survey
- Is best when all elements in the population
have similar characteristics

Disadvantage: May not be truly representative of the


population, if the population has diverse
characteristics.

30
Systematic Random Sampling

This sampling method is …

Systematic in time : sampling elements from the


population within specified time intervals, at the same
location) (when data is sensitive to temporal changes)
(e.g. traffic data collection during rush/peak hours and
off peak hours),

Systematic in space: sampling elements from the


population at selected locations at the same time.
(when data is sensitive to spacial variation). (e.g.
counting the number of car accidents at various
segments/locations of a highway over a given time
period) (Crossectional data sample)
31
Stratified Random Sampling

This sampling method first divides the entire population


into different groups, or strata, on the basis of
certain characteristics of the population.

Next, a random sample is obtained within each stratum


to obtain the desired sample size.

32
Stratified Random Sampling
MAIN POPULATION

SUB-POPULATION SUB-POPULATION SUB-POPULATION SUB-POPULATION


#1 #2 #3 #4

SAMPLE SAMPLE
SAMPLE SAMPLE

Sub-populations may be of same size or of different sizes


33
Stratified Random Sampling (continued)
A stratified sampling approach is most effective when
three conditions are met
 Variability within strata are minimized
 Variability between strata are maximized
 The variables upon which the population is stratified
are strongly correlated with the desired dependent
variable.
Advantage:
Stratified random sampling ensures that each group in
the population is represented in the sample.
Is therefore ideal for populations having diverse groups.
Disadvantage:
Relatively more preparation time is needed to calculate
the proportions of each group in the population, and
therefore determination of their proportions in the
34
sample
Combinations of the 3 major methods of random
sampling.

Sampling schemes which are combination of the 3 methods can


also be used.

For example, You may decide to carry out a stratified and


systematic random sampling of your population.

Any Example?

Collection of data on different classes of vehicles (cars,


motor cycles, bus, etc.) during peak and off peak hours.

35
In Summary ...
- We can afford to take only a small sample
from a large population of systems or system
components in order to investigate the
population.

- Our sample must as much as possible reflect


the population from which it is drawn.

- Good sampling should be random, and


representative. Systematic and Stratified
sampling are useful to ensure that sample is
representative of the population.

- Only a good sample can result in accurate


inferences and predictions about the
population.
36
Population, Sample
and Processes •Properties of
population under
study is assumed to be
known
Probability •Deals with questions
involving samples
taken from population
(logic based on known properties)

Sample
Population

(logic based on observed instances)


Statistics of sample are
known to infer about
Inferential Statistics population
•Point estimation
•Hypothesis testing
•Estimation by
Confidence interval

 Any samples used should be representative of


the target population 37
Using Statistics in Research
 Carrying out research means the collection and
collation of data. Statistics are a way of making use
of this data
◦ Descriptive Statistics: used to describe characteristics
of the sample
 Statistics describe samples
 Gives numerical and graphic procedures to summarize a collection
of data in a clear and understandable way
◦ Inferential Statistics: used to generalise/infer/predict
from our sample to our population
 Parameters describe populations
 Provides procedures to draw inferences about a population from a
sample

38
Introduction to Statistics

Types of Statistical Analysis

Descriptive Inferential

Graphical Non-graphical
Central Tendency Point Estimation
Dot Plots Dispersion/ Variance Hypothesis Testing
Scatter Plots Range Confidence Interval
Box Plots Shape Statistical Regression
Stem-and-leaf Plots
Bar Charts/Histograms

39
Descriptive Statistics
◦ Statistical procedures used to summarise,
organise, and simplify data. This process should be
carried out in such a way that reflects overall
findings
 Raw data is made more manageable
 Raw data is presented in a logical form
 Patterns can be seen from organised data
 Frequency tables
 Graphical techniques
 Measures of Central Tendency
 Measures of Spread (variability)

40
Descriptive Measures
 Central Tendency measures. They are
computed to give a “center” around which the
measurements in the data are distributed.
(Average/Mean/Median)

 Variation or Variability measures. They


describe “data spread” or how far away the
measurements are from the center.

41
Measures of Central
Tendency/Measure of Location

 Mean:
Sum of all measurements divided by the number
of measurements.

 Median:
A number such that at most half of the
measurements are below it and at most half of the
measurements are above it.
 Mode:
The most frequent measurement in the data.

42
Mean
 Sum of the values divided by the number
of cases

y
 y i

43
Summation notation
 The yi (y1, y2, …, yn) are the n values of the
variable Y
 The sum of the values is then denoted as

yy y i
i 1
i  y1  y2    yn

44
Calculating the mean for high
temperatures
 Add values
High
Date
2-Jan
Temperature
59 y i  442
3-Jan 60
4-Jan 43  Number of cases
5-Jan 42
6-Jan
7-Jan
35
32
n  10
8-Jan 32
9-Jan 46  Calculate mean
10-Jan 41

Sum
11-Jan 52
442
y
 y i

442
 44.2
n 10
Notice that every single observation intervenes in the computation
of the mean.
45
Median
 A number such that half of the measurements
are below it and half of the measurements
are above it
 The median represents the middle of the ordered
sample data
 When the sample size is odd, the median is the
middle value
 When the sample size is even, the median is the
mean of the two middle values

46
Calculating the median for high
temperatures
High
Date Temperature
7-Jan 32 n = 10
8-Jan 32
6-Jan 35
10-Jan 41
5-Jan 42 <===Middle values
4-Jan 43 <===Middle values
9-Jan 46
11-Jan 52
2-Jan 59
3-Jan 60

~ 42  43
= Median   42.5
2 47
Comparison of mean and median

 Mean
◦ Uses all of the data
◦ Affected by extreme high or low values (outliers)

 Median
◦ May not necessarily use all data
◦ Not affected by outliers

48
Mode
 The most frequent measurement in the
data.
 Example:

 10,9,8,4,12,10,23,10,33,16,10,20,10

 10,9,8,4,12,10,23,10,33,16,10,20,10

 Mode of the sample is 10


49
Other Measures of Location
 Percentiles
 Quartiles
 Trimmed Mean

50
Percentiles

 The pth percentile is a number such that at most


p% of the measurements are below it and at
most 100 – p percent of the data are above it.

 Example:
if in a certain data the 85th percentile is 340,
it means that 15% of the measurements in the
data are above 340. It also means that 85% of
the measurements are below 340

 The median is the 50th percentile


51
Quartiles
 Data can be divided into quartiles
 Quartiles divide data into four equal
groups
◦ Lower (first) quartile is 25th percentile
◦ Middle (second) quartile is 50th percentile and
is the median
◦ Upper (third) quartile is 75th percentile

52
Trimmed Mean
 Mean is greatly influenced by outliers while median is
not.
 Extreme behavior of either type might be
undesirable.
 Consider alternative measures that are neither as
sensitive as mean nor as insensitive as median
 A trimmed mean is a compromise between
mean and median.
 A 10% trimmed mean, for example, would be
computed by eliminating the smallest 10% and the
largest 10% of the sample and then averaging what
is left over.
53
Central Tendency vs Variance
 Reporting a measure of center gives only partial
information about a data set or distribution.
 Different samples or populations may have
identical measures of center yet differ from one
another in other important ways.

54
Central Tendency vs Variance
 Figure shows dotplots of three samples with the
same mean and median, yet the extent of spread
about the center is different for all three samples.
 The first sample has the largest amount of
variability, the third has the smallest amount, and
the second is intermediate to the other two in this
respect.

55
Measures of variation
 Range
 Variance and standard deviation
 Interquartile range
 Primary measure of variability involve the
deviations from the mean

56
Range
 Range is the difference between the
minimum and maximum values in a data
 A defect of the range: is that it depends
on only the two most extreme
observations and disregards the positions
of the remaining n-2 values.

57
Central Tendency vs Variance
 Samples 1 and 2 in Figure have identical ranges, yet
when we take into account the observations
between the two extremes, there is much less
variability or dispersion in the second sample
than in the first.

58
Calculating the range for high
temperatures

range = 60 – 32 = 28
59
Variance and Standard Deviation
 The variance s2 is the sum of the squared deviations
from the mean divided by the number of cases minus 1

 y  y
2

s 2
 i

n 1
 The standard deviation s is the square root of the
variance

 y  y
2

s i

n 1
It is a measure of “spread”
The larger the deviations (positive or negative) the larger the
variance
Sum of all individual deviations  yi  y  is zero, i.e.

 y i  y  0 60
Variance (for a sample)

 Steps:
◦ Compute each deviation
◦ Square each deviation
◦ Sum all the squares
◦ Divide by the data size (sample size) minus
one: n-1

61
Justification for “n-1”
 The value of population mean is almost never known, so
the sum of squared deviations about sample mean must
be used.
 But the sample values tend to be closer to their average
(sample mean) than to the population mean, so to compensate
for this the divisor n-1 is used rather than n.
 If we use a divisor “n” in the sample variance, then the
resulting quantity would tend to underestimate population
variance
 Dividing by the slightly smaller n - 1 corrects this
underestimating.

62
Standard deviation
 The unit for s is the same as the unit for each
of the observation.
 If, for example, the observations are fuel
efficiencies in miles per gallon, then we
might have s = 2.0 mpg.
 A rough interpretation of the sample
standard deviation is that it is the size of a
typical or representative deviation from
the sample mean within the given
sample.
63
Standard deviation

 If s = 2.0 mpg, then some data values are


closer than 2.0 mpg to mean, whereas others
are farther away

 2.0 mpg is a representative (or “standard”)


deviation from the mean fuel efficiency.

64
Calculating the variance and standard deviation
High Difference Difference
Date Temperature X - mean Squared
2-Jan 59 14.80 219.04
3-Jan 60 15.80 249.64
4-Jan 43 -1.20 1.44
5-Jan 42 -2.20 4.84
6-Jan 35 -9.20 84.64
7-Jan 32 -12.20 148.84
8-Jan 32 -12.20 148.84
9-Jan 46 1.80 3.24
10-Jan 41 -3.20 10.24
11-Jan 52 7.80 60.84
Sum 442 931.60
n 10
Mean 44.2

 y  y 2
931.60 
 iy  y 2

s 2
 i
  103.5 s   103.51  10.2
n 1 101 n 1
65
Calculating Variance and Standard Deviation

66
Coefficient of Variation (CV)
 Also called coefficient of dispersion
std dev s
CV  
mean y
 Measure the variation relative to mean
 Values of this coefficient for several different
data sets can be compared to determine which
data set exhibits more or less variation
 Because the coefficient of variation is unitless,
you can use it instead of the standard deviation
to compare the spread of data sets that have
different units or different means.

67
Coefficient of Variation (CV)
Example:
 Comparing the variation in average volume of
concrete produced by large and small machines
 The mean concrete volume produced at one time
by small machine is 1 ton with a standard deviation
of 0.08 ton.
 The mean concrete volume produced at one time
by large machine is 16 tons with a standard
deviation of 0.4 ton.
 Although the standard deviation of the large
machine is five times greater that the standard
deviation of the smaller one, their coefficients of
variation ﴾CVs﴿ support a different conclusion:
68
Coefficient of Variation (CV)
 CV (Large Machine)
= 100 * 0.4 ton / 16 tons = 2.5 %
 CV(Small Machine)
= 100 * 0.08 ton / 1 ton = 8 %
 The CV of the small machine is more than
three times greater than that of the large
machine.

 Although the large machine has a greater


standard deviation, the small machine has
much more variability relative to its mean.
69

You might also like