Descriptive Statistics Lecture 1

STAT – 835
PROBABILITY AND STATISTICS

Scope of the Course
To serve as a comprehensive introduction to

‘probability concepts’ and ‘statistical
methods & applications’ most likely to be
encountered and used by students in pursuit
of their careers in engineering
2
STAT – 835 Probability and Statistics
Planned Curriculum
 Descriptive Statistics (6 hours)

◦ Populations, Samples, Processes
◦ Pictorial and Tabular Methods in Descriptive Statistics (Stem-
and-leaf, box plot, dot plots, histogram)
◦ Mean, Median, Quartiles, Percentiles, Trimmed mean
◦ Measures of Variability (variance, standard deviation)
 Probability (18 hours)
◦ Sample Spaces, Events
◦ Axioms, Interpretations and Properties of Probability
◦ Conditional Probability , Independence and Bayes’ Theorem
◦ Discrete and Continuous Random Variables
◦ Discrete and Continuous Probability Distributions
3
Planned Curriculum
 Hypotheses Tests ( 6 hours)

◦ Hypotheses and Test Procedures
◦ Test about a population Mean
◦ Inferences based on two samples (two-sample t test)
 Regression and Statistical Modeling (18 hours)

◦ Simple linear Regression Model
◦ Estimating Model Parameters and their inferences
◦ Correlation
◦ Diagnostics and Remedial Measures
◦ Nonlinear and Multiple Regression
◦ Software Learning (Excel, SPSS (PASW 18.0))
4
Miscellaneous Course Information
STAT– 835: Probability and Statistics
Time and Location: Thus, 5:00 - 6:30 PM, Mon 6:45 – 8:15;
Instructor: Dr. Kamran Ahmed
Contact : (051)9085-4153; Mob: 0301-5630831
Email: buet99@hotmail.com
Office Hours for Students: By appointment
Textbooks: Probability and Statistics For Engineering and Sciences by Jay L. Devore (6th
Edition)
Exams: There will be Two Class Tests (One hour Each) and One final examination (3
hours). The final examination will be held during the final exam week, and covers the
entire course,
Home Work: Homework will be given after completion of a major topic (a total of 4/5
Homework Assignments)
Quiz and Attendance:
There will be 3/6 quiz tests including a couple of pop-up quizzes in class. Students are
expected to attend all classes. Poor attendance will affect the final grade of
students.
Final Grade: Final grade will depend on the following components with the
proportions mentioned against each (subject to variation):
Homework (5%), Quiz (10%), Class Tests (30%), Term project (15%) and Final exam
(40%).
5
DESCRIPTIVE STATISTICS
6
Population and Sample and Processes
 Engineers and Scientists are constantly exposed to the
collection of facts, or data
 Statistics provide methods for organizing and

summarizing data and for drawing conclusions
based on data
 An investigation will typically focus on a well-defined

collection of objects constituting a population (e.g. all
graduating students of a University)
 If desired information is available for all objects in the

population, we have what is called census
7
(cont...)
 Usually census is impractical and infeasible: Why?
Constraints on time, money and other scarce resources
8
(cont...)
 Usually census is impractical and infeasible: Why?
Constraints on time, money and other scarce resources
 Instead, a subset of population – a sample is selected in
some prescribed manner (e.g. a randomly selected 50
students out of 500 graduates)
 In order to draw inferences/ conclusions about a
population, certain characteristics of the objects of
population are investigated: (e.g. age, gender, GPA – a
categorical or numerical variable)
 Variable is any characteristic whose value may change
from one object to another
 Uni-variate , bi-variate and Multivariate data set
9
Univariate, Bivariate, and Multivariate
Data
 Depending on how many variables we are
measuring on the individuals or objects in our
sample, we will have one of the three following
types of data sets
◦ Univariate: Measurements made on only one variable
per observation.
◦ Bivariate: Measurements made on two variables per
observation.
◦ Multivariate: Measurements made on more than two
variables per observation.
10
Population and Sample
 Population: The entire collection of individuals or
measurement objects about which information is desired
e.g.Average height of 5-year old children in Pakistan
 Sample: A subset of the population selected for study.

Primary objective is to create a subset of population
whose center, spread and shape (data characteristics)
are as close as that of population; i.e., the sample should be
a true representative of entire population. There are many
methods of sampling. Random (simple or systematic)
sampling, stratified or cluster sampling etc.
 Random Sample: A simple random sample of size n

from a population is a subset of n elements from that
population where the subset is chosen in such a way that
every possible unit of population has the same chance of
being selected.
11
Population and Sample (Cont…)
 Why do we need randomness in sampling?
It reduces the possibility of subjective biaseness (e.g.
selectivity bias).
Mean and variance of a random sample is an unbiased
estimate of the population mean and variance
respectively.
12
Census and Inference
 Census: Complete enumeration of population units.
 Inference: We sample the population (in a manner to ensure

that the sample correctly represents the population) and then
take measurements on our sample and infer (or generalize)
back to the population.
Example: We may want to know the average height of all

adults (over 18 years old) in Pakistan. Our population is then
all adults over 18 years of age. If we were to census, we would
measure every adult and then compute the average. By using
statistics, we can take a random sample of adults over 18
years of age, measure their average height, and then infer that
the average height of the total population is ``close to'' the
average height of our sample.
13
Parameter and Statistic
 Parameter: Any statistical characteristic of a
population. Population mean, population median,
population standard deviation are examples of
parameters.
 Statistics: Any statistical characteristic of a

sample. Sample mean, sample median, sample
standard deviation are some examples of
statistics.
 Statistical Methods: Describing population

through census or making inference from sample
by estimating the value of the parameter using
statistic.
14
Some Differences between Population and Sample
POPULATION SAMPLE
Size Large Small
Size Notation N n
Easy to collect data? No Yes
Term used to describe A “parameter” A “statistic”
its nature
e.g., μ, σ e.g., x, s
15
Some Differences between Population and Sample
(Cont’d)
POPULATION SAMPLE
Mean (notation) μ x
Std Deviation σ s
(notation)
Mean (formula)
  x
x
 x
N n
Variance (formula)
 (x  ) 2
s2 
 (x  x) 2
 2
 n 1
N
16
Statistics!
What is it? What does it involve?
 The art or science of making confident conclusions about the
attributes of a system or collection of systems
 Involves:
- taking a small sample from a larger set (Sampling)
- analyzing data from the small sample (Data analysis)
- testing the hypotheses to ascertain if true (Hypothesis
Testing)
- making conclusions about the larger set (Statistical
Inference)
- presenting your findings to an audience (Information
Delivery)
17
Prelude to Statistics
As engineers we may be required to answer some

questions about a population, which is the
collection of all elements in a system.
However, we’ll find it impractical to use the entire

population for the investigation.
We will have to settle for a small part of the

population, called sample.
18
Some of such questions we may be required to answer as
civil engineer :
- What is the quality of aggregates at a certain quarry?

(Materials/Civil Engineering)
- What is the ratio of auto use to transit use

(Transportation Planning)
19
- What is the strength of concrete being used in
constructing a certain structure?
(Construction/Materials Engineering)
- What is the quality of water produced by a water

treatment plant? (Environmental Engineering)
- What has been the long-term settlement of

high-rise buildings in a City? (Geotechnical)
20
-
- How many of the steel I-sections provided by a certain
supplier have a lower-than-specified strength?
(Structural Engineering)
- What is the quality of water in a water reservoir?

(Environmental Engineering)
21
Therefore:
… there’re countless instances in engineering
where
we’ll have to take only a small sample from a large

population of systems or system components
in order to
investigate an issue and provide needed answers.
22
Because we draw the sample from the population, the
sample is called a subset of the population (Recall
Set Theory)
The population is also referred to as the “Universe”, or

the “Sample Space”.
Sample
Population
23
Ideally, we seek a sample that is a miniature copy of
the population.
But there is no guarantee that we can achieve such a

sample.
This dilemma leads to 2 very important questions …
24
Important Questions …
1. Is our sample a good copy of the

population?
In other words, what quantitative means
can we use to determine whether our
sample is “close” enough to the
population?
2. What steps can we take to ensure that our

sample is a good miniature copy of the
population?
25
Every engineer involved in statistic analysis of his/her system hopes
that:
his/her sample is a good representative of the population.
i.e., the engineer “prays” that the statistics of his/her

sample closely match the true parameters of the population.
Otherwise any conclusion he/she makes about the

sample does not reflect the entire population.
POPULATION SAMPLE
Parameters: μ, σ Statistics: x, s
26
Back to “Important Questions, #1”
Is our sample a good copy (close enough) of the

population?
We may compare the population parameters and the
sample statistics. However, the parameters of the
population are unknown, so can we measure such
closeness of our sample to the population?
We use the concepts of Bias and Efficiency (to be

discussed under “Inferential Statistics”).
“Statistical Inference”, helps to determine the

biased-ness or efficiency of estimates, in order to
see how good our samples are.
27
Back to “Important Questions #2”
What steps can we take to ensure that our sample is a

good miniature copy of the population?
Answer: Sampling must be random (and representative).

i.e., all elements of the population should have an
equal chance of being picked in the sample
28
Methods of Random Sampling
There are 4 major ways by which a sample can be
carried out to ensure that it is random and yet
represents a true miniature copy of the population:
- Simple Random Sampling

- Systematic Random Sampling
- Stratified (or Clustered) Random Sampling
- Combos of the above
The choice of any specific sampling technique above

depends on
- the composition of the population
- the availability of sampling resources
29
Simple Random Sampling
This is just a simple selection of elements of the
population without regard to the nature of the
population.
Advantages: - Less effort in preparations for the survey

- Less effort for conduct of the survey
- Is best when all elements in the population
have similar characteristics
Disadvantage: May not be truly representative of the

population, if the population has diverse
characteristics.
30
Systematic Random Sampling
This sampling method is …
Systematic in time : sampling elements from the

population within specified time intervals, at the same
location) (when data is sensitive to temporal changes)
(e.g. traffic data collection during rush/peak hours and
off peak hours),
Systematic in space: sampling elements from the

population at selected locations at the same time.
(when data is sensitive to spacial variation). (e.g.
counting the number of car accidents at various
segments/locations of a highway over a given time
period) (Crossectional data sample)
31
Stratified Random Sampling
This sampling method first divides the entire population

into different groups, or strata, on the basis of
certain characteristics of the population.
Next, a random sample is obtained within each stratum

to obtain the desired sample size.
32
Stratified Random Sampling
MAIN POPULATION
SUB-POPULATION SUB-POPULATION SUB-POPULATION SUB-POPULATION

#1 #2 #3 #4
SAMPLE SAMPLE
SAMPLE SAMPLE
Sub-populations may be of same size or of different sizes

33
Stratified Random Sampling (continued)
A stratified sampling approach is most effective when
three conditions are met
 Variability within strata are minimized
 Variability between strata are maximized
 The variables upon which the population is stratified
are strongly correlated with the desired dependent
variable.
Advantage:
Stratified random sampling ensures that each group in
the population is represented in the sample.
Is therefore ideal for populations having diverse groups.
Disadvantage:
Relatively more preparation time is needed to calculate
the proportions of each group in the population, and
therefore determination of their proportions in the
34
sample
Combinations of the 3 major methods of random
sampling.
Sampling schemes which are combination of the 3 methods can

also be used.
For example, You may decide to carry out a stratified and

systematic random sampling of your population.
Any Example?
Collection of data on different classes of vehicles (cars,

motor cycles, bus, etc.) during peak and off peak hours.
35
In Summary ...
- We can afford to take only a small sample
from a large population of systems or system
components in order to investigate the
population.
- Our sample must as much as possible reflect

the population from which it is drawn.
- Good sampling should be random, and

representative. Systematic and Stratified
sampling are useful to ensure that sample is
representative of the population.
- Only a good sample can result in accurate

inferences and predictions about the
population.
36
Population, Sample
and Processes •Properties of
population under
study is assumed to be
known
Probability •Deals with questions
involving samples
taken from population
(logic based on known properties)
Sample
Population
(logic based on observed instances)

Statistics of sample are
known to infer about
Inferential Statistics population
•Point estimation
•Hypothesis testing
•Estimation by
Confidence interval
 Any samples used should be representative of

the target population 37
Using Statistics in Research
 Carrying out research means the collection and
collation of data. Statistics are a way of making use
of this data
◦ Descriptive Statistics: used to describe characteristics
of the sample
 Statistics describe samples
 Gives numerical and graphic procedures to summarize a collection
of data in a clear and understandable way
◦ Inferential Statistics: used to generalise/infer/predict
from our sample to our population
 Parameters describe populations
 Provides procedures to draw inferences about a population from a
sample
38
Introduction to Statistics
Types of Statistical Analysis
Descriptive Inferential
Graphical Non-graphical
Central Tendency Point Estimation
Dot Plots Dispersion/ Variance Hypothesis Testing
Scatter Plots Range Confidence Interval
Box Plots Shape Statistical Regression
Stem-and-leaf Plots
Bar Charts/Histograms
39
Descriptive Statistics
◦ Statistical procedures used to summarise,
organise, and simplify data. This process should be
carried out in such a way that reflects overall
findings
 Raw data is made more manageable
 Raw data is presented in a logical form
 Patterns can be seen from organised data
 Frequency tables
 Graphical techniques
 Measures of Central Tendency
 Measures of Spread (variability)
40
Descriptive Measures
 Central Tendency measures. They are
computed to give a “center” around which the
measurements in the data are distributed.
(Average/Mean/Median)
 Variation or Variability measures. They

describe “data spread” or how far away the
measurements are from the center.
41
Measures of Central
Tendency/Measure of Location
 Mean:
Sum of all measurements divided by the number
of measurements.
 Median:
A number such that at most half of the
measurements are below it and at most half of the
measurements are above it.
 Mode:
The most frequent measurement in the data.
42
Mean
 Sum of the values divided by the number
of cases
y
 y i
43
Summation notation
 The yi (y1, y2, …, yn) are the n values of the
variable Y
 The sum of the values is then denoted as
yy y i
i 1
i  y1  y2    yn
44
Calculating the mean for high
temperatures
 Add values
High
Date
2-Jan
Temperature
59 y i  442
3-Jan 60
4-Jan 43  Number of cases
5-Jan 42
6-Jan
7-Jan
35
32
n  10
8-Jan 32
9-Jan 46  Calculate mean
10-Jan 41
Sum
11-Jan 52
442
y
 y i

442
 44.2
n 10
Notice that every single observation intervenes in the computation
of the mean.
45
Median
 A number such that half of the measurements
are below it and half of the measurements
are above it
 The median represents the middle of the ordered
sample data
 When the sample size is odd, the median is the
middle value
 When the sample size is even, the median is the
mean of the two middle values
46
Calculating the median for high
temperatures
High
Date Temperature
7-Jan 32 n = 10
8-Jan 32
6-Jan 35
10-Jan 41
5-Jan 42 <===Middle values
4-Jan 43 <===Middle values
9-Jan 46
11-Jan 52
2-Jan 59
3-Jan 60
~ 42  43
= Median   42.5
2 47
Comparison of mean and median
 Mean
◦ Uses all of the data
◦ Affected by extreme high or low values (outliers)
 Median
◦ May not necessarily use all data
◦ Not affected by outliers
48
Mode
 The most frequent measurement in the
data.
 Example:
 10,9,8,4,12,10,23,10,33,16,10,20,10
 10,9,8,4,12,10,23,10,33,16,10,20,10
 Mode of the sample is 10

49
Other Measures of Location
 Percentiles
 Quartiles
 Trimmed Mean
50
Percentiles
 The pth percentile is a number such that at most

p% of the measurements are below it and at
most 100 – p percent of the data are above it.
 Example:
if in a certain data the 85th percentile is 340,
it means that 15% of the measurements in the
data are above 340. It also means that 85% of
the measurements are below 340
 The median is the 50th percentile

51
Quartiles
 Data can be divided into quartiles
 Quartiles divide data into four equal
groups
◦ Lower (first) quartile is 25th percentile
◦ Middle (second) quartile is 50th percentile and
is the median
◦ Upper (third) quartile is 75th percentile
52
Trimmed Mean
 Mean is greatly influenced by outliers while median is
not.
 Extreme behavior of either type might be
undesirable.
 Consider alternative measures that are neither as
sensitive as mean nor as insensitive as median
 A trimmed mean is a compromise between
mean and median.
 A 10% trimmed mean, for example, would be
computed by eliminating the smallest 10% and the
largest 10% of the sample and then averaging what
is left over.
53
Central Tendency vs Variance
 Reporting a measure of center gives only partial
information about a data set or distribution.
 Different samples or populations may have
identical measures of center yet differ from one
another in other important ways.
54
 Figure shows dotplots of three samples with the
same mean and median, yet the extent of spread
about the center is different for all three samples.
 The first sample has the largest amount of
variability, the third has the smallest amount, and
the second is intermediate to the other two in this
respect.
55
Measures of variation
 Range
 Variance and standard deviation
 Interquartile range
 Primary measure of variability involve the
deviations from the mean
56
Range
 Range is the difference between the
minimum and maximum values in a data
 A defect of the range: is that it depends
on only the two most extreme
observations and disregards the positions
of the remaining n-2 values.
57
 Samples 1 and 2 in Figure have identical ranges, yet
when we take into account the observations
between the two extremes, there is much less
variability or dispersion in the second sample
than in the first.
58
Calculating the range for high
temperatures
range = 60 – 32 = 28
59
Variance and Standard Deviation
 The variance s2 is the sum of the squared deviations
from the mean divided by the number of cases minus 1
 y  y
2
s 2
 i
n 1
 The standard deviation s is the square root of the
variance
 y  y
2
s i
n 1
It is a measure of “spread”
The larger the deviations (positive or negative) the larger the
variance
Sum of all individual deviations  yi  y  is zero, i.e.
 y i  y  0 60
Variance (for a sample)
 Steps:
◦ Compute each deviation
◦ Square each deviation
◦ Sum all the squares
◦ Divide by the data size (sample size) minus
one: n-1
61
Justification for “n-1”
 The value of population mean is almost never known, so
the sum of squared deviations about sample mean must
be used.
 But the sample values tend to be closer to their average
(sample mean) than to the population mean, so to compensate
for this the divisor n-1 is used rather than n.
 If we use a divisor “n” in the sample variance, then the
resulting quantity would tend to underestimate population
variance
 Dividing by the slightly smaller n - 1 corrects this
underestimating.
62
Standard deviation
 The unit for s is the same as the unit for each
of the observation.
 If, for example, the observations are fuel
efficiencies in miles per gallon, then we
might have s = 2.0 mpg.
 A rough interpretation of the sample
standard deviation is that it is the size of a
typical or representative deviation from
the sample mean within the given
sample.
63
Standard deviation
 If s = 2.0 mpg, then some data values are

closer than 2.0 mpg to mean, whereas others
are farther away
 2.0 mpg is a representative (or “standard”)

deviation from the mean fuel efficiency.
64
Calculating the variance and standard deviation
High Difference Difference
Date Temperature X - mean Squared
2-Jan 59 14.80 219.04
3-Jan 60 15.80 249.64
4-Jan 43 -1.20 1.44
5-Jan 42 -2.20 4.84
6-Jan 35 -9.20 84.64
7-Jan 32 -12.20 148.84
8-Jan 32 -12.20 148.84
9-Jan 46 1.80 3.24
10-Jan 41 -3.20 10.24
11-Jan 52 7.80 60.84
Sum 442 931.60
n 10
Mean 44.2
 y  y 2
931.60 
 iy  y 2
s 2
 i
  103.5 s   103.51  10.2
n 1 101 n 1
65
Calculating Variance and Standard Deviation
66
Coefficient of Variation (CV)
 Also called coefficient of dispersion
std dev s
CV  
mean y
 Measure the variation relative to mean
 Values of this coefficient for several different
data sets can be compared to determine which
data set exhibits more or less variation
 Because the coefficient of variation is unitless,
you can use it instead of the standard deviation
to compare the spread of data sets that have
different units or different means.
67
Example:
 Comparing the variation in average volume of
concrete produced by large and small machines
 The mean concrete volume produced at one time
by small machine is 1 ton with a standard deviation
of 0.08 ton.
 The mean concrete volume produced at one time
by large machine is 16 tons with a standard
deviation of 0.4 ton.
 Although the standard deviation of the large
machine is five times greater that the standard
deviation of the smaller one, their coefficients of
variation ﴾CVs﴿ support a different conclusion:
68
 CV (Large Machine)
= 100 * 0.4 ton / 16 tons = 2.5 %
 CV(Small Machine)
= 100 * 0.08 ton / 1 ton = 8 %
 The CV of the small machine is more than
three times greater than that of the large
machine.
 Although the large machine has a greater

standard deviation, the small machine has
much more variability relative to its mean.
69

Descriptive Statistics Lecture 1

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Descriptive Statistics Lecture 1

Uploaded by

Copyright:

Available Formats

STAT – 835

PROBABILITY AND STATISTICS

To serve as a comprehensive introduction to

 Descriptive Statistics (6 hours)

 Hypotheses Tests ( 6 hours)

 Regression and Statistical Modeling (18 hours)

 Statistics provide methods for organizing and

 An investigation will typically focus on a well-defined

 If desired information is available for all objects in the

 Sample: A subset of the population selected for study.

 Random Sample: A simple random sample of size n

 Inference: We sample the population (in a manner to ensure

Example: We may want to know the average height of all

 Statistics: Any statistical characteristic of a

 Statistical Methods: Describing population

As engineers we may be required to answer some

However, we’ll find it impractical to use the entire

We will have to settle for a small part of the

- What is the quality of aggregates at a certain quarry?

- What is the ratio of auto use to transit use

- What is the quality of water produced by a water

- What has been the long-term settlement of

- What is the quality of water in a water reservoir?

… there’re countless instances in engineering

we’ll have to take only a small sample from a large

investigate an issue and provide needed answers.

The population is also referred to as the “Universe”, or

But there is no guarantee that we can achieve such a

This dilemma leads to 2 very important questions …

1. Is our sample a good copy of the

2. What steps can we take to ensure that our

i.e., the engineer “prays” that the statistics of his/her

Otherwise any conclusion he/she makes about the

Is our sample a good copy (close enough) of the

We use the concepts of Bias and Efficiency (to be

“Statistical Inference”, helps to determine the

What steps can we take to ensure that our sample is a

Answer: Sampling must be random (and representative).

- Simple Random Sampling

The choice of any specific sampling technique above

Advantages: - Less effort in preparations for the survey

Disadvantage: May not be truly representative of the

This sampling method is …

Systematic in time : sampling elements from the

Systematic in space: sampling elements from the

This sampling method first divides the entire population

Next, a random sample is obtained within each stratum

SUB-POPULATION SUB-POPULATION SUB-POPULATION SUB-POPULATION

Sub-populations may be of same size or of different sizes

Sampling schemes which are combination of the 3 methods can

For example, You may decide to carry out a stratified and

Collection of data on different classes of vehicles (cars,

- Our sample must as much as possible reflect

- Good sampling should be random, and

- Only a good sample can result in accurate

(logic based on observed instances)

 Any samples used should be representative of

Types of Statistical Analysis

 Variation or Variability measures. They

 Mode of the sample is 10

 The pth percentile is a number such that at most

 The median is the 50th percentile

 If s = 2.0 mpg, then some data values are

 2.0 mpg is a representative (or “standard”)