Professional Documents
Culture Documents
Descriptive Statistics Lecture 1
Descriptive Statistics Lecture 1
2
STAT – 835 Probability and Statistics
Planned Curriculum
3
STAT – 835 Probability and Statistics
Planned Curriculum
4
Miscellaneous Course Information
STAT– 835: Probability and Statistics
Time and Location: Thus, 5:00 - 6:30 PM, Mon 6:45 – 8:15;
Instructor: Dr. Kamran Ahmed
Contact : (051)9085-4153; Mob: 0301-5630831
Email: buet99@hotmail.com
Office Hours for Students: By appointment
Textbooks: Probability and Statistics For Engineering and Sciences by Jay L. Devore (6th
Edition)
Exams: There will be Two Class Tests (One hour Each) and One final examination (3
hours). The final examination will be held during the final exam week, and covers the
entire course,
Home Work: Homework will be given after completion of a major topic (a total of 4/5
Homework Assignments)
Quiz and Attendance:
There will be 3/6 quiz tests including a couple of pop-up quizzes in class. Students are
expected to attend all classes. Poor attendance will affect the final grade of
students.
Final Grade: Final grade will depend on the following components with the
proportions mentioned against each (subject to variation):
Homework (5%), Quiz (10%), Class Tests (30%), Term project (15%) and Final exam
(40%).
5
STAT – 835 Probability and Statistics
DESCRIPTIVE STATISTICS
6
Population and Sample and Processes
Engineers and Scientists are constantly exposed to the
collection of facts, or data
8
Population and Sample and Processes
(cont...)
Usually census is impractical and infeasible: Why?
Constraints on time, money and other scarce resources
Instead, a subset of population – a sample is selected in
some prescribed manner (e.g. a randomly selected 50
students out of 500 graduates)
In order to draw inferences/ conclusions about a
population, certain characteristics of the objects of
population are investigated: (e.g. age, gender, GPA – a
categorical or numerical variable)
Variable is any characteristic whose value may change
from one object to another
Uni-variate , bi-variate and Multivariate data set
9
Univariate, Bivariate, and Multivariate
Data
Depending on how many variables we are
measuring on the individuals or objects in our
sample, we will have one of the three following
types of data sets
◦ Univariate: Measurements made on only one variable
per observation.
◦ Bivariate: Measurements made on two variables per
observation.
◦ Multivariate: Measurements made on more than two
variables per observation.
10
Population and Sample
Population: The entire collection of individuals or
measurement objects about which information is desired
e.g.Average height of 5-year old children in Pakistan
12
Census and Inference
Census: Complete enumeration of population units.
13
Parameter and Statistic
Parameter: Any statistical characteristic of a
population. Population mean, population median,
population standard deviation are examples of
parameters.
14
Some Differences between Population and Sample
POPULATION SAMPLE
Size Large Small
Size Notation N n
Easy to collect data? No Yes
Term used to describe A “parameter” A “statistic”
its nature
e.g., μ, σ e.g., x, s
15
Some Differences between Population and Sample
(Cont’d)
POPULATION SAMPLE
Mean (notation) μ x
Std Deviation σ s
(notation)
Mean (formula)
x
x
x
N n
Variance (formula)
(x ) 2
s2
(x x) 2
2
n 1
N
16
Statistics!
What is it? What does it involve?
The art or science of making confident conclusions about the
attributes of a system or collection of systems
Involves:
- taking a small sample from a larger set (Sampling)
- analyzing data from the small sample (Data analysis)
- testing the hypotheses to ascertain if true (Hypothesis
Testing)
- making conclusions about the larger set (Statistical
Inference)
- presenting your findings to an audience (Information
Delivery)
17
Prelude to Statistics
18
Some of such questions we may be required to answer as
civil engineer :
19
- What is the strength of concrete being used in
constructing a certain structure?
(Construction/Materials Engineering)
20
-
- How many of the steel I-sections provided by a certain
supplier have a lower-than-specified strength?
(Structural Engineering)
21
Therefore:
where
in order to
22
Because we draw the sample from the population, the
sample is called a subset of the population (Recall
Set Theory)
Sample
Population
23
Ideally, we seek a sample that is a miniature copy of
the population.
24
Important Questions …
25
Every engineer involved in statistic analysis of his/her system hopes
that:
his/her sample is a good representative of the population.
POPULATION SAMPLE
Parameters: μ, σ Statistics: x, s
26
Back to “Important Questions, #1”
28
Methods of Random Sampling
There are 4 major ways by which a sample can be
carried out to ensure that it is random and yet
represents a true miniature copy of the population:
30
Systematic Random Sampling
32
Stratified Random Sampling
MAIN POPULATION
SAMPLE SAMPLE
SAMPLE SAMPLE
Any Example?
35
In Summary ...
- We can afford to take only a small sample
from a large population of systems or system
components in order to investigate the
population.
Sample
Population
38
Introduction to Statistics
Descriptive Inferential
Graphical Non-graphical
Central Tendency Point Estimation
Dot Plots Dispersion/ Variance Hypothesis Testing
Scatter Plots Range Confidence Interval
Box Plots Shape Statistical Regression
Stem-and-leaf Plots
Bar Charts/Histograms
39
Descriptive Statistics
◦ Statistical procedures used to summarise,
organise, and simplify data. This process should be
carried out in such a way that reflects overall
findings
Raw data is made more manageable
Raw data is presented in a logical form
Patterns can be seen from organised data
Frequency tables
Graphical techniques
Measures of Central Tendency
Measures of Spread (variability)
40
Descriptive Measures
Central Tendency measures. They are
computed to give a “center” around which the
measurements in the data are distributed.
(Average/Mean/Median)
41
Measures of Central
Tendency/Measure of Location
Mean:
Sum of all measurements divided by the number
of measurements.
Median:
A number such that at most half of the
measurements are below it and at most half of the
measurements are above it.
Mode:
The most frequent measurement in the data.
42
Mean
Sum of the values divided by the number
of cases
y
y i
43
Summation notation
The yi (y1, y2, …, yn) are the n values of the
variable Y
The sum of the values is then denoted as
yy y i
i 1
i y1 y2 yn
44
Calculating the mean for high
temperatures
Add values
High
Date
2-Jan
Temperature
59 y i 442
3-Jan 60
4-Jan 43 Number of cases
5-Jan 42
6-Jan
7-Jan
35
32
n 10
8-Jan 32
9-Jan 46 Calculate mean
10-Jan 41
Sum
11-Jan 52
442
y
y i
442
44.2
n 10
Notice that every single observation intervenes in the computation
of the mean.
45
Median
A number such that half of the measurements
are below it and half of the measurements
are above it
The median represents the middle of the ordered
sample data
When the sample size is odd, the median is the
middle value
When the sample size is even, the median is the
mean of the two middle values
46
Calculating the median for high
temperatures
High
Date Temperature
7-Jan 32 n = 10
8-Jan 32
6-Jan 35
10-Jan 41
5-Jan 42 <===Middle values
4-Jan 43 <===Middle values
9-Jan 46
11-Jan 52
2-Jan 59
3-Jan 60
~ 42 43
= Median 42.5
2 47
Comparison of mean and median
Mean
◦ Uses all of the data
◦ Affected by extreme high or low values (outliers)
Median
◦ May not necessarily use all data
◦ Not affected by outliers
48
Mode
The most frequent measurement in the
data.
Example:
10,9,8,4,12,10,23,10,33,16,10,20,10
10,9,8,4,12,10,23,10,33,16,10,20,10
50
Percentiles
Example:
if in a certain data the 85th percentile is 340,
it means that 15% of the measurements in the
data are above 340. It also means that 85% of
the measurements are below 340
52
Trimmed Mean
Mean is greatly influenced by outliers while median is
not.
Extreme behavior of either type might be
undesirable.
Consider alternative measures that are neither as
sensitive as mean nor as insensitive as median
A trimmed mean is a compromise between
mean and median.
A 10% trimmed mean, for example, would be
computed by eliminating the smallest 10% and the
largest 10% of the sample and then averaging what
is left over.
53
Central Tendency vs Variance
Reporting a measure of center gives only partial
information about a data set or distribution.
Different samples or populations may have
identical measures of center yet differ from one
another in other important ways.
54
Central Tendency vs Variance
Figure shows dotplots of three samples with the
same mean and median, yet the extent of spread
about the center is different for all three samples.
The first sample has the largest amount of
variability, the third has the smallest amount, and
the second is intermediate to the other two in this
respect.
55
Measures of variation
Range
Variance and standard deviation
Interquartile range
Primary measure of variability involve the
deviations from the mean
56
Range
Range is the difference between the
minimum and maximum values in a data
A defect of the range: is that it depends
on only the two most extreme
observations and disregards the positions
of the remaining n-2 values.
57
Central Tendency vs Variance
Samples 1 and 2 in Figure have identical ranges, yet
when we take into account the observations
between the two extremes, there is much less
variability or dispersion in the second sample
than in the first.
58
Calculating the range for high
temperatures
range = 60 – 32 = 28
59
Variance and Standard Deviation
The variance s2 is the sum of the squared deviations
from the mean divided by the number of cases minus 1
y y
2
s 2
i
n 1
The standard deviation s is the square root of the
variance
y y
2
s i
n 1
It is a measure of “spread”
The larger the deviations (positive or negative) the larger the
variance
Sum of all individual deviations yi y is zero, i.e.
y i y 0 60
Variance (for a sample)
Steps:
◦ Compute each deviation
◦ Square each deviation
◦ Sum all the squares
◦ Divide by the data size (sample size) minus
one: n-1
61
Justification for “n-1”
The value of population mean is almost never known, so
the sum of squared deviations about sample mean must
be used.
But the sample values tend to be closer to their average
(sample mean) than to the population mean, so to compensate
for this the divisor n-1 is used rather than n.
If we use a divisor “n” in the sample variance, then the
resulting quantity would tend to underestimate population
variance
Dividing by the slightly smaller n - 1 corrects this
underestimating.
62
Standard deviation
The unit for s is the same as the unit for each
of the observation.
If, for example, the observations are fuel
efficiencies in miles per gallon, then we
might have s = 2.0 mpg.
A rough interpretation of the sample
standard deviation is that it is the size of a
typical or representative deviation from
the sample mean within the given
sample.
63
Standard deviation
64
Calculating the variance and standard deviation
High Difference Difference
Date Temperature X - mean Squared
2-Jan 59 14.80 219.04
3-Jan 60 15.80 249.64
4-Jan 43 -1.20 1.44
5-Jan 42 -2.20 4.84
6-Jan 35 -9.20 84.64
7-Jan 32 -12.20 148.84
8-Jan 32 -12.20 148.84
9-Jan 46 1.80 3.24
10-Jan 41 -3.20 10.24
11-Jan 52 7.80 60.84
Sum 442 931.60
n 10
Mean 44.2
y y 2
931.60
iy y 2
s 2
i
103.5 s 103.51 10.2
n 1 101 n 1
65
Calculating Variance and Standard Deviation
66
Coefficient of Variation (CV)
Also called coefficient of dispersion
std dev s
CV
mean y
Measure the variation relative to mean
Values of this coefficient for several different
data sets can be compared to determine which
data set exhibits more or less variation
Because the coefficient of variation is unitless,
you can use it instead of the standard deviation
to compare the spread of data sets that have
different units or different means.
67
Coefficient of Variation (CV)
Example:
Comparing the variation in average volume of
concrete produced by large and small machines
The mean concrete volume produced at one time
by small machine is 1 ton with a standard deviation
of 0.08 ton.
The mean concrete volume produced at one time
by large machine is 16 tons with a standard
deviation of 0.4 ton.
Although the standard deviation of the large
machine is five times greater that the standard
deviation of the smaller one, their coefficients of
variation ﴾CVs﴿ support a different conclusion:
68
Coefficient of Variation (CV)
CV (Large Machine)
= 100 * 0.4 ton / 16 tons = 2.5 %
CV(Small Machine)
= 100 * 0.08 ton / 1 ton = 8 %
The CV of the small machine is more than
three times greater than that of the large
machine.