DA Lec3

10/1/2019
3-4. Descriptive
statistics
[2] CHAP.2
[3] CHAP.3
Data Analysis 10/1/2019
2
Outline
 Sampling
 Graphical Summaries
 Summary Statistics
1
10/1/2019
3
The basic idea
 Statistical methods of data analysis is to make inferences about

a population by studying a relatively small sample chosen from it.
 Descriptive Statistics: to Report on Populations and Samples.
4
Sample vs. Population
Population Sample
2
10/1/2019
5
Sampling
 A population is the entire collection of objects or outcomes about which

information is sought.
 A sample is a subset of a population, containing the objects or outcomes
that are actually observed.
 A simple random sample of size n is a sample chosen by a method in
which each collection of n population items is equally likely to comprise
the sample, just as in a lottery.
6
Sampling
 A sample of convenience is a sample that is not drawn by a well-defined

random method.
Ex: the engineer might construct a sample simply by taking 10 blocks off
the top of the pile.
3
10/1/2019
7
Independence
 The items in a sample are independent if knowing the values of

some of the items does not help to predict the values of the others.
(when the population is very large)
  sampling with replacement
 Items in a simple random sample may be treated as independent
in many cases encountered in practice. The exception occurs
when the population is finite, and the sample comprises a
substantial fraction (more than 5%) of the population.
8
Other Sampling Methods
 Weighted sampling: some items are given a greater chance of

being selected than others
 Stratified random sampling: the population is divided up into
subpopulations, called strata, and a simple random sample is
drawn from each stratum.
 Cluster sampling: items are drawn from the population in groups, or
clusters. Useful when the population is too large.
4
10/1/2019
10
Types of Experiments
 one-sample: only one population of interest, and a single sample is

drawn from it.
 Multisample: two or more populations of interest, and a sample is
drawn from each population.
 factorial experiments: the populations are distinguished from one
another by the varying of one or more factors that may affect the
outcome.  to determine how varying the levels of the factors affects
the outcome being measured.
10
5
10/1/2019
11
Types of Data
 numerical or quantitative
(how much or how many)
 categorical or qualitative
Which data are numerical, and which data are categorical?

11
12
Controlled Experiments and
Observational Studies
 Controlled experiments: designed to determine the effect of changing

one or more factors on the value of a response.
 run the process several times, changing the concentrations each time, and
compare the yields that result.
 Observational study: cannot control the levels of the factors.
 simply observes the levels of the factor as they are, without having any control
over them.
12
6
10/1/2019
13
Exercises
 If you wanted to estimate the mean height of all the students at a

university, which one of the following sampling strategies would be best?
Why? Note that none of the methods are true simple random samples.
 Measure the heights of 50 students found in the gym during basketball
intramurals.
 Measure the heights of all engineering majors.
 Measure the heights of the students selected by choosing the first name on
each page of the campus phone book.
13
Descriptive Statistics 14
An Illustration:
Which Group is Smarter?
Class B--IQs of 13 Students
Class A--IQs of 13 Students
127 162
102 115
131 103
128 109
96 111
131 89
80 109
98 106
93 87
140 119
120 105
93 97
109
110
Each individual may be different. If you try to understand a group by remembering the qualities of
each member, you become overwhelmed and fail to understand the group.
14
7
10/1/2019
15
Descriptive Statistics
Which group is smarter now?
Class A--Average IQ Class B--Average IQ
110.54 110.23
They’re roughly the same!
With a summary descriptive statistic, it is much easier to answer our question.
15
16
Types of descriptive statistics:
 Organize Data (graphical summaries)

 Tables
Descriptive  Graphs
Statistics
 Summarize Data (Summary Statistics)
 Central Tendency
 Variation
16
8
10/1/2019
17
Types of descriptive statistics:

 Organize Data
 Tables
 Frequency Distributions
 Relative Frequency Distributions
 Graphs
 Bar Chart or Histogram
 Stem and Leaf Plot
 Frequency Polygon
17
18
Frequency Distribution and Relative

Frequency Distribution
18
9
10/1/2019
19
12/62
Class
width = 2 0.1935/2
Source: [1] William Navidi: Statistics for Engineers and Scientists, McGrawHill, 4th Edition, 2015.
19
20
Histogram
20
10
10/1/2019
21
Unequal
class
widths
21
22
22
11
10/1/2019
23
To construct a histogram:
 Draw a rectangle for each class. If the classes

all have the same width, the heights of the
rectangles may be set equal to the
frequencies, the relative frequencies, or the
densities. If the classes do not all have the same
width, the heights of the rectangles must be set
equal to the densities.
23
Unimodal histograms: has only one peak, or mode
negatively skewed positively skewed
Data Analysis 10/1/2019 24
24
12
10/1/2019
25
A bimodal
histogram
 has two clearly distinct
modes
25
26
Stem and Leaf Plot
 Each item in the sample is divided into

two parts: a stem, consisting of the
leftmost one or two digits, and the leaf,
which consists of the next digit.
 Ex: 42, 45, 49  4 | 2 5 9
26
13
10/1/2019
27
Dotplots
Data Analysis
27
 The weather in Los Angeles is dry most of the time, 28

but it can be quite rainy in the winter. The rainiest
month of the year is February. The following table
10/1/2019
Data Analysis
presents the annual rainfall in Los Angeles, in
inches, for each February from 1965 to 2006.
0.2 3.7 1.2 13.7 1.5 0.2 1.7

0.6 0.1 8.9 1.9 5.5 0.5 3.1
3.1 8.9 8.0 12.7 4.1 0.3 2.6
Exercises 1.5
0.1
8.0
4.4
4.6
3.2
0.7
11.0
0.7
7.9
6.6
0.0
4.9
1.3
2.4 0.1 2.8 4.9 3.5 6.1 0.1
a. Construct a stem-and-leaf plot for these data.
b. Construct a histogram for these data.
c. Construct a dotplot for these data.
d. Construct a boxplot for these data. Does the
boxplot show any outliers?
28
14
10/1/2019
29
Summarizing Data:
 Central Tendency (or Groups’ “Middle Values”)

 Mean
 Median
 Mode
 Variation (or Summary of Differences Within Groups)

 Range
 Interquartile Range
 Variance
 Standard Deviation
29
30
Mean
30
15
10/1/2019
31
Mean
1. Means can be badly affected by outliers (data points with extreme

values unlike the rest)
2. Outliers can make the mean a bad measure of central tendency or
common experience
Income in the U.S.
Bill Gates
All of Us
Mean Outlier
31
32
Median
The middle value when a variable’s values are ranked in order;

the point that divides a distribution into two equal halves.
When data are listed in order, the median is the point at which
50% of the cases are above and 50% below it.
The 50th percentile.
32
16
10/1/2019
33
Median
1. The median is unaffected by outliers, making it a better measure of

central tendency, better describing the “typical person” than the mean
when data are skewed.
All of Us Bill Gates

outlier
33
34
Summarizing Data:
 Central Tendency (or Groups’ “Middle Values”)

 Mean
 Median
 Mode
 Variation (or Summary of Differences Within Groups)

 Range
 Interquartile Range
 Variance
 Standard Deviation
34
17
10/1/2019
35
Range
The spread, or the distance, between the lowest and highest values of a variable.
To get the range for a variable, you subtract its lowest value from its highest value.
Class A--IQs of 13 Students Class B--IQs of 13 Students

102 115 127 162
128 109 131 103
131 89 96 111
98 106 80 109
140 119 93 87
93 97 120 105
110 109
Class A Range = 140 - 89 = 51 Class B Range = 162 - 80 = 82
35
36
Interquartile Range (IQR)
A quartile is the value that marks one of the divisions that breaks a series of values into four equal parts.
The median is a quartile and divides the cases in half.
25th percentile is a quartile that divides the first ¼ of cases from the latter ¾.
75th percentile is a quartile that divides the first ¾ of cases from the latter ¼.
The interquartile range is the distance or range between the 25th percentile and the 75th percentile. Below, what is the
interquartile range?
25% 25% 25%
25%
of of
cases cases

0 250 500 750 1000
36
18
10/1/2019
37
Variance
A measure of the spread of the recorded values on a variable. A

measure of dispersion.
The larger the variance, the further the individual cases are from the
mean.
Mean
The smaller the variance, the closer the individual scores are to the
mean.

Mean
37
38
Variance
38
19
10/1/2019
39
39
40
Coefficient of Variation
 The coefficient of variation is a relative measure of variability; it measures

the standard deviation relative to the mean.
40
20
10/1/2019
41
Exercises
Q1. A sample of 100 adult women was taken, and each was asked how many children she
had. The results were as follows:
a. Find the sample mean number of children.

b. Find the sample standard deviation of the number of children.
c. Find the sample median of the number of children.
d. What is the first quartile of the number of children?
e. What proportion of the women had more than the mean number of children?
f. For what proportion of the women was the number of children more than one standard
deviation greater than the mean?
g. For what proportion of the women was the number of children within one standard
deviation of the mean?
41
42
Exercises
Q2. A bowler’s scores for six games were 182, 168, 184, 190, 170, and 174. Using these
data as a sample, compute the following descriptive statistics.
a. Range c. Standard deviation
b. Variance d. Coefficient of variation
Q3. The Los Angeles Times regularly reports the air quality index for various areas of
Southern California. A sample of air quality index values for Pomona provided the
following data: 28, 42, 58, 48, 45, 55, 60, 49, and 50.
a. Compute the range and interquartile range.
b. Compute the sample variance and sample standard deviation.
c. A sample of air quality index readings for Anaheim provided a sample mean of
48.5, a sample variance of 136, and a sample standard deviation of 11.66. What
comparisons can you make between the air quality in Pomona and that in Anaheim
on the basis of these descriptive statistics?
42
21
10/1/2019
43
Reading
 [2] 4
 [3] 7
43
22

DA Lec3

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

DA Lec3

Uploaded by

Copyright:

Available Formats

10/1/2019

Data Analysis 10/1/2019

Data Analysis 10/1/2019

 Statistical methods of data analysis is to make inferences about

Data Analysis 10/1/2019

Data Analysis 10/1/2019

 A population is the entire collection of objects or outcomes about which

Data Analysis 10/1/2019

 A sample of convenience is a sample that is not drawn by a well-defined

Data Analysis 10/1/2019

 The items in a sample are independent if knowing the values of

Data Analysis 10/1/2019

 Weighted sampling: some items are given a greater chance of

Data Analysis 10/1/2019

Data Analysis 10/1/2019

 one-sample: only one population of interest, and a single sample is

Data Analysis 10/1/2019

Which data are numerical, and which data are categorical?

 Controlled experiments: designed to determine the effect of changing

Data Analysis 10/1/2019

 If you wanted to estimate the mean height of all the students at a

Data Analysis 10/1/2019

Which group is smarter now?

Class A--Average IQ Class B--Average IQ

They’re roughly the same!

With a summary descriptive statistic, it is much easier to answer our question.

Data Analysis 10/1/2019

 Organize Data (graphical summaries)

Data Analysis 10/1/2019

Types of descriptive statistics:

Data Analysis 10/1/2019

Frequency Distribution and Relative

Data Analysis 10/1/2019

Data Analysis 10/1/2019

Data Analysis 10/1/2019

Data Analysis 10/1/2019

Data Analysis 10/1/2019

 Draw a rectangle for each class. If the classes

Data Analysis 10/1/2019

Unimodal histograms: has only one peak, or mode

negatively skewed positively skewed

Data Analysis 10/1/2019 24

Data Analysis 10/1/2019

 Each item in the sample is divided into

Data Analysis 10/1/2019

 The weather in Los Angeles is dry most of the time, 28

0.2 3.7 1.2 13.7 1.5 0.2 1.7

 Central Tendency (or Groups’ “Middle Values”)

 Variation (or Summary of Differences Within Groups)

Data Analysis 10/1/2019

1. Means can be badly affected by outliers (data points with extreme

Income in the U.S.

The middle value when a variable’s values are ranked in order;

The 50th percentile.

Data Analysis 10/1/2019

1. The median is unaffected by outliers, making it a better measure of

All of Us Bill Gates

Data Analysis 10/1/2019

 Central Tendency (or Groups’ “Middle Values”)

 Variation (or Summary of Differences Within Groups)

Data Analysis 10/1/2019

Class A--IQs of 13 Students Class B--IQs of 13 Students