You are on page 1of 22

10/1/2019

3-4. Descriptive
statistics

[2] CHAP.2
[3] CHAP.3

Data Analysis 10/1/2019

2
Outline

 Sampling
 Graphical Summaries
 Summary Statistics

Data Analysis 10/1/2019

1
10/1/2019

3
The basic idea

 Statistical methods of data analysis is to make inferences about


a population by studying a relatively small sample chosen from it.
 Descriptive Statistics: to Report on Populations and Samples.

Data Analysis 10/1/2019

4
Sample vs. Population

Population Sample

Data Analysis 10/1/2019

2
10/1/2019

5
Sampling

 A population is the entire collection of objects or outcomes about which


information is sought.
 A sample is a subset of a population, containing the objects or outcomes
that are actually observed.
 A simple random sample of size n is a sample chosen by a method in
which each collection of n population items is equally likely to comprise
the sample, just as in a lottery.

Data Analysis 10/1/2019

6
Sampling

 A sample of convenience is a sample that is not drawn by a well-defined


random method.

Ex: the engineer might construct a sample simply by taking 10 blocks off
the top of the pile.

Data Analysis 10/1/2019

3
10/1/2019

7
Independence

 The items in a sample are independent if knowing the values of


some of the items does not help to predict the values of the others.
(when the population is very large)
  sampling with replacement
 Items in a simple random sample may be treated as independent
in many cases encountered in practice. The exception occurs
when the population is finite, and the sample comprises a
substantial fraction (more than 5%) of the population.

Data Analysis 10/1/2019

8
Other Sampling Methods

 Weighted sampling: some items are given a greater chance of


being selected than others
 Stratified random sampling: the population is divided up into
subpopulations, called strata, and a simple random sample is
drawn from each stratum.
 Cluster sampling: items are drawn from the population in groups, or
clusters. Useful when the population is too large.

Data Analysis 10/1/2019

4
10/1/2019

Data Analysis 10/1/2019

10
Types of Experiments

 one-sample: only one population of interest, and a single sample is


drawn from it.
 Multisample: two or more populations of interest, and a sample is
drawn from each population.
 factorial experiments: the populations are distinguished from one
another by the varying of one or more factors that may affect the
outcome.  to determine how varying the levels of the factors affects
the outcome being measured.

Data Analysis 10/1/2019

10

5
10/1/2019

11
Types of Data

 numerical or quantitative
(how much or how many)
 categorical or qualitative

Which data are numerical, and which data are categorical?


Data Analysis 10/1/2019

11

12
Controlled Experiments and
Observational Studies

 Controlled experiments: designed to determine the effect of changing


one or more factors on the value of a response.
 run the process several times, changing the concentrations each time, and
compare the yields that result.
 Observational study: cannot control the levels of the factors.
 simply observes the levels of the factor as they are, without having any control
over them.

Data Analysis 10/1/2019

12

6
10/1/2019

13
Exercises

 If you wanted to estimate the mean height of all the students at a


university, which one of the following sampling strategies would be best?
Why? Note that none of the methods are true simple random samples.
 Measure the heights of 50 students found in the gym during basketball
intramurals.
 Measure the heights of all engineering majors.
 Measure the heights of the students selected by choosing the first name on
each page of the campus phone book.

Data Analysis 10/1/2019

13

Descriptive Statistics 14
An Illustration:
Which Group is Smarter?
Class B--IQs of 13 Students
Class A--IQs of 13 Students
127 162
102 115
131 103
128 109
96 111
131 89
80 109
98 106
93 87
140 119
120 105
93 97
109
110
Each individual may be different. If you try to understand a group by remembering the qualities of
each member, you become overwhelmed and fail to understand the group.
Data Analysis 10/1/2019

14

7
10/1/2019

15
Descriptive Statistics

Which group is smarter now?

Class A--Average IQ Class B--Average IQ

110.54 110.23

They’re roughly the same!

With a summary descriptive statistic, it is much easier to answer our question.

Data Analysis 10/1/2019

15

16
Types of descriptive statistics:

 Organize Data (graphical summaries)


 Tables
Descriptive  Graphs

Statistics
 Summarize Data (Summary Statistics)
 Central Tendency
 Variation

Data Analysis 10/1/2019

16

8
10/1/2019

17
Descriptive Statistics

Types of descriptive statistics:


 Organize Data
 Tables
 Frequency Distributions
 Relative Frequency Distributions

 Graphs
 Bar Chart or Histogram
 Stem and Leaf Plot
 Frequency Polygon

Data Analysis 10/1/2019

17

18

Frequency Distribution and Relative


Frequency Distribution

Data Analysis 10/1/2019

18

9
10/1/2019

19

12/62

Class
width = 2 0.1935/2

Source: [1] William Navidi: Statistics for Engineers and Scientists, McGrawHill, 4th Edition, 2015.

Data Analysis 10/1/2019

19

20

Histogram

Source: [1] William Navidi: Statistics for Engineers and Scientists, McGrawHill, 4th Edition, 2015.

Data Analysis 10/1/2019

20

10
10/1/2019

21

Unequal
class
widths
Source: [1] William Navidi: Statistics for Engineers and Scientists, McGrawHill, 4th Edition, 2015.

Data Analysis 10/1/2019

21

22

Data Analysis 10/1/2019

22

11
10/1/2019

23
To construct a histogram:

 Draw a rectangle for each class. If the classes


all have the same width, the heights of the
rectangles may be set equal to the
frequencies, the relative frequencies, or the
densities. If the classes do not all have the same
width, the heights of the rectangles must be set
equal to the densities.

Data Analysis 10/1/2019

23

Unimodal histograms: has only one peak, or mode

negatively skewed positively skewed

Data Analysis 10/1/2019 24

24

12
10/1/2019

25
A bimodal
histogram
 has two clearly distinct
modes

Data Analysis 10/1/2019

25

26
Stem and Leaf Plot

 Each item in the sample is divided into


two parts: a stem, consisting of the
leftmost one or two digits, and the leaf,
which consists of the next digit.
 Ex: 42, 45, 49  4 | 2 5 9

Data Analysis 10/1/2019

26

13
10/1/2019

27

Dotplots
Data Analysis

27

 The weather in Los Angeles is dry most of the time, 28


but it can be quite rainy in the winter. The rainiest
month of the year is February. The following table
10/1/2019
Data Analysis
presents the annual rainfall in Los Angeles, in
inches, for each February from 1965 to 2006.

0.2 3.7 1.2 13.7 1.5 0.2 1.7


0.6 0.1 8.9 1.9 5.5 0.5 3.1
3.1 8.9 8.0 12.7 4.1 0.3 2.6
Exercises 1.5
0.1
8.0
4.4
4.6
3.2
0.7
11.0
0.7
7.9
6.6
0.0
4.9
1.3
2.4 0.1 2.8 4.9 3.5 6.1 0.1
a. Construct a stem-and-leaf plot for these data.
b. Construct a histogram for these data.
c. Construct a dotplot for these data.
d. Construct a boxplot for these data. Does the
boxplot show any outliers?

28

14
10/1/2019

29
Descriptive Statistics

Summarizing Data:

 Central Tendency (or Groups’ “Middle Values”)


 Mean
 Median
 Mode

 Variation (or Summary of Differences Within Groups)


 Range
 Interquartile Range
 Variance
 Standard Deviation

Data Analysis 10/1/2019

29

30

Mean
Data Analysis 10/1/2019

30

15
10/1/2019

31
Mean

1. Means can be badly affected by outliers (data points with extreme


values unlike the rest)
2. Outliers can make the mean a bad measure of central tendency or
common experience

Income in the U.S.

Bill Gates
All of Us
Mean Outlier
Data Analysis 10/1/2019

31

32
Median

The middle value when a variable’s values are ranked in order;


the point that divides a distribution into two equal halves.

When data are listed in order, the median is the point at which
50% of the cases are above and 50% below it.

The 50th percentile.

Data Analysis 10/1/2019

32

16
10/1/2019

33
Median

1. The median is unaffected by outliers, making it a better measure of


central tendency, better describing the “typical person” than the mean
when data are skewed.

All of Us Bill Gates


outlier

Data Analysis 10/1/2019

33

34
Descriptive Statistics

Summarizing Data:

 Central Tendency (or Groups’ “Middle Values”)


 Mean
 Median
 Mode

 Variation (or Summary of Differences Within Groups)


 Range
 Interquartile Range
 Variance
 Standard Deviation

Data Analysis 10/1/2019

34

17
10/1/2019

35
Range

The spread, or the distance, between the lowest and highest values of a variable.

To get the range for a variable, you subtract its lowest value from its highest value.

Class A--IQs of 13 Students Class B--IQs of 13 Students


102 115 127 162
128 109 131 103
131 89 96 111
98 106 80 109
140 119 93 87
93 97 120 105
110 109
Class A Range = 140 - 89 = 51 Class B Range = 162 - 80 = 82
Data Analysis 10/1/2019

35

36
Interquartile Range (IQR)

A quartile is the value that marks one of the divisions that breaks a series of values into four equal parts.

The median is a quartile and divides the cases in half.

25th percentile is a quartile that divides the first ¼ of cases from the latter ¾.
75th percentile is a quartile that divides the first ¾ of cases from the latter ¼.

The interquartile range is the distance or range between the 25th percentile and the 75th percentile. Below, what is the
interquartile range?
25% 25% 25%
25%
of of
cases cases

Data Analysis 10/1/2019


0 250 500 750 1000
36

18
10/1/2019

37
Variance

A measure of the spread of the recorded values on a variable. A


measure of dispersion.

The larger the variance, the further the individual cases are from the
mean.

Mean
The smaller the variance, the closer the individual scores are to the
mean.

Data Analysis 10/1/2019


Mean
37

38

Variance
Data Analysis 10/1/2019

38

19
10/1/2019

39

Data Analysis 10/1/2019

39

40
Coefficient of Variation

 The coefficient of variation is a relative measure of variability; it measures


the standard deviation relative to the mean.

Data Analysis 10/1/2019

40

20
10/1/2019

41
Exercises

Q1. A sample of 100 adult women was taken, and each was asked how many children she
had. The results were as follows:

a. Find the sample mean number of children.


b. Find the sample standard deviation of the number of children.
c. Find the sample median of the number of children.
d. What is the first quartile of the number of children?
e. What proportion of the women had more than the mean number of children?
f. For what proportion of the women was the number of children more than one standard
deviation greater than the mean?
g. For what proportion of the women was the number of children within one standard
deviation of the mean?
Data Analysis 10/1/2019

41

42
Exercises

Q2. A bowler’s scores for six games were 182, 168, 184, 190, 170, and 174. Using these
data as a sample, compute the following descriptive statistics.
a. Range c. Standard deviation
b. Variance d. Coefficient of variation

Q3. The Los Angeles Times regularly reports the air quality index for various areas of
Southern California. A sample of air quality index values for Pomona provided the
following data: 28, 42, 58, 48, 45, 55, 60, 49, and 50.
a. Compute the range and interquartile range.
b. Compute the sample variance and sample standard deviation.
c. A sample of air quality index readings for Anaheim provided a sample mean of
48.5, a sample variance of 136, and a sample standard deviation of 11.66. What
comparisons can you make between the air quality in Pomona and that in Anaheim
on the basis of these descriptive statistics?

Data Analysis 10/1/2019

42

21
10/1/2019

43
Reading

 [2] 4
 [3] 7

Data Analysis 10/1/2019

43

22

You might also like