Professional Documents
Culture Documents
STATIS
TICS
NOTES
CHAPT
ER 1-
CHAPT
ER 3
1
Statistics: A branch of mathematics that deals with the
Collection
Organization
Presentation
Analysis
Interpretation
of information in order to draw conclusions or answer questions.
Population
Is the group to be studied
Includes all of the individuals in the group
Sample
Is a subset of the population
Is often used in analyses because getting access to the entire population is
impractical
Individual
A person or object that is a member of the population being studied.
2
Descriptive statistics consists of organizing and summarizing the information
collected. It consists of charts, tables, and numerical summaries. (CH 2 – CH 3)
Inferential statistics uses methods that generalize results obtained from a sample to the
population and measure their reliability. (CH 8-CH 9)
Example: How might you choose a random sample of 5 people from a group of 40
people?..............................................................................................................................
3
3) Use a random number table, graphing calculator, or statistical software to randomly
generate n numbers where n is the desired sample size.
Activity: Use the following table and simple random sampling to obtain a sample of
size 10 from the class.
4
Variables are the characteristics of the individuals within the population. The suggested
approaches to analyzing problems vary by the type of variable.
qualitative or categorical : variables which have values that are attributes or
characteristics and cannot be ordered, added, subtracted, etc. For instance:
o Gender
o Zip code
o Blood type
o States in the United States
quantitative: variables which have values that are numeric. And can be ordered,
added, and subtracted such as:
o Temperature
o Height and weight
o Sales of a product
o Number of children in a family
5
o The distance that a particular model car can drive on a full tank of gas
o Heights of college students
Sometimes the variable is discrete but has so many close values that it could be
considered continuous. For example:
The number of DVDs rented per year at video stores
The number of ants in an ant colony
6
Example: Introduction to organizing Data
Given the following random sample of patients’ blood types, answer the following
questions.
(A) According to the data which blood type is the most common? …………………
Consider the ordered data. Are the questions easier to answer now?
(A) According to the data which blood type is the most common? ……………………
O
O O O O O O O O O O O O O O
O
O O O O O O O O O O O O O O
O
O O O O O O O O O O O O O O
O
O O O O O O O O O O O O O O
O
O O O O O A A A A A A A A A
A
A A A A A A A A A A A A A A
A
A A A A A A A A A A A A A A
A
A A A A A A A A A A A A A B 7
B B B B B B B B B B B B B B B
B B B AB AB AB AB AB AB AB AB AB AB AB AB
Data that is not organized is referred to as raw data.
Frequency Table
Blood
Type Frequency Relative Frequency
A 53 35.3%
AB 12 8.0%
B 19 12.7%
O 66 44.0%
Bar Chart
8
Pie Chart
(A) According to the data which blood type is the most common?
“From the table we look for the _____________ count”
“From the bar graph we look for the _____________ bar”
“From the pie chart we look for the _____________ wedge”
(B) What percent of the sample has blood type B?
A bar graph is constructed by labeling each category of data on a horizontal axis and
the frequency or relative frequency of the category on the vertical axis. A rectangle of
equal width is drawn for each category whose height is equal to the category's
frequency or relative frequency.
9
Note: In a bar graph, the rectangles do not touch, to reinforce the idea that the data are
qualitative and cannot be compared.
Activity: Divide the class into 4 groups. Each group will investigate one variable.
Group Variable
1 Gender
2 Eye color
3 Resident/ Non-resident
4 Major
10
Section 2.3
Organizing Quantitative Data
A histogram is constructed by drawing rectangles for each class of data whose height is
the frequency, relative frequency, or percent of the class. The width of each rectangle
should be the same and they should touch each other.
30121112024222122024
total:
11
To summarize continuous data in a table, categories are created using intervals of
numbers called classes.
There are many perspectives about how to handle the upper class limit for continuous
data. In this text, the author chooses to have the uppercut points have trailing “.9”s.
Degrees Days
50o –59.9o 1
60o –69.9o 31
70o –79.9o 152
80o –89.9o 163
90o –99.9o 50
100o –100.9o 2
The lower class limit of a class is the smallest (possible) value within the class.
The upper class limit of a class is the largest (possible) value within the class.
For the first class the lower class limit is _______ the upper class limit is _______
For the second class the lower class limit is ______the upper class limit is _______
The class width is the difference between consecutive lower class limits.
12
The following data represent the number of people admitted to local hospitals in the last
year. The number of its beds defines the size of the hospital.
When a stem-and-leaf plot is constructed, the quantitative data is organized into classes
in a unique table that has a bar graph appearance when completed.
Example: The following table displays the number of days to maturity for 24 short-
term investments.
Construct a stem-and-leaf diagram.
70 64 99 55 64 89 87 65
62 38 67 70 60 69 78 39
75 56 71 51 99 68 95 86
13
Step 1: Find the smallest item and largest item in the data. These will determine the
smallest and largest stems.
Step 2: Construct the column of stems. If the stems are split, write each stem 2 times.
Otherwise, write each stem 1 time. Draw a vertical line to the right of the stems.
Step 3: Write each leaf corresponding to the stems to the right of the vertical line. The
leaves are to line up vertically.
In an ordered Stem-and-Leaf Diagram the leaves are written in ascending order.
Make the two stem and leave plots ordered.
For split stems:
The stem’s 1st row are for the leaves 0, 1, 2, 3 and 4
The stem’s 2nd row are for the leaves 5, 6, 7, 8 and 9
Example: Construct a stem and leaf diagram for the following data: Cholesterol levels
for 20 high-level patients. Make one graph with and one without splitting the stems.
Which do you like better?
210 209 212 208
14
The data is organized automatically into classes.
You can see the shape of the distribution as you create the stem-and leaf diagram.
The raw data can be retrieved from the stem-and-leaf plot. However, once a
frequency histogram of continuous data is created, the raw data is lost.
Example: The following table is a stem-and-leaf plot. The stem represents the tens digit
and the leaf represents the ones digit.
8 147
9 2233
10 3458
11 019
(C) The lower class limit for the first row = _______
The lower class limit for the second row = ________
The class width = __________________________
15
Dotplot for DAYS
40 50 60 70 80 90 100
DAYS
To construct a dotplot
Step 1 – Draw a horizontal line that displays all possible values
Step 2 – Record each observation by placing a dot over the appropriate value on the
horizontal axis
Exercise: Construct a dotplot for the ages of the students in this class:
Construct a parallel dotplot of ages of the students in a statistics class for 2011. They
are:
22, 42, 28, 30, 20, 18, 26, 49, 39, 23, 21, 42
What time do you think this class was offered?
16
CH 2.4: Distribution Shapes
17
CH 2.5: Misleading Graphs
Some charts have a vertical scale that is unclear
The scale is possibly not labeled
The zero point of the scale is unclear
In these graphs, the order of the sizes is accurate, but the relative comparisons can be
misleading
In this graph, it is unclear
Where the vertical scale begins (bottom of or top of the shirts)
What the scale increments are
Some charts have a vertical scale that is truncated (the vertical scale does not start at 0)
When the vertical scale starts at a higher number, the differences between the bars is
exaggerated
For some data, magnifying the differences is important
For some data, magnifying the differences is misleading
18
The two graphs show the same data … the difference seems larger on the first graph
The vertical scale is truncated on the first graph.
a. Construct a misleading time-series plot that indicates that the life expectancy has
risen sharply over time.
19
Or even more subtle:
● Some charts are made visually more attractive by using symbols and graphics
instead of plain bars and lines
● If one category has twice the frequency of another, that graphic is doubled in size
● If the graphic is a three dimensional graphic, then doubling each dimension
increases the volume by eight times which is misleading
● The gazebo on the right is twice as large in each dimension as the one on the left
● However, it is much more than twice as large as the one on the left
20
Measure of center gives a general idea of the “size” of the data.
Consider two basketball teams, Team 1 and Team 2. The player’s heights are listed
below:
Mean of a Data Set: The mean of a data set is the average, the sum of the observations
divided by the number of observations.
76+72+78+76+73 375
Team 1: 5
=
5
=75
Team 2: ________________
Median of a Data Set: The median of the data set is the middle value. To calculate
this:
1. Sort data in increasing order.
2. Determine n = number of observations
3. Determine the observation in the middle of the data set:
n+1
a. if n is odd, the median is in position 2
n+1
b. if n is even, then 2 is not an integer. The median is the average of the
n+1
two observations fall in positions on either side of 2 .
Height data: sorted
Team 1: 72, 73, 76, 76, 78
Median is the center value
n+1 5+1
Place of median: Since n = 5, 2 = 2 =3
Median = 76
21
Team 2: 67, 72, 76, 76, 84
Example:
Team 1: Mode = 76
Team 2: Mode =
Team 3: Mode =
We presented 3 measures of center: mean, median and mode. Which makes sense if the
data is qualitative?
22
● Skewed left – the mean will usually be smaller than the median
● Skewed right – the mean will usually be larger than the median
23
Asking price of homes for sale in Lincoln, NE.
79,995 128,950 149,900 189,900
99,899 130,950 151,350 203,950
105,200 131,800 154,900 217,500
111,000 132,300 159,900 260,000
120,000 134,950 163,300 284,900
121,700 135,500 165,000 299,900
125,950 138,500 174,850 309,900
126,900 147,500 180,000 349,900
The mean asking price is $168,320 and the median asking price is $148,700. Therefore,
the distribution is skewed right.
24
CH 3.2: Measures of Variation
25
Range of a Data Set: difference between its maximum and minimum
Team 1: 72, 73,76,76,78
Team 2: 67, 72,76,76,84
Range = Max – Min
Team 1: Range = 78-72 = 6
Team 2: _______________
Sample Variance:
s2=
∑ ( x−x )2
n−1
26
Height Deviation from Mean Squared Deviation from mean
x
x−x ( x−x )2
72 72 – 75 = -3 9
73 73 – 75 = -2 4
76 76 – 75 = 1 1
76 76 - 75 = 1 1
78 78 – 75 = 3 9
Sum of Squared deviations = 24
2
s =
∑ ( x− x )2
=
24 24
= =6
Sample Variance for team 1: n−1 5−1 4
σ 2=
∑ ( x−μ )2
N
Population Standard Deviation
σ =√ σ 2
27
Exercise: Susan scored 60, 75, 80, and 65 in four statistics tests. Find the population
variance and the standard deviation of her tests.
If she receives 5 points extra credit toward each test. Find the variance and the standard
deviation of her tests now.
Note: If every value in the dataset is increased or decreased by the same value, the
standard deviation remains unchanged. (Think about the definition of standard
deviation)
28
CH 3.3: Chebyshev’s Rule and Empirical Rule
The Empirical Rule
Example: The number of cars that pass a given intersection in a day is known to be
roughly bell shaped with a mean = 375 cars and sd = 25 cars. Interpret the empirical
rule for the number of cars passing the intersection on a given day.
1 sd interval: 350 – 400: 68% of data, 16% above 400, 16% below 350
2 sd interval: 325 – 425: 95% of the data, 2.5% above 425, 2.5% below 325
3 sd interval: 300 – 450: 99.7% of the data, 0.15% below 300, 0.15% above
450
What would you think it was reported that 550 cars passed the intersection?
29
Example:
If the average age of retirement for the entire population in a country is 64 years and the
distribution is bell shaped with a standard deviation of 3.5 years:
What is the approximate age range in which 95% of people retire?
30
CH 3.4: The Five Number Summary; Boxplots
Quartiles:
Quartiles divide data sets into fourths, or four equal parts.
The 1st quartile, denoted Q1, divides the bottom 25% the data from the top 75%.
Therefore, the 1st quartile is equivalent to the 25th percentile.
The 2nd quartile divides the bottom 50% of the data from the top 50% of the data,
so that the 2nd quartile is equivalent to the 50th percentile, which is equivalent to
the median.
The 3rd quartile divides the bottom 75% of the data from the top 25% of the data,
so that the 3rd quartile is equivalent to the 75th percentile.
31
20, 24, 27, 28, 29, 30, 32
n= 7,
(n+1)/2 = (7+1)/2 = 4
Q1=28
The interquartile range (IQR) is the difference between the third and first quartiles
IQR = Q3 – Q1
The range of the middle 50% of the speed of cars traveling through the construction
zone is 10 miles per hour.
The Five Number Summary of data set consists of minimum, maximum, and quartiles
in increasing order:
Min, Q1, Q2, Q3, Max = 20, 28, 32.5, 38, 40
32
1. Determine the 5 number summary
2. Draw a horizontal axis on which the numbers obtained in Step 1 can be located.
Above the axis mark the quartiles and the Min and Max with vertical lines.
3. Connect the quartile to make a box and then connect the box to the Min and Max
with a line.
Suppose a 15th car travels through the construction zone at 100 miles per hour. How
does this value impact the mean, median, standard deviation, range and interquartile
range?
Original Data: 20, 24, 27, 28, 29, 30, 32, 33, 34, 36, 38, 39, 40, 40
Data with outlier: 20, 24, 27, 28, 29, 30, 32, 33, 34, 36, 38, 39, 40, 40, 100
3.0
2.5
2.0
Frequency
1.5
1.0
0.5
0.0
20 25 30 35 40
speed - original data
33
Histogram of speed - with 100 mph data point
Frequency
4
0
20 40 60 80 100
speed - with 100 mph data point
5
Frequency
0
20 40 60 80 100
speed - original data
5
Frequency
0
20 40 60 80 100
speed - with 100 mph data point
34
Without 15th With 15th car
car
Mean 32.1 mph 36.7 mph
Median 32.5 mph 33 mph
Standard 6.2 mph 18.5 mph
deviation
IQR 10 mph 11 mph
Range 20 mph 80 mph
Q1:
Q2:
Q3:
IQR:
35
Use your eyes to identify outliers, not the upper and lower fence method.
9
8
7
6
Frequency
5
4
3
2
1
0
10 20 30 40 50 60 70
TV TIMES
Observations that are below the lower limit or above the upper limit are potential
outliers.
Use lower and upper limit to find potential outliers in above data (TV watching data).
36
Distribution shape and boxplot
Average Average
Temperature Temperature
Worcester San Francisco
Jan 24 48
Feb 25 52
Mar 34 54
Apr 44 56
May 56 58
Jun 64 62
Jul 70 64
Aug 68 64
Sep 60 65
Oct 50 61
Nov 38 55
Dec 27 48
37
Boxplot of Average Temperature Worcester, Average Temperature SFO
70
60
50
Data
40
30
20
Average Temperature Worcester Average Temperature SFO
Compare the distribution of average temp in Worcester to the distribution of the average
temperature in SFO
1.
2.
3.
38
CH 3.5 Descriptive Measures for Populations
σ=
√ ∑ ( x−μ
N
)2
39
Z-scores and Percentiles
Measures of Position: Precise ways to describe the relative position of a data value
within the entire set of data:
The Z-score is the number of standard deviations a data value is away from the mean.
The population z-score is calculated using the population mean and population
standard deviation
x−μ
z=
σ
The sample z-score is calculated using the sample mean and sample standard deviation
x− x̄
z=
s
The z-score is unitless and has a mean of zero and a standard deviation of 1.
According to the Empirical rule, almost all Z scores are between -3 and 3.
Example:
If the mean of a population was 20, the standard deviation of the population was 6 and
the data value was 26. What is the z-score and what does it mean?
x−μ 26−20
z=
σ
= 6 =1
It means the data point is 1 standard deviation higher than the mean.
What if the date value was 14? Find the Z-score. How many standard deviation
above or below the mean is the data point.
What if the date value was 20? Find the Z-score. How many standard deviation
above or below the mean is the data point.
40
Example: The mean height of males 20 years or older is 69.1 inches with a standard
deviation of 2.8 inches. The mean height of females 20 years or older is 63.7 inches
with a standard deviation of 2.7 inches.
The z-score will tell us how many standard deviations 132 is from the population mean,
100.
x−μ 132−100
z=
σ
= 16
=2
From the Empirical Rule, we know that 95% of the observations are within 2 standard
deviations of the mean. Since the total percentage of observations is 100%,
95%
The probability that a randomly selected person has an IQ greater than 132 = 2.5%
41