You are on page 1of 53

Chapter 1

Data and Data Preparation


The Relevance of Statistics -Example
• Headline of newspaper states “What global warming?” after record
amounts of snow in the winter of 2010.

• Problem with Conclusion: Incorrect to draw conclusion based on one


year’s worth of data.

LO 1.1
The Relevance of Statistics –Example 2
• The CFO of Starbucks Corp. claims that business is picking up since
sales at stores open at least a year climbed 4% in the quarter ended
December 27, 2009.

• Problem with Conclusion. The CFO overstated the company’s


financial position by failing to mention that Starbucks closed more
than 800 stores over the past few years.

LO 1.1
The Relevance of Statistics –Example 3
• Researchers found that infants who sleep with a nightlight are much
more likely to develop myopia later in life.

• Problem with Conclusion This is an example of the correlation-to-


causation fallacy. Even if two variables are highly correlated, one does
not necessarily cause the other.

LO 1.1
Statistics
• Statistics is the
methodology of extracting
useful information from a
data set.
Keys to do good statistical analysis

Find the right data. Use the appropriate statistical tools. Clearly communicate the numerical
information into written language.
Two Branches of Statistics

Descriptive collecting, organizing, and presenting the data.


Statistics

Inferential drawing conclusions about a population based on sample


Statistics data from that population.
Population and Sample
Population • Consists of all items of interest.

Sample • A subset of the population.

A sample statistic is calculated from the sample data and is used


to make inferences about the unknown population parameter.
Discussion
• It came as a big surprise when Apple’s touch screen iPhone 4,
considered by many to be the best smartphone ever, was found to
have a problem (The NewYork Times, June 24, 2010). Users complained
of weak reception, and sometimes even dropped calls, when they
cradled the phone in their hands in a particular way. A quick survey at a
local store found that 2% of iPhone 4 users experienced this reception
problem.
• Describe the relevant population.
• Does 2% denote the population parameter or the sample statistic?
Discussion

•Why do we need sampling (instead of


using the population)?
Types of Data (1)
• Cross-sectional data
• Data collected by recording a characteristic of many subjects at the
same point in time, or without regard to differences in time.
• Subjects might include individuals, households, firms, industries,
regions, and countries.
Types of Data (2)
• Time series data
• Data collected by recording a characteristic of a subject over
several time periods.
• Data can include hourly, daily, weekly, monthly, quarterly, or
annual observations.
Types of Data (3)
• Structured data
• Data that has a well-defined length and format.
• Numbers, dates, strings of words, and so on.
• Unstructured data
• Data that does not conform to a predefined row-column format.
• Reports, emails, multimedia, and so on.
• Big data
• Massive volume of data that is difficult to manage, process, and
analyze using traditional tools.
Types of Variables
• Qualitative – gender, race, political affiliation
• Quantitative – test scores, age, weight
• Discrete: countable number of distinct values, e.g. number of
cars
• Continuous: uncountable number of values within an interval.,
e.g. time, stock return
Discussion
• Which of the following variables are qualitative and which
are quantitative? If the variable is quantitative, then specify
whether the variable is discrete or continuous.
• Colors of cars in a mall parking lot.
• Time it takes each student to complete a final exam.
• The number of patrons who frequent a restaurant.
Scales of Measurement

- Nominal
Qualitative Variables
- Ordinal

- Interval
Quantitative Variables
- Ratio

LO 1.4
Scales of Measurement
• Nominal scale: data are simply categories for grouping the
data
• Ordinal scale: may be categorized and ranked with respect to
some characteristic or trait.
Example
Tweens Survey
• What is the scale of measurement of the radio station data?

Solution: These are nominal data


Example
Tweens Survey
• How are the data based on the ratings of the food quality similar to or
different from the radio station data?

Solution: These are ordinal since they can be both categorized and ranked.
Scales of Measurement
• The Interval Scale
• Data may be categorized and ranked with respect to some
characteristic or trait.
• Differences between interval values are meaningful. Thus the
arithmetic operations of addition and subtraction are meaningful.

LO 1.4
Scales of Measurement
• The Ratio Scale
• The strongest level of measurement.
• Ratio data may be categorized and ranked with respect to
some characteristic or trait.
• Differences between interval values are meaningful.
• Business Examples: Sales, Profits, and Inventory Levels

LO 1.4
Example
Tweens Survey
• How are the time data classified? In what ways do the time data differ
from ordinal data? What is a potential weakness of this measurement
scale?

• Solution: Clock time responses are on an interval scale. With this type of
data we can calculate meaningful differences, however, there is no
apparent zero point.
LO 1.4
Example
Tweens Survey
• What is the measurement scale of the money data? Why is it
considered the most sophisticated form of data?

• Solution: Since the tweens’ responses are in dollar amounts, this


is ratio-scaled data; ratio-scaled data has a natural zero point
which allows the calculation of ratios.
LO 1.4
Chapter 2
Tabular and Graphical
Methods
Summarizing Qualitative Data
• A frequency distribution for qualitative data groups data into
categories and records how many observations fall into each
category.
• Weather conditions in Seattle, WA during February 2010.

LO 2.1
Summarizing Qualitative Data
• Categories: Cloudy, Rainy, or Sunny.
• Calculate relative frequency by dividing each
category’s frequency by the sample size.

Weather Frequency Relative Frequency

Cloudy 1 1/28=0.036
Rainy 20 20/28=0.714
Sunny 7 7/28=0.250
Total 28 28/28=1.000

LO 2.1
Summarizing Qualitative Data
• A pie chart is a segmented circle whose segments
portray the relative frequencies of the categories of a
qualitative variable.

• In this example,
circle is divided
into sectors
proportional to
categories of the
variable Marital
Status.
Source: Pew Research Center analysis of Decennial Census (1960 -
2000) and American Community Survey data (2008, 2010)

LO 2.2
Summarizing Qualitative Data
• A bar chart depicts the frequency or the relative
frequency for each category of the qualitative data as
a series of horizontal or vertical bars which are
proportional to the values that are to be depicted.

• For example, 2010’s


data may emphasize
the decline or rise
in the proportions
compared to 1960’s.

LO 2.2
Summarizing Quantitative Data

• A frequency distribution for quantitative data groups


data into intervals called classes, and records the
number of observations that fall into each class.
• Guidelines when constructing frequency distribution:
• Classes are mutually exclusive.
• Classes are exhaustive.

LO 2.3
Summarizing Quantitative Data

• The number of classes usually ranges from 5 to


20. This is a guideline, not an absolute rule.
• Approximating the class width:

Largest value  Smallest value


Number of classes

LO 2.3
Example

Class (in $1,000s) Frequency

300 up to 400 4
400 up to 500 11
500 up to 600 14
600 up to 700 5
700 up to 800 2
Total = 36

LO 2.3
Summarizing Quantitative Data

Question:
What is the price range Class (in $1,000s) Frequency
over this time period? 300 up to 400 4
400 up to 500 11
Question: 500 up to 600 14
How many of the houses 600 up to 700 5
sold in the $500,000 700 up to 800 2
up to $600,000 range? Total = 36

LO 2.3
Summarizing Quantitative Data

Question:
What is the price range Class (in $1,000s) Frequency
over this time period? 300 up to 400 4
400 up to 500 11
 $300,000 up to $800,000
500 up to 600 14
600 up to 700 5
Question:
700 up to 800 2
How many of the houses
Total = 36
sold in the $500,000
up to $600,000 range?
 14 houses

LO 2.3
Summarizing Quantitative Data
• A cumulative frequency distribution specifies how
many observations fall below the upper limit of a
particular class.
Class (in $1,000s) Frequency Cumulative Frequency
300 up to 400 4 4
400 up to 500 11 4 + 11 = 15
500 up to 600 14 4 + 11 + 14 = 29
600 up to 700 5 4 + 11 + 14 + 5 = 34
700 up to 800 2 4 + 11 + 14 + 5 + 2 = 36
Total 36

• Question: How many houses sold for less than $600,000?


 29 houses
LO 2.3
2.2 Summarizing Quantitative Data
• A relative frequency distribution identifies the
proportion or fraction of values that fall into each class.

Class frequency
Class relative frequency =
Total number of observations

• A cumulative relative frequency distribution gives


the proportion or fraction of values that fall below the
upper limit of each class.

LO 2.3
Summarizing Quantitative Data
• Here are the relative frequency and the cumulative relative
frequency distributions for the house-price data.

Class (in $1,000s) Frequency Relative Cumulative Relative Frequency


Frequency
300 up to 400 4 4/36 = 0.11 0.11

2.2 Summarizing
400 up to 500
500 up to 600 14
Quantitative
11
Data (7)
11/36 = 0.31
14/36 = 0.39
0.11 + 0.31 = 0.42
0.11 + 0.31 + 0.39 = 0.81

600 up to 700 5 5/36 = 0.14 0.11 + 0.31 + 0.39 + 0.14 = 0.95

700 up to 800 2 2/36 = 0.06 0.11 + 0.31 + 0.39 + 0.14 + 0.06  1.0
Total 36 1.0

LO 2.3
Summarizing Quantitative Data
Use the data on the previous slide to answer the
following two questions.

• Question: What percent of the houses sold for at


least $500,000 but not more than $600,000?
 39%

• Question: What percent of the houses sold for


less than $600,000?
 81%
LO 2.3
Summarizing Quantitative Data

 Histogram

 Polygon

 Ogive
LO 2.4
Summarizing Quantitative Data
• A histogram is a visual representation of a
frequency or a relative frequency distribution.

 Bar height represents the respective class


frequency (or relative frequency).

 Bar width represents the class width.

LO 2.4
Summarizing Quantitative Data
• Here are the frequency and relative frequency
histograms for the house-price data.

• Note that the only difference is the y-axis scale.

LO 2.4
Summarizing Quantitative Data
• Shape of Distribution: typically symmetric or
skewed
 Symmetric — mirror image on both sides of its
center.

Symmetric Distribution

LO 2.4
Summarizing Quantitative Data
• Skewed distribution
 Positively skewed - data
form a long, narrow tail to
the right.

 Negatively skewed - data


form a long,
narrow tail to the left.

LO 2.4
Summarizing Quantitative Data
• A polygon is a visual representation of a frequency
or a relative frequency distribution.

 Plot the class midpoints on x-axis and


associated frequency (or relative frequency) on
y-axis.

 Neighboring points are connected with a straight


line.

LO 2.4
Summarizing Quantitative Data
• Here is a polygon for the house-price data.

LO 2.4
Summarizing Quantitative Data
• An ogive is a visual representation of a
cumulative frequency or a cumulative relative
frequency distribution.
 Plot the cumulative frequency (or cumulative
relative frequency) of each class above the upper
limit of the corresponding class.
 The neighboring points are then connected.

LO 2.4
Summarizing Quantitative Data
• Here is an ogive for the house-price data.

• Use the ogive to approximate the percentage of


houses that sold for less than $550,000.
 Answer: 60%
LO 2.4
Stem-and-Leaf Diagrams (1)

• A stem-and-leaf diagram provides a visual


display of quantitative data.
• It gives an overall picture of the data’s center and
variability.
• Each value of the data set is separated into two
parts: the stem consists of the leftmost digits,
while the leaf is the last digit.

LO 2.5
Stem-and-Leaf Diagrams (2)
• The following data set shows the wealthiest
people in the world and their associated ages.
• The leftmost digit is the stem while the last digit is
the leaf as shown here. Age = 36

LO 2.5
Discussion:
A police officer is concerned with excessive speeds on a portion of
Interstate 90 with a posted speed limit of 65 miles per hour. Using his radar
gun, he records the following speeds for 25 cars and trucks:

Construct a stem-and-leaf diagram. Are the officer’s concerns warranted?


Scatterplots (1)

• A scatterplot is used to determine if two


variables are related.
 Each point is a pairing: (xi,yi)

(x1,y1), (x2,y2), etc. y-axis

 This scatterplot shows


income against education.
x-axis
LO 2.6
Scatterplots (2)
• Linear relationship: upward or downward-sloping
trend of the data.

 Positive linear
relationship: as x
increases, so does y.
 Negative linear
relationship (shown
here): as x increases, y
decreases.

LO 2.6
Scatterplots (3)
• Nonlinear relationship

 As x increases,
y increases at an
increasing (or
decreasing) rate.
 As x increases y
decreases, at an
increasing (or
decreasing) rate.

LO 2.6
Scatterplots (4)
• No relationship: data are randomly scattered with
no discernible pattern.

 In this scatterplot, there is


no apparent relationship
between x and y.

LO 2.6

You might also like