Data and Data Preparation

Chapter 1
Data and Data Preparation

The Relevance of Statistics -Example
• Headline of newspaper states “What global warming?” after record
amounts of snow in the winter of 2010.
• Problem with Conclusion: Incorrect to draw conclusion based on one

year’s worth of data.
LO 1.1
The Relevance of Statistics –Example 2
• The CFO of Starbucks Corp. claims that business is picking up since
sales at stores open at least a year climbed 4% in the quarter ended
December 27, 2009.
• Problem with Conclusion. The CFO overstated the company’s

financial position by failing to mention that Starbucks closed more
than 800 stores over the past few years.
LO 1.1
The Relevance of Statistics –Example 3
• Researchers found that infants who sleep with a nightlight are much
more likely to develop myopia later in life.
• Problem with Conclusion This is an example of the correlation-to-

causation fallacy. Even if two variables are highly correlated, one does
not necessarily cause the other.
LO 1.1
Statistics
• Statistics is the
methodology of extracting
useful information from a
data set.
Keys to do good statistical analysis
Find the right data. Use the appropriate statistical tools. Clearly communicate the numerical
information into written language.
Two Branches of Statistics
Descriptive collecting, organizing, and presenting the data.

Statistics
Inferential drawing conclusions about a population based on sample

Statistics data from that population.
Population and Sample
Population • Consists of all items of interest.
Sample • A subset of the population.
A sample statistic is calculated from the sample data and is used

to make inferences about the unknown population parameter.
Discussion
• It came as a big surprise when Apple’s touch screen iPhone 4,
considered by many to be the best smartphone ever, was found to
have a problem (The NewYork Times, June 24, 2010). Users complained
of weak reception, and sometimes even dropped calls, when they
cradled the phone in their hands in a particular way. A quick survey at a
local store found that 2% of iPhone 4 users experienced this reception
problem.
• Describe the relevant population.
• Does 2% denote the population parameter or the sample statistic?
Discussion
•Why do we need sampling (instead of

using the population)?
Types of Data (1)
• Cross-sectional data
• Data collected by recording a characteristic of many subjects at the
same point in time, or without regard to differences in time.
• Subjects might include individuals, households, firms, industries,
regions, and countries.
Types of Data (2)
• Time series data
• Data collected by recording a characteristic of a subject over
several time periods.
• Data can include hourly, daily, weekly, monthly, quarterly, or
annual observations.
Types of Data (3)
• Structured data
• Data that has a well-defined length and format.
• Numbers, dates, strings of words, and so on.
• Unstructured data
• Data that does not conform to a predefined row-column format.
• Reports, emails, multimedia, and so on.
• Big data
• Massive volume of data that is difficult to manage, process, and
analyze using traditional tools.
Types of Variables
• Qualitative – gender, race, political affiliation
• Quantitative – test scores, age, weight
• Discrete: countable number of distinct values, e.g. number of
cars
• Continuous: uncountable number of values within an interval.,
e.g. time, stock return
Discussion
• Which of the following variables are qualitative and which
are quantitative? If the variable is quantitative, then specify
whether the variable is discrete or continuous.
• Colors of cars in a mall parking lot.
• Time it takes each student to complete a final exam.
• The number of patrons who frequent a restaurant.
Scales of Measurement
- Nominal
Qualitative Variables
- Ordinal
- Interval
Quantitative Variables
- Ratio
LO 1.4
• Nominal scale: data are simply categories for grouping the
data
• Ordinal scale: may be categorized and ranked with respect to
some characteristic or trait.
Example
Tweens Survey
• What is the scale of measurement of the radio station data?
Solution: These are nominal data

Example
Tweens Survey
• How are the data based on the ratings of the food quality similar to or
different from the radio station data?
Solution: These are ordinal since they can be both categorized and ranked.
• The Interval Scale
• Data may be categorized and ranked with respect to some
characteristic or trait.
• Differences between interval values are meaningful. Thus the
arithmetic operations of addition and subtraction are meaningful.
LO 1.4
• The Ratio Scale
• The strongest level of measurement.
• Ratio data may be categorized and ranked with respect to
some characteristic or trait.
• Differences between interval values are meaningful.
• Business Examples: Sales, Profits, and Inventory Levels
LO 1.4
Example
Tweens Survey
• How are the time data classified? In what ways do the time data differ
from ordinal data? What is a potential weakness of this measurement
scale?
• Solution: Clock time responses are on an interval scale. With this type of
data we can calculate meaningful differences, however, there is no
apparent zero point.
LO 1.4
Example
Tweens Survey
• What is the measurement scale of the money data? Why is it
considered the most sophisticated form of data?
• Solution: Since the tweens’ responses are in dollar amounts, this

is ratio-scaled data; ratio-scaled data has a natural zero point
which allows the calculation of ratios.
LO 1.4
Chapter 2
Tabular and Graphical
Methods
Summarizing Qualitative Data
• A frequency distribution for qualitative data groups data into
categories and records how many observations fall into each
category.
• Weather conditions in Seattle, WA during February 2010.
LO 2.1
• Categories: Cloudy, Rainy, or Sunny.
• Calculate relative frequency by dividing each
category’s frequency by the sample size.
Weather Frequency Relative Frequency
Cloudy 1 1/28=0.036
Rainy 20 20/28=0.714
Sunny 7 7/28=0.250
Total 28 28/28=1.000
LO 2.1
• A pie chart is a segmented circle whose segments
portray the relative frequencies of the categories of a
qualitative variable.
• In this example,
circle is divided
into sectors
proportional to
categories of the
variable Marital
Status.
Source: Pew Research Center analysis of Decennial Census (1960 -
2000) and American Community Survey data (2008, 2010)
LO 2.2
• A bar chart depicts the frequency or the relative
frequency for each category of the qualitative data as
a series of horizontal or vertical bars which are
proportional to the values that are to be depicted.
• For example, 2010’s

data may emphasize
the decline or rise
in the proportions
compared to 1960’s.
LO 2.2
Summarizing Quantitative Data
• A frequency distribution for quantitative data groups

data into intervals called classes, and records the
number of observations that fall into each class.
• Guidelines when constructing frequency distribution:
• Classes are mutually exclusive.
• Classes are exhaustive.
LO 2.3
• The number of classes usually ranges from 5 to

20. This is a guideline, not an absolute rule.
• Approximating the class width:
Largest value  Smallest value

Number of classes
LO 2.3
Example
Class (in $1,000s) Frequency
300 up to 400 4
400 up to 500 11
500 up to 600 14
600 up to 700 5
700 up to 800 2
Total = 36
LO 2.3
Question:
What is the price range Class (in $1,000s) Frequency
over this time period? 300 up to 400 4
400 up to 500 11
Question: 500 up to 600 14
How many of the houses 600 up to 700 5
sold in the $500,000 700 up to 800 2
up to $600,000 range? Total = 36
LO 2.3
Question:
What is the price range Class (in $1,000s) Frequency
over this time period? 300 up to 400 4
400 up to 500 11
 $300,000 up to $800,000
500 up to 600 14
600 up to 700 5
Question:
700 up to 800 2
How many of the houses
Total = 36
sold in the $500,000
up to $600,000 range?
 14 houses
LO 2.3
• A cumulative frequency distribution specifies how
many observations fall below the upper limit of a
particular class.
Class (in $1,000s) Frequency Cumulative Frequency
300 up to 400 4 4
400 up to 500 11 4 + 11 = 15
500 up to 600 14 4 + 11 + 14 = 29
600 up to 700 5 4 + 11 + 14 + 5 = 34
700 up to 800 2 4 + 11 + 14 + 5 + 2 = 36
Total 36
• Question: How many houses sold for less than $600,000?

 29 houses
LO 2.3
2.2 Summarizing Quantitative Data
• A relative frequency distribution identifies the
proportion or fraction of values that fall into each class.
Class frequency
Class relative frequency =
Total number of observations
• A cumulative relative frequency distribution gives

the proportion or fraction of values that fall below the
upper limit of each class.
LO 2.3
• Here are the relative frequency and the cumulative relative
frequency distributions for the house-price data.
Class (in $1,000s) Frequency Relative Cumulative Relative Frequency

Frequency
300 up to 400 4 4/36 = 0.11 0.11
2.2 Summarizing
400 up to 500
500 up to 600 14
Quantitative
11
Data (7)
11/36 = 0.31
14/36 = 0.39
0.11 + 0.31 = 0.42
0.11 + 0.31 + 0.39 = 0.81
600 up to 700 5 5/36 = 0.14 0.11 + 0.31 + 0.39 + 0.14 = 0.95
700 up to 800 2 2/36 = 0.06 0.11 + 0.31 + 0.39 + 0.14 + 0.06  1.0
Total 36 1.0
LO 2.3
Use the data on the previous slide to answer the
following two questions.
• Question: What percent of the houses sold for at

least $500,000 but not more than $600,000?
 39%
• Question: What percent of the houses sold for

less than $600,000?
 81%
LO 2.3
 Histogram
 Polygon
 Ogive
LO 2.4
• A histogram is a visual representation of a
frequency or a relative frequency distribution.
 Bar height represents the respective class

frequency (or relative frequency).
 Bar width represents the class width.
LO 2.4
• Here are the frequency and relative frequency
histograms for the house-price data.
• Note that the only difference is the y-axis scale.
LO 2.4
• Shape of Distribution: typically symmetric or
skewed
 Symmetric — mirror image on both sides of its
center.
Symmetric Distribution
LO 2.4
• Skewed distribution
 Positively skewed - data
form a long, narrow tail to
the right.
 Negatively skewed - data

form a long,
narrow tail to the left.
LO 2.4
• A polygon is a visual representation of a frequency
or a relative frequency distribution.
 Plot the class midpoints on x-axis and

associated frequency (or relative frequency) on
y-axis.
 Neighboring points are connected with a straight

line.
LO 2.4
• Here is a polygon for the house-price data.
LO 2.4
• An ogive is a visual representation of a
cumulative frequency or a cumulative relative
frequency distribution.
 Plot the cumulative frequency (or cumulative
relative frequency) of each class above the upper
limit of the corresponding class.
 The neighboring points are then connected.
LO 2.4
• Here is an ogive for the house-price data.
• Use the ogive to approximate the percentage of

houses that sold for less than $550,000.
 Answer: 60%
LO 2.4
Stem-and-Leaf Diagrams (1)
• A stem-and-leaf diagram provides a visual

display of quantitative data.
• It gives an overall picture of the data’s center and
variability.
• Each value of the data set is separated into two
parts: the stem consists of the leftmost digits,
while the leaf is the last digit.
LO 2.5
Stem-and-Leaf Diagrams (2)
• The following data set shows the wealthiest
people in the world and their associated ages.
• The leftmost digit is the stem while the last digit is
the leaf as shown here. Age = 36
LO 2.5
Discussion:
A police officer is concerned with excessive speeds on a portion of
Interstate 90 with a posted speed limit of 65 miles per hour. Using his radar
gun, he records the following speeds for 25 cars and trucks:
Construct a stem-and-leaf diagram. Are the officer’s concerns warranted?

Scatterplots (1)
• A scatterplot is used to determine if two

variables are related.
 Each point is a pairing: (xi,yi)
(x1,y1), (x2,y2), etc. y-axis
 This scatterplot shows

income against education.
x-axis
LO 2.6
Scatterplots (2)
• Linear relationship: upward or downward-sloping
trend of the data.
 Positive linear
relationship: as x
increases, so does y.
 Negative linear
relationship (shown
here): as x increases, y
decreases.
LO 2.6
Scatterplots (3)
• Nonlinear relationship
 As x increases,
y increases at an
increasing (or
decreasing) rate.
 As x increases y
decreases, at an
increasing (or
decreasing) rate.
LO 2.6
Scatterplots (4)
• No relationship: data are randomly scattered with
no discernible pattern.
 In this scatterplot, there is

no apparent relationship
between x and y.
LO 2.6

Data and Data Preparation

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Data and Data Preparation

Uploaded by

Copyright:

Available Formats

Chapter 1

Data and Data Preparation

• Problem with Conclusion: Incorrect to draw conclusion based on one

• Problem with Conclusion. The CFO overstated the company’s

• Problem with Conclusion This is an example of the correlation-to-

Descriptive collecting, organizing, and presenting the data.

Inferential drawing conclusions about a population based on sample

Sample • A subset of the population.

A sample statistic is calculated from the sample data and is used

•Why do we need sampling (instead of

Solution: These are nominal data

• Solution: Since the tweens’ responses are in dollar amounts, this

Weather Frequency Relative Frequency

• For example, 2010’s

• A frequency distribution for quantitative data groups

• The number of classes usually ranges from 5 to

Largest value  Smallest value

Class (in $1,000s) Frequency

• Question: How many houses sold for less than $600,000?

• A cumulative relative frequency distribution gives

Class (in $1,000s) Frequency Relative Cumulative Relative Frequency

600 up to 700 5 5/36 = 0.14 0.11 + 0.31 + 0.39 + 0.14 = 0.95

• Question: What percent of the houses sold for at

• Question: What percent of the houses sold for

 Bar height represents the respective class

 Bar width represents the class width.

• Note that the only difference is the y-axis scale.

 Negatively skewed - data

 Plot the class midpoints on x-axis and

 Neighboring points are connected with a straight

• Use the ogive to approximate the percentage of

• A stem-and-leaf diagram provides a visual

Construct a stem-and-leaf diagram. Are the officer’s concerns warranted?

• A scatterplot is used to determine if two

(x1,y1), (x2,y2), etc. y-axis

 This scatterplot shows

 In this scatterplot, there is

You might also like