Professional Documents
Culture Documents
Defining and
Collecting Data
Chapter Overview
1 Defining Variables
4 Data Cleaning
2 5 Other Data
Collecting Data
3 Preprocessing
Tasks
3 6 Types of Survey
Types of Sampling
Errors
Methods
1.Defining Variables
4.Data
Cleaning data fixes defects and ensures your
data has quality
Categorical Data
Organizing
The sample of 479 retirement funds includes the variable Risk Level that
has the defined categories low, average, and high. Construct a
summary table of the retirement funds, categorized by risk
EXAMPLE
2.2
Risk level Frequency Percentage
Low 147 30.69%
The percentages for each category are calculated by dividing the number of funds in
each category by the total sample size. 147/479, 224/479, 108/479.
Observe that almost half the funds have an average risk, about 30% have low risk, and
less than a quarter have high risk.
1
Summary Table
Visualizing
Risk levels in the Retirement Funds sample
High
Risk Average
Low
Frequency
1
Summary Table
Visualizing
Pie chart & Doughnut chart
SIDE-BY-SIDE
Numerical Data
Organizing
3 4
Ordered Array Frequency Distribution
3
Ordered Array
43
Frequency Distribution
EXAMPLE
2.3
A manufacturer of insulation randomly selects 20 winter days
24, 35, 17, 21, 24, 37, 26, 46, 58, 30, 32, 13, 12, 38, 41, 43, 44, 27, 53, 27
12, 13, 17, 21, 24, 24, 26, 27, 27, 30, 32, 35, 37, 38, 41, 43, 44, 46, 53, 58
3 2 Find range:
3 Select number of 4
3 58 - 12 = 46 classes: 5 (usually
Compute class
interval (width): 10
between 5 and 15)
(46/5 then round up)
5 6
Determine class Compute class 7 Count observations
boundaries (limits) midpoints: 15, 25,
& assign to classes
35, 45, 55
3
3
Visualizing one variable
Histogram
Visualize two variables
Scatter plot
Time-series plot
Chapter 3
Numerical
Descriptive Measure
Chapter Overview
1 4 Numerical Descriptive
Measures of
Central Tendency Measures for a
Population
3 6 Descriptive
Exploring Numerical
Statistics: Pitfalls
Variables
and Ethical Issues
Measure of Central Tendency
The mean
The arithmetic mean serves as a “balance point” in a set of data
The sample mean is the sum of the values in a sample divided by
the number of values in the sample
EXAMPLE Nutritional data about a sample of seven breakfast cereals (stored in Cereals )
3.1 includes the number of calories per serving
Median
The sample median is a measure of central tendency that
divides the data into two equal parts, half below the
median and half above.
Lets Started
START
Median
To find the sample median, we arrange the data in
ascending order.
Median
EXAMPLE
3.3
Mode
The mode is the value that
appears most frequently.
There may be no mode or
several modes.
Mode
EXAMPLE
3.3
Geometric Mean
Geometric Mean
Geometric Mean
EXAMPLE
3.4
3.2 Measures of
Variation and Shape
Measures of variation give information on the
spread or variability or dispersion of the data
values.
Measures of variation: the range, the variance,
the standard deviation and the coefficient of
variation.
Range
Interquartile range = Q3 − Q1
The interquartile range measures the spread of the middle 50%
of the values.
The empirical rule states that for population data from a symmetric
mound-shaped distribution such as the normal distribution, the
following are true:
Approximately 68% of the values are within ±1 standard
deviation from the mean.
Approximately 95% of the values are within ±2 standard
deviations from the mean.
Approximately 99.7% of the values are within ±3 standard
deviations from the mean.
Chebyshev’s Theorem
For heavily skewed sets of data and data sets that do not appear to
be normally distributed, should use Chebyshev’s theorem instead of
the empirical rule.
3.5 The Covariance and
the Coefficient of
Correlation
Covariance
Coefficient of Correlation
3.6 Descriptive
Statistics: Pitfalls and
Ethical Issues
Pitfalls and Ethical Issues
Should report the summary measures that best describe and communicate the
important aspects of the data set.
Should document both good and bad results.
In all presentations, need to report results in a fair, objective, and neutral
manner.
Should not use inappropriate summary measures to distort facts.
Thank You
Don't forget to study the lesson again, see
you in the next lesson