You are on page 1of 73

Chapter 1

Defining and
Collecting Data
Chapter Overview
1 Defining Variables
4 Data Cleaning

2 5 Other Data
Collecting Data
3 Preprocessing
Tasks

3 6 Types of Survey
Types of Sampling
Errors
Methods
1.Defining Variables

Categorical (qualitative) variables take


categories as their values such as “yes”, “no”, or
“blue”, “brown”, “green”.

Numerical (quantitative) variables have values


that represent a counted or measured quantity.
- Discrete variables arise from a counting
process.
- Continuous variables arise from a measuring
process.
2.Collecting Data

- Data is collected from either a population or a


sample
- Population: all of the items or individuals of interest
that you seek to study.
- Parameter: a numerical measurement describing
some characteristic of a population.
- Sample: a portion of a population of interest.
- Statistic: a numerical measurement describing some
characteristic of a sample.
We use statistics to make inferences about
parameters.
Data Sources

Capturing data generated by ongoing business activities.


Distributing data compiled by an organization or individual.
Compiling the responses from a survey.
Conducting a designed experiment and recording the
outcomes of the experiment.
Conducting an observational study and recording the results
of the study.
* Primary Sources: The data collector is the one using the data
for analysis.
* Secondary Sources: The person performing data analysis is
not the data collector
3.Types of Sampling Methods
Nonprobability Sample: select the items or individuals
without regard to their probability of occurrence.
Convenience sample: items are selected based only on the
fact that they are easy, inexpensive, or convenient to sample.
Judgment sample: you get the opinions of preselected experts
in the subject matter.
Probability Sample: select items based on known
probabilities.
- Simple random sample:
+ Every individual or item from the population has an equal
chance of being selected.
+Selection may be with replacement (selected individual is
returned to frame for possible reselection) or without
replacement (the selected individual isn’t returned to the
frame).
+ Samples were obtained from a table of random numbers or
computer
random number generators.
3.Types of Sampling Methods
Probability Sample: select items based on known probabilities.
- Systematic sample:
+ Decide on sample size: n
+ Divide population of N individuals into groups of k individuals:
k = N/n (round k to the nearest integer).
+ Randomly select one individual from the first group.
+Select every kth individual thereafter
- Stratified sample:
+Divide population into two or more subgroups (called strata)
according to some common characteristic.
+A simple random sample is selected from each subgroup, with
sample sizes proportional to strata sizes.
+ Samples from subgroups are combined into one.
+ This is a common technique when sampling population of
voters, stratifying across racial or socio-economic lines.
- Cluster sample:
+ Population is divided into several “clusters”. Clusters are often
naturally occurring groups, such as counties, election districts,
city blocks, households.
+Take a random sample of one or more clusters and study all
items in each selected cluster.
+Collected data may be inaccurate or
inconsistent data may affect statistical results.

4.Data
Cleaning data fixes defects and ensures your
data has quality

Cleaning +Data cleaning seeks to correct the following


types of anomalies:
• Invalid variable values
• Encoding error
• Data integration error
5.Other Data
Preprocessing Tasks
+Reformatting means rearranging the structure of the
data or changing the electronic encoding of the data, or
both.
+ Stacking and Unstacking Data
When collecting data for a variable, you may need to
break the data into two or more groups for analysis.
+ Recoding Variables
After data collection, it is necessary to review the
categories you have defined for a categorical variable or
transform a numeric variable into a categorical variable
by assigning individual numerical values to one of
several the group.
6.Types of Survey
Errors
+When you collect data using the responses compiled
from a survey, you must verify two things:
about the survey to make sure you have results that can
be used in decision making
progress.
+Coverage Error.
+Nonresponse Error.
+Sampling Error.
+Measurement Error.
+ Ethical Issues About Surveys.
Chapter 2
Organizing and
Visualizing Variables

Categorical Data
Organizing

1 Summary Table 2 The Contingency


for one variable Table for two variable
1
Summary Table A summary table tallies the frequencies
or percentages of items in a set of
categories so that you can see
differences between categories.
EXAMPLE
2.1

The sample of 479 retirement funds includes the variable Risk Level that
has the defined categories low, average, and high. Construct a
summary table of the retirement funds, categorized by risk

EXAMPLE
2.2
Risk level Frequency Percentage
Low 147 30.69%

Average 224 46.76%


High 108 22.55%
Total 479 100.00%

The percentages for each category are calculated by dividing the number of funds in
each category by the total sample size. 147/479, 224/479, 108/479.
Observe that almost half the funds have an average risk, about 30% have low risk, and
less than a quarter have high risk.

1
Summary Table
Visualizing
Risk levels in the Retirement Funds sample

High

Risk Average

Low

0 50 100 150 200 250

Frequency
1
Summary Table
Visualizing
Pie chart & Doughnut chart

2 The Contingency A contingency table cross-tabulation, or


tallies jointly, the data of two or more
Table for two variable
categorical variables, allowing you to
study patterns that may exist between
the variables.

Both rows and the columns


represent variables.
sample size= 400, each invoice is
categorized as small, medium, large
Each invoice is also examined to
identify if there are any errors.
2 Contingency Table Based 170/400= 42.50% ; 100/400= 25.00%;
On Percentage Of
65/400= 16.25%
Overall Total

83.75%= 335/400 of sampled invoices


have no errors;
47.50%= 190/400 of sampled invoices
are for small amounts
35%=70/400 of sampled invoices are
for medium amounts
17.50%= 70/140 of sampled invoices are
for large amounts.
2 Contingency Table Based
On Percentage of Row
Totals

170/190= 89.47% ; 100/40=71.43% ;


65/70= 92.86%

Medium invoices have a larger chance


(28.57%) of having errors than small
(10.53%) or large (7.14%) invoices.
2 Contingency Table Based
On Percentage Of
Column Totals

170 / 335= 50.75% ; 20 / 65= 30.77%

There is a 61.54% chance that invoices


with errors are of medium size
2 The Contingency
Table for two variable Visualizing

SIDE-BY-SIDE
Numerical Data
Organizing

3 4
Ordered Array Frequency Distribution

Ordered array, frequency distribution, relative frequency


distribution, percentage distribution, cumulative percentage distribution

3
Ordered Array

1. An ordered array is a sequence of data, in rank order, from the


smallest value to the largest value. May help identify outliers
(unusual observations)

Stem and leaf display

1. A stem-and-leaf display organizes data into groups (called stems) so that


the values within each group (the leaves) branch out to the right on each
row.

43
Frequency Distribution

The frequency distribution is a summary table in which the data


are arranged into numerically ordered classes.

EXAMPLE
2.3
A manufacturer of insulation randomly selects 20 winter days

and records the daily high temperature.


24, 35, 17, 21, 24, 37, 26, 46, 58, 30, 32, 13, 12, 38, 41, 43, 44, 27, 53, 27

STEP 1 Sort raw data in


ascending order:

12, 13, 17, 21, 24, 24, 26, 27, 27, 30, 32, 35, 37, 38, 41, 43, 44, 46, 53, 58

3 2 Find range:
3 Select number of 4
3 58 - 12 = 46 classes: 5 (usually
Compute class
interval (width): 10
between 5 and 15)
(46/5 then round up)

5 6
Determine class Compute class 7 Count observations
boundaries (limits) midpoints: 15, 25,
& assign to classes
35, 45, 55
3
3
Visualizing one variable
Histogram
Visualize two variables
Scatter plot
Time-series plot
Chapter 3
Numerical
Descriptive Measure
Chapter Overview
1 4 Numerical Descriptive
Measures of
Central Tendency Measures for a
Population

2 5 The Covariance and


Measures of
3 Variation and Shape
the Coefficient of
Correlation

3 6 Descriptive
Exploring Numerical
Statistics: Pitfalls
Variables
and Ethical Issues
Measure of Central Tendency
The mean
The arithmetic mean serves as a “balance point” in a set of data
The sample mean is the sum of the values in a sample divided by
the number of values in the sample
EXAMPLE Nutritional data about a sample of seven breakfast cereals (stored in Cereals )
3.1 includes the number of calories per serving
Median
The sample median is a measure of central tendency that
divides the data into two equal parts, half below the
median and half above.

Lets Started

START
Median
To find the sample median, we arrange the data in
ascending order.
Median
EXAMPLE
3.3
Mode
The mode is the value that
appears most frequently.
There may be no mode or
several modes.
Mode
EXAMPLE
3.3
Geometric Mean
Geometric Mean
Geometric Mean
EXAMPLE
3.4
3.2 Measures of
Variation and Shape
Measures of variation give information on the
spread or variability or dispersion of the data
values.
Measures of variation: the range, the variance,
the standard deviation and the coefficient of
variation.
Range

Range = the largest value - the smallest value


Sample Variance and
Sample Standard Deviation
Sample Variance and
Sample Standard Deviation
EXAMPLE
3.5
Coefficient of Variation
Coefficient of Variation
Coefficient of Variation
Z Scores
Z Scores
Suppose the mean math SAT score is 490, with a standard
deviation of 100. Compute the Z-score for a test score of 620.
Shape
Describes how data are distributed.
Two useful shape related statistics are:
- Skewness: measures the extent to which data values
are not symmetrical around the mean.
- Kurtosis: measures the peakedness of the curve of the
distribution - that is, how sharply the curve rises
approaching the center of the distribution.
3.3 Exploring Numerical
Variables
Quartiles
Interquartile Range

Interquartile range = Q3 − Q1
The interquartile range measures the spread of the middle 50%
of the values.

The five-number summary for a variable


consists of the smallest value, Q1, Q2, Q3, and
the largest value
Boxplot
The boxplot visualizes the shape of the distribution of the values for a variable.

Boxplots can be drawn either horizontally or vertically. When drawn vertically,


the lowest values appear towards the bottom and Q1 is below Q3.
3.4 Numerical
Descriptive Measures
for a Population
POPULATION MEan
The arithmetic mean serves as a “balance point” in a set of data
The sample mean is the sum of the values in a sample divided by
the number of values in the sample
Population Variance and
Population Standard Deviation
The Empirical Rule

The empirical rule states that for population data from a symmetric
mound-shaped distribution such as the normal distribution, the
following are true:
Approximately 68% of the values are within ±1 standard
deviation from the mean.
Approximately 95% of the values are within ±2 standard
deviations from the mean.
Approximately 99.7% of the values are within ±3 standard
deviations from the mean.
Chebyshev’s Theorem
For heavily skewed sets of data and data sets that do not appear to
be normally distributed, should use Chebyshev’s theorem instead of
the empirical rule.
3.5 The Covariance and
the Coefficient of
Correlation
Covariance
Coefficient of Correlation

Features of the Coefficient of


Correlation

Range between -1 and 1.


The closer to -1, the stronger the negative linear relationship.
The closer to 1, the stronger the positive linear relationship.
The closer to 0, the weaker the linear relationship.
Features of the Coefficient of
Correlation

3.6 Descriptive
Statistics: Pitfalls and
Ethical Issues
Pitfalls and Ethical Issues

Should report the summary measures that best describe and communicate the
important aspects of the data set.
Should document both good and bad results.
In all presentations, need to report results in a fair, objective, and neutral
manner.
Should not use inappropriate summary measures to distort facts.
Thank You
Don't forget to study the lesson again, see
you in the next lesson

You might also like