Module 4

introduction to data analysis
basic statistics
Sampling Design
● Non-Probability
○ Purposive Sampling (hey, can I interview you?)
○ Snowball Sampling (do you know someone like you?)
○ Quota Sampling (I only need 30 people, can I interview the first 30 people in this group?)
● Probability
○ Simple Random Sampling (let’s do draw lots)
○ Systematic Sampling (I’ll interview every 4th house in this street)
○ Stratified Sampling (I’ll take sample for every year level or group)
2
Sampling vs Non-sampling Errors
3
Sampling Errors
Sampling errors are affected by factors such as the size and design of the sample, population
variability, and sampling fraction.
Categories of Sampling Errors:
● Population Specification Error – Happens when the analysts do not understand who to survey.
For example, for a survey of breakfast cereals, the population can be the mother, children, or the
entire family.
● Selection Error – Occurs when the survey participation is self-selected by the respondents
implying only those who are interested respond. Selection error can be reduced by encouraging
participation.
● Sample Frame Error – Occurs when a sample is selected from the wrong population data.
● Non-Response Error – Occurs when a useful response is not obtained from the surveys. It may
happen due to the inability to contact potential respondents or their refusal to respond.
https://corporatefinanceinstitute.com/resources/knowledge/other/non-sampling-error/ 4
Non-sampling Errors
most common non-sampling errors include errors in data entry, biased questions
and decision-making, non-responses, false information, and inappropriate
analysis.
Mechanics of Non-Sampling Error

• Random errors -- Random errors are errors that cannot be accounted for and
just happen. In statistical studies, it is believed that each random error offsets
each other, generally speaking, so they are of little to no concern.
• Systematic errors -- Systematic errors affect the sample of the study and, as a
result, will often create useless data. A systematic error is consistent and
repeatable, so the creators of the study must take great care to mitigate such an
error.
Non-sampling Errors
• Non-response error -- caused by the differences between the people who choose to participate
compared to the people who do not participate in a given survey.
• Measurement error -- refers to all errors relating to the measurement of each sampling unit, as
opposed to errors relating to how they were selected, often arises when there are confusing questions,
low-quality data due to sampling fatigue (i.e., someone is tired of taking a survey), and low-quality
measurement tools.
• Interviewer error -- occurs when the interviewer (or administrator) makes an error when recording a
response. In qualitative research, an interviewer may lead a respondent to answer a certain way. In
quantitative research, an interviewer may ask the question in a different way, which leads to a
different end result.
• Adjustment error -- situation where the analysis of the data adjusts it in such a way that it is not
entirely accurate. Forms of adjustment error include errors with weighting the data, data cleaning, and
imputation.
• Processing error -- A processing error arises when there is a problem with processing the data that
causes an error of some kind. An example will be if the data were entered incorrectly or if the data file is
corrupt.
Sampling vs Non-sampling Errors
7
Reducing Errors
suppose we now have our representative
sample data, how do we get insights from
it?
9
well that’s where basic statistics comes
in.
10
Identify your data
is it univariate or multivariate?
is it categorical or continuous?
11
then?
12
Look for patterns, by doing
exploratory data analysis.
13
How to look for patterns?
14
Descriptive Statistics and Data
Visualization!
Descriptive Statistics
• Univariate
• Mean
• Median
• Mode - categorical
• Variance
• Kurtosis
• Skewness
• Multivariate
• Pairwise Correlation
Data Visualization: Next Session
• Univariate
• Plot the Histogram (Distribution)
• Plot the Data if Time Series
income distribution
• Multivariate
• Plot X and Y
• 3D
y
x
Time Series
(not covered)
Basic Statistics
● Mean
● Median
● Mode
● Variance
● Standard Deviation
● Kurtosis
● Skewness
Basic Statistics
● Mean (the average value)
● Median (the middle value)
● Mode (the frequent value)
● Variance (the value for homogeneity or heterogeneity)
● Standard Deviation (standardized variance)
● Kurtosis (another measure for variance)
● Skewness (the location of concentration of points)
Central Tendencies
● Mean -- the sum of a variable’s values divided by the total number of values
● Median -- the middle value of a variable
● Mode -- the value that occurs most often

Central Tendencies
The incomes of five randomly selected people in the United States are:
$10,000, $10,000, $45,000, $60,000, and $1,000,000
● Mean =
● Median =
● Mode =
Central Tendencies
$10,000, $10,000, $45,000, $60,000, and $1,000,000
● Mean = (10,000 + 10,000 + 45,000 + 60,000 + 1,000,000) / 5 = $225,000
● Median =
● Mode =
Central Tendencies
$10,000, $10,000, $45,000, $60,000, and $1,000,000
● Mean = (10,000 + 10,000 + 45,000 + 60,000 + 1,000,000) / 5 = $225,000
● Median = $45,000
● Mode =
Central Tendencies
$10,000, $10,000, $45,000, $60,000, and $1,000,000
● Mean = (10,000 + 10,000 + 45,000 + 60,000 + 1,000,000) / 5 = $225,000
● Median = $45,000
● Mode = $10,000
Example 1
Mean, Median, Mode
Mean, Median and Mode
Employee Salary (Annual)
John 420,000
Mike 510,000
Kate 630,000
Shane 450,000
Catty 550,000
Mathew 900,000
Sam 1,000,000
John 420,000
Mike 510,000
Kate 630,000
Shane 450,000
Catty 550,000
Mathew 900,000
Sam 1,000,000
Mean
John 420,000
Mike 510,000
Kate 630,000
Shane 450,000
Catty 550,000
Mathew 900,000
Sam 1,000,000
Mean: (420,000 + 510,000 + 630,000 + . . . + 1,000,000) / 7 = 637,142.86

Median: Sort the data first and take the
middle
John 420,000
Mike 510,000
Kate 630,000
Shane 450,000
Catty 550,000
Mathew 900,000
Sam 1,000,000
Median: 420000, 450000, 510000, 550000, 630000, 900000, 1000000

Mode: Applicable for Categorical
Variables Only
John 420,000
Mike 510,000
Kate 630,000
Shane 450,000
Catty 550,000
Mathew 900,000
Sam 1,000,000
Example 2
Mean, Median, Mode
Customer Quantity of orders Country
John 160 UK
Mike 1440 UK
Kate 1200 EIRE
Shane 10 France
Catty 20 UK
Mathew 30 France
Sam 60 Belgium
Mean
John 160 UK
Mike 1440 UK
Kate 1200 EIRE
Shane 10 France
Catty 20 UK
Mathew 30 France
Sam 60 Belgium
Median
John 160 UK
Mike 1440 UK
Kate 1200 EIRE
Shane 10 France
Catty 20 UK
Mathew 30 France
Sam 60 Belgium
Mode
John 160 UK
Mike 1440 UK
Kate 1200 EIRE
Shane 10 France
Catty 20 UK
Mathew 30 France
Sam 60 Belgium
Mean
John 160 UK
Mike 1440 UK
Kate 1200 EIRE
Shane 10 France
Catty 20 UK
Mathew 30 France
Sam 60 Belgium
Mean: (160 + 1440 + 1200 + . . . + 60) / 7 = 417.14

Median: Sort the data first and take the
middle
Customer Quantity of orders
John 160
Mike 1440
Kate 1200
Shane 10
Catty 20
Mathew 30
Sam 60
Median: 10, 12, 30, 60, 160, 1200, 1440

Variables Only
John 160 UK
Mike 1440 UK
Kate 1200 EIRE
Shane 10 France
Catty 20 UK
Mathew 30 France
Sam 60 Belgium
Variables Only
John 160 UK
Mike 1440 UK
Kate 1200 EIRE
Shane 10 France
Catty 20 UK
Mathew 30 France
Sam 60 Belgium
Number from UK = 3, Number from France = 2, Number of Belgium = 1, Number of EIRE = 1

Mode = UK
Central Tendencies
$10,000, $10,000, $45,000, $60,000, and $1,000,000

Central Tendencies
$10,000, $10,000, $45,000, $60,000, and $1,000,000
● Mean --
● Median --
● Mode --
Central Tendencies
$10,000, $10,000, $45,000, $60,000, and $1,000,000
● Mean = (10,000 + 10,000 + 45,000 + 60,000 + 1,000,000) / 5 = $225,000
● Median = $45,000
● Mode = $10,000
what are the insights from these
statistics (mean, median, mode, etc.)?
how sure are we on our estimates?
well, we can estimate that using the
variance.
Measures of Dispersion
● Range -- difference between the smallest and largest values in the data
● Variance -- calculated by taking the average of the squared differences between
each value and the mean
● Standard Deviation -- square root of the variance
● Skew -- measure of whether some values of a variable are extremely different from
the majority of the values
Degrees of Freedom
Standard deviation in a population is:
The estimate of population standard deviation calculated from a random sample

Variance
Employee Salary (Annual) in 1000s Mean = 637,142.86
Mean in 1000s = 637
John 160
Variance = (Salary - Mean)^2 / (n - 1)
Mike 1440
(Salary - Mean)^2 = (420 - 637)^2
Kate 1200 + (510 - 637)^2
+ (630 - 637)^2
Shane 10
:
Catty 20 + (1,000 - 637)^2
= 51123.81
Mathew 30
Variance = 306742.86 / (7 - 1) = 306742.86 / 6 = 51123.81
Sam 60
Standard Deviation
Employee Quantity of Orders Mean = 637,142.86
Mean in 1000s = 637
John 160 Variance = (Salary - Mean)^2 / (n - 1)
Mike 1440
(Salary - Mean)^2 = (420 - 637)^2
Kate 1200 + (510 - 637)^2
+ (630 - 637)^2
Shane 10 :
Catty 20 + (1,000 - 637)^2
= 51123.81
Mathew 30
Variance = 306742.86 / (7 - 1) = 306742.86 / 6 = 51123.81
Sam 60 Standard Deviation = Square Root of Variance = 226.11
Variance
Employee Quantity of Orders
Mean = 417.14
John 160 Variance = (Salary - Mean)^2 / (n - 1)
Mike 1440 (Salary - Mean)^2 = (160 - 417.14)^2

+ (1440 - 417.14)^2
Kate 1200
+ (1200 - 417.14)^2
Shane 10 :
+ (60 - 417.14)^2
Catty 20 = 2326135.72
Mathew 30
Variance = 2326135.72 / (7 - 1) = 2326135.72 / 6 = 387689.28
Sam 60
Standard Deviation
Employee Quantity of Orders Mean = 417.14
Variance = (Salary - Mean)^2 / (n - 1)
John 160
Mike 1440 (Salary - Mean)^2 = (160 - 417.14)^2

+ (1440 - 417.14)^2
Kate 1200 + (1200 - 417.14)^2
:
Shane 10 + (60 - 417.14)^2
Catty 20 = 2326135.72
Mathew 30 Variance = 2326135.72 / (7 - 1) = 2326135.72 / 6 = 387689.28

Standard Deviation = Square Root of Variance = 622.65
Sam 60
the larger the variance/standard
deviation the larger the variability of the
data, and vice-versa
histogram
Histogram
Employee Quantity of Orders
John 160
Mike 1440
Kate 1200
Shane 10
Catty 20
Mathew 30
Sam 60
Histogram
John 420,000
Mike 510,000
Kate 630,000
Shane 450,000
Catty 550,000
Mathew 900,000
Sam 1,000,000
Histogram of Salaries of 1000 Employees
Employee New Salary (Annual)
John 408,562.78
Mike 524,922.50
Kate 570,671.90
Shane 495,470.25
Catty 695,920.38
Mathew 637,457.63
Sam 625,171.31
: :
Bell Curve
Skewed to the Right (Positively Skewed)
Skewed to the Left (Negatively Skewed)
Fitting a Model
Fitting a Model
Before Fitting: Data and Model
Big Difference or
Big Error
Data Model
After Fitting: Data and Model
Small Difference
or Small Error
Data Model
Well Fitted Normal Distribution
$10,000, $10,000, $45,000, $60,000, and $1,000,000
● Range --
● Variance --
● Standard Deviation --
● Skew --
$10,000, $10,000, $45,000, $60,000, and $1,000,000
● Range -- 1,000,000 - 10,000 = 990,000

● Variance --
● Skew --
$10,000, $10,000, $45,000, $60,000, and $1,000,000
● Range -- 1,000,000 - 10,000 = 990,000

● Variance -- [(10,000 - 225,000)^2 + (10,000 - 225,000)^2 + (45,000 - 225,000)^2 +
(60,000 - 225,000)^2 + (1,000,000 - 225,000)^2] / 5 = 150,540,000,000
● Skew --
$10,000, $10,000, $45,000, $60,000, and $1,000,000
● Range -- 1,000,000 - 10,000 = 990,000

● Variance -- [(10,000 - 225,000)^2 + (10,000 - 225,000)^2 + (45,000 - 225,000)^2 +
(60,000 - 225,000)^2 + (1,000,000 - 225,000)^2] / 5 = 150,540,000,000
● Standard Deviation -- Square Root (150,540,000,000) = 387,995
● Skew --
$10,000, $10,000, $45,000, $60,000, and $1,000,000
● Range -- 1,000,000 - 10,000 = 990,000

● Variance -- [(10,000 - 225,000)^2 + (10,000 - 225,000)^2 + (45,000 - 225,000)^2 +
(60,000 - 225,000)^2 + (1,000,000 - 225,000)^2] / 5 = 150,540,000,000
● Standard Deviation -- Square Root (150,540,000,000) = 387,995
● Skew -- Income is positively skewed
● Range -- difference between the smallest and largest values in the data
● Variance -- calculated by taking the average of the squared differences between
each value and the mean
● Standard Deviation -- square root of the variance
● Skew -- measure of whether some values of a variable are extremely different from
the majority of the values
Measure of Centrality
Exercises

Module 4

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Module 4

Uploaded by

Copyright:

Available Formats

introduction to data analysis

Mechanics of Non-Sampling Error

● Median -- the middle value of a variable

● Mode -- the value that occurs most often

$10,000, $10,000, $45,000, $60,000, and $1,000,000

$10,000, $10,000, $45,000, $60,000, and $1,000,000

● Mean = (10,000 + 10,000 + 45,000 + 60,000 + 1,000,000) / 5 = $225,000

$10,000, $10,000, $45,000, $60,000, and $1,000,000

● Mean = (10,000 + 10,000 + 45,000 + 60,000 + 1,000,000) / 5 = $225,000

$10,000, $10,000, $45,000, $60,000, and $1,000,000

● Mean = (10,000 + 10,000 + 45,000 + 60,000 + 1,000,000) / 5 = $225,000

Mean: (420,000 + 510,000 + 630,000 + . . . + 1,000,000) / 7 = 637,142.86

Median: 420000, 450000, 510000, 550000, 630000, 900000, 1000000

Kate 1200 EIRE

Kate 1200 EIRE

Kate 1200 EIRE

Kate 1200 EIRE

Kate 1200 EIRE

Mean: (160 + 1440 + 1200 + . . . + 60) / 7 = 417.14

Median: 10, 12, 30, 60, 160, 1200, 1440

Kate 1200 EIRE

Kate 1200 EIRE

Number from UK = 3, Number from France = 2, Number of Belgium = 1, Number of EIRE = 1

$10,000, $10,000, $45,000, $60,000, and $1,000,000

● Median -- the middle value of a variable

● Mode -- the value that occurs most often

$10,000, $10,000, $45,000, $60,000, and $1,000,000

$10,000, $10,000, $45,000, $60,000, and $1,000,000

● Mean = (10,000 + 10,000 + 45,000 + 60,000 + 1,000,000) / 5 = $225,000

The estimate of population standard deviation calculated from a random sample

Mike 1440 (Salary - Mean)^2 = (160 - 417.14)^2

Mike 1440 (Salary - Mean)^2 = (160 - 417.14)^2

Mathew 30 Variance = 2326135.72 / (7 - 1) = 2326135.72 / 6 = 387689.28

$10,000, $10,000, $45,000, $60,000, and $1,000,000

$10,000, $10,000, $45,000, $60,000, and $1,000,000

● Range -- 1,000,000 - 10,000 = 990,000

$10,000, $10,000, $45,000, $60,000, and $1,000,000

● Range -- 1,000,000 - 10,000 = 990,000

$10,000, $10,000, $45,000, $60,000, and $1,000,000

● Range -- 1,000,000 - 10,000 = 990,000

$10,000, $10,000, $45,000, $60,000, and $1,000,000

● Range -- 1,000,000 - 10,000 = 990,000

You might also like