Professional Documents
Culture Documents
2
Quantitative data
◦ It can be either discrete or continuous
Continuous data: Data that can take on any value in an
interval
Discrete data: Data that can take on only integer
values, such as counts
Qualitative data
◦ Data that can take on only a specific set of values
representing a set of possible categories
Ordinal data: Categorical data that has an explicit
ordering
Binary data: A special case of categorical data with just
two categories of values (0/1, true/false)
3
Data can measured on one of four scales,
depending on the type of data
1. Nominal scale
Categorical variables can be placed into categories. They
don’t have a numeric value and so cannot be added,
subtracted, divided or multiplied. They have no order
2. Ordinal scale
Ordinal scale is used for data where order matters i.e. for
ordinal data. distances along the scale are not meaningful
3. Interval scale
Ordered labels with meaningful distances, we do not have
absolute zero value and ratios are not meaningful
4. Ratio scale
The ratio scale is exactly the same as the interval scale with
two differences: i) true zero point, ii) ratios are meaningful
4
Practical Statistics for Data Scientists by Peter
Bruce and Andrew Bruce
https://www.statisticshowto.com
5
Research Methodologies
2
Median
◦ The value such that one-half of the data lies above and below
e.g. data set 9 7 14 6 19 3 2 10 17
2 3 6 7 9 10 14 17 19
median = 9
e.g. data set 9 12 8 15 8 6
6 8 8 9 12 15
median = 8.5
Weighted median
◦ The value such that one-half of the sum of the weights lies above and
below the sorted data
e.g. data set 12 10 13 8 3 7 1 7 4
1 3 4 7 7 8 10 12 13
weighted median = 10
Outlier
◦ A data value that is very different from most of the data
Robust
◦ A estimate is said to be robust if it is not affected by extreme values
3
Deviations
◦ The difference between the observed values and the estimate of
location
e.g. data set 6 13 5 2 9 mean = 7
deviations -1 6 -2 -5 2
Mean absolute deviation
◦ The mean of the absolute value of the deviations from the mean
e.g. data set 6 13 5 2 9 mean = 7
absolute deviations 1 6 2 5 2
mean absolute deviation = 3.2
Variance
◦ The sum of squared deviations from the mean divided by n – 1
where n is the number of data values
e.g. data set 6 13 5 2 9 mean = 7
deviations -1 6 -2 -5 2
squared deviations 1 36 4 25 4
variance = 70 / 4 = 17.5
4
Standard deviation
◦ The square root of the variance
Median absolute deviation from the median (MAD)
◦ The median of the absolute value of the deviations from the
median
e.g. data set 6 13 5 2 9 median = 6
absolute deviations 0 7 1 4 3
MAD = 3
Range
◦ The difference between the largest and the smallest value
in a data set
5
Percentile
◦ The value such that P percent of the values take on this
value or less and (100–P) percent take on this value or
more
e.g. data set 5 2 4 8 1 2 5 6
we want to compute 80th percentile
sort data 1 2 2 4 5 5 6 8
let i be the position of 80th percentile
i = (80 / 100) × 8 = 6.4 ➔ 7
so 80th percentile is the 7th value in the sorted data
which is 6
Interquartile range (IQR)
◦ The difference between the 75th percentile and the 25th
percentile
6
Mode
◦ The most commonly occurring category or value in a
data set
Expected value
◦ When the categories can be associated with a numeric
value, this gives an average value based on a category’s
probability of occurrence
e.g. numeric value of Cat. 1 = 300, probability of Cat. 1 = 0.05
numeric value of Cat. 2 = 50, probability of Cat. 2 = 0.15
numeric value of Cat. 3 = 0, probability of Cat. 3 = 0.80
expected value = (300)(0.05) + (50)(0.15) + (0)(0.80)
expected value = 22.5
7
Practical Statistics for Data Scientists by Peter
Bruce and Andrew Bruce
https://www.statisticshowto.com
8
Research Methodologies
6
4
2
0
Age
2
Bar graph
◦ Similar to histogram, but depicts
qualitative/categorical data
14
12
10
Frequency
8
6
4
2
0
Sci-Fi Thriller Comedy Horror Other
Movie Genre
3
Probability Density Function (PDF)
◦ Another plot to draw distribution of quantitative
data 0.035
0.03
0.025
Probability
0.02
0.015
0.01
0.005
105
120
30
75
45
60
90
52.5
97.5
112.5
37.5
67.5
82.5
Weight
4
Symmetric vs skewed
Light-tailed vs heavy-tailed
Unimodal vs multimodal
5
16
14
Symmetric distribution
12
10
8
6
4
2
0
1 2 3 4 5 6 7
12 12
10
Left-skewed distribution 10
Right-skewed distribution
8 8
6 6
4 4
2 2
0 0
1 2 3 4 5 6 7 1 2 3 4 5 6 7
6
7
14
12
bimodal
distribution
10
0
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
8
Practical Statistics for Data Scientists by Peter
Bruce and Andrew Bruce
https://www.statisticshowto.com
9
Research Methodologies
2
Point vs interval estimate
◦ Point estimation gives us a particular value as an
estimate of the population parameter
◦ Interval estimation gives us a range of values which is
likely to contain the population parameter. This interval
is called a confidence interval
Systematic error vs random error
◦ Systematic error is associated with faulty measurement
equipment or sampling process that gives
unrepresentative samples
◦ Random error occurs as a result of sampling variability
The importance of unbiased sampling
◦ Classic example of Literary Digest poll of 1936 US
elections
3
Practical Statistics for Data Scientists by Peter
Bruce and Andrew Bruce
https://www.statisticshowto.com
4
Research Methodologies
3
Sampling distribution of a statistic refers to
the distribution of some sample statistic (e.g.
mean, median etc.), over many samples
drawn from the same population
Standard error (SE) of a statistic is the
standard deviation of its sampling
distribution
4
Sampling distribution of the sample means is
approximately normally distributed (bell-
shaped), if the sample size is large enough
Standard error of the
σ
sample means =
𝑛
where σ is the
standard deviation of
the population and n
is the sample size
3 2 2 3
− − − µ + + +
n n n n n n
6
Income data [1] 7
Regardless of the distribution of the population,
sample distribution of the sample means is
approximately normally distributed provided
large enough sample size (n ≥ 30) and a large
number of samples
The sampling distribution of the sample means
centers at the true population mean provided
large enough sample size (n ≥ 30) and a large
number of samples
Standard error of the sample means depends on
the standard deviation of the population
Standard error decreases with increase in sample
size
8
1. Practical Statistics for Data Scientists by
Peter Bruce and Andrew Bruce
2. https://www.statisticshowto.com
9
Research Methodologies
2
99% sample means, µ ± 2.576 × SE
95%
95% sample means, µ ± 1.96 × SE
90% sample means, µ ± 1.645 × SE
3s
3s 2s
2s ss µ ss 2s
2s 3s
3s
−− −− −− ++ ++ ++
nn nn nn nn nn nn
3
s
95% sample means, µ ± 1.96 ×
n
3s 2s s µ s 2s 3s
− − − + + +
n n n n n n
𝑠
(𝑥ҧ ± 1.96 × )
𝑛
4
Practical Statistics for Data Scientists by Peter
Bruce and Andrew Bruce
https://www.statisticshowto.com
5
Research Methodologies
0.4
0.35
0.3
0.25
0.2
0.15
0.1
0.05
0
-5 -4 -3 -2 -1 0 1 2 3 4 5
2
95% confidence interval for sample mean (𝑥),
ҧ
when n ≥ 30
𝒙 ± 1.96 × SE
◦ ഥ
◦ Here 𝒙
ഥ is the “point estimate”, 1.96 × SE is the
“margin of error” and 1.96 is the “critical value”
General form of confidence interval
◦ point estimate ± margin of error
◦ where margin of error = critical value × SE
3
97.5%
95%
2.5% 2.5%
-3 -2 -1 0 1 2 3
z*
4
5
Critical t-value depends on two things:
1. Confidence level
2. Degrees of freedom (df)
6
1. Random sampling is used to draw sample
2. Sampling distribution of sample statistic is
normally distributed, or population is
normally distributed in case of n < 30
3. Individual data values / observations in the
sample are independent of each other
8
Practical Statistics for Data Scientists by Peter
Bruce and Andrew Bruce
https://www.statisticshowto.com
Khan Academy (khanacademy.org) video
lectures on confidence intervals
9
Research Methodologies
3
Practical Statistics for Data Scientists by Peter
Bruce and Andrew Bruce
https://www.statisticshowto.com
4
Research Methodologies
* Check J., Schutt R. K. Survey research. In: J. Check, R. K. Schutt., editors. Research methods in
education. Thousand Oaks, CA:: Sage Publications; 2012. pp. 159–185.
Survey Design
Population and Sample
Instrumentation
Variables in study
Data analysis and interpretation
Survey Design:
Independent variables
Dependent variables.
◦ Students with high IQ level achieve more grades in study.
Materials:
* Creswell, J. W., & Plano Clark, V. L. (2011). Designing and conducting mixed methods research (2nd ed.). Thousand Oaks, CA: Sage.
In this approach, a researcher collects both
quantitative and qualitative data, analyzes
them separately, and then compares the
results to see if the findings confirm or
disconfirm each other.
Issue with this approach is the sample size
for both the qualitative and quantitative data
collection process.
Same number of individuals on both the
qualitative and quantitative database.
◦ Qualitative sample will be increased, and it will limit
the amount of data collected from any one
individual
Whether the individuals for the sample of
qualitative participants should also be
individuals in the quantitative sample.
In this design we collect and analyze
quantitative data which then is followed by
the collection and analysis of qualitative data.
Two phases
The quantitative results typically inform the
types of participants to be purposefully
selected for the qualitative phase and the
types of questions that will be asked of the
participants.
The overall intent of this design is to have the
qualitative data help explain in more detail
the initial quantitative results.
In case of outliers, the participants selected
may effect the final conclusions.
Easy but time taking.
In this design we collect and analyze
qualitative data which then is followed by the
collection and analysis of quantitative data.
Two phases
How to develop instrument out of the
findings from qualitative data
Sample selection may be problem
Again, time taking activity
Research Methodologies
2
Accurately predicting coin tosses:
◦ Probability of accurately predicting two consecutive
1 2
coin tosses = 2
= 0.25
◦ Probability of accurately predicting seven
1 7
consecutive coin tosses = 2
≈ 0.008
3
Null hypothesis (denoted by H0): Statement of
zero or no change
Alternative hypothesis (denoted by Ha or H1):
statement that there is a change (what we
hope to prove)
We test only the null hypothesis i.e. we either
“reject the null hypothesis” (which suggests
the alternative hypothesis) or “fail to reject
the null hypothesis”
4
A website’s interface is changed with the intent to increase the
mean amount of time people spent on it. Before changing the
interface, the average visit-duration was 15 minutes
◦ H0: μ = 15 min.
◦ Ha: μ > 15 min.
As per provincial government, the current literacy rate of the
province is 65%. Ali suspects that it is less than 65%, he
randomly sampled 120 people and found that 59.2% of them
were literate
◦ H0: p = 0.65
◦ Ha: p < 0.65
A restaurant owner installed a new automated drink machine.
The machine is designed to dispense 530mL of liquid on the
medium size setting. The owner suspects that the machine may
not be dispensing the said quantity of liquid in medium drinks [3]
◦ H0: μ = 530mL
◦ Ha: μ ≠ 530mL
5
Test statistic: A test statistic is computed on the sample to
measure the difference between the observed data and what
would be expected under the null hypothesis
◦ e.g. test can be z-test or t-test, and test statistic can be z-statistic or t-
statistic
p-value: It is the probability of finding results at least as extreme
as the observed ones when the null hypothesis is true
◦ Its value is between 0 and 1
◦ High value indicates strong evidence in favor of the null hypothesis and
low value indicates strong evidence against the null hypothesis
Significance level (α): It is a pre-defined value for a hypothesis
test. If the calculated p-value is ≤ α, we reject the null
hypothesis (i.e. the results are statistically significant), otherwise
we fail to reject the null hypothesis
◦ Typical values for significance level are 0.05 and 0.01
Statistical significance: A result is statistically significant when it
is very unlikely to have occurred given the null hypothesis is true
6
Type of test: It depends on the alternative
hypothesis. It can be either one-tailed (right or
left-tailed) or two-tailed
Critical region: The critical region (also called
rejection region) is the region of values that
corresponds to the rejection of the null
hypothesis
◦ It depends on the value of significance level and the type
of test
Acceptance region: The region of values where
we fail to reject the null hypothesis
Critical value(s): The value(s) which separate the
critical region from the acceptance region
7
H0: μ = 0, Ha: μ > 0, α = 0.05
Suppose n ≥ 30 so the test is z-test
Acceptance region (95%)
Critical region
(α = 5%)
-3 -2 -1 0 1 2 3
CriticalCritical
Critical
value value
value
Value of testValue of test
statistic statistic 8
H0: μ = 0, Ha: μ ≠ 0, α = 0.05
Suppose n ≥ 30 so the test is z-test
Acceptance region (95%)
-3 -2 -1 0 1 2 3
10
1. Practical Statistics for Data Scientists by
Peter Bruce and Andrew Bruce
2. https://www.statisticshowto.com
3. Khan Academy (khanacademy.org) video
tutorials on hypothesis testing
11
Research Methodologies
2
H0: μ = μ1, Ha: μ > μ1
µ1 µ2
α, Prob. of Type I error
β,
β, Prob.
Prob. of
of Type
Type IIII error 1 - β, Power
3
Practical Statistics for Data Scientists by Peter
Bruce and Andrew Bruce
https://www.statisticshowto.com
4
Research Methodologies
2
Testing about population proportion (p)
◦ We use z-test and the calculated statistic is z-statistic
ො 0
𝑝−𝑝
◦ Test statistic is calculated as
𝑝0 (1−𝑝0 )
𝑛
where 𝑝Ƹ is the sample proportion
𝑝0 is the population proportion from the H0
n is the sample size
3
p-value for right-tailed z-test
◦ e.g. H0: μ = μ1, Ha: μ > μ1
Assume z-statistic = 1.15
p-value = 1 – 0.8749 = 0.1251
4
5
p-value for two-tailed z-test
◦ e.g. H0: μ = μ1, Ha: μ ≠ μ1
Assume z-statistic = -2.05
p-value = 2 × 0.0202= 0.0404
6
7
p-value for left-tailed t-test
◦ e.g. H0: μ = μ1, Ha: μ < μ1, n = 10 → df = 9
Assume t-statistic = -1.39
p-value ≈ 0.10
8
Practical Statistics for Data Scientists by Peter
Bruce and Andrew Bruce
https://www.statisticshowto.com
t-distribution calculator
◦ https://homepage.divms.uiowa.edu/~mbognar/applets/t.html
10
Research Methodologies
2
For mean
i. if n ≥ 30
We use z-distribution to build a confidence interval or do a hypothesis testing
ii. if n < 30, population be normally distributed
We use t-distribution to build a confidence interval or do a
hypothesis testing
iii. if n < 30 AND distribution of population is unknown
We draw the distribution of sample data to see if the sample data is
roughly symmetric and there are no outliers, if yes, then We use t-
distribution to build a confidence interval or do a hypothesis testing
3
For proportion
◦ Let 𝑝Ƹ be the sample proportion and n be the sample size:
the sampling distribution of sample proportion is normally
distributed if 𝑛𝑝Ƹ ≥ 10 AND 𝑛(1 − 𝑝)Ƹ ≥ 10. And we use z-
distribution to build a confidence interval or do a
hypothesis testing
Example 1: 𝑝Ƹ = 0.28, n = 50
𝑛𝑝Ƹ = 14, 𝑛(1 − 𝑝)Ƹ = 36
We can approximate the sampling distribution of sample proportion
with the normal distribution
Example 2: 𝑝Ƹ = 0.16, n = 50
𝑛𝑝Ƹ = 8, 𝑛(1 − 𝑝)Ƹ = 42
We cannot approximate the sampling distribution of sample
proportion with the normal distribution
4
If we are sampling with replacement then
individual observations / data values are
independent
If we are sampling without replacement then
individual observations / data values aren't
technically independent since removing each
observation changes the population
◦ However, we can treat the individual observations as
independent if the sample size is no more than 10% of
the population. This is known as 10% condition.
e.g. if sample size = 50 then population size must be ≥ 500
5
Practical Statistics for Data Scientists by Peter
Bruce and Andrew Bruce
Khan Academy (khanacademy.org) video
tutorials on hypothesis testing
https://www.statisticshowto.com
6
Research Methodologies
4
https://www.statisticshowto.com
t-distribution calculator
◦ https://homepage.divms.uiowa.edu/~mbognar/applets/t.html
5
Research Methodologies