You are on page 1of 32

FINAL SB

CHAPTER1: OVERVIEW OF STATISTICS


1.1. What’s statistic?
- Statistics is the science of collecting, organizing, analyzing, interpreting, and presenting data.
- Data science is a trilogy of tasks involving data modeling, analysis, and decision making.
- Statistic: A single measure, reported as a number, used to summarize a sample data set. For example, the average
height of students in a university.

Uses of Statistics in Business


 Summarize business data
 Draw conclusion from business data
 Make reliable forecast about business activity
 Improve business process

 Different measures can be used to summarize different sets of data.


 Different measures can be used to summarize different types of question about the same data set.

- 2 types of statistics: descriptive statistics and inferential statistics.

 Descriptive statistics: refers to the collection, organization, presentation, and summary of data (either using charts and
graphs or using a numerical summary).
Include: Collecting and Describing Data
- Sampling and Surveys (Chapter 2)
- Visual Displays (Chapter 3
- Numerical Summaries (Chapter 4)
- Probability Models (Chapter 5-8)

 Inferential statistics: refers to generalizing from a sample to a population, estimating unknown population parameters,
drawing conclusions, and making decisions.

Include: Making Inferences from Samples


- Estimating Parameters (Chapter 8)
- Testing Hypotheses (Chapter 9-16)
- Regression and Trends (Chapter 12-14)
- Quality Control (Chapter 17)

1.2. Why study statistic?


Use for: communication, computer skills, information management, technical literacy, process improvement.

1.3. Statistics in business?


- Auditing: Help auditors check invoice.
- Marketing: Identify hobby of customers  Guide the marketing strategy.
- Health Care: Save money by finding better ways to manage patient appointments, schedule procedures and rotate
their staff.
- Finance: Statistical information is used to guide an investment.
- Quality Improvement, Purchasing, Medicine, Operations Management, Product Warranty
- Production: Statistical quality control charts are used to monitor the output of a production.

1.4. Critical thinking:

- Statistic is an essential part of critical thinking because it allows us to test an idea against empirical evidence.
- Empirical data represent data collected through observation and experiments.
- Use statistical tools to compare our prior ideas with empirical data but pitfall do occur.
 Conclusions from Small Samples
Ex: My aunt smoked all her life and lived to 90. Smoking doesn’t hurt you.
 Conclusions from Nonrandom Samples
Ex: “Rock stars die young. Look at Buddy Holly,…”
 Conclusions from Rare Events
Ex: “Mary in my office won the lottery. Her system must have worked”. – Millions of people play the lottery.
Someone will eventually win.
“Tom’s SUV rolled over. SUVs are dangerous”. – Millions of people drive SUVs. Some will roll over, some
won’t.
 Poor survey method
 Assuming a Causal Link
 Generalization to Individuals
Ex: Men are taller than women.
 Unconscious Bias
Ex: Heart attack were more likely to occur in men than women.
 Significance versus Importance
Ex: Who born in fall will become taller than the others.

CHAPTER2: DATA COLLECTION

2.1. Variables and Data:

In scientific research, data arise from experiments whose results are recorded systematically. In business, data usually
arise from accounting transactions or management processes.

Data Terminology:

- An observation is a single member of a collection of items that we want to study, such as a person, firm, or region.
- A variable is a characteristic of the subject or individual, such as an employee’s income or an invoice amount.
- The data set consists of all the values of all of the variables for all of the observations we have chosen to observe.
- n*m matrix  m column, n row  row display observation and column display variable

Categorical and Numerical Data:


A data set may contain a mixture of data types: categorical and numerical data.

Categorical Data: (also qualitative data) have values that are described by words rather than numbers.
Numerical Data: (also quantitative data) arise from counting, measuring something, or some kind of mathematical
operation.
 A variable with a countable number of distinct values is discrete. A numerical variable that can have any value within
an interval is continuous.
Time Series Data and Cross-Sectional Data:

- If each observation in the sample represents a different equally spaced point in time (years, months, days), we have time
series data.
- The periodicity is the time between observations. It may be annual, quarterly, monthly, weekly, daily, hourly, etc.
Ex: firm’s sales, market share, debt/ equity ratio, employee absenteeism, inventory turnover, and product quality
ratings.
 For time series, we are interested in trends and patterns over time.

- If each observation represents a different individual unit (e.g., a person, firm, geographic area) at the same point in time,
we have cross-sectional data.
 For cross-sectional data, we are interested in variation among observations (e.g., accounts receivable in 20 Subway
franchises) or in relationships (e.g., whether accounts receivable are related to sales volume in 20 Subway franchises as
shown in Figure 2.2).

2.2. Level of Measurement:

Collecting data:
Primary data collection: observation & survey & experiment.
Secondary data collection: based on electrionics and prints document.

Statistician sometimes refer to four levels of measurement for data: nominal, ordinal, interval, and ratio.
 Nominal: is the weakest level of measurement and the easiest to recognize. Data are labels or names used to identify an
attribute of the element. We usually code nominal data numerically.
 Ordinal: a codes connote a ranking of data values. Have same properties of nominal data but order of ranking is
meaningful. Ordinal data can be treated as nominal, but not vice versa.
 Interval: which not only is a rank but also has meaningful intervals between scale points.
 Ratio: is the strongest level of measurement. Ratio data have all the properties of the other three data types, but in
addition possess a meaningful zero that represents the absence of the quantity being measured.
2.3 Sampling Concepts:

- A sample involves looking only at some items selected from the population, while a census is an examination of all
items in a defined population.
- Parameter is a measurement or characteristic of the population represented by a Greek letter while Statistic is a
numerical value calculated from a sample represented by a Roman letter.

- The sampling frame is the group from which we take the sample. If the frame differs from the target population, then
our estimates might not be accurate.

2.4 Sampling Methods:

- There are two main categories of sampling methods.


In random sampling items are chosen by randomization or a chance procedure.
Non-random sampling is less scientific but is sometimes used for expediency.

Random sampling:
Non-random sampling:

CHAPTER 3: DESCRIBING DATA VISUALLY

3.1. Stem-and-Leaf Displays and Dot Plots:

- To visualize small data sets.


- The type of graph you use to display your data is dependent on the type of data you have.

Stem-and-Leaf Display:
- The stem is the tens digit of the data, and the leaf is the one digit. Separate the shorted data series into: leading digits (the
stem) and trailing digits (leaf).

Dot plots:
- Shows variability by displaying the range of the data
- Center by revealing where the data values tend to cluster and where the midpoint lies.
- Reveal some things about the shape of the distribution if the sample is large enough

3.2 FREQUENCY DISTRIBUTIONS AND HISTOGRAMS:


A frequency distribution: is a tabulation of n data values into k classes called bins. The table shows the frequency of
data values within each bin.
 Many bins: some bins are likely to be sparsely populated, or even empty.
 Too few bins, dissimilar data values are lumped together.
k = 1 + 3.3log(n)

Histogram: is a graphical representation of a frequency distribution.

- A modal class is a histogram bar that is higher than those on either side.
- A histogram with a single modal class is unimodal, one with two modal classes is bimodal, and one with more than two
modes is multimodal.

 A histogram’s skewness is indicated by the direction of its longer tail.


 If neither tail is longer, the histogram is symmetric.
 A right-skewed (or positively skewed) histogram has a longer right tail, with most data values clustered on the left side.
 A left-skewed (or negatively skewed) histogram has a longer left tail, with most data values clustered on the right side.

Graphing Numerical Data:

Charts:
• line chart
• column chart (vertical display) and bar chart (horizontal display)
• pie chart: for qualitative data (categories or nominal scale)
• scatter plot
• Tables

Deceptive graphs

Error 1: Nonzero origin will exaggerate the trend. Measured distances do not match the stated values or axis
demarcations
Error 2: Elastic Graph Proportions By shortening the X-axis in relation to the Y-axis, vertical change is exaggerated
Error 3: Dramatic Titles and Distracting Pictures to grab the reader’s attention than to convey the chart’s content
Error 4: 3-D and Novelty Graphs
Error 5: Rotated Graphs By making a graph 3-dimensional and rotating it through space, the author can make trends
appear to dwindle into the distance or loom alarmingly toward you.
Error 6: Unclear Definitions or Scales (Missing or unclear units of measurement)
Error 7: Vague Sources: the author lost the citation, mixed data from several sources
Error 8: Complex Graphs Complicated visual displays make the reader work harder.
Error 9: Gratuitous Effects Slide shows often use many color and special effects
Error 11: Area Trick: visual tricks is enlarging the width of the bars as their height increases, so the bar area misstates the
true proportion.

CHAPTER 4: DESCRIPTIVE STATISTIC


4.2. Measure of Center: include arithmetic mean, median, mode and geometric mean.

Mean: the most familiar statistical measure of center is the mean. (=AVERAGE(data)).
affected by outliers.
x 1+ x2 + ⋯+ x n
μ=
n

Median: middle value in sorted array (50 above:50 below). (=MEDIAN(data)).


n+1
 Median position:
2

Mode: Most frequently occurring data value. (=MODE.SNGL(data)).

Geometric mean: Used to measure the rate of change of a variable over time. (=GEOMEAN(data)).
 √n x 1 x 2 … x n

Mean < Median < Mode: skewed left


Mean = Median = Mode: symmetric
Mode < Median < Mean: skewed right
Midrange: the point halfway between the lowest and highest values of X
xmax + x min
2

Quartiles (denote Q1, Q2, Q3): divide data into 4 groups of approximately equal size, that this the
25th, 50th and 75th percentiles.

n+1
Q 1 POSITION =
4
n+1
Q 2 POSITION = (the median)
2
3(n+ 1)
Q3 POSITION =
4
Box & whisker plot:

- Box plot based on the five-number summary: xmin, Q1, Q2, Q3 and xmax.

Shows:
 center (position of the median Q2)
 variability (width of the “box” defined by Q1 and Q3 and the range between xmin and xmax).
 shape (skewness if the whiskers are of unequal length and/or if the median is not in the center
of the box).

- Midhinge:
Q 1+Q 3
2

4.3. Measure of Variability:

 Range: Xmax – Xmin

 Interquartile Range IQR = Q3 – Q1


 Standard Variance:
n

∑ ( x i−x́ )2
s2= i=1
n−1

Where: x bar = mean; n = sample size; xi = ith value of the variable X


=VAR.S(Data)
 Std. Dev.
n

s=
√ ∑ ( x i− x́ )2
i=1

=STDEV.S(Data)
n−1

 Coefficient of Variation: measures relative variation

s
CV = ⋅ 100 %

4.4. Measure of Population:

X 1+ X 2+ ⋯+ X n
 Mean: μ=
N
N

 Population variance:
∑ ( X i−μ ) 2
σ 2= i=1
N

 Population std dev:


N

σ=
√ ∑ ( X i−μ ) 2
i=1
N

4.4. Standardized data:

 Chebyshev’s Theorem: the percentage of observations that lie within k standard deviations of the
mean must be at least 100 [1-1/k^2] for any population with mean μ and standard deviation σ.
 Empirical Rule: for data from a normal distribution, we expect the interval μ 6 kσ to contain a known
percentage of the data
Outliers: values outside μ +_ 3σ (beyond three standard deviations from the mean) are rare (less than
1%), unusual in a normal distribution

 Z-score: A measure of distance from the mean. A Z score above 3.0 or below -3.0 is considered an
outlier.
X −μ
Formula for a population: z=
σ

X − X́
Formula for a sample: z=
s

4.6. Correlation and Corvariance:


m: mid-point For a population: For a sample:
f: frequency

k k
mean
∑ f i mi k ∑ f i mi
i=1
μ= where N =∑ f i x́= i=1
N i=1 n

k k
variance 2
∑ f i ( mi−μ ) k ∑ f i ( mi −x́ ) 2
σ 2= i=1 where n=∑ f i s2= i=1
N i=1 n−1
=VAR.P =VAR.S
k k
Covariance ( X ¿ ¿ i− X́ )(Y ¿ ¿ i−Ý ) (X ¿ ¿ i− X́)(Y ¿ ¿i−Ý )
(the strength of the linear σ XY =cov ( X , Y )=∑ s XY¿ =¿
¿ ∑ ¿¿
i =1 N i=1 n−1
relationship between 2 =COVARIANCE.P =COVARIANCE.S
variables
)
cov ( X , Y ) >0=¿ X∧Y tend ¿ move∈the same direction
cov ( X , Y ) <0=¿¿ opposite ¿
cov ( X , Y )=0=¿ X ∧Y areindependent

coefficient of correlation
k
(the relative strength of the linear relationship between two variables) ( X ¿ ¿ i− X́ )(Y ¿ ¿ i−Ý )
r =∑ ¿¿
k k
i=1

√∑ (i=1
2
X i− X́ ) √∑ (
i=1
2
Y i−Ý )

=CORREL

4.7. Grouped data:


n

The weighted mean of a set of data is


∑ w i xi
x́= i=1
∑ wi

4.8. Skewness and Kurtosis:

CHAPTER 5: PROBABILITY
probability
 Empirical: Estimated from outcome frequency
 Classical: Known a priori by the nature of the experiment
 Subjective: Based on informed opinion or judgment
RULES OF PROBABILITY
o Complement of an event P(A) + P(A′) = 1 or P(A′ ) = 1 – P(A)
o Union of Two Events: P(A ∪ B)
o Intersection of Two Events: P(A ∩ B)
o General law of addition: P(A ∪ B) = P(A) + P(B) - P(A ∩ B)
o Mutually Exclusive Events P(A ∩ B) = 0
o Collectively Exhaustive Events: P(A)⊂ P(C) and P(B)⊂ P(C)
o Special Law of Addition: P(A ∪ B) = P(A) + P(B)
P ( A ∩B)
o P(A|B) =
P( B)
o Event A is independent of event B: P(A|B) = P(A).
=> P(A ∩ B) = P(A).P(B)
P ( A)
o odds in favor of event A:
1−P ( A )
1−P ( A )
o odds against event A:
P ( A)

 marginal probability: dividing a row or column total by the total sample size
 joint probability: dividing by the total sample size
 Conditional probabilities: restricting ourselves to a single row or column (dividing by the row or
column total)
General Forms of Bayes’ Theorem:

CHAP 6: Discrete Random Variable


Random variable:
o discrete random variable
o continuous ___________

discrete probability distribution


 between 0-1
 sum to 1
PDF: shows the probability of each X–value, sum to 1
CDF: shows the cumulative sum of probabilities, adding from the smallest to the largest X–value, approaches 1
expected value E(X) μ= E ( X )=∑ x P ( x )
(mean) x

Variance σ 2=E ( x−μ )2=∑ ( x−μ )2 P ( x )


x
Std dev σ
 Uniform distribution:
2 parameters a and b
PDF 1
P ( X=x )=
b−a+1
CDF x−a+ 1
P ( X ≤ x )=
b−a+1
Domain a ≤ x ≤ b (integers)
Mean μ a+b
2
Std devσ 2
[ ( b−a ) +1] −1
√ 12

 BINOMIAL DISTRIBUTION: has 2 parameters (n, π)


Bernoulli experiment:
1 parameter π, the probability of success
two outcomes 0 or 1
expected value E(X) π
(mean)
variance π(1- π)

Each Bernoulli trial is independent so that the probability of success π remains constant on each trial
The binomial distribution arises when a Bernoulli experiment is repeated n times. 
n!
P ( X )= π X (1−π )n−X
X !( n− X)!
Where:
P ( X ): probability of X successes in n trials,
X : number of ‘successes’ in sample
n: sample size
π: probability of “success”
Mean μ= E ( X )=n π
Variance σ 2=n π (1−π )
Std dev σ

 POISSON DISTRIBUTION describes the number of occurrences within a randomly chosen unit of time or
space

ⅇ− λ λ x
P ( X=x ∨λ )=
x!
where:
x = number of events in an area of opportunity
λ = average number of events per unit
1 parameter λ
Mean μ= λ
Variance σ 2=λ
Std dev σ

 HYPERGEOMETRIC DISTRIBUTION
Similar to the binomial (involve sample size of n and the number of successes X) except that sampling is
without replacement from a finite population of N items
The trials are not independent and the probability of success is not constant from trial to trial.
C sx C n−
N−s
x
P ( X=x ∨n , N , s )= N
Cn
Where
N = population size
s = number of items of interest in the population
N – s = number of events not of interest in the population
n = sample size
x = number of items of interest in the sample
n – x = number of events not of interest in the sample
3 parameters N, n, s
Mean ns
N
Std dev N−n
√ nπ ( 1−π ) √ N −1

 geometric distribution describes the number of Bernoulli trials until the first success, the number of
trials is not fixed
1 parameter π, the probability of success
PDF P ( X=x )=π ( 1−π ) x−1

CDF P ( X ≤ x )=1−( 1−π )x

Domain Integers
Mean μ 1
π
Std devσ 1−π
√ π2

CHAP 7: CONTINUOUS PROBABILITY DISTRIBUTION


PDF
 must be nonnegative
 the area under the entire PDF must be 1.
 mean, variance, and shape of the distribution depend on the PDF parameters.
CDF
 shows P ( X ≤ x )
 approaching 1 as X increases
 useful for probabilities

 UNIFORM CONTINUOUS DISTRIBUTION


2 parameters a and b
PDF 1
f ( x )=
b−a
CDF x−a
P ( X ≤ x )=
b−a
d−c
P ( c ≤ X ≤ d )=
b−a
Domain a≤ x≤b
Mean μ a+b
2
Std devσ ( b−a )2

Shape
√ 12
Symmetric. No mode

 NORMAL DISTRIBUTION
2 parameters the mean μ and standard
deviation σ
2
PDF −1 x−μ
1 2 ( σ )
f ( x )= ⅇ
√ 2 πσ
In excel =NORM.DIST( x , μ , σ , 0)
CDF F ( x 0 ) =P ( X ≤ x 0 )
P ( c ≤ X ≤ d )=F ( d ) −F (c)
In excel =NORM.DIST( x , μ , σ , 1)
Domain −∞ < x <+∞
Mean μ μ
Std devσ σ
Shape Bell Shaped
Symmetrical
Mean= Median = Mode

 STANDARD NORMAL DISTRIBUTION: Any normal distribution can be transformed into the standardized
normal distribution (Z), with mean 0 and variance 1.
2 parameters the mean μ and standard deviation σ
2
PDF −Z
1 2 X −μ
f ( Z )= ⅇ where Z=
√2 π σ
CDF In excel =NORM.S.DIST( Z , 1)
Domain −∞< Z<+ ∞
Mean μ 0
Std devσ 1
Shape Bell Shaped
Symmetrical

 EXPONENTIAL DISTRIBUTION describes waiting time until the next Poisson arrival
1 parameter λ
PDF f ( x )= λ ⅇ−λx
CDF Probability of waiting less than x:
P ( X ≤ x )=1−ⅇ− λx
=EXPONDIST(x, λ ,TRUE)

Probability of waiting more than x


P ( X > x )=ⅇ− λx
Domain x≥0
Mean μ 1
λ
Std devσ 1
λ
Shape Right-skewed

 Triangular distribution
has 3 parameters (a and c enclose the range, and b is the mode)

CHAP 8:
Sampling error: the difference between an estimate and the corresponding population parameter
= X́ −μ
Bias: the difference between the expected value of the estimator and the corresponding parameter
= E( X́)−μ
=> Unbiased: E ( X́ ) =μ
Efficiency: A more efficient estimator has smaller variance.
Consistency: The estimator converges toward the parameter being estimated as the sample size increases.

population sample
Mean μ μ x́ =μ
standard error σ σ
σ x́ =
√n
the standard error of the mean decreases
as the sample size increases

Confidence interval provides additional information about variability


Width of confidence interval = (Upper Confidence Limit – Lower Confidence Limit)/2
Interval estimate provides more information about a population characteristic than does a point estimate
=> interval estimates are confidence intervals
σ
 Confidence interval when we know σ μ= X́ ± z
√n
s
 ________________________dont know μ= X́ ± t α ∕ 2 ,n−1
√n
t α ∕ 2 , n−1 the critical value of the t distribution (one-tail) with n-1 d.f. and an area of α/2 in each tail
The margin of error can be reduced if
 The population standard deviation can be reduced (σ↓)
 The sample size is increased (n↑)
 The confidence level is decreased, (1 – α ) ↓

σ
E=z
√n
Where π = a pilot sample yields p (or For the mean: For the proportion
conservatively use 0.5 as an estimate of π)
Sample Size z2 σ 2 z 2 π ( 1−π )
n= 2 n=
E E2

Two-tailed test: One-tailed test:


Right-tailed Left-tailed
The mean is greater than… The mean is less than…
H 0 :μ=μ0 H 0 :μ ≤ μ0 H 0 :μ ≥ μ0
H 1 : μ ≠ μ0 H 1 : μ> μ 0 H 1 : μ< μ 0
n < 30
x́−μ0
t= vs Z ¿ confidence level
s
√n
t in rejection area => Reject H 0 => No…..
n ≥ 30 x́−μ0
z=
s
x́−μ0
z= (critical value) √n
s z-table => a
√n p-value = a < α = 1 - cl => Reject H 0
z-table => a p-value ≥ α = 1 - cl => do not reject H 0
p-value = 2(1 – a) < α = 1 - cl => Reject H 0
p-value ≥ α = 1 - cl => do not reject H 0

CHAP 12:
sample regression equation provides an estimate of the population regression line
2 SSR SSE
Coefficient of Determination R = =1− (0 ≤ R 2 ≤ 1)
SST SST
Estimation of Model Error Variance:

standard error of the estimate:


The variance of the regression slope coefficient (b1):

Null and alternative hypotheses:


H 0 : β1 =0
H1: β1 ≠ 0
Test statistic:
df = n-2
b −β
t= 1 1
1
sb
Where:
b 1: from coefficient
sb : from standard error
1

β 1=0 (according to H 0)
=> t stat > t n−2 ,α ∕ 2 or p-value ≤ α
=> Reject H 0
Confidence Interval Estimate of the Slope:
b 1 ± t n−2 , α ∕ 2 s b
1

F-Test for Significance


MSR
F Test statistic
MSE
Where:
SSR SSE
MSR= , MSE=
k n−k−1

CHAP 9: ONE-SAMPLE HYPOTHESIS TEST


9.2. TYPE I AND TYPE II ERROR:

 Type I error (also called a false positive). Failure to reject the null hypothesis when it is false is a Type II
error (also called a false negative).
 Type I error is more dangerous.

9.3. DECISION RULES AND CRITICAL VALUES:


A statistical hypothesis is a statement about the value of a population parameter that we are interested in. A
hypothesis test is a decision between two competing, mutually exclusive, and collectively exhaustive
hypotheses about the value of the parameter.

Test statistic that measures the difference between the sample statistic and the hypothesized parameter.

Critical value is the boundary between the two regions (reject H0, do not reject H0).

9.4. TESTING A MEAN: KNOW POPULATION VARIANCE:

Test statistic:

P-Value:

The p-value is the probability of obtaining a test statistic as extreme as the one observed, assuming that the null
hypothesis is true.

To find the p-value, we can use Excel’s function =NORM.S.DIST

Analogy to Confidence Intervals:

9.5. TESTING A MEAN: UNKNOW POPULATION VARIANCE:


Confidence Intervals:

9.6. TESTING A PROPORTION:

CHAP 10: TWO-SAMPLE HYPOTHESIS TEST


10.1. Two-sample Test: compare two sample estimates with each other.

Basis of 2-sample test: are especially useful because they possess a built-in point of comparison. 3 situations
use 2-sample test:

• Before versus after


• Old versus new
• Experimental versus control

2-sample can come from: the same population. Any differences are due to sampling variation.
the different populations with different parameter values.

10.2. COMPARING TWO MEANS: INDEPENDENT SAMPLES

 Comparing two population means is a common business problem.

Format of hypothesis:

Left-Tailed Test Two-Tailed Test Right-Tailed Test


H O : μ1−u2 ≥ D 0 H O : μ1−u2 =D 0 H O : μ1−u2 ≤ D 0
H O : μ1−u2 < D 0 H O : μ1−u2 ≠ D 0 H O : μ1−u2 > D 0
Test statistic: is the difference between the sample statistic and the parameter divided by the standard error of
the sample statistic.

Case 1: Know variances

( x´1− x́ 2 )−(μ1−μ 2)
z calc =
σ 21 σ 22

Case 2: Unknow variances assumed equal


√ +
n1 n 2

( x´1− x´2 )−( μ1−μ2 )


t calc=
s2p s 2p
√ +
n1 n2

where the pooled variance is

( n 1−1 ) s 12 + ( n2−1 ) s 22
s p 2= ∧d . f =n1 +n 2−2
n1+ n2−2

Case 3: Unknown Variances Assumed Unequal

( x´1− x´2 )−( μ1−μ2 )


t calc=
s 12 s 22
√ +
n1 n2

with d.f
2
s 12 s 22

d .f=
( +
n1 n2 )
2 2
s12 s22
( ) ( )
n1
+
n2
n1−1 n2−1

Finding Welch’s degrees of freedom requires a tedious calculation, but this is easily handled by Excel. When
computer software is not available, a conservative quick rule for degrees of freedom is to use:

d . f =min ⁡(n ¿ ¿1−1 ,n 2−1)¿

 Test statistics will always be identical, but the degrees of freedom (and hence the critical values) may differ.

10.3. CONFIDENCE INTERVAL FOR THE DIFFERENCE OF TWO MEANS, μ1 - μ2

Assuming equal variances:


or d . f =min ⁡(n ¿ ¿1−1 ,n 2−1)¿

Assuming unequal variances:

x bar: sample mean; s1 s2: sample std. dev.; n1 n2: sample size

10.4. COMPARING TWO MEANS: PAIRED SAMPLES

Paired Data: If the same individuals are observed twice but under different circumstances, we have a paired
comparison.

 Paired data typically come from a before-after experiment, but not always.

Pair t Test: define a new variable d= X 1−X 2 as the different berween X 1∧X 2

Step 1: State the Hypothesis (Ho, H1)  Step 2: Specify the Decision Rule (d.f, alpha, critical value, dieu kien
reject)  Step 3: Calculate the Test Statistic  Step4: Make the Decision (compare t statistic with critical
value)  Step 5: Take action.

Analogy to Confidence Interval:


10.5. COMPARING TWO PROPORTIONS:

Testing for Zero Difference : π 1−π 2=0

Sample Proportions:

Pooled Proportion:

Test Statistic:
Step 1: State the Hypothesis (Ho, H1)  Step 2: Specify the Decision Rule (d.f, alpha, critical value, dieu kien
reject)  Step 3: Calculate the Test Statistic (pooled estimate, calculate pc and test statistic)  Step4: Make the
Decision (compare t statistic with critical value or p-value with alpha)  Step 5: Take action.

Checking Normality:

 Sample sizes in paired test doesn’t need to equal. Unequal sample sizes are common, and the formulas
still apply.
10.6. CONFIDENCE INTERVAL FOR THE DIFFERENCE OF TWO PROPORTIONS, π1 – π2

10.7. COMPARING TWO VARIANCES:

The F-test:
Two-tailed F-test:

Step 1: State the Hypothesis (Ho, H1)  Step 2: Specify the Decision Rule (d.f, FR&FL, reject when Fcalc <
FR and > FL )  Step 3: Calculate the Test Statistic (Fcalc)  Step4: Make the Decision (compare t statistic
with critical value or p-value with alpha)  Step 5: Take action.

Folded F-test: this method requires that we put the larger observed variance in the numerator.

CHAP 11: ANALYSIS OF VARIANCE (ANOVA)

ONE-FACTOR ANOVA:
One-factor ANOVA as a Linear Model:

n yij comes from a population with a common mean (m) plus a treatment effect (Aj) plus random error (eij):
yij = μ + Aj + eij (j = 1, 2, …, c and i = 1, 2, …, n)
where A j= ý j− ý
n Random error is assumed to be normally distributed with zero mean and the same variance.

N-FACTOR ANOVA:

 Each factor has two or more levels. A particular combination of factor levels is called a treatment

n Test if each factor has a significant effect on Y:


n H0: m1 = m2 = m3 =…= mc
n H1: Not all the means are equal

 If we cannot reject H0, we conclude that observations within each treatment have the same mean m.

 Sample sizes within each treatment do not need to be equal. The total number of observations: n = n1+n2+..

GROUP AND GRAND MEANS:

The mean of each group (group mean):

The overall sample mean (grand mean):


PARTION OF DEVIATIONS:

For a given observation yij, the following relationship holds

PARTIONED SUM OF SQUARES:

This relationship is also true for the sums of squared deviations:


2 2 2
SST =( y 11− ȳ ) +( y12− ȳ ) +¿⋅¿+( y cnc − ȳ )
SSA=n1 ( ȳ 1 − ȳ )2 +n2 ( ȳ 2 − ȳ )2 +¿⋅¿+nc ( ȳ c − ȳ )2
2 2 2
SSE=( y 11 − ȳ 1 ) +( y12− ȳ 2 ) +¿⋅¿+( y cnc− ȳ c )

HYPOTHESIS TESTING:

- SSA and SSE are used to test the hypothesis of equal means by dividing each sum of squares by it degrees of
freedom.
- These ratios are called Mean Squares (MSA and MSE).

F STATISTIC:

- The F statistic is the ratio of the variance due to treatments (MSA) to the variance due to error (MSE).
THE TUKEY’S TEST:

 Do after rejection of equal means in ANOVA


 Tells which population means are significantly different
e.g.: μ1 = μ2 ≠ μ3
 Tukey’s studentized range test is a multiple comparison test
For c groups, there are c(c – 1)/2 distinct pairs of means to be compared.
 Tukey’s is a two-tailed test for equality of paired means from c groups compared simultaneously.

The hypotheses are:

THE HARTLEY’S TEST:


The hypotheses are:

The test statistic is the ratio of the largest sample variance to the smallest sample variance

The decision rule:


Reject H0 if Hcalc > Hcritical

Hcritical can be found in Hartley’s test statistics, using: df1 = c, df2 = n/c -1

You might also like