Professional Documents
Culture Documents
Descriptive statistics: refers to the collection, organization, presentation, and summary of data (either using charts and
graphs or using a numerical summary).
Include: Collecting and Describing Data
- Sampling and Surveys (Chapter 2)
- Visual Displays (Chapter 3
- Numerical Summaries (Chapter 4)
- Probability Models (Chapter 5-8)
Inferential statistics: refers to generalizing from a sample to a population, estimating unknown population parameters,
drawing conclusions, and making decisions.
- Statistic is an essential part of critical thinking because it allows us to test an idea against empirical evidence.
- Empirical data represent data collected through observation and experiments.
- Use statistical tools to compare our prior ideas with empirical data but pitfall do occur.
Conclusions from Small Samples
Ex: My aunt smoked all her life and lived to 90. Smoking doesn’t hurt you.
Conclusions from Nonrandom Samples
Ex: “Rock stars die young. Look at Buddy Holly,…”
Conclusions from Rare Events
Ex: “Mary in my office won the lottery. Her system must have worked”. – Millions of people play the lottery.
Someone will eventually win.
“Tom’s SUV rolled over. SUVs are dangerous”. – Millions of people drive SUVs. Some will roll over, some
won’t.
Poor survey method
Assuming a Causal Link
Generalization to Individuals
Ex: Men are taller than women.
Unconscious Bias
Ex: Heart attack were more likely to occur in men than women.
Significance versus Importance
Ex: Who born in fall will become taller than the others.
In scientific research, data arise from experiments whose results are recorded systematically. In business, data usually
arise from accounting transactions or management processes.
Data Terminology:
- An observation is a single member of a collection of items that we want to study, such as a person, firm, or region.
- A variable is a characteristic of the subject or individual, such as an employee’s income or an invoice amount.
- The data set consists of all the values of all of the variables for all of the observations we have chosen to observe.
- n*m matrix m column, n row row display observation and column display variable
Categorical Data: (also qualitative data) have values that are described by words rather than numbers.
Numerical Data: (also quantitative data) arise from counting, measuring something, or some kind of mathematical
operation.
A variable with a countable number of distinct values is discrete. A numerical variable that can have any value within
an interval is continuous.
Time Series Data and Cross-Sectional Data:
- If each observation in the sample represents a different equally spaced point in time (years, months, days), we have time
series data.
- The periodicity is the time between observations. It may be annual, quarterly, monthly, weekly, daily, hourly, etc.
Ex: firm’s sales, market share, debt/ equity ratio, employee absenteeism, inventory turnover, and product quality
ratings.
For time series, we are interested in trends and patterns over time.
- If each observation represents a different individual unit (e.g., a person, firm, geographic area) at the same point in time,
we have cross-sectional data.
For cross-sectional data, we are interested in variation among observations (e.g., accounts receivable in 20 Subway
franchises) or in relationships (e.g., whether accounts receivable are related to sales volume in 20 Subway franchises as
shown in Figure 2.2).
Collecting data:
Primary data collection: observation & survey & experiment.
Secondary data collection: based on electrionics and prints document.
Statistician sometimes refer to four levels of measurement for data: nominal, ordinal, interval, and ratio.
Nominal: is the weakest level of measurement and the easiest to recognize. Data are labels or names used to identify an
attribute of the element. We usually code nominal data numerically.
Ordinal: a codes connote a ranking of data values. Have same properties of nominal data but order of ranking is
meaningful. Ordinal data can be treated as nominal, but not vice versa.
Interval: which not only is a rank but also has meaningful intervals between scale points.
Ratio: is the strongest level of measurement. Ratio data have all the properties of the other three data types, but in
addition possess a meaningful zero that represents the absence of the quantity being measured.
2.3 Sampling Concepts:
- A sample involves looking only at some items selected from the population, while a census is an examination of all
items in a defined population.
- Parameter is a measurement or characteristic of the population represented by a Greek letter while Statistic is a
numerical value calculated from a sample represented by a Roman letter.
- The sampling frame is the group from which we take the sample. If the frame differs from the target population, then
our estimates might not be accurate.
Random sampling:
Non-random sampling:
Stem-and-Leaf Display:
- The stem is the tens digit of the data, and the leaf is the one digit. Separate the shorted data series into: leading digits (the
stem) and trailing digits (leaf).
Dot plots:
- Shows variability by displaying the range of the data
- Center by revealing where the data values tend to cluster and where the midpoint lies.
- Reveal some things about the shape of the distribution if the sample is large enough
- A modal class is a histogram bar that is higher than those on either side.
- A histogram with a single modal class is unimodal, one with two modal classes is bimodal, and one with more than two
modes is multimodal.
Charts:
• line chart
• column chart (vertical display) and bar chart (horizontal display)
• pie chart: for qualitative data (categories or nominal scale)
• scatter plot
• Tables
Deceptive graphs
Error 1: Nonzero origin will exaggerate the trend. Measured distances do not match the stated values or axis
demarcations
Error 2: Elastic Graph Proportions By shortening the X-axis in relation to the Y-axis, vertical change is exaggerated
Error 3: Dramatic Titles and Distracting Pictures to grab the reader’s attention than to convey the chart’s content
Error 4: 3-D and Novelty Graphs
Error 5: Rotated Graphs By making a graph 3-dimensional and rotating it through space, the author can make trends
appear to dwindle into the distance or loom alarmingly toward you.
Error 6: Unclear Definitions or Scales (Missing or unclear units of measurement)
Error 7: Vague Sources: the author lost the citation, mixed data from several sources
Error 8: Complex Graphs Complicated visual displays make the reader work harder.
Error 9: Gratuitous Effects Slide shows often use many color and special effects
Error 11: Area Trick: visual tricks is enlarging the width of the bars as their height increases, so the bar area misstates the
true proportion.
Mean: the most familiar statistical measure of center is the mean. (=AVERAGE(data)).
affected by outliers.
x 1+ x2 + ⋯+ x n
μ=
n
Geometric mean: Used to measure the rate of change of a variable over time. (=GEOMEAN(data)).
√n x 1 x 2 … x n
Quartiles (denote Q1, Q2, Q3): divide data into 4 groups of approximately equal size, that this the
25th, 50th and 75th percentiles.
n+1
Q 1 POSITION =
4
n+1
Q 2 POSITION = (the median)
2
3(n+ 1)
Q3 POSITION =
4
Box & whisker plot:
- Box plot based on the five-number summary: xmin, Q1, Q2, Q3 and xmax.
Shows:
center (position of the median Q2)
variability (width of the “box” defined by Q1 and Q3 and the range between xmin and xmax).
shape (skewness if the whiskers are of unequal length and/or if the median is not in the center
of the box).
- Midhinge:
Q 1+Q 3
2
∑ ( x i−x́ )2
s2= i=1
n−1
s=
√ ∑ ( x i− x́ )2
i=1
=STDEV.S(Data)
n−1
s
CV = ⋅ 100 %
x́
4.4. Measure of Population:
X 1+ X 2+ ⋯+ X n
Mean: μ=
N
N
Population variance:
∑ ( X i−μ ) 2
σ 2= i=1
N
σ=
√ ∑ ( X i−μ ) 2
i=1
N
Chebyshev’s Theorem: the percentage of observations that lie within k standard deviations of the
mean must be at least 100 [1-1/k^2] for any population with mean μ and standard deviation σ.
Empirical Rule: for data from a normal distribution, we expect the interval μ 6 kσ to contain a known
percentage of the data
Outliers: values outside μ +_ 3σ (beyond three standard deviations from the mean) are rare (less than
1%), unusual in a normal distribution
Z-score: A measure of distance from the mean. A Z score above 3.0 or below -3.0 is considered an
outlier.
X −μ
Formula for a population: z=
σ
X − X́
Formula for a sample: z=
s
k k
mean
∑ f i mi k ∑ f i mi
i=1
μ= where N =∑ f i x́= i=1
N i=1 n
k k
variance 2
∑ f i ( mi−μ ) k ∑ f i ( mi −x́ ) 2
σ 2= i=1 where n=∑ f i s2= i=1
N i=1 n−1
=VAR.P =VAR.S
k k
Covariance ( X ¿ ¿ i− X́ )(Y ¿ ¿ i−Ý ) (X ¿ ¿ i− X́)(Y ¿ ¿i−Ý )
(the strength of the linear σ XY =cov ( X , Y )=∑ s XY¿ =¿
¿ ∑ ¿¿
i =1 N i=1 n−1
relationship between 2 =COVARIANCE.P =COVARIANCE.S
variables
)
cov ( X , Y ) >0=¿ X∧Y tend ¿ move∈the same direction
cov ( X , Y ) <0=¿¿ opposite ¿
cov ( X , Y )=0=¿ X ∧Y areindependent
coefficient of correlation
k
(the relative strength of the linear relationship between two variables) ( X ¿ ¿ i− X́ )(Y ¿ ¿ i−Ý )
r =∑ ¿¿
k k
i=1
√∑ (i=1
2
X i− X́ ) √∑ (
i=1
2
Y i−Ý )
=CORREL
CHAPTER 5: PROBABILITY
probability
Empirical: Estimated from outcome frequency
Classical: Known a priori by the nature of the experiment
Subjective: Based on informed opinion or judgment
RULES OF PROBABILITY
o Complement of an event P(A) + P(A′) = 1 or P(A′ ) = 1 – P(A)
o Union of Two Events: P(A ∪ B)
o Intersection of Two Events: P(A ∩ B)
o General law of addition: P(A ∪ B) = P(A) + P(B) - P(A ∩ B)
o Mutually Exclusive Events P(A ∩ B) = 0
o Collectively Exhaustive Events: P(A)⊂ P(C) and P(B)⊂ P(C)
o Special Law of Addition: P(A ∪ B) = P(A) + P(B)
P ( A ∩B)
o P(A|B) =
P( B)
o Event A is independent of event B: P(A|B) = P(A).
=> P(A ∩ B) = P(A).P(B)
P ( A)
o odds in favor of event A:
1−P ( A )
1−P ( A )
o odds against event A:
P ( A)
marginal probability: dividing a row or column total by the total sample size
joint probability: dividing by the total sample size
Conditional probabilities: restricting ourselves to a single row or column (dividing by the row or
column total)
General Forms of Bayes’ Theorem:
Each Bernoulli trial is independent so that the probability of success π remains constant on each trial
The binomial distribution arises when a Bernoulli experiment is repeated n times.
n!
P ( X )= π X (1−π )n−X
X !( n− X)!
Where:
P ( X ): probability of X successes in n trials,
X : number of ‘successes’ in sample
n: sample size
π: probability of “success”
Mean μ= E ( X )=n π
Variance σ 2=n π (1−π )
Std dev σ
POISSON DISTRIBUTION describes the number of occurrences within a randomly chosen unit of time or
space
ⅇ− λ λ x
P ( X=x ∨λ )=
x!
where:
x = number of events in an area of opportunity
λ = average number of events per unit
1 parameter λ
Mean μ= λ
Variance σ 2=λ
Std dev σ
HYPERGEOMETRIC DISTRIBUTION
Similar to the binomial (involve sample size of n and the number of successes X) except that sampling is
without replacement from a finite population of N items
The trials are not independent and the probability of success is not constant from trial to trial.
C sx C n−
N−s
x
P ( X=x ∨n , N , s )= N
Cn
Where
N = population size
s = number of items of interest in the population
N – s = number of events not of interest in the population
n = sample size
x = number of items of interest in the sample
n – x = number of events not of interest in the sample
3 parameters N, n, s
Mean ns
N
Std dev N−n
√ nπ ( 1−π ) √ N −1
geometric distribution describes the number of Bernoulli trials until the first success, the number of
trials is not fixed
1 parameter π, the probability of success
PDF P ( X=x )=π ( 1−π ) x−1
Domain Integers
Mean μ 1
π
Std devσ 1−π
√ π2
Shape
√ 12
Symmetric. No mode
NORMAL DISTRIBUTION
2 parameters the mean μ and standard
deviation σ
2
PDF −1 x−μ
1 2 ( σ )
f ( x )= ⅇ
√ 2 πσ
In excel =NORM.DIST( x , μ , σ , 0)
CDF F ( x 0 ) =P ( X ≤ x 0 )
P ( c ≤ X ≤ d )=F ( d ) −F (c)
In excel =NORM.DIST( x , μ , σ , 1)
Domain −∞ < x <+∞
Mean μ μ
Std devσ σ
Shape Bell Shaped
Symmetrical
Mean= Median = Mode
STANDARD NORMAL DISTRIBUTION: Any normal distribution can be transformed into the standardized
normal distribution (Z), with mean 0 and variance 1.
2 parameters the mean μ and standard deviation σ
2
PDF −Z
1 2 X −μ
f ( Z )= ⅇ where Z=
√2 π σ
CDF In excel =NORM.S.DIST( Z , 1)
Domain −∞< Z<+ ∞
Mean μ 0
Std devσ 1
Shape Bell Shaped
Symmetrical
EXPONENTIAL DISTRIBUTION describes waiting time until the next Poisson arrival
1 parameter λ
PDF f ( x )= λ ⅇ−λx
CDF Probability of waiting less than x:
P ( X ≤ x )=1−ⅇ− λx
=EXPONDIST(x, λ ,TRUE)
Triangular distribution
has 3 parameters (a and c enclose the range, and b is the mode)
CHAP 8:
Sampling error: the difference between an estimate and the corresponding population parameter
= X́ −μ
Bias: the difference between the expected value of the estimator and the corresponding parameter
= E( X́)−μ
=> Unbiased: E ( X́ ) =μ
Efficiency: A more efficient estimator has smaller variance.
Consistency: The estimator converges toward the parameter being estimated as the sample size increases.
population sample
Mean μ μ x́ =μ
standard error σ σ
σ x́ =
√n
the standard error of the mean decreases
as the sample size increases
σ
E=z
√n
Where π = a pilot sample yields p (or For the mean: For the proportion
conservatively use 0.5 as an estimate of π)
Sample Size z2 σ 2 z 2 π ( 1−π )
n= 2 n=
E E2
CHAP 12:
sample regression equation provides an estimate of the population regression line
2 SSR SSE
Coefficient of Determination R = =1− (0 ≤ R 2 ≤ 1)
SST SST
Estimation of Model Error Variance:
β 1=0 (according to H 0)
=> t stat > t n−2 ,α ∕ 2 or p-value ≤ α
=> Reject H 0
Confidence Interval Estimate of the Slope:
b 1 ± t n−2 , α ∕ 2 s b
1
Type I error (also called a false positive). Failure to reject the null hypothesis when it is false is a Type II
error (also called a false negative).
Type I error is more dangerous.
Test statistic that measures the difference between the sample statistic and the hypothesized parameter.
Critical value is the boundary between the two regions (reject H0, do not reject H0).
Test statistic:
P-Value:
The p-value is the probability of obtaining a test statistic as extreme as the one observed, assuming that the null
hypothesis is true.
Basis of 2-sample test: are especially useful because they possess a built-in point of comparison. 3 situations
use 2-sample test:
2-sample can come from: the same population. Any differences are due to sampling variation.
the different populations with different parameter values.
Format of hypothesis:
( x´1− x́ 2 )−(μ1−μ 2)
z calc =
σ 21 σ 22
( n 1−1 ) s 12 + ( n2−1 ) s 22
s p 2= ∧d . f =n1 +n 2−2
n1+ n2−2
with d.f
2
s 12 s 22
d .f=
( +
n1 n2 )
2 2
s12 s22
( ) ( )
n1
+
n2
n1−1 n2−1
Finding Welch’s degrees of freedom requires a tedious calculation, but this is easily handled by Excel. When
computer software is not available, a conservative quick rule for degrees of freedom is to use:
Test statistics will always be identical, but the degrees of freedom (and hence the critical values) may differ.
x bar: sample mean; s1 s2: sample std. dev.; n1 n2: sample size
Paired Data: If the same individuals are observed twice but under different circumstances, we have a paired
comparison.
Paired data typically come from a before-after experiment, but not always.
Pair t Test: define a new variable d= X 1−X 2 as the different berween X 1∧X 2
Step 1: State the Hypothesis (Ho, H1) Step 2: Specify the Decision Rule (d.f, alpha, critical value, dieu kien
reject) Step 3: Calculate the Test Statistic Step4: Make the Decision (compare t statistic with critical
value) Step 5: Take action.
Sample Proportions:
Pooled Proportion:
Test Statistic:
Step 1: State the Hypothesis (Ho, H1) Step 2: Specify the Decision Rule (d.f, alpha, critical value, dieu kien
reject) Step 3: Calculate the Test Statistic (pooled estimate, calculate pc and test statistic) Step4: Make the
Decision (compare t statistic with critical value or p-value with alpha) Step 5: Take action.
Checking Normality:
Sample sizes in paired test doesn’t need to equal. Unequal sample sizes are common, and the formulas
still apply.
10.6. CONFIDENCE INTERVAL FOR THE DIFFERENCE OF TWO PROPORTIONS, π1 – π2
The F-test:
Two-tailed F-test:
Step 1: State the Hypothesis (Ho, H1) Step 2: Specify the Decision Rule (d.f, FR&FL, reject when Fcalc <
FR and > FL ) Step 3: Calculate the Test Statistic (Fcalc) Step4: Make the Decision (compare t statistic
with critical value or p-value with alpha) Step 5: Take action.
Folded F-test: this method requires that we put the larger observed variance in the numerator.
ONE-FACTOR ANOVA:
One-factor ANOVA as a Linear Model:
n yij comes from a population with a common mean (m) plus a treatment effect (Aj) plus random error (eij):
yij = μ + Aj + eij (j = 1, 2, …, c and i = 1, 2, …, n)
where A j= ý j− ý
n Random error is assumed to be normally distributed with zero mean and the same variance.
N-FACTOR ANOVA:
Each factor has two or more levels. A particular combination of factor levels is called a treatment
If we cannot reject H0, we conclude that observations within each treatment have the same mean m.
Sample sizes within each treatment do not need to be equal. The total number of observations: n = n1+n2+..
HYPOTHESIS TESTING:
- SSA and SSE are used to test the hypothesis of equal means by dividing each sum of squares by it degrees of
freedom.
- These ratios are called Mean Squares (MSA and MSE).
F STATISTIC:
- The F statistic is the ratio of the variance due to treatments (MSA) to the variance due to error (MSE).
THE TUKEY’S TEST:
The test statistic is the ratio of the largest sample variance to the smallest sample variance
Hcritical can be found in Hartley’s test statistics, using: df1 = c, df2 = n/c -1