Professional Documents
Culture Documents
File Tổng Hợp Kiến Thức SB
File Tổng Hợp Kiến Thức SB
I – Introduction to statistics:
1. What is Statistics?
- Statistics is the science of collecting, organizing, analyzing, interpreting, presenting data and
drawing conclusions from data.
- Example:
A teacher wants to get a better overview of student performance in the classroom so she has to do
these following steps:
● The teacher collects the process scores and test scores of students (collect)
● She compiles a score table (organize)
● Then the teacher calculates the average score of each student and the average score of the
whole class (analysis)
● Therefore, she is able to evaluate the learning situation of each student and the whole class
(interpreting), adjusts the program and teaching methods (presenting and conclusion).
2. Overview of Statistics:
1
3. Two Branches of Statistics:
- Statistics: The branch of mathematics that transforms data into useful information for decision
makers. Statistics can be applied in many fields including physical and social science, business,
government, and manufacturing.
- Statistics is separated into 2 categories:
II – Data collection:
1. Variables and data:
a. Data Terminology
2
- Data: are facts and figures… collected for analysis, presentation and interpretation.
- An observation: a single member of a collection of items that we want to study, such as a
person, firm, or region.
Example: an employee, or an invoice mailed last month
- A variable: a characteristic about the items that we want to study (e.g., student name, Gender,
DOB).
Example: an employee’s income or an invoice amount.
- Data set: all the values of all of the variables for all of the observations we chose.
Data usually are entered into a spreadsheet or database as an n X m matrix
Example:
Jan 5.000.000đ
Feb 6.000.000đ
Mar 6.500.000đ
Apr 4.000.000đ
May 7.000.000đ
Jun 6.000.000đ
Jul 5.500.000đ
Aug 5.500.000đ
Sep 4.500.000đ
Oct 6.500.000đ
Dec 7.000.000đ
From the dataset:
3
● Categorical (qualitative) data: values that are described by words rather than numbers -
nonnumerical values.
+ Verbal label (labels describing different categories or groups)
+ Coded (codes representing different categories or groups)
● Numerical (quantitative) data: arise from counting, measuring something, or some kind of
mathematical operation. Two types: Discrete (integers), Continuous (physical measurements,
financial variables)
- Cross-sectional Data: observation represents a different individual unit (e.g., a person, firm,
geographic area) at the same point in time.
→ variation among observations and relationships
Ex: daily closing prices of a group of 20 stocks recorded on December 1, 2015.
- Combine the two data types to get pooled cross-sectional and time series data.
Ex: monthly unemployment rates for the 13 Canadian provinces or territories for the last 60 months
4
2. Level of measurement:
- Four levels of measurement for data: nominal, ordinal, interval, and ratio
a. Nominal Measurement:
- Nominal data: the weakest level of measurement and the easiest to recognize, identify a
category. “Nominal” data are the same as “qualitative”, “categorical” or “classification” data. The only
permissible mathematical operations are counting (e.g., frequencies).
Example:
● Hair color: Hair color is a nominal variable because there is no order to the different colors.
We can have black hair, brown hair, blonde hair, red hair, etc. These colors are not ranked in any way,
so it makes no sense to say that black hair is better than brown hair, or vice versa.
● Gender: The categories of gender might be "male," "female," "non-binary," and "other". These
categories are not ordered from high to low, and there is no inherent meaning to the order in which
they are listed.
● Zip codes: Zip codes are nominal variables because they are simply labels that are used to
identify different geographic locations. Note that although zip codes are series of numbers, there is no
order to zip codes, so it makes no sense to say that one zip code is better than another.
5
→ Nominal measurements have no natural order
b. Ordinal Measurement
- Ordinal data codes imply a ranking of data values. There is no clear meaning to the distance
between.
- Like nominal data, ordinary data lacks the properties that are required to compute many
statistics, such as the average. Ordinal data can be treated as nominal, but not vice versa.
Example:
● Customer satisfaction ratings: Customer satisfaction ratings can be ranked in order, from least
satisfied to most satisfied, such as very dissatisfied, somewhat dissatisfied, neutral, somewhat
satisfied, and very satisfied. This data is ordinal because there is a clear order to the ratings, but the
exact difference between each rating is not known.
● Educational levels: Educational levels can be ranked in order, from lowest to highest, such as
elementary school, middle school, high school, and college. This data is ordinal because there is a
natural order to the levels, but the exact difference between each level is not known.
→ Ordinal measurements have ordering, but the differences between have no clear meaning
c. Interval Measurement
- Interval data is not only a rank but also has meaningful intervals between scale points.
Intervals between numbers represent distances, which have a meaningful and constant
interpretation. The difference between any two values on the scale is always the same. However,
ratios are not meaningful for interval data, and the measurements do not have a true zero value
Example:
6
→ However, it is important to note that interval data does not have an absolute zero point. This
means that we cannot say that 0 degrees Celsius is the same as no temperature, or that an IQ score of
0 represents no intelligence.
→ Interval measurements have ordering and the differences between have meaning, but ratios
have no meaning
d. Ratio Measurement
- The data have all the properties of interval data and the ratio of two values is meaningful. The
measurements have a true zero value. We can recode ratio measurements downward into ordinal or
nominal measurements (but not conversely).
Example:
● Height: A person's height can be measured in centimeters, inches, or feet. Zero represents no
height, and the distance between each unit of measurement is equal. For example, the difference
between 5 feet and 6 feet is the same as the difference between 1 meter and 2 meters.
● Weight: A person's weight can be measured in kilograms, pounds, or grams. Zero represents
no weight, and the distance between each unit of measurement is equal. For example, the difference
between 50 kilograms and 55 kilograms is the same as the difference between 110 pounds and 121
pounds.
→ Ratios measurements have ordering, the differences between have meaning, and ratios have
meaning
7
3. Sampling concepts:
- A population: the collection of all items of interest or under investigation, could be finite or
infinite.
- A sample: looking only at some items selected from the population - an observed subset of
the population.
Sampling is used when: Infinite Population, Destructive Testing, Timely Results, Accuracy, Cost,
Sensitive Information. In general, a sample is usually preferred because the observation of a whole
population costs so much resources.
- A census: an examination of all items in a defined population. Census is used when: Small
population, large sample size, database exist, legal requirement.
- Rule of Thumb: A population may be treated as infinite when the population size N is at least
20 times the sample size n (i.e., when N/n ≥ 20)
- Parameters and Statistics:
8
● A parameter is a specific characteristic of a population
● A statistic is a specific characteristic of a sample
- From a sample of n items, chosen from a population, we compute statistics that can be used
as estimates of parameters found in the population.
● Population mean = µ
● Population proportion = π
● Sample mean = x
● Sample proportion = p
Ex: Imagine you are interested in determining the average income (mean) of all residents (population)
in a city. In this case:
● Parameter: The population mean (µ) would be the actual average income of all residents in
the entire city.
Now, conducting a survey of a specific neighborhood in that city, you collect data from a
sample of 100 households:
● Statistic: The sample mean (x̄) would be the average income of the 100 households you
surveyed. This is an estimate that you can use to make an inference about the population
mean.
- Target Population:
● The target population contains all the individuals in which we are interested
● The sampling frame is the group from which we take the sample
4. Sampling methods:
9
- Two main categories: Statistical Sampling and Nonstatistical Sampling
- Statistical Sampling (Random Sampling Methods)
Example: Select one student at random from a list of 15 students to do an oral test.
Sampling without replacement: once an item has been selected to be included in the sample, it
cannot be considered for the sample again. Problem when our sample size n is close to our
population size N → Cause bias/ tendency to overestimate/ underestimate when doing the
research
A finite population is effectively infinite if the sample is less than 5 percent of the population (if
n/N < .05)
Sampling with replacement: the same random number could show up more than once.
Duplicates are unlikely when n is much smaller than N
b. Systematic Sampling
- Systematic sample: choose every kth item from a sequence or list, starting from a randomly
chosen entry among the first k items on the list.
● Decide on sample size: n
● Divide frame of N individuals into n groups of k individuals: k=N/n
● Randomly select one xth individual from the first group
10
● Select every xth individual in other groups thereafter
Example:
Imagine you are a teacher and you want to survey your students about their favorite subject in
school. You have a class of 30 students and you want to select a sample of 10 students to participate
in the survey.
1. Determine the sampling interval: Divide the total population size (30 students) by the desired
sample size (10 students) to get the sampling interval. In this case, the sampling interval is 3.
2. Select a random starting point: Randomly select a student from the class as the starting point.
For example, let's say you randomly select student number 5.
3. Select the sample: Starting from the randomly selected student, select every third student
until you have 10 students. In this case, the sample would be students number 5, 8, 11, 14, 17, 20, 23,
26, and 29.
c. Stratified Sampling
- Divide population into homogeneous subgroups (called strata) according to some common
characteristic (e.g. age, gender, occupation)
- Select a simple random sample from each subgroup
- Combine samples from subgroups into one
11
Example:
Imagine you're a teacher and you want to conduct a survey to understand the study habits of
your students. Your students come from three different grade levels: 9th grade, 10th grade, and 11th
grade.
In this scenario, stratified sampling involves dividing your students into three strata based on
their grade levels and then randomly selecting samples from each stratum.
1. Identify Strata:
- Stratum 1: 9th-grade students
- Stratum 2: 10th-grade students
- Stratum 3: 11th-grade students
2. Determine Sample Size: Decide how many students you want to survey overall.
3. Randomly Sample Within Each Stratum: Randomly select a certain number of students from
each grade level. For example, if you want to survey 30 students in total, you might choose 10
students randomly from each grade.
4. Collect Data: Survey the selected students about their study habits.
d. Cluster Sampling
- Divide population into several “clusters” (e.g. regions), each representative of the population
● One-stage cluster sampling: randomly selected k clusters
● Two-stage cluster sampling: randomly select k clusters and then choose a random sample of
elements within each cluster.
Example:
12
Let's say you're a researcher studying the eating habits of people in a city. Instead of individually
selecting people, you decide to use cluster sampling.
1. Define Clusters: Divide the city into clusters based on geographical regions. For example, you
might have four clusters: District 1, District 2, District 3, District 4.
2. Randomly Select Clusters: Randomly choose two clusters out of the four. Let's say you select
District 1 and District 2 clusters.
3. Include All Members in Selected Clusters: Instead of surveying individuals, you survey
everyone in the selected clusters. So, in District 1 and District 2 clusters, you would survey all the
residents.
Examples in detail:
Judgment Sample: Let's say you're a manager in a company and you want to gather feedback
about a new workplace initiative. Instead of randomly selecting employees, you decide to use
judgment sampling.
1. Define Criteria: Identify specific criteria that you believe are relevant to the success of the new
initiative. For instance, you might consider employees who have been with the company for a
significant amount of time, or those who have experience with similar initiatives.
2. Use Expertise: Utilize your knowledge and experience as a manager to handpick employees
who you believe would provide valuable insights. This could involve considering individuals who have
shown a keen interest in similar projects before.
3. Select Participants: Choose a small number of employees based on your criteria. For example,
you might decide to interview five employees who meet the specific criteria you've outlined.
13
4. Conduct Interviews: Interview the selected employees to gather their opinions and feedback
on the new initiative.
Convenience sample: Let's say you're curious about your co-workers' opinions on a new office
policy, and you decide to use convenience sampling to gather quick feedback during lunchtime.
1. Select a Convenient Location: Choose a spot where your co-workers often gather during
lunch, like the breakroom or a common area.
2. Approach Easily Accessible Participants: Rather than selecting participants randomly or with
specific criteria, you opt for convenience. As people enter the breakroom, you approach them for a
quick chat about the new office policy.
3. Ask a Few Questions: Keep it brief and ask a few simple questions about their thoughts on the
new policy. For example, "Hey, what do you think about the new office policy on flexible work hours?"
4. Record Responses: Jot down or mentally note their responses. Since this method is more
about ease than representativeness, you're not aiming for a comprehensive survey but rather a quick
sense of opinions.
5. Repeat as Convenient: Continue this process during lunchtime over the next few days,
approaching co-workers as it's convenient for both you and them.
Focus Group: Let's say you work for Apple which is planning to launch a new version of an iPod,
and you want to gather in-depth insights from current iPod users. You decide to use a focus group
sampling method.
1. Identify Target Participants: Define your target group, in this case, current iPod users. This
might include people who use iPods for music, podcasts, and other media.
2. Recruit Participants: Reach out to a diverse group of iPod users, ensuring you have
participants who use different models and have varied experiences with their devices.
3. Schedule a Focus Group: Set up a time and place for a focus group session. This could be a
meeting room where participants can comfortably discuss their experiences.
14
4. Moderate the Discussion: As the moderator, guide the discussion by asking open-ended
questions about their experiences with iPods. For example, you might ask about their favorite
features, any challenges they've faced, and what improvements they'd like to see in a new version.
5. Record Insights: Take notes on the participants' responses and interactions during the focus
group. This qualitative data can provide rich, detailed insights into their thoughts and opinions.
f. Sources of Error
Example:
- Nonresponse bias:
● Phone Surveys Bias: If a survey only calls mobile phones and ignores landlines, it could
misrepresent opinions, especially among older individuals who primarily use landlines.
● Extreme Responses Online: In an online customer survey, low responses may lead to bias if
only customers with extreme experiences (very positive or negative) participate, missing
the views of those with moderate experiences.
- Selection bias:
● Hospital Study Bias: Research relying only on hospital data for a disease might introduce bias
by excluding individuals who never sought medical help, giving an incomplete picture.
● Volunteer Clinical Trials: Clinical trials with volunteers may not represent the whole population
if those with milder symptoms are more likely to participate, potentially skewing the results.
15
5. Surveys:
a. Survey
16
b. Questionnaire Design
17
EXERCISE
G. Industry codes
ANSWER:
18
SESSION 2: DESCRIBING DATA
I - Center:
1. Arithmetic Mean (mean): not use for outlier (an observation or data point that significantly
deviates from the overall pattern or trend of a dataset)
- The most common measure of central tendency
Example:
19
2. Median: The median (denoted M) is the 50th percentile or midpoint of the sorted sample data
set x1, x2, . . . , xn
- In an ordered array => most appropriate measure of central tendency for ordinal data,
“middle” number (50% above, 50% below)
- Median is not affected by extreme values
- Example: 1,2,7,8,9,10
- Position: (6+1)/2 =3.5
- Median: (7+8)/2=7.5
Explanation: Since there are 6 numbers, the median will be the average of the 3rd and 4th
numbers. => The 3rd number is 7, and the 4th number is 8.
- Example: 3,5,2,8,5,3,9,6,5
Explanation: In this case, the number 5 appears most frequently, three times. Therefore, the
mode of the given set is 5.
- Example:
Lee’s scores: 60, 70, 70, 70, 80 Mean = 70, Median = 70, Mode = 70
Pat’s scores: 45, 45, 70, 90, 100 Mean = 70, Median = 70, Mode = 45
20
Sam’s scores: 50, 60, 70, 80, 90 Mean = 70, Median = 70, Mode = none
Xiao’s scores: 50, 50, 70, 90, 90 Mean = 70, Median = 70, Modes = 50, 90
- Ri is the percentage change from the beginning of the period until the end.
Example:
An investment of $100,000 declined to $50,000 at the end of year one and rebounded to
$100,000 at end of year two:
a. Ordered array:
- Ordered array: a sequence of data ranked in a particular order
● Shows range (min to max)
● Signals -> variability
o larger ranges indicate higher variability
o smaller ranges suggest lower variability
21
● May help identify outliers (unusual observations)
● Large data => less useful
Example:
● Raw form: 24, 26, 24, 21, 27, 27, 30, 41, 32, 38
● Ordered array from smallest to largest : 21, 24, 24, 26, 27, 27, 30, 32, 38, 41
b. The stem-and-leaf:
- The stem-and-leaf: a tool of exploratory data analysis (EDA) that seeks to reveal essential data
features in an intuitive way.
22
● Reveal:
+ Central tendency
+ Dispersion
- A stem-and-leaf plot works well for small samples of integer data with a limited range but
becomes awkward when you have decimal data (e.g., $60.39) or multi digit data (e.g., $3,857).
c. Dot Plots:
- A dot plot is another simple graphical display of n individual values of numerical data.
- The basic steps in making a dot plot are to (1) make a scale that covers the data range, (2)
mark axis demarcations and label them, and (3) plot each data value as a dot above the scale at its
approximate location.
- A dot plot shows
● variability by displaying the range of the data.
● the center by revealing where the data values tend to cluster and where the midpoint lies.
● some things about the shape of the distribution if the sample is large enough.
23
d. Frequency distribution:
- Frequency distribution: is a table formed by classifying n data values into k classes called bins
- Example: Suppose you have a dataset representing the scores of 20 students in a math test:
78,85,92,78,90,85,80,88,92,78,85,88,90,92,78,80,85,90,88,80
Score Frequency
78 4
80 3
85 4
88 3
90 3
92 3
e. Histogram:
- Definition: A histogram is a graphical representation of a frequency distribution.
24
- Y-axis:
● The number of data values (or a percentage) within each bin of a frequency distribution
● Frequency, relative frequency, or percentage
- X-axis:
● Ticks show the end points of each bin
● The class boundaries (or class midpoints)
- No gaps between bars
25
g. Scatter Plots:
- A scatter plot shows n pairs of observations (x1, y2), (x2, y2), . . . , (xn, yn) as dots (or some other
symbol) on an X-Y graph. A scatter plot is a starting point for bivariate data analysis.
26
Example:
27
● increasing percent (concave upward or convex function),
● constant percent (straight line)
● declining percent (concave downward).
● the distance from 100 to 1,000 => same => distance from 1,000 to 10,000
● both have the same 10:1 ratio
- Suited: positive data values
- Displayed: vertical axis => reveal more detail for small data values
II- Deceptive graphs: graphical representations of data that are intentionally or unintentionally
designed in a way that misleads the viewer, distorts the data, or presents information in a manner
that can create a false or exaggerated impression.
28
1. Error 1: Nonzero Origin
- A nonzero origin will exaggerate the trend. Measured distances do not match the stated
values or axis demarcations. The accounting profession is particularly aggressive in enforcing this rule.
- Although zero origins are preferred, sometimes a nonzero origin is needed to show sufficient
detail. For instance, tracking a satellite's movement relative to a reference point on Earth requires a
nonzero origin to accurately represent its position in calculations, ensuring the relative position is
accounted for.
29
3. Error 3: Dramatic Titles and Distracting Pictures
- The title often is designed more to grab the reader’s attention than to convey the chart’s
content
4. Error 4: 3-D and Novelty Graphs
- Depth may enhance the visual impact of a bar chart, but it introduces ambiguity in bar height.
30
7. Error 7: Vague Sources
- Vague sources may indicate that the author lost the citation, didn’t know the data source, or
mixed data from several sources.
8. Error 8: Complex Graphs
- Complicated visual displays make the reader work harder. Keep your main objective in mind.
31
10. Error 10: Estimated Data
- In a spirit of zeal to include the “latest” figures, the last few data points in a time series are
often estimated. At a minimum, estimated points should be noted.
11. Error 11: Area Trick
- One of the most pernicious visual tricks is simultaneously enlarging the width of the bars as
their height increases, so the bar area misstates the true proportion.
32
SESSION 3: DESCRIPTIVE STATISTICS
I - Data types:
1. Quartiles:
- The quartiles (denoted Q1, Q2, Q3) are scale points that divide the sorted data into four
groups of approximately equal size, that is, the 25th, 50th, and 75th percentiles, respectively.
33
2. Box-and-whisker Plot:
Box-and whisker Plot shows
● center (position of the median Q2)
● variability (width of the “box” defined by Q1 and Q3 and the range between xmin and
xmax).
● shape (skewness if the whiskers are of unequal length and/or if the median is not in
the center of the box)
3. Range
- Simplest measure of variation
- Difference between the largest and the smallest values in a set of data:
34
- Problems:
● considers the two extreme data values. It seems desirable to seek a broad-based measure of
variability that is based on all the data values x1, x2, . . . , xn
- Solution:
● Variance: Average (approximately) of squared deviations of values from the mean
o The square of the distance (xi-mean) is to avoid the case where the negative and
positive value distances cancel each other out, distorting the total value.
o Use (n-1) for the sample because the sample is often more biased than the population,
and - 1 is subtracted from the value taken as a measure of central tendency.
=> Variance is used to determine how varied the data points are compared to the measure
of Central Tendency.
35
- Standard deviation:
- Interquartile Range:
● Interquartile range = 3rd quartile – 1st quartile
● IQR = Q3 – Q1
36
- Values outside the inner fences: unusual
- Values outside the outer fences: outliers => determine the outliers
- In a Boxplot, Xmin and Xmax: smallest and highest values in the inner fences.
- Median is near the center of the interquartile range, distribution is symmetric (mean
approximately median), if mean<median then left-skewed, mean> median then right-skewed.
37
- Population:
● Population summary measures: parameters
● Population mean: the sum of the values in the population divided by the population size, N
+ μ = population mean
+ N = population size
● Population variance:
38
(measure variation, same units)
4. Shape:
- Describes how data are distributed
- Measures of shape:
- Skewness refers to the symmetry or asymmetry of the frequency distribution
- Chebyshev's Theorem: states that a certain proportion of any data set must fall within a
particular range around the central mean value which is determined by the standard deviation of the
data. Regardless of how the data are distributed, at least (1 - 1/k2) x 100% of the values will fall within
k standard deviations of the mean (for k > 1)
39
Ex: (1 - 1/22) x 100% = 75% ….... k=2 (μ ± kσ = μ ± 2σ )
Let μ = 72, σ = 8 (standard deviation)
=> At least 75% of the scores will be within the interval 72 ± 2.8 or [56,88] (regardless of how
the scores are distributed)
Note: Chebyshev's Theorem can can be apply any population with mean μ and standard deviation σ
- The Empirical Rule: Applies only to normal or bell-shaped distributions.
● Data distribution: approximately bell-shaped => interval μ ± kσ
(contain a know percentage of the data)
- Z Scores:
● Measure distance from the mean
=>-> compare a raw data (observation) to the average population or sample mean
Ex:
Z-score of 2.0: a value is 2.0 standard deviations from the mean
Z score > 3.0 or < -3.0: an outlier
40
Ex: Mean=14.0, Standard Deviation=3.0
=> What is the Z score of the value 18.5
Answer:
41
● Finance (asset weights in investment portfolios)
- Approximations for Grouped Data:
● Mean:
● Sample
● Variance:
● Standard Deviation:
4. Linear Relationship:
- The Covariance:
● Measures the direction of the linear relationship between two variables
● Population Covariance:
● Sample Covariance:
42
- Interpreting:
● cov(X,Y) > 0: X and Y move in the same direction
● cov(X,Y) < 0 X and Y move in the opposite directions
● cov(X,Y) = 0 X and Y are independent
- Coefficient of Correlation:
● Measures the relative strength of the linear relationship between two variables
● Sample Coefficient of correlation
-
-
43
- Features:
● Unit free
● Ranges between –1 and 1
○ cov(X,Y) closer to –1, the stronger the negative linear relationship
○ cov(X,Y) = 1 , the stronger the positive linear relationship
○ cov(X,Y) = 0 , the weaker the linear relationship
44
SESSION 4: PROBABILITY
I. Random experiments
1. Sample Space
- Random experiment is an observational process whose results cannot be known in advance.
- The set of all possible outcomes (denoted S) is the sample space for the experiment.
● A sample space with a countable number of outcomes is discrete.
+ Flip a coin, the sample space consists of 2 outcomes S = {Head, Tail}
+ Roll a die, the sample space consists of 6 outcomes S = {1, 2, 3, 4, 5, 6}
- If the outcome of the experiment is a continuous measurement, the sample space cannot be
listed, but it can be described by a rule. E.g. S = {all X such that X > 0}
2. Event
II. Probability
- The probability of an event is a number that measures the relative likelihood that the event
will occur.
● The probability of an event A, denoted P(A), must within the interval from 0 to 1:
0 ≤ P(A) ≤ 1
1. Assigning Probability
45
- Three distinct ways of assigning probability:
a. Empirical Approach
- Counting the frequency of observed outcomes (f) defined in our experimental sample space
and dividing by the number of observations (n). The estimated probability is fn
b. Classical Approach
- A priori: the process of assigning probabilities before actually observe the event or try an
experiment.
- When flipping a coin, rolling a pair of dice cards, lottery numbers, and roulette, the nature of
the process allows us to envision the entire sample space.
Example: assumes that the coin has only two equally likely outcomes: heads (H) or tails (T). In
this case, the probability of getting heads (P(H)) or tails (P(T)) is 1/2 or 0.5To generalize the
classical approach, let's consider rolling a fair six-sided die. Each face of the die has one
number from 1 to 6. The classical probability of rolling any specific number, say 3, is 1/6. This is
because there is one favorable outcome (rolling a 3) out of six possible outcomes (rolling a 1,
2, 3, 4, 5, or 6).
c. Subjective Approach
- A subjective probability reflects someone’s informed judgment about the likelihood of an
event - needed when there is no repeatable random experiment.
46
Note: The subjective approach is usually only useful in cases of absence of information (cannot
assign possibilities and no historical data). Thus, later when more data is available, a different
approach can be evaluated when revising this problem.
47
- The union of A and B is denoted:
● A union B
● “A or B”
If A ∩ B = φ, then P(A ∩ B) = 0
Example:
48
● Event A = a day in January. Even B = a day in February
8. Conditional Probability
- The probability of event A given that event B has occurred is a conditional probability.
-
- Denoted P(A | B). The vertical line “ | ” is read as “given”.
𝑃(𝐴 ∩ 𝐵)
P(A | B) = for P(B) > 0
𝑃(𝐵)
49
9. Independent Events
- Two events A & B are independent if and only if:
P(A | B) = P(A)
- Events A and B are independent when the probability of one event is not affected by the fact
that the other event has occurred.
Note: If A and B are independent, then P(A | B) = P(A) and the multiplication rule simplifies to
50
51
a. Relationship with Probability
- The odds in favor of event A occurring is:
Odds = P(A)/P(A')=P(A)/[1-P(A)]
Odds = P(A')/P(A)=[1-P(A)]/P(A)
1. Joint Probabilities
- A joint probability representing the probability of the intersection of two events.
- Found by dividing the cell (except the total row and column) by the total sample size
52
● P(No GPS∩AC) = 55/100 = 0.55
2. Marginal probability
- The marginal probability of an event is found by dividing a row or column total by the total
sample size.
● P(AC) = 90/100 = 0,9
● P(GPS) = 40/100 = 0,4
3. Conditional probability
- Found by restricting ourselves to a single row or column of the given condition.
V. Decision tree
53
Given AC or no AC:
54
VI. Bayes’ Theorem
- Bayes’ Theorem is used to revise previously calculated probabilities based on new
information.
- Developed by Thomas Bayes in the 18th Century.
- It is an extension of conditional probability.
- The prior (marginal) probability of an event B is revised after event A has been considered to
yield a posterior (conditional) probability.
- In situations where P(A) is not given, the form of Bayes’ Theorem is: n
- where:
● Bi = ith event of k mutually exclusive and collectively exhaustive events
● A = new event that might impact P(Bi)
n1n2 ×… ×nm
Example: You want to go to a park, eat at a restaurant, and see a movie. There are 3 parks, 4
restaurants, and 6 movie choices. How many different possible combinations are there?
55
b. Counting Rule 2:
- If any one of k different mutually exclusive and collectively exhaustive events can occur on
each of n trials, the number of possible outcomes is equal to kn
Example: If you roll a fair die 3 times then there are 63 = 216 possible outcomes
2. Factorials
- The number of unique ways that n items can be arranged in a particular order: n!, the product
of all integers from 1 to n
n! = 1.2.3…(n-2)(n-1)(n)
Example: You have five books to put on a bookshelf. How many different ways can these
books be placed on the shelf?
3. Permutations
- Choose X items at random without replacement from a group of n items. The number of
ways of arranging X objects selected from n objects in order
Example: You have five books and are going to put three on a bookshelf following alphabetical
order. How many different ways can the books be ordered on the bookshelf?
56
4.Combinations
- A combination is a collection of X items chosen at random without replacement from n items.
The number of ways of selecting X objects from n objects, irrespective of order, is
Example: You have five books and are going to randomly select three to read regardless of the
order. How many different combinations of books might you select?
57
SESSION 5: DISCRETE PROBABILITY DISTRIBUTION
1. Random Variables
- A random variable is a function or rule that assigns a numerical value to each outcome in the
sample space of a random experiment.
- A discrete random variable has a countable number (integer) of distinct values.
● Some have a clear upper limit (e.g., number of absences in a class of 40 students)
● Others do not (e.g., number of text messages you receive in a given hour).
2. Probability Distributions
a. Discrete Probability Distribution
- A discrete probability distribution assigns a probability to each value of a discrete random
variable X. The distribution must follow the rules of probability
P(xi) = P(X=xi)
0 ≤ 𝑃(𝑥𝑖) ≤ 1
58
- More than one random variable value can be assigned to the same probability, but one
random variable value cannot have two different probabilities.
1. Expected Value
- The expected value E(X) (of a discrete random variable) is the sum of all X-values weighted by
their respective probabilities.
- Because E(x) is an average (weighted mean) → E(x) is the mean and uses the symbol μ.
59
2. Variance and Standard Deviation
- The variance Var(X) (of a discrete random variable) is the sum of the squared deviations
about its expected value, weighted by the probability of each X-value.
- Var(X) is a weighted average → measures variability around the mean
- The standard deviation is the square root of the variance and is denoted σ:
60
III - Uniform distribution
61
IV - Binomial distribution
1. Bernoulli Experiments
- A random experiment with only 2 outcomes is a Bernoulli experiment.
● One outcome is labeled a “success” (denoted X = 1) and the other a “failure” (denoted X = 0).
o π is the P(success), 1 – π is the P(failure).
● We assume π < 0.5 for convenience.
2.Binomial Distribution
- The binomial distribution arises when a Bernoulli experiment is repeated n times.
- In a binomial experiment, X = the number of successes in n trials.
-> Binomial random variable X is the sum of n independent Bernoulli random variables.
62
- P(X = x) is determined by the two parameters n and π. The binomial probability function:
A fair coin is flipped 4 times. What is the probability of getting exactly 2 heads?
where:
→ Plugging in the values we have, we get: P(X = 2) = 4C2 * (0,5)2 * (1 - 0.5)(4-2)= 6 * 0.25 * 0.25
= 0.375
63
3. Binomial Shape
- π < 0.5 : skewed right
- π=0: symmetric
- π > 0.5 : skewed left
V - Hypergeometric distribution
- The hypergeometric distribution is similar to the binomial except that sampling is without
replacement from a finite population of N items
- The trials are not independent and the probability of success is not constant from trial to trial
- Finding probability of “X=xi” items of interest in the sample (n) where there are “s” items of
interest in the population (N)
- The hypergeometric distribution has three parameters:
64
● n (the number of items in the sample)
● s (the number of successes in the population).
- Hypergeometric Distribution Formula
Where
● N = population size
● s = number of items of interest in the population
● N – s = number of events not of interest in the population
● n = sample size
● x = number of items of interest in the sample
● n – x = number of events not of interest in the sample
Example: 3 different computers are selected from 10 in the department. 4 of the 10 computers
have illegal software loaded. What is the probability that 2 of the 3 selected computers have illegal
software loaded?
● N = 10
● s=4
● x=2
-> The probability that 2 of the 3 selected computers have illegal software loaded is 0.30, or 30%.
65
VI - Geometric distribution
- The geometric distribution describes the number of Bernoulli trials until the first success.
Example:
Suppose you are playing a game of darts. The probability of hitting the bullseye is 0.2.
What is the probability of hitting the bullseye on your second try?
Solution: In this case, we are interested in the probability of getting a success (hitting
the bullseye) after one failure (missing the bullseye).
The formula for the geometric distribution is: P(X = x) = π(1 - π)(x-1)
66
where:
● P(X = x) is the probability of getting a success on the xth trial.
● π is the probability of success on each individual trial x is the number of trials In
this case, p = 0.2 and k = 2.
→ Plugging these values into the formula, we get: P(X = 2) = 0.2(1 - 0.2)(2-1) = 0.16
Therefore, the probability of hitting the bullseye on your second try is 0.16.
67
a. Poisson Distribution Formula
where:
- Always right-skewed. The larger the l, the less right-skewed the distribution
68
- Poisson Distribution Example
Example: An average number of houses sold per day by a real estate company is 2.
What is the probability that 3 houses will be sold tomorrow?
X = 3; l = 2:
- Both the binomial and hypergeometric involve sample size of n and the number of successes
X.
- The binomial sample is with replacement while the hypergeometric sample is without
replacement
- Rule of Thumb: If n/N < 0.05, we can use the binomial approximation to the hypergeometric,
using sample size n and p = s/N.
69
- A linear transformation of a random variable X is performed by adding a constant, multiplying
by a constant, or both
- Two useful rules about the mean and variance of a transformed random variable aX + b,
where a and b are any constants (a ≥ 0).
● Example: Professor Hardtack gave a tough exam whose scores had μ = 40 and σ = 10. He
decided to add 20 points to every student’s score → Raise the mean 20 points.
- Rule 1: adding a constant to all X-values will shift the mean but will leave the standard
deviation unchanged. Alternatively, by multiply every exam score by 1.5 (40x1.5 = 60)
- Rule 2 : the standard deviation would rise from 10 to 15 → increasing the dispersion. In
other words, this policy would “spread out” the students’ exam scores. Some scores might even exceed
100.
70
● Covariance
- When X and Y are dependent, the covariance of them, denoted by Cov(X,Y) or σxy, describes
how the variables vary in relation to each other.
- Cov(X,Y) > 0 : indicates that the two variables move in the same direction
- Cov(X,Y) < 0 : indicates that the two variables move in opposite direction.
- We use both the covariance and the variances of X and Y to calculate the standard deviation
of the sum of X and Y.
71
SESSION 6: CONTINUOUS PROBABILITY DISTRIBUTION
72
2. Probabilities as Areas
- Continuous probability functions are smooth curves.
● Unlike discrete distributions, the area at any single point = 0.
● The entire area under any PDF must be 1.
- P(a < X < b) is the integral of the probability density function f(x) over the interval from a to b.
Because P(X = a) 5 0 the expression P(a < X < b) is equal to P(a ≤ X ≤ b).
73
II - Uniform continuous distribution
1.Characteristics of the Uniform Distribution:
- Uniform continuous distribution has equal probabilities for all possible outcomes of the
random variable. aka rectangular distribution - Denoted U(a, b) for short.
- If X is a random variable that is uniformly distributed between a and b
● PDF constant height: f(x)=1/(b-a)
For a ≤ X ≤ b
P(X ≤ x) = (x - a)/(b - a)
74
SUMMARY TABLE
● Symmetrical
75
- The random variable has an infinite theoretical range: - ∞ to + ∞
- The Normal CDF
● The formula for the normal PDF is
Where
76
- The probability for a range of values is measured by the area under the curve
77
- The shape of the distribution is unaffected by the z transformation, only the scale changed.
We can express the problem in original units (X) or in standardized units (Z).
78
2. Finding Normal Probabilities
79
- Normal Areas from Appendix C-2. Appendix C-2 shows cumulative normal areas from the left
to z.
V - Normal Approximations
1. Normal Approximation to the Binomial
- Normal approximation to the binomial: If n is sufficiently large then we can actually use the
normal distribution to approximate the probabilities related to the binomial distribution
Example: For 4, 16, and 64 flips of a fair coin with X defined as the number of heads in n tries. As
sample size increases, it becomes easier to visualize a smooth, bell-shaped curve overlaid on the bars.
- The logic of this approximation is that as n becomes large, the discrete binomial bars become
more like a smooth, continuous, normal curve
- Rule of thumb: when nπ > 10 and n(1- π) > 10, then it is appropriate to use the normal
approximation to the binomial.
80
- The binomial mean and standard deviation will be equal to the normal µ and σ
* Tips: Usually this type exercise would tell you to follow a Poisson distribution with mean of X.
Example: On Wednesday between 10 a.m. and noon, customer billing inquiries arrive at a mean rate
of 42 inquiries per hour at Consumers Energy. What is the probability of receiving more than 50 calls?
. . . 46 47 48 49 50 51 52 53 . . .
● The standardized Z-value for the event “more than 50” is P(X > 50.5) 5 P(Z > 1.31) since
𝑥−μ 50.5 − 42
z= σ
= 6.48074
≃ 1.31
● Using Appendix C-2 we look up P(Z < -1.31) = .0951, which is the same as P(Z > 1.31) because
the normal distribution is symmetric.
VI - Exponential distribution
1.Characteristics of the Exponential Distribution (Usually for continuous variables)
- Often used to model the length of time between two occurrences of an event (the time
between 2 events that happen)
Examples:
81
● Time between trucks arriving at an unloading dock
● Time between transactions at an ATM Machine
● Time between phone calls to the main operator
Examples: Between 2 p.m. and 4 p.m. on Wednesday, patient insurance inquiries arrive at Blue
Choice insurance at a mean rate of 2.2 calls per minute. What is the probability of waiting more than
30 seconds for the next call?
−λ𝑥 −(2.2)(0.5)
P(X > 0.50) = 𝑒 =𝑒 = .3329, or 33.29%
● There is about a 33 percent chance of waiting more than 30 seconds before the next call
arrives. Since x = 0.50 is a point that has no area in a continuous model, P(X ≥ 0.50) and P(X >
0.50) refer to the same event (unlike, say, a binomial model, in which a point does have a
probability). The probability that 30 seconds or less (0.50 minute) will be needed before the
next call arrives is:
- The count of customer arrivals is a discrete random variable: Poisson distribution. When the
count of customer arrivals has a Poisson distribution
● The distribution of the time between two customer arrivals will have an exponential
distribution
82
SUMMARY TABLE
Finding Probability
83
- Probability of waiting more than x
- Probability of waiting less than x
Example: Customers arrive at the service counter at the rate of 20 per hour. What is the
probability that the arrival time between consecutive customers is less than 6 minutes?
84
SESSION 7: SAMPLING, SAMPLING DISTRIBUTION, CONFIDENCE
INTERVAL
- A sample statistic: a random variable whose value depends on items included in the random
sample.
- Some samples may represent the population well, while other samples could differ greatly
from the population (particularly if the sample size is small)
⇒ In larger samples, the sample means would tend to be even closer to μ. This fact is the basis
for statistical estimation.
- To make inferences about a population we must consider four factors:
● Sampling variation (uncontrollable).
● Population variation (uncontrollable).
● Sample size (controllable).
● Desired confidence in the estimate (controllable).
- An estimator: a statistic derived from a sample to infer the value of a population parameter.
- An estimate: the value of the estimator in a particular sample.
- Sample estimator of population parameters.
85
Examples of estimators
2. Sampling error
- Sampling error is the difference between an estimate and the corresponding population
parameter (the sampled value and the true population value).
- This is because you can not choose a sample that is perfectly representative of the population.
Sampling error = 𝑋 - μ
3. Properties of Estimators
a. Bias
- The bias is the difference between the expected value of the estimator and the true
parameter.
Bias = E(𝑋)- μ
- An unbiased estimator neither overstates nor understates the true parameter on average.
E(𝑋)= μ
86
- Sample mean (𝑋) and sample proportion (p) are unbiased estimators of μ and π
- Sampling error is an inevitable risk in statistical sampling and random sampling, whereas bias
is systematic
b. Efficiency
- Efficiency refers to the variance of the estimator’s sampling distribution
- Smaller variance means a more efficient estimator. We prefer the minimum variance
estimator
c. Consistency
- Consistent estimator converges toward the parameter being estimated as the sample size
increases. Which means that the larger the sample, its characteristics will be closer to the whole
population.
87
- The variances of three estimators, 𝑋, s and p diminish as n increases, so all are consistent
estimators.
- Sampling distribution of an estimator: the probability distribution of all possible values the
statistic may assume when a random sample of size n is taken.
- Central limit theorem establishes that, in many situations, for independent and identically
distributed random variables, the sampling distribution of the standardized sample mean
tends towards the standard normal distribution even if it not originally normally distributed =>
can apply to many problems involving other types of distributions for decreasing the stress on
calculating and analyzing.
88
- Sampling error of the sample mean - standard error of the mean: described by its standard
deviation:
σ
σ𝑋 =
𝑛
89
● Even if your population is not normal, by the Central Limit Theorem, if the sample size
is large enough, the sample means will have approximately a normal distribution.
90
a. Uniform Population
- The Rule of Thumb says that n ≥ 30 is required to ensure a normal distribution for the sample
mean, but actually a much smaller n will suffice if the population is symmetric.
- The Central Limit Theorem predicts:
● The distribution of sample means drawn from the population will be normal.
● The standard error of the sample mean X will decrease as the sample size increases.
b. Skewed Population
- The Central Limit Theorem predicts
● The distribution of sample means drawn from any population will approach normality.
● The standard error of the sample mean X will diminish as sample size increases.
- In highly skewed populations, even n ≥ 30 will not ensure normality, though it is not a bad
rule.
- In severely skewed populations, the mean is a poor measure of center to begin with due to
outliers.
- Histograms of the actual means of many samples drawn from this uniform population.
91
c. Range of Sample Means
- The Central Limit Theorem permits us to define an interval which the sample means are
expected to fall in.
- As long as the sample size n is large enough, we can use the normal distribution regardless of
the population shape (or any n if the population is normal to begin with).
- We use the familiar z-values for the standard normal distribution. If we know μ and σ, the CLT
allows us to predict the range of sample means for samples of size n:
92
σ
- You can make the standard error as small as you want by increasing n. The mean X̿ of
𝑛
93
b. Choosing a Confidence Level
- In order to gain confidence, we must accept a wider range of possible values for μ. Greater
confidence implies loss of precision (i.e., a greater margin of error)
- Common confidence level
- If σ is known and the population is normally distributed, then we can safely construct the
confidence interval for μ.
- If σ is known but we do not know whether the population is normal
- Rule of thumb: n ≥ 30 is sufficient to assume a normal distribution for X (by the CLT) as
long as the population is reasonably symmetric and has no outliers.
94
- The confidence intervals will be wider (other things being the same) - tα/2 is always greater
than zα/2.
Example:
b. Degrees of Freedom
- Knowing the sample size allows us to calculate a parameter called degrees of freedom (d.f)
used to determine the value of the t statistic used in the confidence interval formula.
- The degree of freedom (d.f) is the number of observations that are free to vary after sample
mean has been calculated.
- They depend on the sample size (n) and the number of restrictions (usually 1).
95
c. Comparison of z and t
- As degrees of freedom increase, the t-values approach the familiar to normal z-values.
- Standard error σp will decrease as n increases like the standard error for 𝑋. We say that p = x/n
is a consistent estimator of π.
96
a. Standard Error of the Proportion
- The standard error of the proportion is denoted σp -
- σ𝑝 will be largest when the population proportion is near π = .50 and becoming smallest
when π is near 0 or 1.
b. Confidence Interval for π
e. Rule of Three
- If in n independent trials, no events occur, the upper 95% confidence bound is approximately
3/n
97
- Example: When proofreading, you read about 50 pages and do not found any mistake, then
you can estimate the chance you find a mistake in the next 50 pages is 3/50
6. Sample size determination for a mean
a. Sample Size to Estimate μ
b. Estimate σ
- Method 1: Take a Preliminary Sample
● Take a small preliminary sample and use the sample estimates in place of σ. This method is
the most common, though its logic is somewhat circular (i.e., take a sample to plan a sample).
- Method 2: Assume Uniform Population
● Estimate upper and lower limits a and b and set σ = [(b - a)2/12 ]1/2 .
- Method 3: Assume Normal Population
● Estimate upper and lower bounds a and b and set σ = (b - a)/6. This assumes normaגlity with
most of the data within μ + 3σ and μ - 3σ the range is 6σ
- Method 4: Poisson Arrivals
● In the special case when λ is a Poisson arrival rate, then σ=λ .
2 π (λ−π)
n= 𝑧σ/2 × 2
𝐸
98
frequencies of outcomes in a contingency table differ significantly from the expected
frequencies.
- Note: Chi-square only theoretically applies for hypothesis testing but in reality there are not
many practical applications.
- If the population is normal, construct a confidence interval for the population variance σ2
using the chi-square distribution with degrees of freedom equal to d.f. = n – 1
- Lower-tail and upper-tail percentiles for the chi-square distribution (denoted XL2and XU2) can
be found in Appendix E.
99
EXERCISE
3. What is the approximate width of an 80% confidence interval for the true population
proportion if there are 12 successes in a sample of 80?
A. ± .078
B. ± .066
C. ± .051
D. ± .094
100
4. A highway inspector needs an estimate of the mean weight of trucks crossing a bridge on the
Interstate highway system. She selects a random sample of 49 trucks and finds a mean of 15.8 tons
with a sample standard deviation of 3.85 tons. The 90 percent confidence interval for the
population mean is
A. 14.72 to 16.88 tons.
B. 14.90 to 16.70 tons.
C. 14.69 to 16.91 tons.
D. 14.88 to 16.72 tons.
Degrees of freedom = df = n - 1 = 49 - 1 = 48
At 90% confidence level the t is ,
α = 1 - 90% = 1 - 0.90 = 0.1
α / 2 = 0.1/ 2 = 0.05
tα /2,df = t 0.05,48 = 1.677
Margin of error = E = tα /2,df* (s /√n)
= 1.677* ( 3.85/√49)
= 0.92235
Margin of error = E = 0.92
The 90% confidence interval estimate of the population mean is,
x-E<μ< x+E
15.8-0.92 < μ < 15.8+0.92
14.88 < μ < 16.72
(14.88,16.72)
The 90% confidence interval estimate of the population mean is : 14.88 to 16.72
101
5. Last week, 108 cars received parking violations in the main university parking lot. Of these, 27
had unpaid parking tickets from a previous violation. Assuming that last week was a random sample
of all parking violators, find the 95 percent confidence interval for the percentage of parking violators
that have prior unpaid parking tickets.
A. 18.1% to 31.9%
B. 16.8% to 33.2%
C. 15.3% to 34.7%
D. 19.5% to 30.5%
Point estimate = sample proportion = p = x / n = 27 /108=0.25
1 - p = 1 - 0.25=0.75
At 90% confidence level
α = 1 - 90%
α = 1 - 0.90 =0.10
α/2 = 0.05
Zα/2 = Z0.05 = 1.645
Margin of error = E = Zα / 2 * √((p * (1 - p)) / n)
= 1.645 (√((0.25*0.75) / 108)
=0.0685
A 90% confidence interval for population proportion p is ,
p-E<p<p+E
0.25 -0.0685 < p < 0.25 + 0.0685
0.1815< p < 0.3185
The 90% confidence interval for the population proportion p is : 18.1% and 31.9%
102
SESSION 8: ONE-SAMPLE HYPOTHESIS TESTS
- Population proportion: often denoted by the symbol π (pi), is a statistical measure that
represents the proportion or percentage of a specific characteristic within an entire population. The
population proportion is calculated by dividing the number of elements in the population with the
specific characteristic by the total number of elements in the population
103
- Some examples of hypothesis testing are:
● Testing whether the average height of men is different from the average height of women.
● Testing whether the proportion of voters who prefer candidate A is greater than 50%.
● Testing whether there is a positive correlation between income and education level.
● Testing whether the mean blood pressure of patients who receive a new drug is lower than
the mean blood pressure of patients who receive a placebo.
● Testing whether there is a relationship between gender and academic performance.
104
c. Relationship between the two types of error
- The relationship between Type I Error and Type II Error is that they are inversely related. That
is, if you decrease the probability of one type of error, you increase the probability of the other type
of error, and vice versa.
- This is because the two types of errors depend on the same factors, such as the sample size,
the effect size, and the variability of the data. Therefore, there is a trade-off between minimizing
Type I Error and minimizing Type II Error. You have to balance the risks and consequences of both
types of errors and choose an appropriate level of significance and power for your test.
3. Critical value
a. Statistical Hypothesis
- Sample mean x̄ is close to the stated population mean μ
➔ H0 is not rejected. And vice versa.
- The question: How "close" is enough to conclude the hypothesis?
➔ It is based on the Critical Value.
105
b. Decision Rule
- A decision rule usually specifies a test statistic and a critical value (or a critical region) that
determines the rejection or acceptance of the null hypothesis.
- The test statistic is a numerical summary of the data that reflects the strength of the evidence
against the null hypothesis.
- The critical value (or region) is a threshold that defines the boundary between rejecting and
failing to reject the null hypothesis.
- Depending on the type of test statistic and the alternative hypothesis, there are different
types of decision rules:
● Upper-tailed test: Reject the null hypothesis if the test statistic is greater than the critical
value.
● Lower-tailed test: Reject the null hypothesis if the test statistic is less than the critical value.
● Two-tailed test: Reject the null hypothesis if the test statistic is either greater than the upper
critical value or less than the lower critical value.
- The decision rule is based on a chosen significance level, which is the probability of rejecting
the null hypothesis when it is true. The significance level is usually denoted by α, and it determines
the size of the critical region. A smaller α means a more stringent decision rule, and a larger α means
a more lenient decision rule.
For example, when the significance level is 0.05, the confidence interval will be 90%. Similarly,
they might be 0.01 and 99%, 0.025 and 95%, respectively.
106
4. Hypothesis tests: Testing a mean
107
● Step 2: Specify the decision rule
+ We use the level of significance (α) to find the critical value of the test statistic that
determines the threshold for rejecting the null hypothesis.
+ Example: A right-tailed test with α = .05 and known σ, then the critical value of z will be 1.645,
therefore our decision rule is:
Reject H0 if z > 1.645
Otherwise do not reject H0
108
● Compare the p-value with α
+ If p-value < α , reject H0
+ If p-value ≥ α , do not reject H0
109
- Sample proportion in the category of interest: p
- X and n - X are at least 5 => p can be approximated by normal distribution with mean and
standard deviation calculated as following:
- Formula to calculate Z:
● P is approximately normal:
● An equivalent form (in terms of the number in the category of interest, X):
110
Sample Exercises
1. “I believe your airplane’s engine is sound,” states the mechanic. “I’ve been over it carefully,
and can’t see anything wrong, I’d be happy to tear the engine down completely for an internal
inspection at a cost of $1,500. But I believe that roughness you heard in the engine on your last
flight was probably just a bit of water in the fuel, which passed harmlessly through the engine and
is now gone”. As the pilot considers the mechanic’s hypothesis, the cost of Type I error is
A. The pilot will experience the thrill of no-engine flight.
B. The pilot will be out $1,500 unnecessarily.
C. The mechanic will lose a good customer.
D. Impossible to determine without knowing a.
Answer: B
Solution: The cost of a Type I error is the cost of rejecting a true null hypothesis. In this scenario,
a Type I error would occur if the pilot decides to pay $1,500 for the internal inspection, even though
the engine is sound and the roughness was caused by water in the fuel. The cost of this error would
be $1,500 unnecessarily.
2. Guidelines for the Jolly Blue Giant Health Insurance Company say that the average
hospitalization for a triple hernia operation should not exceed 30 hours. A diligent auditor studied
records of 16 randomly chosen triple hernia operations at Hackmore Hospital and found a mean
hospital stay of 40 hours with a standard deviation of 20 hours. “Aha!” she cried, “the average stay
exceeds the guideline.” State her hypothesis and test it at ∝ = .025
Hint:
To test this hypothesis, we can use a one-tailed t-test with a significance level of ∝ = .025. The
null hypothesis is that the true mean hospital stay is equal to or less than 30 hours, while the
alternative hypothesis is that the true mean hospital stay is greater than 30 hours.
Calculate the test statistic, then compare it to the critical value, we can conclude that the
hypothesis is accepted or rejected statistically:
● test statistic: t = (40-30)/(20/√16) = 2
111
● critical value: ∝ = .025, df = n - 1 = 16 - 1 = 15 => critical value = 2.131
Answer: We fail to reject the null hypothesis. There is not enough evidence at the 0.025
significance level to conclude that the average hospitalization for a triple hernia operation exceeds 30
hours based on the sample data.
3/ In the nation of Gondor, the EPA requires that half the new cars sold will meet a certain
particulate emission standard a year later. A sample of 64 one-year-old cars revealed that only 24
met the particulate emission standard. Test the hypotheses to see whether the proportion is below
the requirement.
Hint:
To test this hypothesis, we can use a one-tailed z-test with a significance level of ∝ = .05. The null
hypothesis is that the true proportion of cars meeting the standard is equal to or greater than 0.5,
while the alternative hypothesis is that the true proportion is less than 0.5.
Calculate the test statistic (using the formula for proportion), then compare it to the critical value,
we can conclude that the hypothesis is accepted or rejected statistically.
● test statistic (z score): z = (24/64 - 0.5)/√(0.5x(1-0.5)/64) = -4
● critical value = -1.645 (∝ = .05)
Answer: The null hypothesis is rejected, we can conclude that there is sufficient evidence to
suggest that the proportion of cars meeting the particulate emission standard in Gondor is less than
0.5, as required by the EPA.
112
SESSION 9: TWO-SAMPLE HYPOTHESIS TESTS
I- Independent samples
- Independent samples are often used to test the difference between two population means or
proportions
- Assumptions:
● Samples are randomly and independently draw
● Populations are normally distributed or both sample sizes are at least 30
1. Situation 1
Example: You want to compare the average hours of sleep between students from two different
schools. You know the population variances for both schools.
113
2. Situation 2
Example: You want to compare the average study hours per week of two different groups of
students. You don't know the population variances, but you assume they are equal.
114
3. Situation 3
Example: You want to assess whether there's a difference in the average grades of two different
courses. You don't know the population variances, and you cannot assume them to be equal.
Note: If the sample sizes are equal, the Case 2 and Case 3 test statistics will always be identical,
but the degrees of freedom (and hence the critical values) may differ. If you have no information
about the population variances, then the best choice is Case 3.
115
- Now, you have two sets of observations: the blood pressure measurements before and after
treatment for each individual. These measurements are paired because each "after"
measurement is related to a specific "before" measurement.
➔ Paired t-test
116
5. Population proportion test
- This is used to test the hypothesis of confidence interval (the difference of two population
proportion π1 – π2 )
117
6. Population variance test
118
119
Sample Exercises
1. Two well-known aviation training schools are being compared using random samples of their
graduates. It is found that 70 of 140 graduates of Fly-More Academy passed their FAA exams on the
first try, compared with 104 of 260 graduates of Blue Yonder Institute. Test the pass rates for
equality at ∝ = .05
Hint:
Firstly, calculate the pooled sample proportion (p) under the assumption that the pass rates are
equal, then calculate the test statistic (z). Compare it to the critical value and conclude that the
hypothesis is rejected or not.
2. Suppose you are working for a fitness company, and you want to determine if a new workout
program is more effective in terms of weight loss compared to the company's previous program.
You select a sample of 20 participants who have completed both programs, and you record the
weight loss (in pounds) for each participant before and after completing both programs. The data is
as follows:
Before Program A: 190, 195, 200, 185, 205, 210, 192, 198, 187, 193, 199, 204, 189, 194, 201,
188, 203, 197, 191, 206
After Program A: 180, 187, 195, 178, 198, 205, 185, 190, 180, 182, 192, 199, 178, 183, 197, 175,
198, 191, 177, 201
Perform a paired-samples t-test to determine if there is a statistically significant difference in
weight loss before and after completing Program A.
Hint:
Step 1: Calculate the differences between the "Before" and "After" weights for each participant.
Step 2: Calculate the mean (average) of the differences.
Step 3: Calculate the standard deviation of the differences. You can use the formula for the
sample standard deviation.
Step 4: Calculate the t-statistic.
Step 5: Determine the degrees of freedom (df). In a paired-samples t-test, df is equal to the
number of pairs minus 1.
120
Step 6: Find the critical t-value for your desired level of significance (alpha). Let's assume a 0.05
level of significance (95% confidence level) for a two-tailed test. You can use a t-table or a calculator
to find the critical t-value.
Step 7: Compare the calculated t-statistic to the critical t-value to determine statistical
significance.
Answer:
Conclusion: Since the calculated t-statistic falls outside the range of the critical t-values, we can
reject the null hypothesis. This means there is a statistically significant difference in weight loss before
and after completing Program A. In other words, the new workout program appears to be more
effective in terms of weight loss compared to the previous program.
121
SESSION 10: ANOVA
I - Analysis of variance
1. Purpose of analysis of variance
- Compare more than two means simultaneously and how to trace sources of variation to
potential explanatory factors by using analysis of variance (commonly referred to as ANOVA).
- Analysis of variance seeks to identify sources of variation in a numerical dependent variable Y
(the response variable). Variation in the response variable about its mean either is explained by one
or more categorical independent variables (the factors) or is unexplained (random error)
- Note: When conducting multiple t-tests, the probability of making a Type I error (incorrectly
rejecting a true null hypothesis) increases with the number of tests. ANOVA controls the overall Type I
error rate, reducing the risk of false positives when comparing multiple groups.
122
3. N-Factor ANOVA (Involved Two or more Factors)
4. Treatment
- Each possible value of a factor or combination of factors is a treatment.
- Test if each factor has a significant effect on Y:
H0: μ1 = μ2 = μ3
H1: Not all the means are equal
If we cannot reject H0, we conclude that observations within each treatment have the same
mean μ.
Note: If : μ1 = μ2 ≠ μ3 we also reject H0
5. ANOVA Assumptions (Only used for One-factor/ One way ANOVA)
- Independence: The observations in each group are independent of each other and the
observations within groups were obtained by a random sample.
- Normality: Each sample was drawn from a normally distributed population
- Homogeneity of Variances / Homoskedascity: The variances of the populations that the
samples come from are equal
II - One-factor anova
1. Purpose: Compares the means of c treatments (groups)
Example: Paint quality is a major concern of car makers. A key characteristic of paint is its viscosity, a
continuous numerical variable. Viscosity is to be tested for dependence on application, temperature
(low, medium, high), as illustrated in Figure 11.4. Although temperature is a numerical variable, it has
been coded into categories that represent the test conditions of the experiment because the car
maker did not want to assume that viscosity was linearly related to temperature.
123
2. Data Format
- Sample sizes within each treatment do not need to be equal.
- Total number of observations: n = n1 + n2 + n3 + … + nc
- Advantages to have balanced sample sizes:
1. Equal sample size ensures that each treatment contributes equally to the analysis;
2. Reduces problems arising from violations of the assumptions (e.g., nonindependent Y values,
unequal variances or nonidentical distributions within treatments, or non-normality of Y; and
3. Increases the power of the test (i.e., the ability of the test to detect differences in treatment
means).
- Example of one factor: Is there a difference between income of students from 3 schools
T1 = UEH ; T2 = UAH ; T3 = IU
y11: student 1 (UEH)
y21: student 2 (UEH)
y12: student 1 (UAH)
y22: student 2 (UAH)
…
➔ If the income of all student are the same ➔ The ȳ is not affected
➔ If the income of all student are not the same ➔ The ȳ is affected
3. Hypothesis to Be Tested
- The question of interest is whether the mean of Y varies from treatment to treatment. The
hypotheses to be tested are:
124
4. One-Factor ANOVA as a Linear Model
- An equivalent way to express the one-factor model is to say that observations in treatment j
came from a population with a common mean (μ) plus a treatment effect (Tj) plus random error (εij):
● yij: observation
● μ :common mean
● Tj: treatment effect
● eij: random effect
- The random error is assumed to be normally distributed with zero mean and the same
variance for all treatments.
- Testing hypotheses:
● If the null hypothesis is true (Tj = 0 for all j), then knowing that an observation x came from
treatment j does not help explain the variation in Y and the ANOVA model become:
● If the null hypothesis is false, then at least some of the Tj must be nonzero. In that case, the Tj
that are negative (below μ) must be offset by the Tj that are positive (above μ ) when weighted by
sample size.
125
III - Decomposition of variation
1. Group Means
- The mean of each group is calculated in the usual way by summing the observations in the
treatment and dividing by the sample size:
- The overall sample mean or grand mean y can be calculated either by summing all the
observations and dividing by n or by taking a weighted average of the c sample means:
IV - Partition of deviations
1. Partition Of Deviations
- Any deviation of an observation from the grand mean ȳ may be expressed in two parts: the
variation in yij = the variation of the predicted scores + the variation of the errors of prediction
126
2. Hypothesis Testing
- SSA and SSE are used to test the hypothesis of equal means by dividing each sum of squares its
degrees of freedom.
- These ratios are called Mean Squares (MSA and MSE)
V - Test statistic
1. F statistic
- F statistic: Ratio of variance due to treatments (MSA) and to error (MSE) -> find out if the
means between two populations are significantly different or not.
- When F is near zero ➔ little difference among treatments ➔ not reject H0
- Decision Rule: Reject H0 if F > Fa, otherwise do not reject
- Note: If the result is significant (reject the null hypothesis) -> Check the P- value of the
individual variable since the F-test can only test the overall effect on Y. We can test P-value to examine
the effect of each variable; otherwise the F-test result might be caused by joint-effects of all variables.
127
- After running ANOVA, the result indicates that the means of sales revenue for the three stores
A, B, and C are different. Therefore, we reject the null hypothesis and conclude that the mean sales
revenues for these three stores are different. We then use Tukey's test to examine each pair of stores
(AB, BC, CA). Suppose the result shows that A is not significantly different from B (A # B = C). In this
case, we would modify our conclusion to state that only Store A has a statistically different mean sales
revenue, while the differences between Stores B and C are not statistically significant.
Example: Cupertino, San Jose, Santa Clara are restaurants that sell tofu pizza. We want to
examine if there was a difference in means among the three restaurants.
( In reviewing the graph of the sample means, it appears that Santa Clara has a much higher number
of sales than Cupertino and San Jose. There will be three pairwise post‐hoc tests to run.)
Solutions:
- The hypothesis are
H0a: μ1 = μ2 ; H1a: μ1 ≠ μ2
H0b: μ1 = μ3 ; H1b: μ1 ≠ μ3
H0b: μ2 = μ3 ; H1b: μ2 ≠ μ3
128
- These three tests will be conducted with an overall significance level of
- 𝛼 = 5%.
- Here are the differences of the sample means for each pair ranked from lowest to highest:
- The HSD critical values (using statistical software) for this particular test:
- HSDcrit at 5% significance level = 1.85
- HSDcrit at 1% significance level = 2.51
- For each test, reject
- 𝐻𝑜 f the difference of means is greater than HSDcrit
- Test 2 and Test 3 show significantly different means at both the 1% and 5% level.
- Conclusion: Santa Clara has a significantly higher mean number of tofu pizzas sold compared
to both San Jose and Cupertino. There is no significant difference in mean sales between San
Jose and Cupertino.
- Decision Rule:
T(c,n-c) is a critical value of the Tukey’s Test statistic T(calc) for the desired level of significance.
129
3. Hartley’s Test
- Hartley’s test is used to test for Homogeneity of Variances (Homoskedascity)
- The test statistic is the ratio of the largest sample variance to the smallest sample variance:
4. Levents’ Test
- Alternative to Hartley's F test. It is also used to test Homogeneity. Both Hartley and Levent
tests should be performed prior to a one-way ANOVA to ensure that ANOVA assumptions are met.
- Levene's test does not assume a normal distribution.
- Based on the distances of the observations from their sample medians rather than
their sample means.
130
Sample Exercises
1. Using the following Excel results:
(a) What was the overall sample size?
(b) How many groups were there?
(c) Write the hypotheses. (d) Find the critical value of F for α = .10.
(e) Calculate the test statistic.
(f ) Do the population means differ at α = .10?
c) Hypothesis:
H0: μ1 = μ2 = μ3 = μ4 = μ5
d) Critical value
We use F-test; with alpha = 0.05. We use appendix F in the textbook. With the df1 = 4 (the
column); df2 = 35 (the row)
e)
Test statistic:
131
f) Since f = 1.8 < 2.64
132
SESSION 11: SIMPLE REGRESSION
2. Correlation Coefficient
- Sample correlation coefficient (Pearson correlation coefficient) - denoted r - measures the
degree of linearity in the relationship between two random variables X and Y.
- Its value will fall in the interval [-1, 1].
133
- Negative correlation:
● 𝑥𝑖 is above its mean
- Positive correlation: 𝑥𝑖 and 𝑦𝑖 are above/below their means at the same time
134
- The formula for the sample correlation coefficient
Correlation coefficient only measures the degree of linear relationship between X and Y.
135
● In very large samples, even very small correlations could be “significant”.
● A larger sample does not mean that the correlation is stronger nor does its increased
significance imply increased importance.
136
● To calculate the test statistic, we first need to calculate the value for r. Using Excel’s function
=CORREL(array1,array2), we find r = .4356 for the variables Quant GMAT and Verbal GMAT. We must
then calculate tcacl.
𝑛−2 30−2
𝑡𝑐𝑎𝑐𝑙 = r 2 = .4356 2 = 2.561
1−𝑟 1−(.4356)
II - Simple regression
1. What is Simple Regression?
- The simple linear model in slope-intercept form: Y = slope × X + y-intercept. In statistics, this
straight-line model is referred as a simple regression equation.
● The Y variable as the response variable (the dependent variable)
● The X variable as the predictor variable (the independent variable)
- Only the dependent variable (not the independent variable) is treated as a random variable
2. Interpreting an Estimated Regression Equation
- Cause and effect are not proven by a simple regression
➔ cannot assume that the explanatory variable is “causing” the variation in the response
variable
● Example: There are 2 scenarios to consider
+ Third-variable problem: A third variable can influence both variables under study, making
them appear causally linked when they are not. For example, vitamin D and bone thinning
are closely correlated, but they are not directly causing each other. Instead, calcium, a
third variable, affects both variables independently. In this case, concluding a causal
relationship would be a research bias.
137
+ Directionality problem: Both variables A and B could have a causal relationship, but the
direction of influence is unclear. For example, vitamin D levels and depression are
correlated, but it's not clear whether low vitamin D causes depression, or whether
depression causes people to consume less vitamin D.
- Inclusion of a random error ε is necessary because other unspecified variables also may affect
Y
- The regression model without the error term represents the expected value of Y for a given x
value called simple regression equation
138
E(Y|x) = β0 + β1X (simple regression equation)
- The regression equation used to predict the expected value of Y for a given value of X:
residual is the vertical distance between each 𝑦𝑖 and the estimated regression line on a scatter plot of
(𝑥𝑖,𝑦𝑖) values.
139
IV - Ordinary least squares formulas
1. Slope and Intercept
- The ordinary least squares method (OLS method): estimate a regression so as to ensure the
best fit
➔ selected the slope and intercept → residuals are as small as possible -> create a straight line as
close as possible to your data points.
- Residuals can be either positive or negative, and residuals around the regression line always
sum to zero
- The fitted coefficients b0 and b1 are chosen so that the fitted linear model y =b0+b1x has the
smallest possible sum of squared residuals (SSE):
- Differential calculus used to obtain the coefficient estimators b0 and b1 that minimize SSE
140
- The OLS formula for the slope can also be:
2. Sources of Variation in Y
- The total variation as a sum of squares (SST), split the SST into two parts:
141
+ Variation attributable to factors other than the linear relationship between x and y
3. Coefficient of Determination
- The coefficient of determination: the portion of the total variation in the dependent variable
that is explained by variation in the independent variable.
=> Measures how well a statistical model predicts an outcome
● The coefficient of determination called R-squared - denoted as R2.
2
Example: If 𝑅 = 0. 0867 => 8.67% of the variation in y can be explained by the x-variables.
- Examples of Approximate R2 Values
● The range of the coefficient of determination is 0 ≤ R2≤ 1
142
- Division by n – 2 → the simple regression model uses two estimated parameters, b0 and b1.
- The standard error of the estimate
- The magnitude of se should always be judged relative to the size of the y values in the sample
data
● Inferences about the regression model
- The variance of the regression slope coefficient (b1) is estimated by
where:
Sb1= Estimate of the standard error of the least squares slope
𝑆𝑆𝐸
𝑠𝑒= 𝑛−2
= Standard error of the estimate
143
Confidence Intervals for Slope and Intercept
These standard errors → construct confidence intervals for the true slope and intercept.
Using Student’s t with d.f. = n - 2 degrees of freedom and any desired confidence level.
2. Hypothesis Tests
- if β1 = 0 ➔ X does not influence Y
→ the regression model collapses to a constant 0 + a random error term:
- For either coefficient, we use a t test with d.f. = n - 2. The hypotheses and test statistics
144
3. Slope versus correlation
- The test for zero slope is the same as the test for zero correlation.
➔ The t test for zero slope will always yield exactly the same tcalc as the t test for zero correlation.
145
VI - Confidence and prediction intervals for y
- Construct an Interval Estimate for Y
146
Appendix D:
147
Exercise:
A hypothesis test is conducted at the 5 percent level of significance to test whether the
population correlation is zero. If the sample consists of 25 observations and the correlation coefficient
is 0.60, then the computed test statistic would be:
Key:
H0: ρ = 0
H0: ρ ≠ 0
ρ is the population correlation coefficient.
𝑛−2 25−2
𝑡𝑐𝑎𝑐𝑙 = r 2 = .60 2 = 3.597
1−𝑟 1−0.6
This is compared with a critical t (23 degrees of freedom two-tailed value) = 2.069 (see Appendix
D: d.f.=23, significance level for two-tailed test=.05)
Since computed t exceeds critical t, reject H0 and conclude the population correlation coefficient
significantly differs from 0 at the 5% level of significance.
148