You are on page 1of 107

Chap 1: stat

Thursday, June 28, 2018 12:20 AM

Statistic (data science): -> single measure


-> number
-> summarize a sample data set
Tasks: - data modeling
- analysis
- decision making
Kinds:
- Descriptive (thống kê mô tả): collect-summarize-present-analyze
(charts, graphs, numerical summary)
- Inferential (suy diễn): generalizing -> estimating unknown population
parameters,
Drawing conclu, making deci
▪ Estimating
▪ Hypothesis test: test the claim…
Application:
- Audit
- Economics: make a forecast
- Marketing: survey customer preference
- Production: output
- Finance: guide an investment

Empirical data: collected through observation + exp


=> pitfalls -----> large pop, small sample
-----> nonrandom samples
----> rare events
----> poor survey method (memory)
----> casual links (crime rates-full moon)
----> generalization (indi-> groups)
----> unconscious bias (for many years it is assumed that…)
----> signi than important

session 0 Page 1
Chap 2: data
Thursday, June 28, 2018 10:04 AM

• Data: facts, figures,.. Collected for - analysis


-presentation
-intepretion
Types:
○ Categorical (qualitative): values described by words rather than numbers
○ Numerical (quantitative)

○ Time series: study trends, patterns over time


○ Cross-sectional: same time, different individuals (price of a group of 20 stocks on 30/6)
○ Pooled = time series+cross-sectional (price of a group of 20 stocks last week)

• Data set: collection of data value as a whole


• Subject/individual: item for study
• Variable: characteristic about the subject
• Observation: a single data value
variables

Employee Name Sexuality DOB Income per year in $


Gladys Simpson Female 1-May-1971 120,000
Divid Hinds Male 17-Dec-1968 135,000
Kenneth Henry Male 3-Sep-1965 98,000

Observation

Data types (vari):


- Univariate (1 vari)
- Bivariate (2 vari)
- Multivariate (>2 vari)

Collecting data

1 data collection Page 2


(infor availble somewhere)

1 data collection Page 3


Scales of measurement
Thursday, June 28, 2018 9:58 AM

Def:
- Determine the amount of infor contained in the data
- Indicate -> data summarization
-> statistical analysis Most appropriate
Types: ways to measure depend on the scale
➢ Nominal scale : category, classification data
- Data (lables, names) -> identify an attribute of the element
- Can code data numerically -> no numerical meaning
- Example:
• Students of a university are classified by as Business, Humanities, Education, and so on.
- No ordering
➢ Ordinary: ranking (nominal data+order of data)
- Ex:
• Uni students: freshman, sophomore, junior/senior
- Ordering but no clear meaning to the distance between data
➢ Interval: ordinal data
- Difference between measurements -> meaningful
- No true zero value, ratios have no meaning (ko thể nói 4 độ nóng gấp đôi 2 độ)
➢ Ratio: interval data
- Ratio of the 2 values is meaningful (tiền của A gấp đôi tiền B)

1 data collection Page 4


Sampling concepts
Thursday, June 28, 2018 2:29 PM

• Population
- collection of all items of interest or under investigation
- finite or infinite.
• Census: examination of all items in a defined population.
• Sample: observed subset (tập hợp con) of the population.

• Parameter: specific characteristic of a population


• Statistic: specific characteristic of a sample
• Target population: population we are interested in (e.g., U.S. gasoline prices).
• Sampling frame: the group from which we take the sample (e.g., 115,000 stations).
• Replacement: allow dublicate when sampling
- N is much smaller than n --> unlikely to dublicate
- Without replacement ---> not allow to dublicate

1 data collection Page 5


Sampling Method
Monday, July 16, 2018 3:57 PM

1 data collection Page 6


Non-random sampling
Monday, July 16, 2018 4:02 PM

• Convenience Sample
• Sample -> available (e.g., ask co-worker opinions at lunch).
• Judgment Sample
• Use expert knowledge to choose “typical” items (e.g., which employees to interview).
• Focus Groups
• In-depth dialog with a representative panel of individuals (e.g. iPhone users).

1 data collection Page 7


Statistical Sampling
Monday, July 16, 2018 4:04 PM

Statistical Sampling: sample based on


- Known
- Calculable probability

Stratified:
Simple random:
• Divide population -> subgroups (called strata) have
• Mem of pop
Equal chance of being selected common characteristic (e.g. age, gender,
• Sample of given size occupation)
• Select a simple random sample from each subgroup
• Selection: replacement / without replacement • Combine samples from subgroups into one
• The sample can be obtained using a table of random numbers or computer random
number generator

Systematic: Cluster:

• sample size: n • One-stage cluster sampling: randomly selected k


clusters
• Divide N individuals into n groups of k individuals: k=N/n • Two-stage :
(chia 36 thành 9 nhóm mỗi nhóm 4 người) - randomly select k clusters
• Chọn những mặc áo đỏ mỗi nhóm - choose a random sample of elements within
each cluster.

1 data collection Page 8


Make a survey
Monday, July 16, 2018 9:51 PM

1 data collection Page 9


Rule of thumb
Tuesday, July 17, 2018 1:59 PM

1 data collection Page 10


Categories data
Monday, July 16, 2018 9:57 AM

Tabulating data:
Graphing Data Summary table:
Bar + Pie Charts: qualitative data (categories or nominal scale) ❖ Purpose:
- Height of bar => frequency To see differences between or among categories.
- Size of pie slice => percentage
Bar Pie (%)
Tips: - Convey general idea
- Numerical vari-> Y asix - Ineffective -> too many slices
- Category labels-> X asix - Represent parts of a whole
- Height/ length: proportional - Need relative frequency to construct
to the quality displayed

Pareto:
- portray categorical data (nominal scale)
- Categories: descending order of frequency => the most common categories appear first
- Cumulative polygon included

2 describe data Page 11


Numerical data
Monday, July 16, 2018 10:41 AM

2 describe data Page 12


Ordered array
Monday, July 16, 2018 10:47 AM

Ordered array: a sequence of data in rank order

- Shows range (min to max)

- Signals -> variability

- May help identify outliers (unusual observations)

- Large data => less useful

○ Ex:
raw form
24, 26, 24, 21, 27, 27, 30, 41, 32, 38
ordered array from smallest to largest
21, 24, 24, 26, 27, 27, 30, 32, 38, 41

Stem-and-Leaf Diagram: tool of: exploratory data analysis (EDA)


- Useful => small data set
- Most closely resemble: rudimentary bar chart
- Reveal:
○ Central tendency
○ Dispersion
- Method: Separate the sorted data into:
○ leading digits (the stem)
○ trailing digits (the leaves)

Dot Plots:
• The simplest graphical display of n individual values of numerical data:
• Tool for data exploration => easy to understand
• Not good for large samples (e.g., > 5,000).
• Basic steps:
1. Make a scale -> cover data range
2. Mark axis demarcations (ranh giới) + label them

2 describe data Page 13


2. Mark axis demarcations (ranh giới) + label them
3. Plot data

• Range -> dispersion


• Clustering (điểm tụ) -> central tendency
• Reveal shape (when sample is large enough)

2 describe data Page 14


Frequency distribution
Monday, July 16, 2018 11:24 AM

Frequency distribution:
• A frequency distribution is a table

• containing class groupings

• and the corresponding frequencies with which data fall within each grouping

Frequency Polygons and Orgives (cumulative):


Histogram: ❖ Frequency Polygon: a line graph
○ Connects the midpoints of the histogram intervals,
Y-axis: ○ Plus extra intervals at the beginning and end so that the line will touch the
X-axis
• the number of data values (or a percentage) within each bin of a frequency ○ Same purpose -> histogram
distribution

• Frequency, relative frequency, or percentage

X-axis:

• ticks show the end points of each bin

• The class boundaries (or class midpoints)

No gaps between bars

❖ Orgives (%):
○ Line graph of the cumulative frequencies

2 describe data Page 15


Scatter Plots
Monday, July 16, 2018 2:43 PM

Scatter Plots:

• possible relationships between two numerical variables

2 describe data Page 16


Time Series Plot
Monday, July 16, 2018 3:05 PM

- Patterns in the values of a variable over time.

Log scales (ratio scale):


- Useful -> time series ( grow at a compound annual percentage rate)
○ Ex: GDP, the national debt, or your future income).
- Reveal: quantity growth
○ increasing percent (concave upward or convex function),
○ constant percent (straight line)
○ declining percent (concave downward).
- Equal distances => equal ratios
○ the distance from 100 to 1,000 => same => distance from 1,000 to 10,000
○ both have the same 10:1 ratio
- Suited: positive data values
- Displayed: vertical axis => reveal more detail for small data values

2 describe data Page 17


Pictograms
Monday, July 16, 2018 3:34 PM

2 describe data Page 18


Numerical data property
Friday, July 13, 2018 10:17 AM

3 descriptive statistics Page 19


1) Arithmetic Mean (mean): not use for outlier
Center the most common measure of central tendency
Friday, July 13, 2018 10:20 AM

2) Median: Not affected by extreme values


- In an ordered array =>
most appropriate measure of central tendency for ordinal data

From <http://wps.prenhall.com/wps/grader>

- “middle” number (50% above, 50% below)

4) Mode: Not affected by extreme values (apply to nominal scale) Ex: 1,2,7,8,9,10
- Value that occurs most often ○ Position: (6+1)/2 =3.5
- Used for either numerical or categorical (nominal) data ○ Mean: (7+8)/2 = 7.5 1.5
- No mode/several mode

5) Geometric mean: measure the rate of change of a variable over time


( dùng trường hợp có % hoặc "interest rate")
- Mean rate of return:
○ Measures the status of an investment over time

Ri is the rate of return in time period I


Ex:
An investment of $100,000 declined to $50,000 at the end of year one and rebounded to
$100,000 at end of year two:

3 descriptive statistics Page 20


Central (overall)
Friday, July 13, 2018 11:01 AM

3 descriptive statistics Page 21


Quartiles
Friday, July 13, 2018 11:13 AM

Descriptive Statistics:
split the ranked data -> 4 segments: an equal number of values per segment

- First quartile position: Q1 = (n+1)/4


- Second quartile position: Q2 = (n+1)/2 (the median position)
- Third quartile position: Q3 = 3(n+1)/4
where n is the number of observed values
Ex:

Q1,Q3: nocentral location


Q2 : median (central tendency)

Box-and-whisker Plot: A Graphical display of data using 5-number summary:

Minimum* -- Q1 -- Median -- Q3 -- Maximum

3 descriptive statistics Page 22


Vari (Range)
Friday, July 13, 2018 11:39 AM

• Simplest measure of variation


• Difference between the largest and the smallest values in a set of data:

Problems:
A. Ignore -> data distributions

Solution:

- Variance: Average (approximately) of squared deviations of values from the mean - Coefficient of Variation:
○ Measures relative variation
= mean ○ Always in percentage (%)
n = sample size ○ Shows variation relative to mean
○ Can be used to compare two or more sets of data measured in different
Xi = ith value of the variable X units
- Standard deviation:
• Used to measure variation (most common)
• Show variation about the mean S: standard deviation
• Same units as the original data : mean (average)

Ex: Both stocks have the same standard deviation, but stock B is less
variable relative to its price by using CV

A. Sensitive to outliers:

Solution: Interquartile Range:


• Interquartile range = 3rd quartile – 1st quartile
IQR = Q3 – Q1

• Values outside the inner fences: unusual


• Values outside the outer fences: outliers => xác định outliers
• In a Boxplot, Xmin and Xmax: smallest and highest values in the inner fences.

3 descriptive statistics Page 23


3 descriptive statistics Page 24
Variation (overall)
Friday, July 13, 2018 11:32 AM

(spread/ variability of the data value)

3 descriptive statistics Page 25


Population
Sunday, July 15, 2018 11:46 AM

• Population summary measures: parameters


• Population mean: the sum of the values in the population divided by the population size, N

μ = population mean

N = population size

• Population variance:

• Population Standard Deviation:


( measure variation, same units)

3 descriptive statistics Page 26


Shape
Sunday, July 15, 2018 2:03 PM

• Describes how data are distributed


• Measures of shape:
• Symmetry / asymmetry
• peakedness

Kurtosis: represents too high / too low

3 descriptive statistics Page 27


Standardized data
Sunday, July 15, 2018 2:10 PM

Chebyshev's Theorem
• Regardless of how the data are distributed, at least (1 - 1/k2) x 100% of the values will fall within
k standard deviations of the mean (for k > 1)
- Ex:
(1 - 1/22) x 100% = 75% ….... k=2 (μ ± kσ = μ ± 2σ )
Let μ = 72, σ = 8 (standard deviation)
=> At least 75% of the scores will be within the interval 72 ± 2.8 or [56,88]
(regardless of how the scores are distributed)

The Empirical Rule: (not applied for a population of sight-skewness)


• Data distribution: approximately bell-shaped
=> interval μ ± kσ
(contain a know percentage of the data)'

Z Scores:
• Measure distance from the mean
- Ex:
Z-score of 2.0: a value is 2.0 standard deviations from the mean
• Z score > 3.0 or < -3.0: an outlier

- Ex:
Mean: 14.0 Standard Deviation: 3.0

3 descriptive statistics Page 28


Mean: 14.0 Standard Deviation: 3.0
What is the Z score of the value 18.5

➢ The value 18.5 is 1.5 standard deviations above the mean

• Negative Z-score: a value is less than the mean

3 descriptive statistics Page 29


Grouped data
Sunday, July 15, 2018 2:57 PM

Weighted mean:
- A sum: each data value a weight wj that represents a fraction of the total (i.e., the k
weights must sum to 1).

- Ex:
Your instructor give
▪ a weight of 30 percent to homework, 20 percent to the midterm exam, 40
percent to the final exam, and 10 percent to a term project (so that .30 + .20
+ .40 + .10 = 1.00).
▪ your scores on these were 85, 68, 78, and 90. Your weighted average for
the course would be:

= (0.3 x 85) + (0.2 x 68) + (0.4 x 78) + (0.1 x 90) = 79.3 79.3

Application:
- Accounting (weights for cost categories),
- finance (asset weights in investment portfolios)

Approximations for Grouped Data:


❖ Mean:
Population Sample

f: frequency
m: midponit
N or n: size

❖ Variance:

❖ Standard Variation: căn

3 descriptive statistics Page 30


Linear Rela
Sunday, July 15, 2018 8:48 PM

The Covariance:
- measures the strength of the linear relationship between two variables

- Population Covariance: Sample Covariance:

- Interpreting:

○ cov(X,Y) > 0 X and Y move: same direction

○ cov(X,Y) < 0 X and Y move: opposite directions

○ cov(X,Y) = 0 X and Y: independent

- Coefficient of Correlation:
○ Measures the relative strength of the linear relationship between two variables
○ Sample Coe-cor

○ Features:
• Unit free
• Ranges between –1 and 1
• Closer –1, the stronger the negative linear relationship
• 1, the stronger the positive linear relationship
• 0, the weaker the linear relationship

3 descriptive statistics Page 31


3 descriptive statistics Page 32
Chap 5: Probability
Thursday, June 28, 2018 2:29 PM

• Random exp: results cannot be known in advance


• Sample space: outcomes of the exp
• Discrete: countable number of outcomes
• Ex:
Flip a coin, the sample space consists of 2 outcomes S = {H, T}
• Continuous sample space: measurement outcome (weight, height,…)
• Event: subset outcomes (đkiện đi kèm, vd xúc sắc số lẻ)
• Simple/ elementary event: single outcome
• Probability: number that measures --> relative likelihood that the event will occur
• The probability of event A: P(A)

• Complement (tập hợp đối): P(A) + P(A′ ) = 1


- A': complement of an event A
• Intersect: (giao) -> "AND"
• Union: (hợp) -> "OR"
• General law of addition:

• Mutually Exclusive event (disjoint): can't occur at the same time


- Event A = a day in January. Even B = a day in February
- If A  B = f, then P(A  B) = 0
- null (f)
• Special Law of addition:
- Mutually exclusive => P(A  B) = P(A) + P(B)
• Collectively Exhaustive Events:
- One of the events must occur
- The set of events covers the entire sample space
Ex: A = Weekday; B = Weekend;
C = January; D = Spring;
• Events A, B, C and D: collectively exhaustive (but not mutually exclusive – a
weekday can be in January or in Spring)
• Events A and B are collectively exhaustive and also mutually exclusive

4 probability Page 33
• Events A and B are collectively exhaustive and also mutually exclusive
• Conditional Probability:
- Based on Contingency table
- The probability of event A given that event B has occurred

• Independent Event: the probability of one event is not affected by the fact that the other
event has occurred

• Multiplication Rules:

If A,B: independent

• Odds in favor of event A:

• Odds against of event A:

• Bayes'Theorem
- revise previously calculated probabilities based on new information.
- an extension of conditional probability.

- P(A) is not given:


( trong tất cả những máy A, xác suất có máy hư (B) là??

A: máy A
B: máy bị hư
Assessing probability
• Empirical (relative frequency approach): estimated through observation and exp
- P(f) = f/n

4 probability Page 34
- P(f) = f/n
- f: the frequency of observed outcomes defined in our experimental sample space
- n: number of observations
P(a missed scan) =number of missed scans / number of items scanned
• Classical: know the probability b4 observing the event/ do exp
- 50% chance of heads on coin flip
• Subjective: judgment about the likelihood of an event
- needed when there is no repeatable random experiment

4 probability Page 35
Counting rules
Friday, July 13, 2018 9:43 AM

Rules for counting the number of possible outcomes


1)

- Example: If you roll a fair die 3 times then there are 63 = 216 possible outcomes
- k=6: mutually exclusive, collectively effective events
- n=3: trials

2) the number of possible outcomes

- k1 events on the first trial


- k2 events on the second trial, …
- kn events on the nth trials
- Example:
○ You want to go to a park, eat at a restaurant, and see a movie. There are 3 parks, 4 restaurants,
and 6 movie choices. How many different possible combinations are there?
○ Answer: (3)(4)(6) = 72 different possibilities

3)

- ways that n items can be arranged in order is


- Example:
○ You have five books to put on a bookshelf. How many different ways can these books be placed
on the shelf?
○ Answer: 5! = (5)(4)(3)(2)(1) = 120 different possibilities

4)
Permutations

- ways of arranging X objects selected from n objects in order


- Example:
○ You have five books and are going to put three on a bookshelf. How many different ways can the
books be ordered on the bookshelf?
Answer:

5)
Combinations

- ways of selecting X objects from n objects, no order

4 probability Page 36
- ways of selecting X objects from n objects, no order
- Example:
○ You have five books and are going to randomly select three to read. How many different
combinations of books might you select?
Answer

4 probability Page 37
Thursday, August 23, 2018 10:58 PM

Random Varible ----represents-----> possible numerical value from an


uncertain event
Discrete random varibales:
-> produce outcomes from a counting process (e.g. number of classes
you are taking)
Probability distribution:
- All possible numerical outcomes -> mutually exclusive
- Probability of occurrence ---associated with---> each outcome

Experiment:
Toss 2 Coins.
Let X = # heads.

Expected value (mean or weighted average)

Experiment:
E(x) = (0 x .25) + (1 x .50) + (2 x .25)
Toss 2 Coins.
= 1.0
Let X = # heads.

Variance

5 Discrete Distribution Page 38


( muy = mean = E(x) )

Standard Deviation

Continuous random variables:


-> produce outcomes from a measurement process (e.g. your
weight,…)

5 Discrete Distribution Page 39


PDF & CDF
Wednesday, August 29, 2018 10:26 PM

PDF (Probability Distribution Function) -> shows the probability for


each value

• P(x)  0 for any value of x

• Individual probabilties sum to 1

CDF (Cumulative Probability Function) -> shows probability of X is less


than or equal to (x0)
• Denote F(x0)

• Example: F(1) = P(X≤1) = P(0)+P(1) = 0.75

5 Discrete Distribution Page 40


Transformation of random variables
Wednesday, August 29, 2018 10:58 PM

E(ax+b)=a.E(x)+E(b) (a,b=const)
E(ax+b)=a.E(x)+b

5 Discrete Distribution Page 41


Probability Distributions
Wednesday, August 29, 2018 11:02 PM

5 Discrete Distribution Page 42


Uniform distribution
Wednesday, August 29, 2018 11:06 PM

Uniform distribution:
• Random variables: finite number of integer value from a to b
• Depends: a&b
○ a: lower limit
○ b: upper limit
• 1 probability for all X (each value is equally likely to occur)

5 Discrete Distribution Page 43


Bernoulli+Binominal
Wednesday, August 29, 2018 11:06 PM

Bernoulli:
• Experiment has 2 outcomes
• Success (X=1), failure (X=0)
○ "success" ---defined--> less likely outcome
=>  < 0.5 for convenience.

Probability of success 
Probability of failure 1-
Mean E(X)=
variance (1-)

Binomial: repeat Bernoulli n times


• From a finite population with replacement or from an infinite
population

n Number of trials (fixed)

PDF  Probaility of success (const)

Mean

Standard deviation

shape Skewed right if π < 0.5,


skewed left if π > 0.5
symmetric if π 5 .50.

5 Discrete Distribution Page 44


Hypergeometric+Geometric
Wednesday, August 29, 2018 11:07 PM

Hypergeometric distribution:
• Similar: binomial but
- "n" trials from a finite population
- Sample taken without replacement
- Outcomes of trials: dependent

PDF
Where
N = population size
A = number of items of interest in the population
Mean N – A = number of events not of interest in the population
n = sample size
x = number of items of interest in the sample
n – x = number of events not of interest in the sample
Standard deviation

shape Symmetric : A/N =0.5, skewed right, skewed left

Geometric: -> Bernoulli trials (not fixed) until the first success
-> Depends on 

5 Discrete Distribution Page 45


Poison distribution
Wednesday, August 29, 2018 11:07 PM

Poison distribution (model of rare events): occurrence (independent) in a


given area of opportunity (time, space, volume,…)

- The average number of events per unit is  (lambda)

-  increases -> less skewed, more bell-shaped

PDF where:
x = number of events in an area of opportunity
 = expected number of events (average number of
events per unit)
Mean e = base of the natural logarithm system (2.71828...)

S.D

shape Skewed right

5 Discrete Distribution Page 46


CONTINUOUS DISTRIBUTION
Friday, August 31, 2018 10:35 AM

A variable that can assume any value in an interval


. Thickness of an item
. Time required to complete a task
. Temperature in a room

These can potentially take on any value, depending only on the ability to
measure accurately

Discrete Variable: each value of X has its own propability P(X)


Continuous Variable: events are intervals and probabilities are areas
underneath smooth curves. A single point has no probability

Example: Your weight, if you have normal scale, you weight is 50, but if you
have more detailed scale, your weight may be 50.2

6. Continuous Distribution Page 47


PDF and CDF of Continuous Distributions
Friday, August 31, 2018 10:36 AM

Probability Density Function (PDF)


. Denote f(x); must be nonnegative
. Total area under curve = 1
. Mean, variance, and shape depend on the PDF parameters

In discrete, PDF is for 1 function, but in continuous, it's for an arrange of


functions
Interval of F(x) from a -> b is P ( a =< x = < b )
Interval of F(x) from -infinite -> +infinite equal to 1
The area at any single point = 0

Cumulative Distribution Function (CDF)


. Denote F(x)
. Shows P (X =< x)
. Useful for finding probabilities

6. Continuous Distribution Page 48


THE UNIFORM DISTRIBUTION
Friday, August 31, 2018 10:37 AM

The uniform distribution is a probability distribution that has equal probabilities for all possible outcomes of the random
variable
Also called a rectangular distribution

If X is a random variable that is uniformly distributed between a and b, its PDF has constant height
• Area = base x height = (b-a) x 1/(b-a) = 1

Example:
SUMMARY

Example:

6. Continuous Distribution Page 49


6. Continuous Distribution Page 50
THE NORMAL DISTRIBUTION
Friday, August 31, 2018 10:40 AM
Normal distribution

Bell Shaped
Symmetrical
Mean = Median = Mode

Centre is determined by the mean, μ


Spread is determined by the standard deviation, σ
The random variable has an infinite theoretical
range -  to + 

Formula for the normal PDF

e = the mathematical constant approximated by 2.71828


π = the mathematical constant approximated by
3.14159
μ = the population mean
σ = the population standard deviation
x = any value of the continuous variable, − < x < 

6. Continuous Distribution Page 51


FINDING NORMAL PROBABILITIES
Friday, August 31, 2018 10:45 AM

The probability for a range of values is measured by the area under the curve

SUMMARY FOR NORMAL DISTRIBUTION

6. Continuous Distribution Page 52


STANDARDIZED NORMAL
Friday, August 31, 2018 10:46 AM

Any normal distribution can be transformed into the standardized normal


distribution (Z), with mean 0 and variance 1

Need to transform X units into Z units by subtracting the mean of X and dividing
by its standard deviation

SUMMARY OF STANDARDIZED NORMAL DISTRIBUTIONS

6. Continuous Distribution Page 53


FINDING NORMAL PROBABILITIES
Friday, August 31, 2018 10:49 AM

6. Continuous Distribution Page 54


STANDARDIZED NORMAL TABLE
Friday, August 31, 2018 10:50 AM

For a given Z - value a, the table shows F(a) (the area under the curve from -  to
a

For negative Z-values, use the fact that the distribution is symmetric to find the
needed probability

6. Continuous Distribution Page 55


FINDING THE X VALUE FOR A KNOWN PROBABILITY
Friday, August 31, 2018 10:54 AM

Steps to find the X value for a known probablity:


1. Find the Z value for the known probability
2. Convert to X units using the formula

6. Continuous Distribution Page 56


6. Continuous Distribution Page 57
NORMAL APPROXIMATION BINOMIAL DISTRIBUTIONS
Friday, August 31, 2018 10:56 AM

When is Approximation Needed?


Rule of Thumb: When n   and n(1-)   , then it is appropriate to use
the normal approximation to the binomial
• In this case, the binomial mean and standard deviation will be equal to the normal
µ and s, respectively.

6. Continuous Distribution Page 58


POISSON DISTRIBUTIONS (NORMAL APPROXIMATION)
Friday, August 31, 2018 10:57 AM

• The normal approximation to the Poisson works best when


❖  is large (e.g., when  exceeds the values in Appendix B).
• Set the normal µ and  equal to the Poisson mean and standard deviation.

6. Continuous Distribution Page 59


THE EXPONENTIAL DISTRIBUTION
Friday, August 31, 2018 10:59 AM

Often used to model the length of time between two occurrences of an event
(the time between arrivals)

Examples:
Time between trucks arriving at an unloading dock
Time between transactions at an ATM machines
Time between phone calls to the main operator

SUMMARY OF EXPONENTIAL DISTRIBUTIONS

6. Continuous Distribution Page 60


6. Continuous Distribution Page 61
Some terminology
Tuesday, August 7, 2018 12:59 PM

In larger samples, the sample means would tend to be even closer to μ. This fact is the basis for statistical
estimation.

How to make inferences about a population that take into account four factors:
• Sampling variation (uncontrollable).
• Population variation (uncontrollable).
• Sample size (controllable).
• Desired confidence in the estimate (controllable).

Estimators:
Estimator: - a statistic --derived)--> from a sample
- infer the value of a population parameter.
- a random variable (random samples vary)
Estimate: the value of the estimator in a specific sample.

Sample proportion:
• p = x/n
○ x: number of successes in the sample
○ n: the sample size
• Parameter: π

Sampling error: difference between an estimate and the corresponding (tương ứng) population parameters

Ex: Sampling Error =x - μ

Bias: difference between the expected value (i.e., the average value) of the estimator and the true parameter
• For the mean:

Bias = E( X ) - μ
Unbiased parameter:
• The expected value of the estimator equal to the parameter being estimated.
• The sample mean is an unbiased estimator of the population mean when
• Neither overstates nor understates the true parameter on average.
• Can be studied mathematically or by simulation experiments.

▪ sample mean ( x )
▪ Sample proportion (p)

Sampling distribution:
• Probability distribution of all possible values of a statistic for a given size sample selected from a population

Efficiency:
• Refers to: variance of the estimator's sampling dis
• More efficient estimator -> smaller variance

7. sampling distribution Page 62


• More efficient estimator -> smaller variance

Consistency
onverges ( k o ể
• Estimator the parameter being estimated as the sample size increases
toward

7. sampling distribution Page 63


Expected value
Thursday, August 23, 2018 10:39 PM

Different sample (same size from same pop) -> yeild different sample means
: sample mean

: population mean

❖ Expected value of the sample means:

❖ Standard Error of the mean: variability in the mean from sample to sample
○ Decrease when sample size increase

❖ Population -normal- ( mean μ and standard deviation σ )

=> sampling distribution of X -> normally distributed

7. sampling distribution Page 64


Central limit theorem
Thursday, August 23, 2018 11:10 PM

Central limit theorem --help us--> to approximate the shape of the sampling
distribution of X bar even when we don't know what the population looks like

- If population: not normal Most distribution: n >30 - > nearly normal


○ Sample means from pop -> will be approximately normal -> sample Fairly symmetric distributions: n > 15
size large enough

7. sampling distribution Page 65


Z-value for sampling dis of the mean
Thursday, August 23, 2018 11:53 PM

7. sampling distribution Page 66


Confidence interval
Monday, August 20, 2018 11:49 PM

- Sample mean: point estimate of an unknown population mean


- Confidence interval (interval estimate): additional information about variability (population characteristic)
○ Contain the unknown population parameter
○ confidence level = 95%

=> (1 - ) = 0.95
(Interpret) In the long run, 95%
Significant level
• of all the confidence intervals that
can be constructed will contain the
unknown true parameter
A specific interval either will contain or will not contain the true parameter (no probability involved in a specific
interval)

○ Assumptions: population:
▪ Normally distributed (=> any sample size is okay)
▪ Not normal => use large sample ( >30%)

8. Confidence Interval Page 67


σ2 Known
Friday, August 24, 2018 11:44 PM

8. Confidence Interval Page 68


Common levels of confidence
Friday, August 24, 2018 11:54 PM

8. Confidence Interval Page 69


σ Unknown
Friday, August 24, 2018 11:55 PM

Population standard deviation σ unknown --substitute--> sample standard deviation S

➢ Extra uncertainty: cuz S is variable from sample to sample


➢ Using Sudent's t Distribution (instead of the normal distribution) => important
when the sample size is small
- t: a family of distributions
- t: depends on degrees of freedom (d.f.) -> number of observations
that can change freely after sample mean has been calculated
- Appendix D
Confidence Interval Estimate:

Estimated standard error of the mean

tn-1,α/2 : the critical value of the t distribution with


- n-1 d.f.
- an area of α/2 in each tail.

8. Confidence Interval Page 70


Population proportion
Saturday, August 25, 2018 12:27 AM

An interval estimate for the population proportion ( π ) can be calculated by adding an


allowance for uncertainty to the sample proportion ( p )

❖ Sample proportion:
○ Distribution: approximately normal if the sample size is large
○ Standard deviation:

❖ Sample data:

❖ Confidence Interval Endpoints:

○ where
• Zα/2 is the standard normal value for the level of confidence desired
• p is the sample proportion
• n is the sample size
○ Note: must have X = np > 5 and n – X = n(1-p) > 5 (to make sure that we can
use normal distribution to estimate)
-

8. Confidence Interval Page 71


Pop proport -Chi-square
Saturday, August 25, 2018 12:45 AM

Population normal => the sample variance s2 follows the chi-square


distribution (2) with degrees of freedom d.f. = n – 1.

- Lower (2L= 21- α /2)


- Upper (2U= 2 α/2)
- Appendix E
- Sample variance : s2
- Confidence interval:

8. Confidence Interval Page 72


Dertermine sample size
Saturday, August 25, 2018 12:52 AM

Sampling error (margin of error):


- Imprecision in the estimate of the population parameter
- Added and subtracted amount to the point estimate => to form the
confidence interval
- Z của các công thức dưới đều là Z (alpha/2)

Must know:
• The desired level of confidence (1 - ), which
determines the critical value, Zα/2
- Must know:
• The acceptable sampling error, e
• The desired level of confidence (1 - ), which determines the • The true proportion of events of interest, π
critical Z value
○ π can be estimated with a pilot sample if
• The acceptable sampling error, e necessary (or conservatively use 0.5 as an
• The standard deviation, σ estimate of π)

8. Confidence Interval Page 73


Hypothesis
Thursday, August 23, 2018 10:57 PM

Definition: a claim (assertion) about a population parameter


Population mean:

Population proportion:

The null hypothesis H0 (maintain hypothesis): states the assertion to be The Alternative Hypothesis H1 : the opposite of the null
tested - Researcher is trying to prove

- Always contains “=” sign.

- Always about a population parameter,

- not about a sample statistic.

9.One-sample Hypothesis Tests Page 74


Critical value
Sunday, August 26, 2018 10:01 PM

Sample mean --close to---> stated population mean


=> H0 is not rejected.
=> Vice versa
How "close" is enough?
❖ Critical Value

9.One-sample Hypothesis Tests Page 75


Error in Hyp test
Sunday, August 26, 2018 10:04 PM

❖ Type I Error (false positive)


○ Reject a true null hypothesis
○ A serious type of error
○ The probability is  (level of significance)
❖ Type II Error (false negative)
○ Don’t reject sth false
○ The probability is β --depends on--> alpha, sample size

9.One-sample Hypothesis Tests Page 76


Confidence Level
Sunday, August 26, 2018 10:11 PM

Confidence coefficient: (1-α)


- The probability of not rejecting H0 when it is true.

Confidence Level (of a hypothesis test):


(1-α)*100%.
Power of statistical test (1-β) :
- The probability of rejecting H0 when it is false.

9.One-sample Hypothesis Tests Page 77


Hypothesis Tests: Mean
Sunday, August 26, 2018 10:20 PM

D.f: n-1

- Convert sample statistic ( X ) to test statistic (Z statistic )

- Critical Z values ---based on---> level of significance 

- Decision Rule: If the test statistic falls in the rejection region, reject H0; otherwise do not
reject H0

Lower-tail test / left-tailed test


➢ Upper-tail tesr / right-tailed test

9.One-sample Hypothesis Tests Page 78


P-value approach to testing
Sunday, August 26, 2018 10:26 PM

P-value (observed level of significance):


- Probability of obtaining a test statistic more extreme ( ≤ or  ) than the observed
sample value given H0 is true

- Smallest value of  for which H0 can be rejected

- Sample Statistic (eg: X ) ---> Test Statistic (eg: Z statistic)

- Compare the p-value with 

○ If p-value <  , reject H0

○ If p-value   , do not reject H0


-

9.One-sample Hypothesis Tests Page 79


Connection to Confidence Interval
Sunday, August 26, 2018 10:36 PM

Connection to Confidence Interval (apply for 2-tailed test only)

9.One-sample Hypothesis Tests Page 80


Hyp test for proportion
Sunday, August 26, 2018 10:45 PM

- Involve categorical variable


- 2 possible outcomes: (doesn’t) possess characteristic of interest
Proportion of the population in the category of interest: π
Sample proportion in the category of interest: p

- X and n - X are at least 5 => p can be approximated by normal


distribution with

Mean Standard deviation

-
- P is approximately normal - An equivalent form (in terms of the number in the
category of interest, X:

9.One-sample Hypothesis Tests Page 81


Independent samples
Monday, August 20, 2018 11:43 PM

Goal: test hypothesis -> difference between 2 popuplation mean: μ1 – μ2

The point estimate for the difference: X1 – X2


Assumptions:
• Samples are randomly and
independently drawn
• Populations are normally
distributed or both sample
sizes are at least 30

10. Two-sample Hypothesis Tests Page 82


The paired difference test
Tuesday, August 21, 2018 12:15 AM

10. Two-sample Hypothesis Tests Page 83


2 pop proportion
Tuesday, August 21, 2018 12:41 AM

Goal: test hypothesis / form of confidence interval


-> difference: 2 population proportion: π1 – π2

The point estimate for the difference:

Testing for Zero difference (π1 – π2 = 0)

Pooled estimate:

(X1 and X2: items in sample 1,2)

10. Two-sample Hypothesis Tests Page 84


Test 2 pop variance
Tuesday, August 21, 2018 12:51 AM

➢ F critical: F table
➢ 2 d.f required:
- Numerator: later sample variance (column in F
table)
- Denominator (row in F table)

10. Two-sample Hypothesis Tests Page 85


Analysis of Variance
Wednesday, August 29, 2018 10:29 PM

Purpose: compare more than two means simultaneously (If there is only two --> it
is called T-test)
Variation in Y about its mean is explained by one or more
categorical independent variables (the factors) or is unexplained (random error)

One - factor ANOVA (Inclued only one factor)

N - factor ANOVA -> Involving two or more factors

11

OVERVIEW
Each possible value of a factor or combination of factors is a treatment.
Test if each factor has a significant effect on Y:
H0: μ1 = μ2 = μ3

H1: Not all the means are equal


If we cannot reject H0, we conclude that observations within each treatment have
the same mean μ
ANOVA ASSUMPTION

1/ Observations on Y are independent


2/ Populations being sampled are normal and have equal variances

11. ANOVA Page 86


ONE FACTOR ANOVA
Friday, August 31, 2018 10:26 AM

Purpose: Compares the means of c treatments (groups)

'* Sample sizes within each treatment do not need to be equal


Total number of observations: n = n1 + n2 + n3 + … + nc
Observation on Y (Yij) -- included -->
• explained variation
• unexplained error which depend on character of each observation ->
independent
When we look at population mean, we can conclude which one is higher. But for
the sample mean, we cannot conclude because it’s just estimate --> with sample
mean we need to compare by Zstat (included: sample mean, standard deviation
and sample size) and 1- alpha (confidence interval)
Example of one factor: income of student
A1: UEH - A2: UAH - A3: IU
Y11: student 1
Y21: student 2

-> If the income of all student are the same -> The Ybar is not affected
-> If the income of all student are not the same -> The Ybar is affected

One factor ANOVA as a Linear Model

11. ANOVA Page 87


Yij: observation
μ :common mean
Aj: treatment effect
ij: random effect

Testing hypotheses:
H0: A1 = A2 = A3 = … = Ac = 0
H1: Not all Aj are zero

If H0 is true, the the ANOVA model is


yij =  + ij

The same mean in all groups, or no factor effect

11. ANOVA Page 88


DECOMPOSITION OF VARIATION
Friday, August 31, 2018 10:29 AM

The mean of each group (group mean)

The overall sample mean (grand mean)

11. ANOVA Page 89


PARTITION OF DEVIATIONS
Friday, August 31, 2018 10:31 AM

HYPOTHESIS TESTING
SSA and SSE are used to test the hypothesis of equal means by dividing each sum
of squares by it degrees of freedom
These ratios are called Mean Squares (MSA and MSE)

11. ANOVA Page 90


F STATISTICS:
Friday, August 31, 2018 10:32 AM

F STATISTICS: Ratio of variance due to treatments (MSA) and to error (MSE)

When F is near zero -> little difference among treatments -> not reject H0
Decision Rule: Reject H0 if F > Fa, otherwise do not reject

11. ANOVA Page 91


THE TUKEY'S TEST
Friday, August 31, 2018 10:32 AM

When? --> After rejection of equal means in ANOVA


Tells which population means are significantly different
Ex: μ1 = μ2 >< μ3

Tukey's studentized range test is a multiple comparion test


--> for c groups, there are c(c-1)/2 distinct pairs of means to be compared

Tukey 's Test is a two - tailed test for equality of paired means from c groups
compaired simultaneously
The hypothesis are:

Decision Rule

Tc,n-c is a critical value of the Tukey test statistic Tcalc for the desired level of
significant

5% critical values of Tukey test statistics

11. ANOVA Page 92


HARTLEY'S TEST
Friday, August 31, 2018 10:33 AM

The test statistic is the ratio of the largest sample variance to the smallest
sample variance

Decision Rule: Reject H0 if Hcalc > Hcritical


Hcritical can be found in Hartley's test statistics, using
Df1 = c, df2 = n/c-1

11. ANOVA Page 93


LEVENTE'S TEST - Just can only be used in Minitab
Friday, August 31, 2018 10:33 AM

Alternative to Hartley's F test


Levene's test does not assume a normal distribution
Based on the distances of the observations from their sample medians rather
than their sample means

11. ANOVA Page 94


Regression analysis
Friday, August 31, 2018 1:20 PM

• Regression analysis is used to:

• Predict dependent variable based on independent variable(s)


• Explain the impact of changes in an independent variable on the
dependent variable
• Dependent variable (y): the variable we wish to explain
• Independent variable (x): the variable used to explain the dependent
variable
• Relationship between x and y ---described by---> linear function
• Changes in y ----caused by---> changes in x
• Linear regression population equation model

12. Simple linear regression Page 95


12. Simple linear regression Page 96
Least Squares Estimators
Friday, August 31, 2018 1:31 PM

b0 & b1 & are obtained by:


- Finding b0 & b1 (which minimize the sum of the squared differences
between y and

Differential calculus is used to obtain the coefficient estimators b0 and


b1 that minimize SSE (eg: derivative,..)

- The slope coefficient estimator is

- And the constant or y-intercept is

- The regression line always goes through the mean x, y

12. Simple linear regression Page 97


Intercept+slope
Friday, August 31, 2018 6:20 PM

b0 : estimated average value of y when the value of x is zero (if x = 0 is in the range of observed
x values)

b1 : estimated change in the average value of y as a result of a one-unit change in x

House Price in $1000s Square Feet


(Y) (X)
245 1400
312 1600
279 1700
308 1875
199 1100
219 1550
405 2350
324 2450
319 1425
255 1700

• Here, no houses had 0 square feet, so b0 = 98.24833 just indicates that, for houses within the
range of sizes observed, $98,248.33 is the portion of the house price not explained by square
feet
• Here, b1 = .10977 tells us that the average value of a house increases by 0.10977($1000) =
$109.77, on average, for each additional one square foot of size

12. Simple linear regression Page 98


Measures of variation
Friday, August 31, 2018 6:30 PM

Measures the Explained variation Variation


variation of the yi attributable to the attributable to
values around their linear relationship factors other than
mean, y between x and y the linear
relationship
between x and y

12. Simple linear regression Page 99


Coefficient of Determination, R2
Friday, August 31, 2018 6:34 PM

• Coefficient of determination (R-squared)


• the portion of the total variation in the dependent variable ---
explained by---> variation in the independent variable
• denoted as R2

12. Simple linear regression Page 100


Estimation of model Error variance
Friday, August 31, 2018 6:40 PM

• An estimator for the variance of the population model error is

• Division by n – 2: because there are 2 constraint (b0 and b1)


• standard error of the estimate

12. Simple linear regression Page 101


Friday, August 31, 2018 6:45 PM

Inferences About the


Regression Model

12. Simple linear regression Page 102


Confidence interval estimate of the slope
Friday, August 31, 2018 6:47 PM

The slope coefficient estimator is

12. Simple linear regression Page 103


12. Simple linear regression Page 104
T test: inference about the slope
Friday, August 31, 2018 6:46 PM

12. Simple linear regression Page 105


F-test for significance
Friday, August 31, 2018 6:48 PM

Simple regression model => số independent variable là 1 => k=1

12. Simple linear regression Page 106


Prediction+prediction interval
Friday, August 31, 2018 6:49 PM

• The regression equation ---be used to---> predict a value for y, given
a particular x
• For a specified value, xn+1 , the predicted value is

12. Simple linear regression Page 107

You might also like