Chap 1: Stat: Tasks: - Data Modeling

Chap 1: stat
Thursday, June 28, 2018 12:20 AM
Statistic (data science): -> single measure

-> number
-> summarize a sample data set
Tasks: - data modeling
- analysis
- decision making
Kinds:
- Descriptive (thống kê mô tả): collect-summarize-present-analyze
(charts, graphs, numerical summary)
- Inferential (suy diễn): generalizing -> estimating unknown population
parameters,
Drawing conclu, making deci
▪ Estimating
▪ Hypothesis test: test the claim…
Application:
- Audit
- Economics: make a forecast
- Marketing: survey customer preference
- Production: output
- Finance: guide an investment
Empirical data: collected through observation + exp

=> pitfalls -----> large pop, small sample
-----> nonrandom samples
----> rare events
----> poor survey method (memory)
----> casual links (crime rates-full moon)
----> generalization (indi-> groups)
----> unconscious bias (for many years it is assumed that…)
----> signi than important
session 0 Page 1
Chap 2: data
• Data: facts, figures,.. Collected for - analysis

-presentation
-intepretion
Types:
○ Categorical (qualitative): values described by words rather than numbers
○ Numerical (quantitative)
○ Time series: study trends, patterns over time

○ Cross-sectional: same time, different individuals (price of a group of 20 stocks on 30/6)
○ Pooled = time series+cross-sectional (price of a group of 20 stocks last week)
• Data set: collection of data value as a whole

• Subject/individual: item for study
• Variable: characteristic about the subject
• Observation: a single data value
variables
Employee Name Sexuality DOB Income per year in $

Gladys Simpson Female 1-May-1971 120,000
Divid Hinds Male 17-Dec-1968 135,000
Kenneth Henry Male 3-Sep-1965 98,000
Observation
Data types (vari):

- Univariate (1 vari)
- Bivariate (2 vari)
- Multivariate (>2 vari)
Collecting data
1 data collection Page 2

(infor availble somewhere)

Scales of measurement
Def:
- Determine the amount of infor contained in the data
- Indicate -> data summarization
-> statistical analysis Most appropriate
Types: ways to measure depend on the scale
➢ Nominal scale : category, classification data
- Data (lables, names) -> identify an attribute of the element
- Can code data numerically -> no numerical meaning
- Example:
• Students of a university are classified by as Business, Humanities, Education, and so on.
- No ordering
➢ Ordinary: ranking (nominal data+order of data)
- Ex:
• Uni students: freshman, sophomore, junior/senior
- Ordering but no clear meaning to the distance between data
➢ Interval: ordinal data
- Difference between measurements -> meaningful
- No true zero value, ratios have no meaning (ko thể nói 4 độ nóng gấp đôi 2 độ)
➢ Ratio: interval data
- Ratio of the 2 values is meaningful (tiền của A gấp đôi tiền B)

Sampling concepts
Thursday, June 28, 2018 2:29 PM
• Population
- collection of all items of interest or under investigation
- finite or infinite.
• Census: examination of all items in a defined population.
• Sample: observed subset (tập hợp con) of the population.
• Parameter: specific characteristic of a population

• Statistic: specific characteristic of a sample
• Target population: population we are interested in (e.g., U.S. gasoline prices).
• Sampling frame: the group from which we take the sample (e.g., 115,000 stations).
• Replacement: allow dublicate when sampling
- N is much smaller than n --> unlikely to dublicate
- Without replacement ---> not allow to dublicate

Sampling Method
Monday, July 16, 2018 3:57 PM

Non-random sampling
• Convenience Sample
• Sample -> available (e.g., ask co-worker opinions at lunch).
• Judgment Sample
• Use expert knowledge to choose “typical” items (e.g., which employees to interview).
• Focus Groups
• In-depth dialog with a representative panel of individuals (e.g. iPhone users).

Statistical Sampling
Statistical Sampling: sample based on

- Known
- Calculable probability
Stratified:
Simple random:
• Divide population -> subgroups (called strata) have
• Mem of pop
Equal chance of being selected common characteristic (e.g. age, gender,
• Sample of given size occupation)
• Select a simple random sample from each subgroup
• Selection: replacement / without replacement • Combine samples from subgroups into one
• The sample can be obtained using a table of random numbers or computer random
number generator
Systematic: Cluster:
• sample size: n • One-stage cluster sampling: randomly selected k

clusters
• Divide N individuals into n groups of k individuals: k=N/n • Two-stage :
(chia 36 thành 9 nhóm mỗi nhóm 4 người) - randomly select k clusters
• Chọn những mặc áo đỏ mỗi nhóm - choose a random sample of elements within
each cluster.

Make a survey

Rule of thumb
Tuesday, July 17, 2018 1:59 PM

Categories data
Monday, July 16, 2018 9:57 AM
Tabulating data:
Graphing Data Summary table:
Bar + Pie Charts: qualitative data (categories or nominal scale) ❖ Purpose:
- Height of bar => frequency To see differences between or among categories.
- Size of pie slice => percentage
Bar Pie (%)
Tips: - Convey general idea
- Numerical vari-> Y asix - Ineffective -> too many slices
- Category labels-> X asix - Represent parts of a whole
- Height/ length: proportional - Need relative frequency to construct
to the quality displayed
Pareto:
- portray categorical data (nominal scale)
- Categories: descending order of frequency => the most common categories appear first
- Cumulative polygon included
2 describe data Page 11

Numerical data

Ordered array
Ordered array: a sequence of data in rank order
- Shows range (min to max)
- Signals -> variability
- May help identify outliers (unusual observations)
- Large data => less useful
○ Ex:
raw form
24, 26, 24, 21, 27, 27, 30, 41, 32, 38
ordered array from smallest to largest
21, 24, 24, 26, 27, 27, 30, 32, 38, 41
Stem-and-Leaf Diagram: tool of: exploratory data analysis (EDA)

- Useful => small data set
- Most closely resemble: rudimentary bar chart
- Reveal:
○ Central tendency
○ Dispersion
- Method: Separate the sorted data into:
○ leading digits (the stem)
○ trailing digits (the leaves)
Dot Plots:
• The simplest graphical display of n individual values of numerical data:
• Tool for data exploration => easy to understand
• Not good for large samples (e.g., > 5,000).
• Basic steps:
1. Make a scale -> cover data range
2. Mark axis demarcations (ranh giới) + label them

2. Mark axis demarcations (ranh giới) + label them
3. Plot data
• Range -> dispersion

• Clustering (điểm tụ) -> central tendency
• Reveal shape (when sample is large enough)

Frequency distribution
Frequency distribution:
• A frequency distribution is a table
• containing class groupings
• and the corresponding frequencies with which data fall within each grouping
Frequency Polygons and Orgives (cumulative):

Histogram: ❖ Frequency Polygon: a line graph
○ Connects the midpoints of the histogram intervals,
Y-axis: ○ Plus extra intervals at the beginning and end so that the line will touch the
X-axis
• the number of data values (or a percentage) within each bin of a frequency ○ Same purpose -> histogram
distribution
• Frequency, relative frequency, or percentage
X-axis:
• ticks show the end points of each bin
• The class boundaries (or class midpoints)
No gaps between bars
❖ Orgives (%):
○ Line graph of the cumulative frequencies

Scatter Plots
Scatter Plots:
• possible relationships between two numerical variables

Time Series Plot
- Patterns in the values of a variable over time.
Log scales (ratio scale):

- Useful -> time series ( grow at a compound annual percentage rate)
○ Ex: GDP, the national debt, or your future income).
- Reveal: quantity growth
○ increasing percent (concave upward or convex function),
○ constant percent (straight line)
○ declining percent (concave downward).
- Equal distances => equal ratios
○ the distance from 100 to 1,000 => same => distance from 1,000 to 10,000
○ both have the same 10:1 ratio
- Suited: positive data values
- Displayed: vertical axis => reveal more detail for small data values

Pictograms

Numerical data property
Friday, July 13, 2018 10:17 AM
3 descriptive statistics Page 19

1) Arithmetic Mean (mean): not use for outlier
Center the most common measure of central tendency
2) Median: Not affected by extreme values

- In an ordered array =>
most appropriate measure of central tendency for ordinal data
From <http://wps.prenhall.com/wps/grader>
- “middle” number (50% above, 50% below)
4) Mode: Not affected by extreme values (apply to nominal scale) Ex: 1,2,7,8,9,10
- Value that occurs most often ○ Position: (6+1)/2 =3.5
- Used for either numerical or categorical (nominal) data ○ Mean: (7+8)/2 = 7.5 1.5
- No mode/several mode
5) Geometric mean: measure the rate of change of a variable over time

( dùng trường hợp có % hoặc "interest rate")
- Mean rate of return:
○ Measures the status of an investment over time
Ri is the rate of return in time period I

Ex:
An investment of $100,000 declined to $50,000 at the end of year one and rebounded to
$100,000 at end of year two:

Central (overall)

Quartiles
Descriptive Statistics:
split the ranked data -> 4 segments: an equal number of values per segment
- First quartile position: Q1 = (n+1)/4

- Second quartile position: Q2 = (n+1)/2 (the median position)
- Third quartile position: Q3 = 3(n+1)/4
where n is the number of observed values
Ex:
Q1,Q3: nocentral location

Q2 : median (central tendency)
Box-and-whisker Plot: A Graphical display of data using 5-number summary:
Minimum* -- Q1 -- Median -- Q3 -- Maximum

Vari (Range)
• Simplest measure of variation

• Difference between the largest and the smallest values in a set of data:
Problems:
A. Ignore -> data distributions
Solution:
- Variance: Average (approximately) of squared deviations of values from the mean - Coefficient of Variation:
○ Measures relative variation
= mean ○ Always in percentage (%)
n = sample size ○ Shows variation relative to mean
○ Can be used to compare two or more sets of data measured in different
Xi = ith value of the variable X units
- Standard deviation:
• Used to measure variation (most common)
• Show variation about the mean S: standard deviation
• Same units as the original data : mean (average)
Ex: Both stocks have the same standard deviation, but stock B is less
variable relative to its price by using CV
A. Sensitive to outliers:
Solution: Interquartile Range:

• Interquartile range = 3rd quartile – 1st quartile
IQR = Q3 – Q1
• Values outside the inner fences: unusual

• Values outside the outer fences: outliers => xác định outliers
• In a Boxplot, Xmin and Xmax: smallest and highest values in the inner fences.

Variation (overall)
(spread/ variability of the data value)

Population
Sunday, July 15, 2018 11:46 AM
• Population summary measures: parameters

• Population mean: the sum of the values in the population divided by the population size, N
μ = population mean
N = population size
• Population variance:
• Population Standard Deviation:

( measure variation, same units)

Shape
Sunday, July 15, 2018 2:03 PM
• Describes how data are distributed

• Measures of shape:
• Symmetry / asymmetry
• peakedness
Kurtosis: represents too high / too low

Standardized data
Chebyshev's Theorem
• Regardless of how the data are distributed, at least (1 - 1/k2) x 100% of the values will fall within
k standard deviations of the mean (for k > 1)
- Ex:
(1 - 1/22) x 100% = 75% ….... k=2 (μ ± kσ = μ ± 2σ )
Let μ = 72, σ = 8 (standard deviation)
=> At least 75% of the scores will be within the interval 72 ± 2.8 or [56,88]
(regardless of how the scores are distributed)
The Empirical Rule: (not applied for a population of sight-skewness)

• Data distribution: approximately bell-shaped
=> interval μ ± kσ
(contain a know percentage of the data)'
Z Scores:
• Measure distance from the mean
- Ex:
Z-score of 2.0: a value is 2.0 standard deviations from the mean
• Z score > 3.0 or < -3.0: an outlier
- Ex:
Mean: 14.0 Standard Deviation: 3.0

Mean: 14.0 Standard Deviation: 3.0
What is the Z score of the value 18.5
➢ The value 18.5 is 1.5 standard deviations above the mean
• Negative Z-score: a value is less than the mean

Grouped data
Weighted mean:
- A sum: each data value a weight wj that represents a fraction of the total (i.e., the k
weights must sum to 1).
- Ex:
Your instructor give
▪ a weight of 30 percent to homework, 20 percent to the midterm exam, 40
percent to the final exam, and 10 percent to a term project (so that .30 + .20
+ .40 + .10 = 1.00).
▪ your scores on these were 85, 68, 78, and 90. Your weighted average for
the course would be:
= (0.3 x 85) + (0.2 x 68) + (0.4 x 78) + (0.1 x 90) = 79.3 79.3
Application:
- Accounting (weights for cost categories),
- finance (asset weights in investment portfolios)
Approximations for Grouped Data:

❖ Mean:
Population Sample
f: frequency
m: midponit
N or n: size
❖ Variance:
❖ Standard Variation: căn

Linear Rela
The Covariance:
- measures the strength of the linear relationship between two variables
- Population Covariance: Sample Covariance:
- Interpreting:
○ cov(X,Y) > 0 X and Y move: same direction
○ cov(X,Y) < 0 X and Y move: opposite directions
○ cov(X,Y) = 0 X and Y: independent
- Coefficient of Correlation:
○ Measures the relative strength of the linear relationship between two variables
○ Sample Coe-cor
○ Features:
• Unit free
• Ranges between –1 and 1
• Closer –1, the stronger the negative linear relationship
• 1, the stronger the positive linear relationship
• 0, the weaker the linear relationship

Chap 5: Probability
Thursday, June 28, 2018 2:29 PM
• Random exp: results cannot be known in advance

• Sample space: outcomes of the exp
• Discrete: countable number of outcomes
• Ex:
Flip a coin, the sample space consists of 2 outcomes S = {H, T}
• Continuous sample space: measurement outcome (weight, height,…)
• Event: subset outcomes (đkiện đi kèm, vd xúc sắc số lẻ)
• Simple/ elementary event: single outcome
• Probability: number that measures --> relative likelihood that the event will occur
• The probability of event A: P(A)
• Complement (tập hợp đối): P(A) + P(A′ ) = 1

- A': complement of an event A
• Intersect: (giao) -> "AND"
• Union: (hợp) -> "OR"
• General law of addition:
• Mutually Exclusive event (disjoint): can't occur at the same time

- Event A = a day in January. Even B = a day in February
- If A  B = f, then P(A  B) = 0
- null (f)
• Special Law of addition:
- Mutually exclusive => P(A  B) = P(A) + P(B)
• Collectively Exhaustive Events:
- One of the events must occur
- The set of events covers the entire sample space
Ex: A = Weekday; B = Weekend;
C = January; D = Spring;
• Events A, B, C and D: collectively exhaustive (but not mutually exclusive – a
weekday can be in January or in Spring)
• Events A and B are collectively exhaustive and also mutually exclusive
4 probability Page 33
• Events A and B are collectively exhaustive and also mutually exclusive
• Conditional Probability:
- Based on Contingency table
- The probability of event A given that event B has occurred
• Independent Event: the probability of one event is not affected by the fact that the other
event has occurred
• Multiplication Rules:
If A,B: independent
• Odds in favor of event A:
• Odds against of event A:
• Bayes'Theorem
- revise previously calculated probabilities based on new information.
- an extension of conditional probability.
- P(A) is not given:

( trong tất cả những máy A, xác suất có máy hư (B) là??
A: máy A
B: máy bị hư
Assessing probability
• Empirical (relative frequency approach): estimated through observation and exp
- P(f) = f/n
- P(f) = f/n
- f: the frequency of observed outcomes defined in our experimental sample space
- n: number of observations
P(a missed scan) =number of missed scans / number of items scanned
• Classical: know the probability b4 observing the event/ do exp
- 50% chance of heads on coin flip
• Subjective: judgment about the likelihood of an event
- needed when there is no repeatable random experiment
Counting rules
Rules for counting the number of possible outcomes

1)
- Example: If you roll a fair die 3 times then there are 63 = 216 possible outcomes
- k=6: mutually exclusive, collectively effective events
- n=3: trials
2) the number of possible outcomes
- k1 events on the first trial

- k2 events on the second trial, …
- kn events on the nth trials
- Example:
○ You want to go to a park, eat at a restaurant, and see a movie. There are 3 parks, 4 restaurants,
and 6 movie choices. How many different possible combinations are there?
○ Answer: (3)(4)(6) = 72 different possibilities
3)
- ways that n items can be arranged in order is

- Example:
○ You have five books to put on a bookshelf. How many different ways can these books be placed
on the shelf?
○ Answer: 5! = (5)(4)(3)(2)(1) = 120 different possibilities
4)
Permutations
- ways of arranging X objects selected from n objects in order

- Example:
○ You have five books and are going to put three on a bookshelf. How many different ways can the
books be ordered on the bookshelf?
Answer:
5)
Combinations
- ways of selecting X objects from n objects, no order
- ways of selecting X objects from n objects, no order
- Example:
○ You have five books and are going to randomly select three to read. How many different
combinations of books might you select?
Answer
Thursday, August 23, 2018 10:58 PM
Random Varible ----represents-----> possible numerical value from an

uncertain event
Discrete random varibales:
-> produce outcomes from a counting process (e.g. number of classes
you are taking)
Probability distribution:
- All possible numerical outcomes -> mutually exclusive
- Probability of occurrence ---associated with---> each outcome
Experiment:
Toss 2 Coins.
Let X = # heads.
Expected value (mean or weighted average)
Experiment:
E(x) = (0 x .25) + (1 x .50) + (2 x .25)
Toss 2 Coins.
= 1.0
Let X = # heads.
Variance
5 Discrete Distribution Page 38

( muy = mean = E(x) )
Standard Deviation
Continuous random variables:

-> produce outcomes from a measurement process (e.g. your
weight,…)

PDF & CDF
Wednesday, August 29, 2018 10:26 PM
PDF (Probability Distribution Function) -> shows the probability for

each value
• P(x)  0 for any value of x
• Individual probabilties sum to 1
CDF (Cumulative Probability Function) -> shows probability of X is less

than or equal to (x0)
• Denote F(x0)
• Example: F(1) = P(X≤1) = P(0)+P(1) = 0.75

Transformation of random variables
E(ax+b)=a.E(x)+E(b) (a,b=const)
E(ax+b)=a.E(x)+b

Probability Distributions

Uniform distribution
Uniform distribution:
• Random variables: finite number of integer value from a to b
• Depends: a&b
○ a: lower limit
○ b: upper limit
• 1 probability for all X (each value is equally likely to occur)

Bernoulli+Binominal
Bernoulli:
• Experiment has 2 outcomes
• Success (X=1), failure (X=0)
○ "success" ---defined--> less likely outcome
=>  < 0.5 for convenience.
Probability of success 
Probability of failure 1-
Mean E(X)=
variance (1-)
Binomial: repeat Bernoulli n times

• From a finite population with replacement or from an infinite
population
n Number of trials (fixed)
PDF  Probaility of success (const)
Mean
Standard deviation
shape Skewed right if π < 0.5,

skewed left if π > 0.5
symmetric if π 5 .50.

Hypergeometric+Geometric
Hypergeometric distribution:
• Similar: binomial but
- "n" trials from a finite population
- Sample taken without replacement
- Outcomes of trials: dependent
PDF
Where
N = population size
A = number of items of interest in the population
Mean N – A = number of events not of interest in the population
n = sample size
x = number of items of interest in the sample
n – x = number of events not of interest in the sample
Standard deviation
shape Symmetric : A/N =0.5, skewed right, skewed left
Geometric: -> Bernoulli trials (not fixed) until the first success
-> Depends on 

Poison distribution
Poison distribution (model of rare events): occurrence (independent) in a

given area of opportunity (time, space, volume,…)
- The average number of events per unit is  (lambda)
-  increases -> less skewed, more bell-shaped
PDF where:
x = number of events in an area of opportunity
 = expected number of events (average number of
events per unit)
Mean e = base of the natural logarithm system (2.71828...)
S.D
shape Skewed right

CONTINUOUS DISTRIBUTION
Friday, August 31, 2018 10:35 AM
A variable that can assume any value in an interval

. Thickness of an item
. Time required to complete a task
. Temperature in a room
These can potentially take on any value, depending only on the ability to
measure accurately
Discrete Variable: each value of X has its own propability P(X)

Continuous Variable: events are intervals and probabilities are areas
underneath smooth curves. A single point has no probability
Example: Your weight, if you have normal scale, you weight is 50, but if you
have more detailed scale, your weight may be 50.2
6. Continuous Distribution Page 47

PDF and CDF of Continuous Distributions
Probability Density Function (PDF)

. Denote f(x); must be nonnegative
. Total area under curve = 1
. Mean, variance, and shape depend on the PDF parameters
In discrete, PDF is for 1 function, but in continuous, it's for an arrange of

functions
Interval of F(x) from a -> b is P ( a =< x = < b )
Interval of F(x) from -infinite -> +infinite equal to 1
The area at any single point = 0
Cumulative Distribution Function (CDF)

. Denote F(x)
. Shows P (X =< x)
. Useful for finding probabilities

THE UNIFORM DISTRIBUTION
The uniform distribution is a probability distribution that has equal probabilities for all possible outcomes of the random
variable
Also called a rectangular distribution
If X is a random variable that is uniformly distributed between a and b, its PDF has constant height
• Area = base x height = (b-a) x 1/(b-a) = 1
Example:
SUMMARY
Example:

THE NORMAL DISTRIBUTION
Normal distribution
Bell Shaped
Symmetrical
Mean = Median = Mode
Centre is determined by the mean, μ

Spread is determined by the standard deviation, σ
The random variable has an infinite theoretical
range -  to + 
Formula for the normal PDF
e = the mathematical constant approximated by 2.71828

π = the mathematical constant approximated by
3.14159
μ = the population mean
σ = the population standard deviation
x = any value of the continuous variable, − < x < 

FINDING NORMAL PROBABILITIES
The probability for a range of values is measured by the area under the curve
SUMMARY FOR NORMAL DISTRIBUTION

STANDARDIZED NORMAL
Any normal distribution can be transformed into the standardized normal

distribution (Z), with mean 0 and variance 1
Need to transform X units into Z units by subtracting the mean of X and dividing
by its standard deviation
SUMMARY OF STANDARDIZED NORMAL DISTRIBUTIONS

FINDING NORMAL PROBABILITIES

STANDARDIZED NORMAL TABLE
For a given Z - value a, the table shows F(a) (the area under the curve from -  to
a
For negative Z-values, use the fact that the distribution is symmetric to find the
needed probability

FINDING THE X VALUE FOR A KNOWN PROBABILITY
Steps to find the X value for a known probablity:

1. Find the Z value for the known probability
2. Convert to X units using the formula

NORMAL APPROXIMATION BINOMIAL DISTRIBUTIONS
When is Approximation Needed?

Rule of Thumb: When n   and n(1-)   , then it is appropriate to use
the normal approximation to the binomial
• In this case, the binomial mean and standard deviation will be equal to the normal
µ and s, respectively.

POISSON DISTRIBUTIONS (NORMAL APPROXIMATION)
• The normal approximation to the Poisson works best when

❖  is large (e.g., when  exceeds the values in Appendix B).
• Set the normal µ and  equal to the Poisson mean and standard deviation.

THE EXPONENTIAL DISTRIBUTION
Often used to model the length of time between two occurrences of an event
(the time between arrivals)
Examples:
Time between trucks arriving at an unloading dock
Time between transactions at an ATM machines
Time between phone calls to the main operator
SUMMARY OF EXPONENTIAL DISTRIBUTIONS

Some terminology
Tuesday, August 7, 2018 12:59 PM
In larger samples, the sample means would tend to be even closer to μ. This fact is the basis for statistical
estimation.
How to make inferences about a population that take into account four factors:
• Sampling variation (uncontrollable).
• Population variation (uncontrollable).
• Sample size (controllable).
• Desired confidence in the estimate (controllable).
Estimators:
Estimator: - a statistic --derived)--> from a sample
- infer the value of a population parameter.
- a random variable (random samples vary)
Estimate: the value of the estimator in a specific sample.
Sample proportion:
• p = x/n
○ x: number of successes in the sample
○ n: the sample size
• Parameter: π
Sampling error: difference between an estimate and the corresponding (tương ứng) population parameters
Ex: Sampling Error =x - μ
Bias: difference between the expected value (i.e., the average value) of the estimator and the true parameter
• For the mean:
Bias = E( X ) - μ
Unbiased parameter:
• The expected value of the estimator equal to the parameter being estimated.
• The sample mean is an unbiased estimator of the population mean when
• Neither overstates nor understates the true parameter on average.
• Can be studied mathematically or by simulation experiments.
▪ sample mean ( x )
▪ Sample proportion (p)
Sampling distribution:
• Probability distribution of all possible values of a statistic for a given size sample selected from a population
Efficiency:
• Refers to: variance of the estimator's sampling dis
• More efficient estimator -> smaller variance
7. sampling distribution Page 62

• More efficient estimator -> smaller variance
Consistency
onverges ( k o ể
• Estimator the parameter being estimated as the sample size increases
toward

Expected value
Different sample (same size from same pop) -> yeild different sample means
: sample mean
: population mean
❖ Expected value of the sample means:
❖ Standard Error of the mean: variability in the mean from sample to sample
○ Decrease when sample size increase
❖ Population -normal- ( mean μ and standard deviation σ )
=> sampling distribution of X -> normally distributed

Central limit theorem
Central limit theorem --help us--> to approximate the shape of the sampling
distribution of X bar even when we don't know what the population looks like
- If population: not normal Most distribution: n >30 - > nearly normal

○ Sample means from pop -> will be approximately normal -> sample Fairly symmetric distributions: n > 15
size large enough

Z-value for sampling dis of the mean

Confidence interval
Monday, August 20, 2018 11:49 PM
- Sample mean: point estimate of an unknown population mean

- Confidence interval (interval estimate): additional information about variability (population characteristic)
○ Contain the unknown population parameter
○ confidence level = 95%
=> (1 - ) = 0.95
(Interpret) In the long run, 95%
Significant level
• of all the confidence intervals that
can be constructed will contain the
unknown true parameter
A specific interval either will contain or will not contain the true parameter (no probability involved in a specific
interval)
○ Assumptions: population:
▪ Normally distributed (=> any sample size is okay)
▪ Not normal => use large sample ( >30%)
8. Confidence Interval Page 67

σ2 Known
Friday, August 24, 2018 11:44 PM

Common levels of confidence

σ Unknown
Population standard deviation σ unknown --substitute--> sample standard deviation S
➢ Extra uncertainty: cuz S is variable from sample to sample

➢ Using Sudent's t Distribution (instead of the normal distribution) => important
when the sample size is small
- t: a family of distributions
- t: depends on degrees of freedom (d.f.) -> number of observations
that can change freely after sample mean has been calculated
- Appendix D
Confidence Interval Estimate:
Estimated standard error of the mean
tn-1,α/2 : the critical value of the t distribution with

- n-1 d.f.
- an area of α/2 in each tail.

Population proportion
Saturday, August 25, 2018 12:27 AM
An interval estimate for the population proportion ( π ) can be calculated by adding an

allowance for uncertainty to the sample proportion ( p )
❖ Sample proportion:
○ Distribution: approximately normal if the sample size is large
○ Standard deviation:
❖ Sample data:
❖ Confidence Interval Endpoints:
○ where
• Zα/2 is the standard normal value for the level of confidence desired
• p is the sample proportion
• n is the sample size
○ Note: must have X = np > 5 and n – X = n(1-p) > 5 (to make sure that we can
use normal distribution to estimate)
-

Pop proport -Chi-square
Population normal => the sample variance s2 follows the chi-square

distribution (2) with degrees of freedom d.f. = n – 1.
- Lower (2L= 21- α /2)

- Upper (2U= 2 α/2)
- Appendix E
- Sample variance : s2
- Confidence interval:

Dertermine sample size
Sampling error (margin of error):

- Imprecision in the estimate of the population parameter
- Added and subtracted amount to the point estimate => to form the
confidence interval
- Z của các công thức dưới đều là Z (alpha/2)
Must know:
• The desired level of confidence (1 - ), which
determines the critical value, Zα/2
- Must know:
• The acceptable sampling error, e
• The desired level of confidence (1 - ), which determines the • The true proportion of events of interest, π
critical Z value
○ π can be estimated with a pilot sample if
• The acceptable sampling error, e necessary (or conservatively use 0.5 as an
• The standard deviation, σ estimate of π)

Hypothesis
Definition: a claim (assertion) about a population parameter

Population mean:
Population proportion:
The null hypothesis H0 (maintain hypothesis): states the assertion to be The Alternative Hypothesis H1 : the opposite of the null
tested - Researcher is trying to prove
- Always contains “=” sign.
- Always about a population parameter,
- not about a sample statistic.
9.One-sample Hypothesis Tests Page 74

Critical value
Sunday, August 26, 2018 10:01 PM
Sample mean --close to---> stated population mean

=> H0 is not rejected.
=> Vice versa
How "close" is enough?
❖ Critical Value

Error in Hyp test
❖ Type I Error (false positive)

○ Reject a true null hypothesis
○ A serious type of error
○ The probability is  (level of significance)
❖ Type II Error (false negative)
○ Don’t reject sth false
○ The probability is β --depends on--> alpha, sample size

Confidence Level
Confidence coefficient: (1-α)

- The probability of not rejecting H0 when it is true.
Confidence Level (of a hypothesis test):

(1-α)*100%.
Power of statistical test (1-β) :
- The probability of rejecting H0 when it is false.

Hypothesis Tests: Mean
D.f: n-1
- Convert sample statistic ( X ) to test statistic (Z statistic )
- Critical Z values ---based on---> level of significance 
- Decision Rule: If the test statistic falls in the rejection region, reject H0; otherwise do not
reject H0
Lower-tail test / left-tailed test

➢
➢ Upper-tail tesr / right-tailed test

P-value approach to testing
P-value (observed level of significance):

- Probability of obtaining a test statistic more extreme ( ≤ or  ) than the observed
sample value given H0 is true
- Smallest value of  for which H0 can be rejected
- Sample Statistic (eg: X ) ---> Test Statistic (eg: Z statistic)
- Compare the p-value with 
○ If p-value <  , reject H0
○ If p-value   , do not reject H0

-

Connection to Confidence Interval
Connection to Confidence Interval (apply for 2-tailed test only)

Hyp test for proportion
- Involve categorical variable

- 2 possible outcomes: (doesn’t) possess characteristic of interest
Proportion of the population in the category of interest: π
Sample proportion in the category of interest: p
- X and n - X are at least 5 => p can be approximated by normal

distribution with
Mean Standard deviation
-
- P is approximately normal - An equivalent form (in terms of the number in the
category of interest, X:

Independent samples
Monday, August 20, 2018 11:43 PM
Goal: test hypothesis -> difference between 2 popuplation mean: μ1 – μ2
The point estimate for the difference: X1 – X2

Assumptions:
• Samples are randomly and
independently drawn
• Populations are normally
distributed or both sample
sizes are at least 30
10. Two-sample Hypothesis Tests Page 82

The paired difference test
Tuesday, August 21, 2018 12:15 AM

2 pop proportion
Goal: test hypothesis / form of confidence interval

-> difference: 2 population proportion: π1 – π2
The point estimate for the difference:
Testing for Zero difference (π1 – π2 = 0)
Pooled estimate:
(X1 and X2: items in sample 1,2)

Test 2 pop variance
➢ F critical: F table
➢ 2 d.f required:
- Numerator: later sample variance (column in F
table)
- Denominator (row in F table)

Analysis of Variance
Purpose: compare more than two means simultaneously (If there is only two --> it
is called T-test)
Variation in Y about its mean is explained by one or more
categorical independent variables (the factors) or is unexplained (random error)
One - factor ANOVA (Inclued only one factor)
N - factor ANOVA -> Involving two or more factors
11
OVERVIEW
Each possible value of a factor or combination of factors is a treatment.
Test if each factor has a significant effect on Y:
H0: μ1 = μ2 = μ3
H1: Not all the means are equal

If we cannot reject H0, we conclude that observations within each treatment have
the same mean μ
ANOVA ASSUMPTION
1/ Observations on Y are independent

2/ Populations being sampled are normal and have equal variances
11. ANOVA Page 86

ONE FACTOR ANOVA
Purpose: Compares the means of c treatments (groups)
'* Sample sizes within each treatment do not need to be equal

Total number of observations: n = n1 + n2 + n3 + … + nc
Observation on Y (Yij) -- included -->
• explained variation
• unexplained error which depend on character of each observation ->
independent
When we look at population mean, we can conclude which one is higher. But for
the sample mean, we cannot conclude because it’s just estimate --> with sample
mean we need to compare by Zstat (included: sample mean, standard deviation
and sample size) and 1- alpha (confidence interval)
Example of one factor: income of student
A1: UEH - A2: UAH - A3: IU
Y11: student 1
Y21: student 2
-> If the income of all student are the same -> The Ybar is not affected
-> If the income of all student are not the same -> The Ybar is affected
One factor ANOVA as a Linear Model
11. ANOVA Page 87

Yij: observation
μ :common mean
Aj: treatment effect
ij: random effect
Testing hypotheses:
H0: A1 = A2 = A3 = … = Ac = 0
H1: Not all Aj are zero
If H0 is true, the the ANOVA model is

yij =  + ij
The same mean in all groups, or no factor effect
11. ANOVA Page 88

DECOMPOSITION OF VARIATION
The mean of each group (group mean)
The overall sample mean (grand mean)
11. ANOVA Page 89

PARTITION OF DEVIATIONS
HYPOTHESIS TESTING
SSA and SSE are used to test the hypothesis of equal means by dividing each sum
of squares by it degrees of freedom
These ratios are called Mean Squares (MSA and MSE)
11. ANOVA Page 90

F STATISTICS:
F STATISTICS: Ratio of variance due to treatments (MSA) and to error (MSE)
When F is near zero -> little difference among treatments -> not reject H0
Decision Rule: Reject H0 if F > Fa, otherwise do not reject
11. ANOVA Page 91

THE TUKEY'S TEST
When? --> After rejection of equal means in ANOVA

Tells which population means are significantly different
Ex: μ1 = μ2 >< μ3
Tukey's studentized range test is a multiple comparion test

--> for c groups, there are c(c-1)/2 distinct pairs of means to be compared
Tukey 's Test is a two - tailed test for equality of paired means from c groups
compaired simultaneously
The hypothesis are:
Decision Rule
Tc,n-c is a critical value of the Tukey test statistic Tcalc for the desired level of
significant
5% critical values of Tukey test statistics
11. ANOVA Page 92

HARTLEY'S TEST
The test statistic is the ratio of the largest sample variance to the smallest
sample variance
Decision Rule: Reject H0 if Hcalc > Hcritical

Hcritical can be found in Hartley's test statistics, using
Df1 = c, df2 = n/c-1
11. ANOVA Page 93

LEVENTE'S TEST - Just can only be used in Minitab
Alternative to Hartley's F test

Levene's test does not assume a normal distribution
Based on the distances of the observations from their sample medians rather
than their sample means
11. ANOVA Page 94

Regression analysis
• Regression analysis is used to:
• Predict dependent variable based on independent variable(s)

• Explain the impact of changes in an independent variable on the
dependent variable
• Dependent variable (y): the variable we wish to explain
• Independent variable (x): the variable used to explain the dependent
variable
• Relationship between x and y ---described by---> linear function
• Changes in y ----caused by---> changes in x
• Linear regression population equation model
12. Simple linear regression Page 95

Least Squares Estimators
b0 & b1 & are obtained by:

- Finding b0 & b1 (which minimize the sum of the squared differences
between y and
Differential calculus is used to obtain the coefficient estimators b0 and

b1 that minimize SSE (eg: derivative,..)
- The slope coefficient estimator is
- And the constant or y-intercept is
- The regression line always goes through the mean x, y

Intercept+slope
b0 : estimated average value of y when the value of x is zero (if x = 0 is in the range of observed
x values)
b1 : estimated change in the average value of y as a result of a one-unit change in x
House Price in $1000s Square Feet

(Y) (X)
245 1400
312 1600
279 1700
308 1875
199 1100
219 1550
405 2350
324 2450
319 1425
255 1700
• Here, no houses had 0 square feet, so b0 = 98.24833 just indicates that, for houses within the
range of sizes observed, $98,248.33 is the portion of the house price not explained by square
feet
• Here, b1 = .10977 tells us that the average value of a house increases by 0.10977($1000) =
$109.77, on average, for each additional one square foot of size

Measures of variation
Measures the Explained variation Variation

variation of the yi attributable to the attributable to
values around their linear relationship factors other than
mean, y between x and y the linear
relationship
between x and y

Coefficient of Determination, R2
• Coefficient of determination (R-squared)

• the portion of the total variation in the dependent variable ---
explained by---> variation in the independent variable
• denoted as R2

Estimation of model Error variance
• An estimator for the variance of the population model error is
• Division by n – 2: because there are 2 constraint (b0 and b1)

• standard error of the estimate

Inferences About the

Regression Model

Confidence interval estimate of the slope
The slope coefficient estimator is

T test: inference about the slope

F-test for significance
Simple regression model => số independent variable là 1 => k=1

Prediction+prediction interval
• The regression equation ---be used to---> predict a value for y, given
a particular x
• For a specified value, xn+1 , the predicted value is

Chap 1: Stat: Tasks: - Data Modeling

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Chap 1: Stat: Tasks: - Data Modeling

Uploaded by

Copyright:

Available Formats

Chap 1: stat

Thursday, June 28, 2018 12:20 AM

Statistic (data science): -> single measure

Empirical data: collected through observation + exp

• Data: facts, figures,.. Collected for - analysis

○ Time series: study trends, patterns over time

• Data set: collection of data value as a whole

Employee Name Sexuality DOB Income per year in $

Data types (vari):

1 data collection Page 2

1 data collection Page 3

1 data collection Page 4

• Parameter: specific characteristic of a population

1 data collection Page 5

1 data collection Page 6

1 data collection Page 7

Statistical Sampling: sample based on

• sample size: n • One-stage cluster sampling: randomly selected k

1 data collection Page 8

1 data collection Page 9

1 data collection Page 10

2 describe data Page 11

2 describe data Page 12

Ordered array: a sequence of data in rank order

- Shows range (min to max)

- Signals -> variability

- May help identify outliers (unusual observations)

- Large data => less useful

Stem-and-Leaf Diagram: tool of: exploratory data analysis (EDA)

2 describe data Page 13

• Range -> dispersion

2 describe data Page 14

• containing class groupings

Frequency Polygons and Orgives (cumulative):

• Frequency, relative frequency, or percentage

• ticks show the end points of each bin

• The class boundaries (or class midpoints)

No gaps between bars

2 describe data Page 15

• possible relationships between two numerical variables

2 describe data Page 16

- Patterns in the values of a variable over time.

Log scales (ratio scale):

2 describe data Page 17

2 describe data Page 18

3 descriptive statistics Page 19

2) Median: Not affected by extreme values

- “middle” number (50% above, 50% below)

5) Geometric mean: measure the rate of change of a variable over time

Ri is the rate of return in time period I

3 descriptive statistics Page 20

3 descriptive statistics Page 21

- First quartile position: Q1 = (n+1)/4

Q1,Q3: nocentral location

Box-and-whisker Plot: A Graphical display of data using 5-number summary:

Minimum* -- Q1 -- Median -- Q3 -- Maximum

3 descriptive statistics Page 22

• Simplest measure of variation

Solution: Interquartile Range:

• Values outside the inner fences: unusual

3 descriptive statistics Page 23

(spread/ variability of the data value)