Professional Documents
Culture Documents
Chap 1: Stat: Tasks: - Data Modeling
Chap 1: Stat: Tasks: - Data Modeling
session 0 Page 1
Chap 2: data
Thursday, June 28, 2018 10:04 AM
Observation
Collecting data
Def:
- Determine the amount of infor contained in the data
- Indicate -> data summarization
-> statistical analysis Most appropriate
Types: ways to measure depend on the scale
➢ Nominal scale : category, classification data
- Data (lables, names) -> identify an attribute of the element
- Can code data numerically -> no numerical meaning
- Example:
• Students of a university are classified by as Business, Humanities, Education, and so on.
- No ordering
➢ Ordinary: ranking (nominal data+order of data)
- Ex:
• Uni students: freshman, sophomore, junior/senior
- Ordering but no clear meaning to the distance between data
➢ Interval: ordinal data
- Difference between measurements -> meaningful
- No true zero value, ratios have no meaning (ko thể nói 4 độ nóng gấp đôi 2 độ)
➢ Ratio: interval data
- Ratio of the 2 values is meaningful (tiền của A gấp đôi tiền B)
• Population
- collection of all items of interest or under investigation
- finite or infinite.
• Census: examination of all items in a defined population.
• Sample: observed subset (tập hợp con) of the population.
• Convenience Sample
• Sample -> available (e.g., ask co-worker opinions at lunch).
• Judgment Sample
• Use expert knowledge to choose “typical” items (e.g., which employees to interview).
• Focus Groups
• In-depth dialog with a representative panel of individuals (e.g. iPhone users).
Stratified:
Simple random:
• Divide population -> subgroups (called strata) have
• Mem of pop
Equal chance of being selected common characteristic (e.g. age, gender,
• Sample of given size occupation)
• Select a simple random sample from each subgroup
• Selection: replacement / without replacement • Combine samples from subgroups into one
• The sample can be obtained using a table of random numbers or computer random
number generator
Systematic: Cluster:
Tabulating data:
Graphing Data Summary table:
Bar + Pie Charts: qualitative data (categories or nominal scale) ❖ Purpose:
- Height of bar => frequency To see differences between or among categories.
- Size of pie slice => percentage
Bar Pie (%)
Tips: - Convey general idea
- Numerical vari-> Y asix - Ineffective -> too many slices
- Category labels-> X asix - Represent parts of a whole
- Height/ length: proportional - Need relative frequency to construct
to the quality displayed
Pareto:
- portray categorical data (nominal scale)
- Categories: descending order of frequency => the most common categories appear first
- Cumulative polygon included
○ Ex:
raw form
24, 26, 24, 21, 27, 27, 30, 41, 32, 38
ordered array from smallest to largest
21, 24, 24, 26, 27, 27, 30, 32, 38, 41
Dot Plots:
• The simplest graphical display of n individual values of numerical data:
• Tool for data exploration => easy to understand
• Not good for large samples (e.g., > 5,000).
• Basic steps:
1. Make a scale -> cover data range
2. Mark axis demarcations (ranh giới) + label them
Frequency distribution:
• A frequency distribution is a table
• and the corresponding frequencies with which data fall within each grouping
X-axis:
❖ Orgives (%):
○ Line graph of the cumulative frequencies
Scatter Plots:
From <http://wps.prenhall.com/wps/grader>
4) Mode: Not affected by extreme values (apply to nominal scale) Ex: 1,2,7,8,9,10
- Value that occurs most often ○ Position: (6+1)/2 =3.5
- Used for either numerical or categorical (nominal) data ○ Mean: (7+8)/2 = 7.5 1.5
- No mode/several mode
Descriptive Statistics:
split the ranked data -> 4 segments: an equal number of values per segment
Problems:
A. Ignore -> data distributions
Solution:
- Variance: Average (approximately) of squared deviations of values from the mean - Coefficient of Variation:
○ Measures relative variation
= mean ○ Always in percentage (%)
n = sample size ○ Shows variation relative to mean
○ Can be used to compare two or more sets of data measured in different
Xi = ith value of the variable X units
- Standard deviation:
• Used to measure variation (most common)
• Show variation about the mean S: standard deviation
• Same units as the original data : mean (average)
Ex: Both stocks have the same standard deviation, but stock B is less
variable relative to its price by using CV
A. Sensitive to outliers:
μ = population mean
N = population size
• Population variance:
Chebyshev's Theorem
• Regardless of how the data are distributed, at least (1 - 1/k2) x 100% of the values will fall within
k standard deviations of the mean (for k > 1)
- Ex:
(1 - 1/22) x 100% = 75% ….... k=2 (μ ± kσ = μ ± 2σ )
Let μ = 72, σ = 8 (standard deviation)
=> At least 75% of the scores will be within the interval 72 ± 2.8 or [56,88]
(regardless of how the scores are distributed)
Z Scores:
• Measure distance from the mean
- Ex:
Z-score of 2.0: a value is 2.0 standard deviations from the mean
• Z score > 3.0 or < -3.0: an outlier
- Ex:
Mean: 14.0 Standard Deviation: 3.0
Weighted mean:
- A sum: each data value a weight wj that represents a fraction of the total (i.e., the k
weights must sum to 1).
- Ex:
Your instructor give
▪ a weight of 30 percent to homework, 20 percent to the midterm exam, 40
percent to the final exam, and 10 percent to a term project (so that .30 + .20
+ .40 + .10 = 1.00).
▪ your scores on these were 85, 68, 78, and 90. Your weighted average for
the course would be:
= (0.3 x 85) + (0.2 x 68) + (0.4 x 78) + (0.1 x 90) = 79.3 79.3
Application:
- Accounting (weights for cost categories),
- finance (asset weights in investment portfolios)
f: frequency
m: midponit
N or n: size
❖ Variance:
The Covariance:
- measures the strength of the linear relationship between two variables
- Interpreting:
- Coefficient of Correlation:
○ Measures the relative strength of the linear relationship between two variables
○ Sample Coe-cor
○ Features:
• Unit free
• Ranges between –1 and 1
• Closer –1, the stronger the negative linear relationship
• 1, the stronger the positive linear relationship
• 0, the weaker the linear relationship
4 probability Page 33
• Events A and B are collectively exhaustive and also mutually exclusive
• Conditional Probability:
- Based on Contingency table
- The probability of event A given that event B has occurred
• Independent Event: the probability of one event is not affected by the fact that the other
event has occurred
• Multiplication Rules:
If A,B: independent
• Bayes'Theorem
- revise previously calculated probabilities based on new information.
- an extension of conditional probability.
A: máy A
B: máy bị hư
Assessing probability
• Empirical (relative frequency approach): estimated through observation and exp
- P(f) = f/n
4 probability Page 34
- P(f) = f/n
- f: the frequency of observed outcomes defined in our experimental sample space
- n: number of observations
P(a missed scan) =number of missed scans / number of items scanned
• Classical: know the probability b4 observing the event/ do exp
- 50% chance of heads on coin flip
• Subjective: judgment about the likelihood of an event
- needed when there is no repeatable random experiment
4 probability Page 35
Counting rules
Friday, July 13, 2018 9:43 AM
- Example: If you roll a fair die 3 times then there are 63 = 216 possible outcomes
- k=6: mutually exclusive, collectively effective events
- n=3: trials
3)
4)
Permutations
5)
Combinations
4 probability Page 36
- ways of selecting X objects from n objects, no order
- Example:
○ You have five books and are going to randomly select three to read. How many different
combinations of books might you select?
Answer
4 probability Page 37
Thursday, August 23, 2018 10:58 PM
Experiment:
Toss 2 Coins.
Let X = # heads.
Experiment:
E(x) = (0 x .25) + (1 x .50) + (2 x .25)
Toss 2 Coins.
= 1.0
Let X = # heads.
Variance
Standard Deviation
E(ax+b)=a.E(x)+E(b) (a,b=const)
E(ax+b)=a.E(x)+b
Uniform distribution:
• Random variables: finite number of integer value from a to b
• Depends: a&b
○ a: lower limit
○ b: upper limit
• 1 probability for all X (each value is equally likely to occur)
Bernoulli:
• Experiment has 2 outcomes
• Success (X=1), failure (X=0)
○ "success" ---defined--> less likely outcome
=> < 0.5 for convenience.
Probability of success
Probability of failure 1-
Mean E(X)=
variance (1-)
Mean
Standard deviation
Hypergeometric distribution:
• Similar: binomial but
- "n" trials from a finite population
- Sample taken without replacement
- Outcomes of trials: dependent
PDF
Where
N = population size
A = number of items of interest in the population
Mean N – A = number of events not of interest in the population
n = sample size
x = number of items of interest in the sample
n – x = number of events not of interest in the sample
Standard deviation
Geometric: -> Bernoulli trials (not fixed) until the first success
-> Depends on
PDF where:
x = number of events in an area of opportunity
= expected number of events (average number of
events per unit)
Mean e = base of the natural logarithm system (2.71828...)
S.D
These can potentially take on any value, depending only on the ability to
measure accurately
Example: Your weight, if you have normal scale, you weight is 50, but if you
have more detailed scale, your weight may be 50.2
The uniform distribution is a probability distribution that has equal probabilities for all possible outcomes of the random
variable
Also called a rectangular distribution
If X is a random variable that is uniformly distributed between a and b, its PDF has constant height
• Area = base x height = (b-a) x 1/(b-a) = 1
Example:
SUMMARY
Example:
Bell Shaped
Symmetrical
Mean = Median = Mode
The probability for a range of values is measured by the area under the curve
Need to transform X units into Z units by subtracting the mean of X and dividing
by its standard deviation
For a given Z - value a, the table shows F(a) (the area under the curve from - to
a
For negative Z-values, use the fact that the distribution is symmetric to find the
needed probability
Often used to model the length of time between two occurrences of an event
(the time between arrivals)
Examples:
Time between trucks arriving at an unloading dock
Time between transactions at an ATM machines
Time between phone calls to the main operator
In larger samples, the sample means would tend to be even closer to μ. This fact is the basis for statistical
estimation.
How to make inferences about a population that take into account four factors:
• Sampling variation (uncontrollable).
• Population variation (uncontrollable).
• Sample size (controllable).
• Desired confidence in the estimate (controllable).
Estimators:
Estimator: - a statistic --derived)--> from a sample
- infer the value of a population parameter.
- a random variable (random samples vary)
Estimate: the value of the estimator in a specific sample.
Sample proportion:
• p = x/n
○ x: number of successes in the sample
○ n: the sample size
• Parameter: π
Sampling error: difference between an estimate and the corresponding (tương ứng) population parameters
Bias: difference between the expected value (i.e., the average value) of the estimator and the true parameter
• For the mean:
Bias = E( X ) - μ
Unbiased parameter:
• The expected value of the estimator equal to the parameter being estimated.
• The sample mean is an unbiased estimator of the population mean when
• Neither overstates nor understates the true parameter on average.
• Can be studied mathematically or by simulation experiments.
▪ sample mean ( x )
▪ Sample proportion (p)
Sampling distribution:
• Probability distribution of all possible values of a statistic for a given size sample selected from a population
Efficiency:
• Refers to: variance of the estimator's sampling dis
• More efficient estimator -> smaller variance
Consistency
onverges ( k o ể
• Estimator the parameter being estimated as the sample size increases
toward
Different sample (same size from same pop) -> yeild different sample means
: sample mean
: population mean
❖ Standard Error of the mean: variability in the mean from sample to sample
○ Decrease when sample size increase
Central limit theorem --help us--> to approximate the shape of the sampling
distribution of X bar even when we don't know what the population looks like
=> (1 - ) = 0.95
(Interpret) In the long run, 95%
Significant level
• of all the confidence intervals that
can be constructed will contain the
unknown true parameter
A specific interval either will contain or will not contain the true parameter (no probability involved in a specific
interval)
○ Assumptions: population:
▪ Normally distributed (=> any sample size is okay)
▪ Not normal => use large sample ( >30%)
❖ Sample proportion:
○ Distribution: approximately normal if the sample size is large
○ Standard deviation:
❖ Sample data:
○ where
• Zα/2 is the standard normal value for the level of confidence desired
• p is the sample proportion
• n is the sample size
○ Note: must have X = np > 5 and n – X = n(1-p) > 5 (to make sure that we can
use normal distribution to estimate)
-
Must know:
• The desired level of confidence (1 - ), which
determines the critical value, Zα/2
- Must know:
• The acceptable sampling error, e
• The desired level of confidence (1 - ), which determines the • The true proportion of events of interest, π
critical Z value
○ π can be estimated with a pilot sample if
• The acceptable sampling error, e necessary (or conservatively use 0.5 as an
• The standard deviation, σ estimate of π)
Population proportion:
The null hypothesis H0 (maintain hypothesis): states the assertion to be The Alternative Hypothesis H1 : the opposite of the null
tested - Researcher is trying to prove
D.f: n-1
- Decision Rule: If the test statistic falls in the rejection region, reject H0; otherwise do not
reject H0
-
- P is approximately normal - An equivalent form (in terms of the number in the
category of interest, X:
Pooled estimate:
➢ F critical: F table
➢ 2 d.f required:
- Numerator: later sample variance (column in F
table)
- Denominator (row in F table)
Purpose: compare more than two means simultaneously (If there is only two --> it
is called T-test)
Variation in Y about its mean is explained by one or more
categorical independent variables (the factors) or is unexplained (random error)
11
OVERVIEW
Each possible value of a factor or combination of factors is a treatment.
Test if each factor has a significant effect on Y:
H0: μ1 = μ2 = μ3
-> If the income of all student are the same -> The Ybar is not affected
-> If the income of all student are not the same -> The Ybar is affected
Testing hypotheses:
H0: A1 = A2 = A3 = … = Ac = 0
H1: Not all Aj are zero
HYPOTHESIS TESTING
SSA and SSE are used to test the hypothesis of equal means by dividing each sum
of squares by it degrees of freedom
These ratios are called Mean Squares (MSA and MSE)
When F is near zero -> little difference among treatments -> not reject H0
Decision Rule: Reject H0 if F > Fa, otherwise do not reject
Tukey 's Test is a two - tailed test for equality of paired means from c groups
compaired simultaneously
The hypothesis are:
Decision Rule
Tc,n-c is a critical value of the Tukey test statistic Tcalc for the desired level of
significant
The test statistic is the ratio of the largest sample variance to the smallest
sample variance
b0 : estimated average value of y when the value of x is zero (if x = 0 is in the range of observed
x values)
• Here, no houses had 0 square feet, so b0 = 98.24833 just indicates that, for houses within the
range of sizes observed, $98,248.33 is the portion of the house price not explained by square
feet
• Here, b1 = .10977 tells us that the average value of a house increases by 0.10977($1000) =
$109.77, on average, for each additional one square foot of size
• The regression equation ---be used to---> predict a value for y, given
a particular x
• For a specified value, xn+1 , the predicted value is