You are on page 1of 3

1.

Overview: Procedures for Statistical Analysis


a. Collect Data  Explanatory Data Analysis  Pre-Processing  Normality 
Hypothesis Testing
b. Pre-processing: need normality assumption; 1) delete outliers? 2) need
transformation (lose interpretability of data) 3) scaling: standardization (z-score
equation) 4) Imputation (N/A), scale problem
2. What is a difference between parameter and statistic?
a. Parameter: µ (mu, mean) and σ (standard deviation)
b. Statistic: x̄ (x bar, mean) and s (standard deviation)
3. Explain two types of variables in details. (Describes the 4 levels of measurements)
a. Categorial is a non-numeric observation that is never used directly in calculations
(Nominal, ordinal)
b. Quantitative is a numerical observation that is used directly in calculations.
(Interval, ratio)
4. Types of probability-based technique sampling method
a. Simple Random Sampling, Stratified, and Cluster
5. Examples for a stem and leaf plot (I know this)
6. What is an outlier? And how could we treat the outlier?
a. An outlier is a value that does not fit in with its data, it is usually either below the
lower fence or above the upper fence. It is an extreme value that does not fit
within the data. We can treat an outlier by having bigger data values, spreading
them out more, and maybe either lowering or raising the fence values. (by doing
so we need to change the entire data) We can also delete an outlier from our
data values.
7. Explain the 5 number summary in detail.
a. The five-number summary includes the Min, Q1, Median, Q3, Max.
i. The min is the minimum value of a data set, the lowest value and it
typically falls within the fences. However, it can be an outlier, and if so,
there is a second minimum that is used inside the fences.
ii. The Q1, or quartile 1, is the median of the first left half of the data set
(excluding q2).
iii. The Median, or Q2, is the middle value of a data set.
iv. Q3, or quartile 3, is the median of the right half of the data set (excluding
q2).
v. The max is the maximum value of the data set, the highest value and it
typically falls within the fence. If it does not fall within the fence, it is
considered an outlier, and if so, there is a second maximum value that is
found to be inside the fence.
8. How to define an outlier in the data?
a. The easiest way to define an outlier in the data is by looking at the data values
and seeing if there is a value that is very different from the rest of the values, for
example: in a data set that includes 50s, 60s, 70s, and 80s, if there is a value that
is in the 100s, that is most likely an outlier.
b. We can also calculate the upper and lower fences and compare that with the
data set values, if the upper fence is a value of 90 and the max value in the data
set is a value of 91+, then that is an outlier.
9. Explain variance and standard deviation in details. If we doubled data, then is there any
changes at the variance.
a. Variance is a measure of variability; it tells you the degree of spread in your data.
It is calculated as the standard deviation squared. Standard deviation is the
average distance from the mean, it is either always positive or zero. The
standard deviation is calculated as sigma (σ).
b. If the data is doubled, then there is a change in variance. The higher the data set,
the higher the variance. For example, if the variance of a data set is 9^2, and we
double it, it will be 18^2, that means that the variance of the doubled data set
will be 4 times higher than that of the single data set.
10. What is a standardized score (equation and explanation needed)?
a. The Standardized score is the Z-score, which values decide how many standard
deviations they are from the mean. The Z score tells us where a data point is
located in relation to the mean. Z-score equation: Zi = xi - x̄ / s
11. What is empirical rule?
a. The empirical rule is the three-sigma rule, or the 68-95-997 rule. It is used in
relation to the standard deviation of a data set. 68% of the data set falls within 1
standard deviation from the mean, 95% of the data set falls within 2 standard
deviations from the mean, and 99.7% of the data set falls within 3 standard
deviations from the mean.
12. Write down properties of a normal distribution.
a. The mean, mode, and median are all the same.
b. A normal distribution is symmetric, usually bell-shaped curve.
c. Characterized by the mean and standard deviation. N(µ, σ2)
13. What is the central limit theorem?
a. The central limit theorem states that the sampling distribution will follow a
normal distribution as long as the sample size is large enough (n is greater than
or equal to 30). Notation: x̄ ~ N(µ, σ2/n)
14. Rules of probability
a. Probability is always a number between 0 and 1
b. Recall P(E) = number of outcomes in event E/number of all possible outcomes
c. A complete set of probabilities always adds up to 1
15. What is the meaning of independent in probability?
a. Independent means that the outcome of one event does not change the
outcome of another event. If two events are independent from one another, it
means that they do not affect each other.
16. Brief example of a 2-way contingency table and explain
a. We know this! (A 2-way contingency table always adds up to 1)
17. What is a random variable? And how to denote it?
a. A random variable can be dependent or independent. It is a function with
numerical values that depend on the outcomes of a random process. It is
denoted by X, Y, Z.
18. Steps to hypothesis testing.
a. 1) State the null and the alternative hypothesis
b. 2) State the significance level = alpha
c. 3) Compute test statistic and p-value (or use the C.I. approach)
d. 4) Reject or do not reject the null
e. 5) Draw conclusions as to why you reject or do not reject the null
19. Approaches for hypothesis testing.
a. Test stats vs. Critical Value |Z| < |Zalpha/2|
b. P-value vs. Significance level
c. C.I. Approach; mu = 3 and mu is not = 3
20. How to interpret (1-alpha)x100
21. What is the purpose of the SLR
a. To model the relationship between two variables (explanatory and response)
22. What is the population model for SLR?
a. Y = alpha + BX + E, E ~ N(0,1)
i. Y = r.v. dependent, alpha = intercept, B = slope, X = r.v. independent, E =
Error, N(0,1) = normal distribution

You might also like