a. Collect Data Explanatory Data Analysis Pre-Processing Normality Hypothesis Testing b. Pre-processing: need normality assumption; 1) delete outliers? 2) need transformation (lose interpretability of data) 3) scaling: standardization (z-score equation) 4) Imputation (N/A), scale problem 2. What is a difference between parameter and statistic? a. Parameter: µ (mu, mean) and σ (standard deviation) b. Statistic: x̄ (x bar, mean) and s (standard deviation) 3. Explain two types of variables in details. (Describes the 4 levels of measurements) a. Categorial is a non-numeric observation that is never used directly in calculations (Nominal, ordinal) b. Quantitative is a numerical observation that is used directly in calculations. (Interval, ratio) 4. Types of probability-based technique sampling method a. Simple Random Sampling, Stratified, and Cluster 5. Examples for a stem and leaf plot (I know this) 6. What is an outlier? And how could we treat the outlier? a. An outlier is a value that does not fit in with its data, it is usually either below the lower fence or above the upper fence. It is an extreme value that does not fit within the data. We can treat an outlier by having bigger data values, spreading them out more, and maybe either lowering or raising the fence values. (by doing so we need to change the entire data) We can also delete an outlier from our data values. 7. Explain the 5 number summary in detail. a. The five-number summary includes the Min, Q1, Median, Q3, Max. i. The min is the minimum value of a data set, the lowest value and it typically falls within the fences. However, it can be an outlier, and if so, there is a second minimum that is used inside the fences. ii. The Q1, or quartile 1, is the median of the first left half of the data set (excluding q2). iii. The Median, or Q2, is the middle value of a data set. iv. Q3, or quartile 3, is the median of the right half of the data set (excluding q2). v. The max is the maximum value of the data set, the highest value and it typically falls within the fence. If it does not fall within the fence, it is considered an outlier, and if so, there is a second maximum value that is found to be inside the fence. 8. How to define an outlier in the data? a. The easiest way to define an outlier in the data is by looking at the data values and seeing if there is a value that is very different from the rest of the values, for example: in a data set that includes 50s, 60s, 70s, and 80s, if there is a value that is in the 100s, that is most likely an outlier. b. We can also calculate the upper and lower fences and compare that with the data set values, if the upper fence is a value of 90 and the max value in the data set is a value of 91+, then that is an outlier. 9. Explain variance and standard deviation in details. If we doubled data, then is there any changes at the variance. a. Variance is a measure of variability; it tells you the degree of spread in your data. It is calculated as the standard deviation squared. Standard deviation is the average distance from the mean, it is either always positive or zero. The standard deviation is calculated as sigma (σ). b. If the data is doubled, then there is a change in variance. The higher the data set, the higher the variance. For example, if the variance of a data set is 9^2, and we double it, it will be 18^2, that means that the variance of the doubled data set will be 4 times higher than that of the single data set. 10. What is a standardized score (equation and explanation needed)? a. The Standardized score is the Z-score, which values decide how many standard deviations they are from the mean. The Z score tells us where a data point is located in relation to the mean. Z-score equation: Zi = xi - x̄ / s 11. What is empirical rule? a. The empirical rule is the three-sigma rule, or the 68-95-997 rule. It is used in relation to the standard deviation of a data set. 68% of the data set falls within 1 standard deviation from the mean, 95% of the data set falls within 2 standard deviations from the mean, and 99.7% of the data set falls within 3 standard deviations from the mean. 12. Write down properties of a normal distribution. a. The mean, mode, and median are all the same. b. A normal distribution is symmetric, usually bell-shaped curve. c. Characterized by the mean and standard deviation. N(µ, σ2) 13. What is the central limit theorem? a. The central limit theorem states that the sampling distribution will follow a normal distribution as long as the sample size is large enough (n is greater than or equal to 30). Notation: x̄ ~ N(µ, σ2/n) 14. Rules of probability a. Probability is always a number between 0 and 1 b. Recall P(E) = number of outcomes in event E/number of all possible outcomes c. A complete set of probabilities always adds up to 1 15. What is the meaning of independent in probability? a. Independent means that the outcome of one event does not change the outcome of another event. If two events are independent from one another, it means that they do not affect each other. 16. Brief example of a 2-way contingency table and explain a. We know this! (A 2-way contingency table always adds up to 1) 17. What is a random variable? And how to denote it? a. A random variable can be dependent or independent. It is a function with numerical values that depend on the outcomes of a random process. It is denoted by X, Y, Z. 18. Steps to hypothesis testing. a. 1) State the null and the alternative hypothesis b. 2) State the significance level = alpha c. 3) Compute test statistic and p-value (or use the C.I. approach) d. 4) Reject or do not reject the null e. 5) Draw conclusions as to why you reject or do not reject the null 19. Approaches for hypothesis testing. a. Test stats vs. Critical Value |Z| < |Zalpha/2| b. P-value vs. Significance level c. C.I. Approach; mu = 3 and mu is not = 3 20. How to interpret (1-alpha)x100 21. What is the purpose of the SLR a. To model the relationship between two variables (explanatory and response) 22. What is the population model for SLR? a. Y = alpha + BX + E, E ~ N(0,1) i. Y = r.v. dependent, alpha = intercept, B = slope, X = r.v. independent, E = Error, N(0,1) = normal distribution