# Module 3 Interpreting Scatterplots Form: linear, curved, clusters, no pattern Direction: positive, negative, no direction Strength: How well

points fit form Outliers that are deviations from pattern Correlation Coefficient is the measure of the direction and strength of a linear relationship. r = (1-(n-1)) Between -1 and 1 Values of r close to 0 imply weak or no linear association. X and Y are independent, knowing X tells nothing about Y. WATCH out for Outliers Rule of thumb Values larger than .8 or smaller than -.8 represent VERY strong correlation. Anything between -.3 and .3 indicated weak or no association. Least Squares Regression Analysis Describes the linear relationship and defines how one variable changes the other. Y = a + bX Interpreting Slope Slope b represent the increase or decrease in Y for any one-unit increase in X. 2 Coefficient of determination R 2 R varies between 0 and 1. If close to 1, then there is almost a perfect linear association between Y and X. Values of R^2 larger than .6 show that x is a strong predictor of y, and that the regression line on x provides a good explanation of changes in y. If R^2 values less than .25, indicate x is not a good predictor for y, and that there is a large fraction of variation in y that is not explained by changes in X.

90% Example residual explanation: The residual plot of least squares regression line for CPU usage time on LINET shows no pattern in residuals. Residuals are randomly scattered around the zero line in a horizontal band. The regression line appears to be a good representation between TIME and LINET. MODULE 4 Sampling approaches use statistical techniques to produce representative samples from a large population. Sampling plan states objectives of study, target population, method of sampling, set of characteristics and variables of interest, and more. Probabilistic sampling chooses individuals or units in the sample using random procedure. Eliminates bias. Following are types: Simple random sample a set of individuals chosen in such a way that everyone has the same change of being chose. Stratified Sampling population is divided into groups of similar units (or strata). Convenience Sampling individuals are chosen based on ease with which the data can be collected. Conclusions can t be generalized to entire population. Voluntary Response Sampling individuals that select themselves to respond to data collection appeal. Observational Study vs. Experiment Observational is when research has no control on the factors, can t assign subjects to variables of interest. Just observe. (Hard to prove causation) Experiment researched is able to control factors and assign individuals to diff. treatments. (Can prove cause and effect). Experimental Design: Controlled Experiments comparing several treatments. One is many times a placebo. Randomization: Subjects are selected randomly to each treatment. Reduces bias. Replication Each treatment is done on many units to reduce chance of variation. Double-blind process Both the researches and subjects know what treatment they are in reduces unconscious bias. Module 6 Point estimate single value computed from a sample to estimate the value of a population parameter. Point estimates alone provide no information on accuracy of estimate. Confidence Interval range of values that are believes to contain the value of the population parameter. Sample mean denoted by Population mean denoted by Central Limit Theorem given the size, the population mean and standard deviation, and the size is large, the sampling distribution of the sample mean is approx. normal with the center in the population mean and the standard deviation equal to (st. dev / (sqrrot of n)) To get more accurate estimates, we should increase the sample size. Confidence Interval 95% 99% Reduce Margin of error: Reduce confidence level Increase Sample size Reduce variation in S (Sample size should be over 30) Example: analysis shows with 95% confidence that the user s average mean time between strokes is within 0.225 and 0.375 seconds. The observed attempt has average time of 0.39 seconds that is significantly higher than the 95% confidence range. We conclude that the recent attempt cannot be explained by chance variation and therefore is an unauthorized intrusion. Module 7 Sampling Distribution of Sampling Proportions We use proportions when dealing with categorical data when there are two possible outcomes for variables. For large samples, the sampling distribution of the sample proportion is approximately normal with mean and

Standard deviation In large samples, the std. deviation of the sample proportion is computed by replacing the population proportion p with the sample estimate

Large Sample Confidence Interval for population mean p: 95%

90% 99%

Residuals Vertical deviations difference between the observed y and the y predicted by the least squares regression line. Average of residuals is equal to zero. st 1 Linear randomly scattered around the zero line nd 2 Problem non-linear association curved pattern indicates the relationship between y and x is curved and not linear. rd 3 Problem non-constant variance variation in y increases as x increase (fan shaped)

Module 8 Test of significance a probability that measures how well the data support the hypothesis. Null and Alternative Hypotheses Null hypothesits no difference Alternative hypothesis must be true if null hypothesis is false. Ex. Verify mean response time of call center is less than 15.

Null and Alternative Hypotheses on population average

Calculating a Test Statistic How far away is the sample mean from general mean? If Z is 2 or 3 standard errors away from general mean, then it is Far away.

P-Value probability that test statistic is equal to or more extreme than the value obtained from data when null hypothesis is true. The smaller the p-value, the stronger evidence against the null hypothesis. - If p-value is LESS than =0.05, the null hypothesis IS REJECTED at 5% significance level. The test result is called statistically significant . -If p-value is LESS than =0.01, the null hypothesis IS REJECTED at 1% significance level. The test result is called highly significant . - If p-value is LARGER than 0.05, the null hypothesis CANNOT BE REJECTED. The test is not significant . For One sided test Ha < mean Find the entry in standard normal table corresponding to z*. For One sided test Ha > mean Find the entry in standard normal table corresponding to z* and do 1 that value. For Two sided test Ha != mean Find entry in table corresponding to absolute value of -|z*| and compute 2 * that value Example answers: - The p-value is extremely small, so we can conclude that data provide very strong evidence against the null hypothesis. The test is highly significant indicating that the mean response time is significantly lower than 15 minutes. - Note that the p-value is 0.053 that is slightly larger than 0.05. So we cannot reject the null hypothesis at 5% significance level. However, since the p-value < 0.10, we could reject it at 10% significance level. Thus the test does not provide enough evidence at 5% level to support the claim that there is a significant change in the mean percentage of visits from search engines. More data are necessary to evaluate the test hypotheses. Confidence Interval based decision rule: Construct a 95% interval. Use the sample mean, not general mean to compute it. IF null hypothesis is NOT contained in 95% C.I. then reject the null hypothesis at 5% significance level in favor of Ha. IF null hypothesis IS contained then you cann reject the null hypothesis at 5% significance level. WHY? The sample data provide a range of plausible values for the general mean. If is not contained in the C.I. there is strong evidence that the population value is not equal to , and therefore we can reject HO.