An association exists between two variables if a particular value for one variable is more likely to occur with certain

values of another variable. Response variable is the outcome variable on which comparisons are made ex. Income level. Explanatory variable defines the groups to be compared with respect to values on the response variable. Positive Correlation (both up) Negative correlation (one up/one down) The closer to 1 in ABSOLUTE VALUE the stronger the correlation also the stronger the STRAIGHT LINE CORRELATION, the closer to 0 in abs. value the weaker the correlation. -.879 is still a stronger correlation than .750 Positive correlation indicates positive association. Negative correlation indicates neg. association. R^2 is the coefficient of determination, represents the % of data closest to the line of best fit. Ex What proportional reduction in error do we get by using the regression line to make predictions instead of simply using the mean? Answer use r^2 Denotes strength of linear association of x & y Correlation does not imply causation; can be expressed, as association does not imply causation. Correlation always falls between -1 & 1. Two variables have the same correlation no matter, which is treated as the response variable. Simpsons Paradox- direction of an association between two variables can change after including a 3 rd value and analyze the data at separate levels of that variable. It is possible for the association to reverse after adjusting for a 3rd variable. Regression analysis—often used for observations of a quantitative response variable over time. A regression equation if often called a prediction equation. Regression line predicts the response variable y as a straight-line function of the value x of the explanatory variable. *Construct a scatterplot before finding a correlation or regression line. Regression equation y^=a+bx find with LingRegTTest The correlation and regression line are NONRESISTANT they are prone to distortion by outliers. Prediction errors are called RESIDUALS. Extrapolation refers to using a regression line to predict y values for x values outside the observed range of data. Predictions about the future using time series data are called forecasts Regression outliers are well removed from the trend that the rest of the data follows Sampling Frame is the list of all subjects in the population from which the sample is taken. When an observation has a large effect on results of a regression analysis it is influential for it to be influential it has to be a regression OUTLIER. Sampling Design is the next step after sampling frame (a sound sampling design can prevent sampling bias but cannot prevent response or nonresponse bias. A lurking variable is a variable, (not measured in the study) usually unobserved, that influences the association between the variables of primary interest. A lurking variable may be a common cause of both the response and explanatory variable. Lurking variables have to potential to be Confounding—when two explanatory variables are both associated with a response variable but are also associated with each other. For a LURKING variable to be CONFOUNDING it must be included in the study and associated with neither the response nor explanatory variable. Randomization—in assigning experimental units (subjects) to the treatments helps to balance out lurking variables. ANECDOTAL EVIDENCE—come from personal observations. Not representative of the entire population. Systematic sample—take every kth one. Multiple causes (more common) the association between the two variables becomes difficult to study the effect of any single variable. Crossover Design (Really Good Design) a matched pair’s design in which subject’s crossover during the experiment from using one treatment to using another treatment. Matched pair—two observations for a particular subject, because they both come from the same person. Completely Randomized Design—subjects are randomly assigned to one of the treatments. Blocking (matching of subjects is a type of blocking) In experiments with matching, a set of matched experimental units. Randomized block—a block design with random assignment of treatments to units within blocks (to reduce possible bias treatments are usually randomly assigned within a block.) Design of StudiesBest way to collect data. REDUCE NOISE—INCREASE SIGNAL Observational Study—merely observes rather than experiments with study subjects. Some researchers use this term to refer only to studies that use available subjects (ex convenience sample) and not to sample surveys that randomly select people. Experimental Study—assigns to each subject a treatment; subjects in an experimental study are often referred to as EXPERIMENTAL UNITS. Researchers ―impose‖ a treatment or condition (such as exposure or non exposure to cell phone radiation.) CAN CONTROL for Lurking Variables, gives strongest INFERENCE. CAN ESTABLISH Cause & Effect. A simple random sample is often called a random sample. RANDOM SAMPLING IS THE BEST-all subjects in the frame have an equal chance of being selected. Simple Random Sampling—much more likely to get a representative sample if you let chance rather than convenience determine the sample; Random Sample Design—is implemented by using random numbers to select n subjects from the sampling frame. EU=the ―thing‖ to which treatment is applied. Replication involves more than one EU per condition (treatment) Methods of collecting sample surveys—personal interviews, telephone interviews, and self-administered questionnaire. Under coverage—having a sampling frame that lacks representation from parts of the population. Sampling Bias—Results from the sampling method (ex Non-random sampling or bias.) Subject does NOT have to be a person. Nonresponse Bias—when some sampled subjects cannot be reached or refuse to participate. To Reduce Bias—experiments should be double blind, with neither the subject nor the data collector knowing which treatment a subject was assigned. Response Bias—occurs when subjects give an incorrect response (ex lying) or the question wording or the way the interviewer asks the questions is confusing or misleading. Volunteer Sample—is the MOST COMMON type of CONVENIENCE SAMPLE (not ideal) however sometimes necessary in both observational studies and experiments. Convenience Sample—Not random, easy and cheap way to obtain data. Key Parts of a Sample Survey(most common use of SS is to estimate population percentages)—1) Identify the population of all subjects of interest 2) Construct a sampling frame (attempts to list all the subjects in the population) 3) Use a random sampling design; implemented using random #’s 4) Be cautious of sampling bias, due to non-random samples, under coverage, response bias, non-response bias. Stratified Random Sample—divides the population into separate groups, called STRATA, and then selects a simple random sample from each STRATUM. Cluster Random Sample—takes a simple random sample of clusters (such as city blocks) Most often by location. Factor—a categorical explanatory variable in an experiment, the categories are the treatments. Experimental Studies are preferable to non-experimental studies but are not always possible. Multi-Factor Experiments (Factorial Design) has at least two explanatory variables, allows you to test for a combination of treatments. Case-Control Study—an example of a retrospective study. Subjects who have a response outcome of interest, (ex cancer serves as cases) other subjects not having that outcome serve as (controls). The cases and controls are compared on an explanatory variable, like whether they were smokers, Case=Control Design

Census—a complete enumeration of an entire population. Also a survey that attempts to count the # of people in the population and to measure certain characteristics. Prospective Studies—follow subjects into the future, tracks exposure and disease status over time. Retrospective studies—look at the subjects past.

Cohort Study Design--at the beginning none have disease. Influential Observation—can strongly effect the correlation and regression equation. Cross-Sectional—at one point in time Contingency Table (used for two categorical variables) Scatterplot (used for two quantitative variables) displays the relationship and show a positive or a negative correlation. Probability Distribution—for it to be a valid the sum of all probabilities is 1, and each probability must fall between 0 &1. The probability distribution of a random variable specifies its possible values and their probabilities. It is the randomness of the variable that allows us to specify probabilities for the outcomes. Parameters—numerical summaries of probabilities, most are denoted by Greek letters ex mean and SD, and population mean or a population proportion The mean of a probability distribution for a discrete random variable can be interpreted as the expected value of that variable. It is the value that can be expected as the average in a long run of observations. (not unusual for the expected value of a random variable to equal a number that is not a possible outcome) SD=the larger the SD the greater the spread, describes how far that random variable falls, on the average, from the mean of its distribution. The mean for a continuous distribution is the value of X where the graph would be in balance. The mean is called a weighted average—used when x is not equally likely. If all possible values are equally likely, then the value of the probability distribution is constant and the curve of the constant will be straight-line A RANDOM VARIABLE—is a numerical measurement of the outcome of a random phenomenon. X-refers to the variable itself, x-refers to a particular value of the random variable. (ex X=number of heads in 3 flips; defines the random variable) x=2 represents a possible value for the random variable. A Discrete Random Variable—X has separate values such as (0, 1, 2, 3) X p(x)=the mean of a probability distribution for a discrete random variable. Continuous Random Variable—can take any value in an interval, for example time, age, and size measurements like height and weight. The interval containing all possible values has a probability equal to 1, are measured in discrete values because of rounding. Normal Distribution—(most important) is continuous, symmetric (symmetric around the mean), bell-shaped, and characterized by its mean the probability =0.68 within 1 SD, 0.95 within 2 SD’s and 0.997 within 3 SD’s of the mean. Z-sore for a value of x of a random variable is the number of sd’s that x falls from the mean. Z=x-mean/SD A STANDARD NORMAL DISTRIBUTION has a mean of ZERO and a Standard Deviation of ONE The mean and standard deviation completely describe the density curve. A negative (positive) z score indicates that the value is below (above) the mean. Probabilities for NORMAL CURVES are found using normalcdf(lower bound, upper bound, mean, standard deviation) also for normal random variables Invnorm function is used to find the value of z that corresponds to a certain probability. Invnorm(area under the curve, mean, sd) Finding probabilities for OTHER normally distributed random variables—1. State the problem in terms of the observed random variable P(X<x) 2. Draw a picture to show the desired probability under the given normal curve. 3. Find the area under the normal curve using normalcdf( Conditions for a BINOMIAL DISTRIBUTION—0) Counting the # of successes in a fixed # of trials. 1) Each trial has exactly two possible outcomes. 2) Each trial has the same probability of success 3) the trials are independent. 2&3 are the same thing. n * p = 17 * 0.6 = 10.2 expected success n * (1 - p) = 17 * 0.4 = 6.8 expected failures ALWAYS CHECK TO SEE IF BINOMIAL CONDITIONS APPLY—1) Binary data 2) the same probability of success for each trial 3) Independent trials. EX of Binomial conditions—Deal 10 cards from a shuffled deck and count the # of cards 1. Two categories? Yes, red card=success & black=failure 2. Fixed # n? Yes n=10 3. Independent observations? No,

cards not replaced-so they are not independent. 4. Probability is the same? No, cards are not replaced-so p will changed based as each new card is drawn. P(X=x)=binompdf(# of trials, probability, # of successes looking for) To find the probability of exactly X successes out of N trials. P(X<=)=binomcdf(# of trials, probability, # of successes looking for) cumulative distribution function (adds up all the probabilities of successes up to a certain number.

P(X>=207)=1-P(X<=206)=1-binomcdf(262.119 What is z? invNorm(0. . A reading above 137 is high.885invNorm(. What is the probability that at least 1 of those sampled have Internet at home? (P>=1)=1-P(x<=0)1-binomcdf(15. Answer. Suppose you select one ball bearing at random. Find the number X such that 90% of ball bearings will have weights between (0.30(.25)=-.0. It has a distribution that is skewed to the right with a mean of $8. where y is the selling price in thousands of dollars and x is the size of the house in thousands of square feet. and 53% say they voted for Cleedus Aardvark.5 EX WHAT IF (X>x)=0.43.15– X=0. What percentage of adults have systolic blood pressure is less than 100? Normalcdf(-1e99. p=0.43 SD’s of the mean.120.26(15)=1.003) = .0580. but < 250 mg/dL? Answer: normalcdf(200. 0.P(X ≤ x-1) (so if we did NOT get at least 3.0003)=1.6(1-0.14+2.51.003)=0.5(. if you observe x=0 would you be skeptical of the die? Find the probability that x=0—n=40. Answer 8.5(3)=$238.000 square feet Answer 9.. EX—Find the probability that a normal random variable takes a value greater than 1.9648 Part2—Let X represent the number of people with Internet in the sample of 15.the mean is . Confidence level is within 3 SE’s of the μ (mean) Ex—Population Distribution 7.881 what is x? invNorm(1-0.2 answer y^=-0. what is the probability the mean of the five is greater than 90? P(mean of x > 90)= to find new SD take 11.14)=127. x=0 if they do not for a random sample of 50 people who suffer from migraines—state the probability distribution=for each observation the probability that the medicine helps is 0.65 to 1. even with accounting for random variation. The mean of the sampling distribution of sample size 36 is 13.003/sqrt(100)=. Enter 320 & 80 in to L1 and .4122 Z-score EX—Find the z-score such that the interval within z standard deviations of the mean for a normal distribution contains a) 29% of the prob. EX—The normal random variable X is the number of successes in n trials.53)=normalcdf(0.50 Basically what are the upper and lower bounds for the middle 50% of the area.8 and a SD of 4.50 is less than 2 standard errors away from .25=10. It is described by PARAMETERS.2—Sunshine city was designed to attract retired people its current population is 55. such a random variable would be binomial.3 since sample size n=100 is large so the CLT is applicable.20.50 (always use .645invNorm(. sample proportion and sample mean.100 How much for 3.50 P(-z<X<z)=0.70)/50)=. Population Distribution—is the probability distribution from which we take the sample.8/sqrt(36)=0. To find X.1e99.17(15)=5. Sampling Distribution—is the probability distribution of a sample statistic.1451 and solve for X.43)=? P(-1..1e99. written as P( X ≥ x). One home 3.1+76. EX—The percent of the population in a country using cell phones can be predicted using y^=-0.78.75/sqrt(5)=5. Answer: The central 90% would be between the 5th and 95th percentiles.600=$61. .100.25)(. such as a sample proportion or sample mean. from a population the is 0. So there must be 25% below –z and 25% above z now use invNorm(.0)=. . Tells how a sample statistic falls to an unknown parameter.645)=zscore of . Ex—The population distribution of # of years of education for self-emlpoyed individuals in a certain region has a mean of 13.74 the probability that x=0 is binompdf(40. Answer—The center of population distribution is 60. The center of the sampling distribution of the sample mean is 60.103.1% of its population using cell phones.2.67.0. EX—Suppose weekly income has a dist. find the mean and standard error of the sampling distribution of the restaurant’s sample mean expense per customer.0.003g.7+0.1)= -1.43)?=normalcdf(1.53.x-1) To find the probability of at least x successes out of n trials.119.0648 this SE describes the SD of the sampling distribution.103. The news takes a random sample of 50 students and surveys them. What is the SE? SE=150/sqrt(100)=15 Find the probability that the sample mean is within $21 of $600.8. for a sample of n=50.77/2=.30 is the proportion who get some relief from taking a certain medicine.80 per customer to eat at the restaurant.25)=. A) Find the mean and standard error of the sampling distribution of the sample proportion of females.17(40)=1.15.53 or higher? Would you be willing to declare him the winner? Answer: In order to compute a probability of p^ we need to know its probability distribution.120. If the truth in the population is that it is a tie (i.8.43<z<1. .0693 b) Is it ―unusual‖ to get a proportion of 0. we must have gotten at most 2).8472 TI83 defaults to 0.15. The distribution of ages is skewed to the left.145.5 the spread of data distribution is 14. 34. for the binomial dist.73. Find the probability that 10 or more of those sample have Internet at home. Find how the predicted murder rates increase as % with a college education increases from x=15% to x=40%. SE=$4/sqrt of 100=$.90. standard error=sqrt(.4 The mean stays the same and SE decreases as n increases.25 find x: invNorm(0. What is the z-score for a blood pressure reading above 137? 137-123/18=0.50+.600 so for every thousand square foot increase the price increases by $76.2 to 7. also becomes normal as the sample size increases. with mean .50 females in a sample size of 50.000 residents has a mean age of 60 years with a standard deviation of 13 years. σ=20.0.2-0.148. Normalcdf(579.62(34.25.120.25)=0. 4% of the population have an index score below invNorm(. because the sample proportion is a sample mean when the possible values are 0 and 1..43. cell phone usage increases by 2. and use the cumulative distn function (cdf): Margin of ERROR=1/sqrt(n)*100 Proportions are NEVER normal Histogram—graph that uses bars to portray the frequencies or relative frequencies for a quantitative variable.8.43 SD’s above the mean: P(z>1.9)=0.0764 EX—Find the probability that a normal random variable assumes a value within 1.60 The standard error is se=sqrt(0. They report that 25 of the 50 in the sample were female.148g is given by normalcdf( -9999.0. You win $320 if it comes up heads and lose $80 if it comes up tails. . If 100 customers have the characteristic of a random sample. SD=sqrt(40(.1e99. 1-binomcdf(15.17x.137.1+76.2-0.5 into L2 then 1-var stats L1.7 When the line is fitted with only 50 observations. .20 and a SD of $4. a.148g? Answer: normalcdf( -9999. Find how the predicted rate decreases as percent with a college education increases from 15% to 40%.62x. y^=8. If the POPULATION DISTRIBUTION is NORMAL.2. prior to the polls officially closing. as the sample size increases the standard error decreases Standard Error—(the standard deviation of the sampling distribution) describes how much a statistic varies from sample to sample. is probably skewed to the left. 50/sqrt(36)) = .15.20)=145.1e99. what is the probability that it weighs less than 0. Answer 9.2)=89. So the probability of getting 207 or more without profiling is essentially zero. The number of negro’s stopped is too high.0009 Ex (X=x)binompdf—A balanced die with 4 sides is rolled 40 times.003) = 0.0.S.1549 c.5. (ii) maximum x=34. Central Limit Theorem—states that for random samples of sufficiently large size (at least about 30 is usually enough) the sampling distribution of the sample mean is approximately normal.53 and Standard Error of Square root(. Applies to sample proportions as well.49 Ex—You are given $160 and told to pick one of two wagers for an outcome based on flipping a fair coin.96.0049 d.26(40)=7.43<z<1.103. where y is the percentage of the population that uses cell phones and x is gross domestic product (GDP in thousands of dollars per capita).5) 0.4 so it decreases from 5.2-0. What proportion of children has an index of at least 125? Normalcdf(125.7)=1. 219.63 *area entered is area to the left of the point desired. For a particular subject.315E-11 or 0. it’s the distribution described by sample statistics.50+.4)=.1.1)= -1. 0.385.000 sq. with computers have Internet.4.30. μ=0.000113 *remember 1 less than what your looking for.95.0. p=0.8 The SE is 4.0)=0. Chapter 7.18)=0.e.62(0.30 and does not is 0.25 then use normalcdf(90.L2= σ$120 Ex—At a university.20)=.103. μ (mean)=40*. there isn’t a function to do it directly. A random sample of 100 residents of Sunshine city has a mean of 57.4 this makes the 51st observation a regression outlier because it pulls the line up on the right and suggests a positive correlation when without the 51st it is a negative correlation. If 1000 people were randomly selected as they left the polls. we take .04. of X=number of 3’s.65 & 8.70.5(2)=$162.p. mean=0.1+76.400 so the house sold for $61.8/sqrt(144)=0.8580 Find the index score of the 96th percentile invNorm(. let x=1 if they get relief.what is n and p? Find the mean and standard deviation of the distribution of X. what is the probability their pulse is greater than 90 with a mean of 73.2 & -2. which are usually known. . The mean of sample size 144 is still 13.500 also the correlation is positive because as the square footage increases the selling price increases.5.5. not normal. actually has 45. b) 77% of the probability.7.148g? Answer: The distribution of X would be normal. The spread of the sampling distribution of the sample mean is 1. what is the probability that a poll of 1000 people could have a proportion of .00000000001315 way small. . Suppose you select a sample of 100 ball bearings at random.621.14+2.999999. and realize that P( X ≥ x) = 1 .206)=1-1=0 (not exactly zero but so close it rounds to zero. so the 5th percentile is: invnorm(.1451. The data distribution describes the sample data.14)=78.000 increase in GDP.15+X)g. Find the residualy-y^=45.50 Ex—An index that is a standardized measure used in observing infants over time is approximately normal with a mean of 103 and a SD of 14. This gives X=. EX—a-d—The process of manufacturing a ball bearing results in weights that have an approximately normal distribution with mean 0.1-89.8 with SE of 4.37 b) .123. what does the central limit theorem tell us about the sampling distribution of the sample mean X of these 100 ball bearings? What is the probability that the sample mean is less than 0.14+2. By the Central Limit Theorem it is approximately normal with a mean of p^=0.5x. .44 standard errors away.145=.18 Ex (X>=1)binomcdf—Current estimates suggest that 20% of people in the U. EX—Another (X>=1)binomcdf—if there was no racial profiling we would not be surprised if between about 87 and 135 of the 262 drivers stopped were negro.73.46=44.422. . Answer -2.6 in fact it is z=(.881. so it wont happen Ex—the selling price of homes can be predicted using y^=9.000 Find the residual the RESIDUAL is the difference of y-y^(hat) $300. so a 25% chance (one in four) b.15) EX—An exit poll in a recent election was conducted in order to predict the winner during the evening news.5.400 more than the expected price.20. How much do you predict the house will sell for if it is 2. P(-1. since the population distribution is skewed to the left the shape of the data dist.0003 
 So the probability that the sample mean is less than 0.600.60 female? Answer no since .0.7+0.69% answer (ii) y^=-0. the center of data distribution is 57.14)=0.0158)=0.933 Ex—Probability distribution—For the population of people who suffer occasionally from migraine headaches.50)/1000)=0.6745)=.05.885)=zscore of 1.385=.25. the proportion of children having an index of at least 88? Normalcdf(88.148. What is the 95th percentile of weights of ball bearings? (That is.25.1)=0.15–X)g and (0. This theorem holds true no matter the shape of the population distribution. that is skewed to the right with a mean of $600 and a SD of $150. 250.14)=0.20 Z score ex—find the z-score such that the probability that X is within z standard deviations of the mean is 0. ex. apply CLT—normalcdf(-99999.15 and SE=.1e99. Predict cell phone use at the (i) minimum x value.62%—The country with the maximum GDP. They plan to randomly sample 100 farmers and use the sample mean weekly income to estimate the mean.26x for 51 observations on y^=murder rate and x=percent with a college education.823 part2--If we select five individuals.8 identify the random variable—answer: the # of yrs of education.000-238.7+0.20.0158 so then P(p^ 0. Ex (X>x) 10% of adults have systolic blood pressure above what level? Given: adult systolic blood pressure is normally distributed with mean=120 & SD=20 P(X>x)=find x: invNorm(.15.0693=-1.P(X>=x)=1-binomcdf(n.43. Since we are told 10% is to the right 90% must be to the left* Ex—Adults systolic blood pressure is normally distributed with μ=120. the spread of population distribution is 13.9886 where 50/sqrt(36) is the SE of X .5 & . 50% support Cleedus).0000101 EX Adult blood pressure is normally distributed with mean=120 & SD=20 What is the first quartile? (X<x)=0. The actual # stopped (207) is well above these values. The standard deviation of the sampling distribution of the sample proportion is the standard error of the sample proportion.2525.745<X<. 
 Answer: False.0289 EX—What is the probability that the sample mean from a randomly selected sample of 36 people will be > 200. Answer by P(X>=207) since 207 is evidence of profiling it would be any # above 207.18 EX—The probability that a Z is < z is 0. the sampling distribution of the mean is also normal for any sample size.1)=0. So. what proportion is between 117 & 137? Normalcdf(117.43)=normalcdf(-1..67. and the sampling distribution of the sample mean is approximately a normal distribution Ex—Jan’s all you can eat restaurant charges $8. 60% of 7.000 square feet.8 this describes the variability of the mean for sample sizes of 36. roughly the range of observed x values.1 for mean and sd EX-part1If we randomly select 1 individual.46%—Interpretation of the slope—For every $1.75)=2.20)=106.7 so the predicted murder rate increases from 1.29/2=.36% residual EX tricky reg-line—There is a regression line y^=-2.0.1587 EX—Readings of blood pressure have a mean of 123 and a SD of 18.6745 So P(-6. foot home sold for $300.67 and SD of 11.6)/50)=0. MEAN=$8. what weight will have 95% of the distribution below it?) Answer: invnorm( . we take advantage of the complementary probability rule.6)/0.400 students are female.15g and standard deviation 0.5 and SD=14. Suppose 15 people with computers were randomly and independently sampled.75? P(X>90)=normalcdf(90.

Sign up to vote on this title
UsefulNot useful