You are on page 1of 3

An association exists between two variables if a particular value for one variable is more likely to occur with certain

values of another variable. Response variable is the outcome variable on which comparisons are made ex. Income level. Explanatory variable defines the groups to be compared with respect to values on the response variable. Positive Correlation (both up) Negative correlation (one up/one down) The closer to 1 in ABSOLUTE VALUE the stronger the correlation also the stronger the STRAIGHT LINE CORRELATION, the closer to 0 in abs. value the weaker the correlation. -.879 is still a stronger correlation than .750 Positive correlation indicates positive association. Negative correlation indicates neg. association. R^2 is the coefficient of determination, represents the % of data closest to the line of best fit. Ex What proportional reduction in error do we get by using the regression line to make predictions instead of simply using the mean? Answer use r^2 Denotes strength of linear association of x & y Correlation does not imply causation; can be expressed, as association does not imply causation. Correlation always falls between -1 & 1. Two variables have the same correlation no matter, which is treated as the response variable. Simpsons Paradox- direction of an association between two variables can change after including a 3 rd value and analyze the data at separate levels of that variable. It is possible for the association to reverse after adjusting for a 3rd variable. Regression analysisoften used for observations of a quantitative response variable over time. A regression equation if often called a prediction equation. Regression line predicts the response variable y as a straight-line function of the value x of the explanatory variable. *Construct a scatterplot before finding a correlation or regression line. Regression equation y^=a+bx find with LingRegTTest The correlation and regression line are NONRESISTANT they are prone to distortion by outliers. Prediction errors are called RESIDUALS. Extrapolation refers to using a regression line to predict y values for x values outside the observed range of data. Predictions about the future using time series data are called forecasts Regression outliers are well removed from the trend that the rest of the data follows Sampling Frame is the list of all subjects in the population from which the sample is taken. When an observation has a large effect on results of a regression analysis it is influential for it to be influential it has to be a regression OUTLIER. Sampling Design is the next step after sampling frame (a sound sampling design can prevent sampling bias but cannot prevent response or nonresponse bias. A lurking variable is a variable, (not measured in the study) usually unobserved, that influences the association between the variables of primary interest. A lurking variable may be a common cause of both the response and explanatory variable. Lurking variables have to potential to be Confoundingwhen two explanatory variables are both associated with a response variable but are also associated with each other. For a LURKING variable to be CONFOUNDING it must be included in the study and associated with neither the response nor explanatory variable. Randomizationin assigning experimental units (subjects) to the treatments helps to balance out lurking variables. ANECDOTAL EVIDENCEcome from personal observations. Not representative of the entire population. Systematic sampletake every kth one. Multiple causes (more common) the association between the two variables becomes difficult to study the effect of any single variable. Crossover Design (Really Good Design) a matched pairs design in which subjects crossover during the experiment from using one treatment to using another treatment. Matched pairtwo observations for a particular subject, because they both come from the same person. Completely Randomized Designsubjects are randomly assigned to one of the treatments. Blocking (matching of subjects is a type of blocking) In experiments with matching, a set of matched experimental units. Randomized blocka block design with random assignment of treatments to units within blocks (to reduce possible bias treatments are usually randomly assigned within a block.) Design of StudiesBest way to collect data. REDUCE NOISEINCREASE SIGNAL Observational Studymerely observes rather than experiments with study subjects. Some researchers use this term to refer only to studies that use available subjects (ex convenience sample) and not to sample surveys that randomly select people. Experimental Studyassigns to each subject a treatment; subjects in an experimental study are often referred to as EXPERIMENTAL UNITS. Researchers impose a treatment or condition (such as exposure or non exposure to cell phone radiation.) CAN CONTROL for Lurking Variables, gives strongest INFERENCE. CAN ESTABLISH Cause & Effect. A simple random sample is often called a random sample. RANDOM SAMPLING IS THE BEST-all subjects in the frame have an equal chance of being selected. Simple Random Samplingmuch more likely to get a representative sample if you let chance rather than convenience determine the sample; Random Sample Designis implemented by using random numbers to select n subjects from the sampling frame. EU=the thing to which treatment is applied. Replication involves more than one EU per condition (treatment) Methods of collecting sample surveyspersonal interviews, telephone interviews, and self-administered questionnaire. Under coveragehaving a sampling frame that lacks representation from parts of the population. Sampling BiasResults from the sampling method (ex Non-random sampling or bias.) Subject does NOT have to be a person. Nonresponse Biaswhen some sampled subjects cannot be reached or refuse to participate. To Reduce Biasexperiments should be double blind, with neither the subject nor the data collector knowing which treatment a subject was assigned. Response Biasoccurs when subjects give an incorrect response (ex lying) or the question wording or the way the interviewer asks the questions is confusing or misleading. Volunteer Sampleis the MOST COMMON type of CONVENIENCE SAMPLE (not ideal) however sometimes necessary in both observational studies and experiments. Convenience SampleNot random, easy and cheap way to obtain data. Key Parts of a Sample Survey(most common use of SS is to estimate population percentages)1) Identify the population of all subjects of interest 2) Construct a sampling frame (attempts to list all the subjects in the population) 3) Use a random sampling design; implemented using random #s 4) Be cautious of sampling bias, due to non-random samples, under coverage, response bias, non-response bias. Stratified Random Sampledivides the population into separate groups, called STRATA, and then selects a simple random sample from each STRATUM. Cluster Random Sampletakes a simple random sample of clusters (such as city blocks) Most often by location. Factora categorical explanatory variable in an experiment, the categories are the treatments. Experimental Studies are preferable to non-experimental studies but are not always possible. Multi-Factor Experiments (Factorial Design) has at least two explanatory variables, allows you to test for a combination of treatments. Case-Control Studyan example of a retrospective study. Subjects who have a response outcome of interest, (ex cancer serves as cases) other subjects not having that outcome serve as (controls). The cases and controls are compared on an explanatory variable, like whether they were smokers, Case=Control Design

Censusa complete enumeration of an entire population. Also a survey that attempts to count the # of people in the population and to measure certain characteristics. Prospective Studiesfollow subjects into the future, tracks exposure and disease status over time. Retrospective studieslook at the subjects past.

Cohort Study Design--at the beginning none have disease. Influential Observationcan strongly effect the correlation and regression equation. Cross-Sectionalat one point in time Contingency Table (used for two categorical variables) Scatterplot (used for two quantitative variables) displays the relationship and show a positive or a negative correlation. Probability Distributionfor it to be a valid the sum of all probabilities is 1, and each probability must fall between 0 &1. The probability distribution of a random variable specifies its possible values and their probabilities. It is the randomness of the variable that allows us to specify probabilities for the outcomes. Parametersnumerical summaries of probabilities, most are denoted by Greek letters ex mean and SD, and population mean or a population proportion The mean of a probability distribution for a discrete random variable can be interpreted as the expected value of that variable. It is the value that can be expected as the average in a long run of observations. (not unusual for the expected value of a random variable to equal a number that is not a possible outcome) SD=the larger the SD the greater the spread, describes how far that random variable falls, on the average, from the mean of its distribution. The mean for a continuous distribution is the value of X where the graph would be in balance. The mean is called a weighted averageused when x is not equally likely. If all possible values are equally likely, then the value of the probability distribution is constant and the curve of the constant will be straight-line A RANDOM VARIABLEis a numerical measurement of the outcome of a random phenomenon. X-refers to the variable itself, x-refers to a particular value of the random variable. (ex X=number of heads in 3 flips; defines the random variable) x=2 represents a possible value for the random variable. A Discrete Random VariableX has separate values such as (0, 1, 2, 3) X p(x)=the mean of a probability distribution for a discrete random variable. Continuous Random Variablecan take any value in an interval, for example time, age, and size measurements like height and weight. The interval containing all possible values has a probability equal to 1, are measured in discrete values because of rounding. Normal Distribution(most important) is continuous, symmetric (symmetric around the mean), bell-shaped, and characterized by its mean the probability =0.68 within 1 SD, 0.95 within 2 SDs and 0.997 within 3 SDs of the mean. Z-sore for a value of x of a random variable is the number of sds that x falls from the mean. Z=x-mean/SD A STANDARD NORMAL DISTRIBUTION has a mean of ZERO and a Standard Deviation of ONE The mean and standard deviation completely describe the density curve. A negative (positive) z score indicates that the value is below (above) the mean. Probabilities for NORMAL CURVES are found using normalcdf(lower bound, upper bound, mean, standard deviation) also for normal random variables Invnorm function is used to find the value of z that corresponds to a certain probability. Invnorm(area under the curve, mean, sd) Finding probabilities for OTHER normally distributed random variables1. State the problem in terms of the observed random variable P(X<x) 2. Draw a picture to show the desired probability under the given normal curve. 3. Find the area under the normal curve using normalcdf( Conditions for a BINOMIAL DISTRIBUTION0) Counting the # of successes in a fixed # of trials. 1) Each trial has exactly two possible outcomes. 2) Each trial has the same probability of success 3) the trials are independent. 2&3 are the same thing. n * p = 17 * 0.6 = 10.2 expected success n * (1 - p) = 17 * 0.4 = 6.8 expected failures ALWAYS CHECK TO SEE IF BINOMIAL CONDITIONS APPLY1) Binary data 2) the same probability of success for each trial 3) Independent trials. EX of Binomial conditionsDeal 10 cards from a shuffled deck and count the # of cards 1. Two categories? Yes, red card=success & black=failure 2. Fixed # n? Yes n=10 3. Independent observations? No,

cards not replaced-so they are not independent. 4. Probability is the same? No, cards are not replaced-so p will changed based as each new card is drawn. P(X=x)=binompdf(# of trials, probability, # of successes looking for) To find the probability of exactly X successes out of N trials. P(X<=)=binomcdf(# of trials, probability, # of successes looking for) cumulative distribution function (adds up all the probabilities of successes up to a certain number.

P(X>=x)=1-binomcdf(n,p,x-1) To find the probability of at least x successes out of n trials, written as P( X x), there isnt a function to do it directly. So, we take advantage of the complementary probability rule, and realize that P( X x) = 1 - P(X x-1) (so if we did NOT get at least 3, we must have gotten at most 2), and use the cumulative distn function (cdf): Margin of ERROR=1/sqrt(n)*100 Proportions are NEVER normal Histogramgraph that uses bars to portray the frequencies or relative frequencies for a quantitative variable. Chapter 7, as the sample size increases the standard error decreases Standard Error(the standard deviation of the sampling distribution) describes how much a statistic varies from sample to sample. Population Distributionis the probability distribution from which we take the sample. It is described by PARAMETERS, which are usually known. If the POPULATION DISTRIBUTION is NORMAL, the sampling distribution of the mean is also normal for any sample size. Central Limit Theoremstates that for random samples of sufficiently large size (at least about 30 is usually enough) the sampling distribution of the sample mean is approximately normal. This theorem holds true no matter the shape of the population distribution. Applies to sample proportions as well, because the sample proportion is a sample mean when the possible values are 0 and 1, also becomes normal as the sample size increases. The data distribution describes the sample data, its the distribution described by sample statistics, ex. sample proportion and sample mean. Sampling Distributionis the probability distribution of a sample statistic, such as a sample proportion or sample mean. Tells how a sample statistic falls to an unknown parameter. The standard deviation of the sampling distribution of the sample proportion is the standard error of the sample proportion. Confidence level is within 3 SEs of the (mean) ExPopulation Distribution 7.2Sunshine city was designed to attract retired people its current population is 55,000 residents has a mean age of 60 years with a standard deviation of 13 years. The distribution of ages is skewed to the left. A random sample of 100 residents of Sunshine city has a mean of 57.5 and SD=14. AnswerThe center of population distribution is 60, the spread of population distribution is 13, the center of data distribution is 57.5 the spread of data distribution is 14, since the population distribution is skewed to the left the shape of the data dist. is probably skewed to the left. The center of the sampling distribution of the sample mean is 60. The spread of the sampling distribution of the sample mean is 1.3 since sample size n=100 is large so the CLT is applicable, and the sampling distribution of the sample mean is approximately a normal distribution ExJans all you can eat restaurant charges $8.80 per customer to eat at the restaurant. It has a distribution that is skewed to the right with a mean of $8.20 and a SD of $4. If 100 customers have the characteristic of a random sample, find the mean and standard error of the sampling distribution of the restaurants sample mean expense per customer. MEAN=$8.20, SE=$4/sqrt of 100=$.4, apply CLTnormalcdf(-99999,8.8,8.2,.4)=.933 ExProbability distributionFor the population of people who suffer occasionally from migraine headaches, p=0.30 is the proportion who get some relief from taking a certain medicine. For a particular subject, let x=1 if they get relief, x=0 if they do not for a random sample of 50 people who suffer from migrainesstate the probability distribution=for each observation the probability that the medicine helps is 0.30 and does not is 0.70, mean=0.30, standard error=sqrt(.30(.70)/50)=.0648 this SE describes the SD of the sampling distribution. EXFind the probability that a normal random variable takes a value greater than 1.43 SDs above the mean: P(z>1.43)?=normalcdf(1.43,1e99,0,1)=0.0764 EXFind the probability that a normal random variable assumes a value within 1.43 SDs of the mean. P(-1.43<z<1.43)=? P(-1.43<z<1.43)=normalcdf(-1.43,1.43,0,1)=0.8472 TI83 defaults to 0,1 for mean and sd EX-part1If we randomly select 1 individual, what is the probability their pulse is greater than 90 with a mean of 73.67 and SD of 11.75? P(X>90)=normalcdf(90,1e99,73.67,5.25)=0.823 part2--If we select five individuals, what is the probability the mean of the five is greater than 90? P(mean of x > 90)= to find new SD take 11.75/sqrt(5)=5.25 then use normalcdf(90,1e99,73.67,5.25)=.0009 Ex (X=x)binompdfA balanced die with 4 sides is rolled 40 times, for the binomial dist. of X=number of 3s,what is n and p? Find the mean and standard deviation of the distribution of X, if you observe x=0 would you be skeptical of the die? Find the probability that x=0n=40, p=0.25, (mean)=40*.25=10, SD=sqrt(40(.25)(.75)=2.74 the probability that x=0 is binompdf(40,0.25,0)=.0000101 EX Adult blood pressure is normally distributed with mean=120 & SD=20 What is the first quartile? (X<x)=0.25 find x: invNorm(0.25,120,20)=106.5 EX WHAT IF (X>x)=0.881 what is x? invNorm(1-0.881,0,1)= -1.18 EXThe probability that a Z is < z is 0.119 What is z? invNorm(0.119,0,1)= -1.18 Ex (X>=1)binomcdfCurrent estimates suggest that 20% of people in the U.S. with computers have Internet. Suppose 15 people with computers were randomly and independently sampled. What is the probability that at least 1 of those sampled have Internet at home? (P>=1)=1-P(x<=0)1-binomcdf(15,0.20,0)=0.9648 Part2Let X represent the number of people with Internet in the sample of 15. Find the probability that 10 or more of those sample have Internet at home. 1-binomcdf(15,0.20,9)=0.000113 *remember 1 less than what your looking for. EXAnother (X>=1)binomcdfif there was no racial profiling we would not be surprised if between about 87 and 135 of the 262 drivers stopped were negro. The actual # stopped (207) is well above these values. The number of negros stopped is too high, even with accounting for random variation. Answer by P(X>=207) since 207 is evidence of profiling it would be any # above 207. P(X>=207)=1-P(X<=206)=1-binomcdf(262,.422,206)=1-1=0 (not exactly zero but so close it rounds to zero. So the probability of getting 207 or more without profiling is essentially zero. Ex (X>x) 10% of adults have systolic blood pressure above what level? Given: adult systolic blood pressure is normally distributed with mean=120 & SD=20 P(X>x)=find x: invNorm(.90,120,20)=145.63 *area entered is area to the left of the point desired. Since we are told 10% is to the right 90% must be to the left* ExAdults systolic blood pressure is normally distributed with =120, =20. What percentage of adults have systolic blood pressure is less than 100? Normalcdf(-1e99,100,120,20)=.1587 EXReadings of blood pressure have a mean of 123 and a SD of 18. A reading above 137 is high. What is the z-score for a blood pressure reading above 137? 137-123/18=0.78, what proportion is between 117 & 137? Normalcdf(117,137,123,18)=0.4122 Z-score EXFind the z-score such that the interval within z standard deviations of the mean for a normal distribution contains a) 29% of the prob. b) 77% of the probability. =0.50 (always use .5) 0.29/2=.145, 0.50+.145=.645invNorm(.645)=zscore of .37 b) .77/2=.385, .50+.385=.885invNorm(.885)=zscore of 1.20 Z score exfind the z-score such that the probability that X is within z standard deviations of the mean is 0.50 P(-z<X<z)=0.50 Basically what are the upper and lower bounds for the middle 50% of the area. So there must be 25% below z and 25% above z now use invNorm(.25)=-.6745 So P(-6.745<X<.6745)=.50 ExAn index that is a standardized measure used in observing infants over time is approximately normal with a mean of 103 and a SD of 14. What proportion of children has an index of at least 125? Normalcdf(125,1e99,103,14)=0.0580; the proportion of children having an index of at least 88? Normalcdf(88,1e99,103,14)=0.8580 Find the index score of the 96th percentile invNorm(.96,103,14)=127.51; 4% of the population have an index score below invNorm(.04,103,14)=78.49 ExYou are given $160 and told to pick one of two wagers for an outcome based on flipping a fair coin. You win $320 if it comes up heads and lose $80 if it comes up tails. Enter 320 & 80 in to L1 and .5 & .5 into L2 then 1-var stats L1,L2= $120 ExAt a university, 60% of 7,400 students are female. The news takes a random sample of 50 students and surveys them. They report that 25 of the 50 in the sample were female. A) Find the mean and standard error of the sampling distribution of the sample proportion of females, for a sample of n=50; Answer- the mean is .60 The standard error is se=sqrt(0.6(1-0.6)/50)=0.0693 b) Is it unusual to get a proportion of 0.50 females in a sample size of 50, from a population the is 0.60 female? Answer no since .50 is less than 2 standard errors away from .6 in fact it is z=(.5.6)/0.0693=-1.44 standard errors away. EXThe normal random variable X is the number of successes in n trials. Answer: False, such a random variable would be binomial, not normal. EXa-dThe process of manufacturing a ball bearing results in weights that have an approximately normal distribution with mean 0.15g and standard deviation 0.003g. a. Suppose you select one ball bearing at random, what is the probability that it weighs less than 0.148g? Answer: normalcdf( -9999, .148, .15, .003) = 0.2525, so a 25% chance (one in four) b. What is the 95th percentile of weights of ball bearings? (That is, what weight will have 95% of the distribution below it?) Answer: invnorm( .95, .15, .003) = .1549 c. Find the number X such that 90% of ball bearings will have weights between (0.15X)g and (0.15+X)g. Answer: The central 90% would be between the 5th and 95th percentiles, so the 5th percentile is: invnorm(.05,.15,.003)=0.1451. To find X, we take .15 X=0.1451 and solve for X. This gives X=.0049 d. Suppose you select a sample of 100 ball bearings at random, what does the central limit theorem tell us about the sampling distribution of the sample mean X of these 100 ball bearings? What is the probability that the sample mean is less than 0.148g? Answer: The distribution of X would be normal, with mean .15 and SE=.003/sqrt(100)=.0003 So the probability that the sample mean is less than 0.148g is given by normalcdf( -9999, .148, .15, .0003)=1.315E-11 or 0.00000000001315 way small, so it wont happen Exthe selling price of homes can be predicted using y^=9.1+76.5x, where y is the selling price in thousands of dollars and x is the size of the house in thousands of square feet. How much do you predict the house will sell for if it is 2,000 square feet. Answer 9.1+76.5(2)=$162,100 How much for 3,000 square feet Answer 9.1+76.5(3)=$238,600 so for every thousand square foot increase the price increases by $76,500 also the correlation is positive because as the square footage increases the selling price increases. One home 3,000 sq. foot home sold for $300,000 Find the residual the RESIDUAL is the difference of y-y^(hat) $300,000-238,600=$61,400 so the house sold for $61,400 more than the expected price. EXThe percent of the population in a country using cell phones can be predicted using y^=-0.14+2.62x, where y is the percentage of the population that uses cell phones and x is gross domestic product (GDP in thousands of dollars per capita). Predict cell phone use at the (i) minimum x value, 0.7, (ii) maximum x=34.2 answer y^=-0.14+2.62(0.7)=1.69% answer (ii) y^=-0.14+2.62(34.2)=89.46%Interpretation of the slopeFor every $1,000 increase in GDP, cell phone usage increases by 2.62%The country with the maximum GDP, 34.2, actually has 45.1% of its population using cell phones. Find the residualy-y^=45.1-89.46=44.36% residual EX tricky reg-lineThere is a regression line y^=-2.7+0.26x for 51 observations on y^=murder rate and x=percent with a college education. Find how the predicted murder rates increase as % with a college education increases from x=15% to x=40%, roughly the range of observed x values. Answer -2.7+0.26(15)=1.2 & -2.7+0.26(40)=7.7 so the predicted murder rate increases from 1.2 to 7.7 When the line is fitted with only 50 observations, y^=8.2-0.17x, Find how the predicted rate decreases as percent with a college education increases from 15% to 40%. Answer 8.2-0.17(15)=5.65 & 8.2-0.17(40)=1.4 so it decreases from 5.65 to 1.4 this makes the 51st observation a regression outlier because it pulls the line up on the right and suggests a positive correlation when without the 51st it is a negative correlation. ExThe population distribution of # of years of education for self-emlpoyed individuals in a certain region has a mean of 13.8 and a SD of 4.8 identify the random variableanswer: the # of yrs of education. The mean of the sampling distribution of sample size 36 is 13.8 The SE is 4.8/sqrt(36)=0.8 this describes the variability of the mean for sample sizes of 36. The mean of sample size 144 is still 13.8 with SE of 4.8/sqrt(144)=0.4 The mean stays the same and SE decreases as n increases. EXSuppose weekly income has a dist. that is skewed to the right with a mean of $600 and a SD of $150. They plan to randomly sample 100 farmers and use the sample mean weekly income to estimate the mean. What is the SE? SE=150/sqrt(100)=15 Find the probability that the sample mean is within $21 of $600. Normalcdf(579,621,600,15) EXAn exit poll in a recent election was conducted in order to predict the winner during the evening news, prior to the polls officially closing. If 1000 people were randomly selected as they left the polls, and 53% say they voted for Cleedus Aardvark. If the truth in the population is that it is a tie (i.e. 50% support Cleedus), what is the probability that a poll of 1000 people could have a proportion of .53 or higher? Would you be willing to declare him the winner? Answer: In order to compute a probability of p^ we need to know its probability distribution. By the Central Limit Theorem it is approximately normal with a mean of p^=0.53 and Standard Error of Square root(.5(.50)/1000)=0.0158 so then P(p^ 0.53)=normalcdf(0.53,999999,0.5,0.0158)=0.0289
EXWhat is the probability that the sample mean from a randomly selected sample of 36 people will be > 200, but < 250 mg/dL? Answer: normalcdf(200, 250, 219, 50/sqrt(36)) = .9886 where 50/sqrt(36) is the SE of X .