You are on page 1of 11
Associations between Numerical Variables; Simple linear Regression Recall: Bivariate data can be represented in the following form: (1, ys (42, ¥2)(R3s 4y3),"** ,Gn-yn)vealledan ordered pair where Xi the observed value of vatable x fromm person sampled fom a population. Vj the observed value of variable y from the same person person i sampled from the population. Bovvrile yt verutlas bow Sem Some examples of data that are bivariate in nature: 1. A Statistics 213 students fest midterm mark and second midterm mark. 2! The numberof years of postsecondary education an individual has and thee ‘annual income 3. A year's inflation and interest ate 4. The average price of ol ina mo (lative tothe US. dollar. 5. The numberof cigarettes a smoker consumes in day and hiser ife-span and the average price of the Canadian dollar René Descartes, developed a Cartesian coordinate system isa coordinate system that, specifies each point uniquely in a Cartesian plane by a pair of numerical coordinates, ‘Which ae the signed distances from the point to two fixed perpendicular directed lines, ‘measured inthe same unit of length. Each reference line is called a coordinate axis or just axis of the system, and the point where they meet sits origin, usually at ordered pair (0, (0). The coordinates can also be defined asthe positions of the perpendicular projections tthe pint ont the tweancosenpessd es signe dtenes fom the gins rset viv: Ral ge aie XK pe 6? + en Vardi Meght bw ce i 5 J Nee a fre KEY eoledeel? Using te Cannan pane to graph our variate dat produces a settepla. Jaa ‘eaten, cnch poi epreens one observation, Th lestn of he pol depends on the ales ofthe wo vals y Vo. Me "| : a\ | ‘oe : |. | *4 \5 | \ Recognizing Trend “The tend ofan association is the general tendency of the scatterplot as you scan from left to right. Usually trends ae ether increasing (uphil,/) or decreasing (dowakill,), but ‘other possibilities exist. Inreasing trends are called postive associations (or positive ‘wends) and decreasing tends are negative associations or negative trends) 3 i @ Keil Age ‘Seeing Strength of Association Ct ‘Weak associations result ina large amount of scatter in the scatterplot. A large amount of scatter means that points have a great deal of spread in the vertical ditection. This vertical spread makes it somewhat harder to detect a trend. Strong associations have little vertical variation blac Tracsed Identitying Shape The simplest shape for tends linear. Fortunately, linear tend are quite common in nature, Linear tends always increase (r decrease) at Te-same rat. They are called linear because the trend canbe summarized witha straight ine. Nonlinear trends are more difficult to summarize than linea trends. This course does not ‘cover nonlinear trends. Measuring Strength of Asaclaton with Forrlatin) ‘The corelion cotta sa number ha meas ie stengt of thedfnea®) ssociation between two numerical variables—for example, the relationship between people's Pegi and Weigh We crt empha enough hat the onelaton cosicient Maes sens ony the edie and bh anaes renames F week, Wet — Modwale, losteeoas | ‘shang ‘The correlation coefficient, represented by the letter r,s always a m ‘and +1. Both the value andthe sign (positive or negative) ofr have infor use. Ifthe value of ris close to -1 or +1, then the association is very strong; iris lose to (0, the association is weak. If the value ofthe correlation coefficient is positive, then the trend is positive; if the value is negative, the trend is negative, Joaserat) Dhaxyioniy @— Das, EDF ang 5, = (GED. Correlation Does Not Mean Causation! (Quite often, you'lT hear someone use Me correlation coefficient to support a claim of cause and effect. Fr example, one ofthe authors once read that a politician wanted 9 close liquor stores ina eity because there was a positive conelation between the umber of liquor stores in a neighborhood andthe amount of crime. -1sr<1andwhere ‘As you learnt, we can’t form cause-and-effect conclusions from observational studies. IF ‘your data came from an observational study, it doesn’t matter how strong the correlation js. Even a correlation near 1 is not enough to conclude that changing one variable (closing down liquor stores) will lad toa change in the other variable (crime rate). If you learn nothing else from this course, remember this: No matter how tempting, do not ‘conclude that a cause-and-effect relationship between two variables exists just because there is a correlation, no matter how close to +1 or -1 that correlation might bet Correlation is not causation! Just because things ae related does not determine Gaise nor does it exclude athervarablesassoeatine Example: The following data was collected from a random sample of n = 10 patients into ‘a medical clinic, For each patient, their total serum cholestrol and theit body mass index. (BMI was recorded. The data is given below. DMI y-"Toal Choktert A Scatterplot ofthe data is given below. From this one can a 70M ‘Cant emt to make generalizations: is there a relationship. nd ‘ sor 8 m3 188 hs” | 182 150, a oe Boe —_ ae 4 1 ase awe Bb , , su ap ‘ ) | /m ~ | «| FP | Descriptive Statistics: BUI, TotalCholesterol — SY? Tetalchotesterot 10 167.401) 2, “BaSTS? 132.00 147.75 160.00 188.50 228,00 iro =. 10, Garp Gaus) V=10T40) y= M.26Y Yama MMH) 6.) ‘he comiation coset i given by Eihinw— (+z ay) " (r= Tsesy _ 41225,6 ~ (10 24.17 « 167.40) © (0 1)(4-42818)(28.24675) 795.02 T2218 t 0.707 ‘The value ofr ean be computed with MINITAB: Stat -> Basie Statistics -» Correlation. 7 be as ante? a wet ‘Modeling Linear Trends; Simple Linear Regression ‘The regression line isa tool for making predictions about future observed values. It also provides us with a useful way of summarizing a linear relationship. Specifically, ‘simple linear regression isa statistical method that attempis to build statistical prediction model, one that will atempt to predict the value of one variable based entirely ‘on its historical relationship with another variable ‘The value ofthe variable tht is being predicted is called the response variable, denoted by ys the variable being used as te basis forthe predetion is called the predictor ‘variable, denoted by x. In some texts, the response variable is referred to as the ‘dependent variable y; the predictor variable is commonly refered to as the independent variable x ‘Simple linear regression attempts to fit the data, or model the data, collected on both the predictor variable x and the response variable y inthe following form: 8 3 Ft tems nt on) wa hor Bate) “SE Ne Tbe Zeztro = aces Se Xj value of the predictor variable from subject ior element i- randomly selected from ‘the target population, Yj - value ofthe response variable from subject i or element i - again, randomly selected ‘rom the target population Bo - ¥-imercept term, or starting point for y when x=0 1 -the slope term, the slope tem represents the average change to the response variable ¥ that corresponds toa unit change in the predictor variable x. C Thismocelisesimated to emovgiheeroriem,, ball act * 2 {ag £7) a Bu Yor be R= Bat Bim C Ee onnae PIER where: hark B , Sos dye Sept einerag Estimate the model that predicts a ‘Solution: The regression line is person's total serum cholesterol a a Tinea function of thelr body mass i index, or BMI 7 ge 2 Li Avlyn 1 pelt] ot 7 Wl Aly files opines ® Ablins pmovoge Gercel febbls | wale! Under Tata tala # Date Aadlys's » Pqrerion . 7-2) 4 smtipe roomie ok sve ay = 16740 ~ (45242417) x1 a act 8.17695 a Sy fy = 5818 “The eatimate ofthe prditon mode tha ltempt to predict person's total serum etre - 9 for ive value finer BMI, oF a = bythem $8.18 + 4.820 Helen! MINITAB ouput: STAT + Regrasion > egrsson. *f! Egun mel costtictents hes yS0¥ Jorn Coot SD Cost T-Value Pl 483 ¢ (Conatnese3) 30.219 [ar 4.62} 1.60 2.83 ~ tof fa set Note: fy = 5858; = 452, This fr sgly fom what we hl cobarl, due to rounding eosin the ‘computation of 20,55. ‘An Application of the Model. Using the model, prediet a person's total serum Soler of with a SLOC24S) inert i meaning of his preted vale quie Gresh ee 9s Sat3 1450 eH ERAT oe bigeies msc aas woe Ook! se nae Inert the mganing of te slope ~ he yimercep term and the lope term nthe comet of the question Sage so ule ol cheng \ in Interpreting the Slope ‘The slope tells us how to compare the the xvariable, We should pay atention to whether the slope is 0 or very close 0.0. When the slope is O, ‘we say that no (linear) relationship exists between x aild'y. The reason is that a 0 slope sens tha o mater what vale ox you consider the predicted valu of ys lays the -values for objects that ae I unit apart on ide er eg y huge cOrR? EBB YB, eT Peel OP sage OTR Be Pe Reaadent Interpreting the Intercept esi - “The itrcep lis wsthe pected mean yale when th sales. Quite oe, this is nol erly help. Someines sels For example the eresson Ineo pedi weight gen soneane's eg sasha pons Oiaces tea sor. { herpredicted weights negative 2315 pounds! (on) 15 Setenz cae Poe ue sp ! 5 fester Bo, ‘ae abel et” ot “pe Pitfalls to Avoid Don’t fit linear models to nonlinear associations Units, the units (x and y respectively) you use to calculate the regression will be required when using and understanding the regression and ouput Correlation is not causa Beware of outliers Don’t Extrapolate! ‘© Extrapolation means that we use the regression line to make predictions beyond the range of our data, This practice can be dangerous, because although the association may have a linear shape for the range we're observing, that might not be true over a larger range. This means that our predictions might be wrong outside the range of observed x-values, jae Hon oo oly aoe ae equ Bop wnat yn bree ad at ot ages % will aed Be q \\ by neces be X doesn’ cers Evaluating the Accuracy of our Regression ‘The Coefficient of Determination, + To determine "how well” our model building replicates ‘real life’, we ean compute a statistic that will determine “how good” our prediction model is. This statistic is called the coefficient of determination, The coef tion is computed by ‘nice’ we ob + represents the proportion of al the aia he rsaponse variable y thats explained by its Tine rel EE AO, al Gaunt ob Debrminet | nou last example, the coefficient of determinations r# = (0707)? = 04998 An in Context Meaning of Coefficient of Determination, r2 49.98% of te variation in a persons total serum cholesterol (y) canbe expand by its linear relationship (regression) thier BMI index Using Residual Plots 4,2), Gy, whet the puts! wes lel yee * 4's he e gridit cont oe ‘A residual, or residual eroris how far’ an observed value ofthe response variable yis froin what our model predicts y tobe, fora given value of the predictor variable x. residual is found by = residual = error = y,~ (observed ~ predicted) Note: Yerron, = 0 ‘What do residual errors look like? . This is why ican be called the least + | squared error 7 or east 7 ‘te | squares ine for ert ©) °F | shore: imply itis of the’line of best fie If we pivot the fedression function uni itis perfectly horizontal we can visualize this residal erors mote clealy as still positive or negative deviations from the predicted values STAT > Reynstiin Fit ayn snin Medel Gegurssin 2Siepl nor 9 re a Ag (One of our major concems, when considering residuals, is that we don't want to see any recognizable patterns in our residuals, like sampling we want them to be random! 3 patirns that can arse ae: lines curves, or wedges. Some typi Com > Vine Aamo was a eh es tear ¥ he ne Recall our BMI example: see Ialerospecbslie Consider the fst data point: (23 = 25:9,y; = 165). Using our model, a person having a BMI of 25.9 would have an average/mean total serum cholesterol of GH = 58:18+452(259) 17525 ‘The obeerved value of y for this person was 165, whichis below 175.25, The residual corresponding to this data point is a = uni observed y— predicted y 165 — 175.25 = 1025 ‘A residual plot ic two-dimensional graph that plots the residuals, or e's on the y-asis, and the predicted values of y for given values of 2 - 7s, on the x-axis. A residual plot for ur BMI/Total Serum Cholesterol ate is provided below. shales? q Niro — ‘Residual vs. Fits asa Na petlerat o> Na tne eee pln Je trae oseme er Yealds or 1h * 2 hom one yo be tangsle® om at ¥ Horo cheese a dtd Tol Sen Chol = 6213+ 4361 Example: What is the relationship between a student's midterm and final exam Score in an introductory statistics course, such as Statistics 213. To investigate this, statistics professor randomly selected 10 students who completed an introductory statistics course similar to Statistics 213 inthe Winter term of| 2011. For each student chosen, their midterm exam mark and final exam mark (in percentage of marks earned) was observed, The raw data is provided below. (@) From the sate pls, what ean you Midterm Exam Score (96) Final Exam Seore (5) - y s8¥ about th @) dieton ofthe ‘elatoship? (i) he strength of the 3e 8 ‘elationship? Gil) does the relationship = 8 cmee har " a 0 Dpen ie tobias A u - ° “ i) slog » reat ” 2 2 2 La) Yas, bee ge “ " (©) Find the vale ofthe conetation ‘sft pring tte) Fetng of Tarot ay rash o¢ ¢%OIT (0) Find te ae of os fu cnines Sor [MINITAB output A. rsege > 0.65F (4) rm (estate th mode at predic ou nal exam mar neat Sect cas incon of students mies te ; Seger 10.657 % o (enn © 6:2 10.657 elem) : {e) Take your average (in percentage) of miter exams # and #2 Using thi, Predit you States 213 ial exam mak ad Interpret the meaning of tis predited valve Gy Ronge” 72) Te ee Fea vo es Maye ge Mitt 28.5 (Using your answerin find he Pearson correlation of MEPercent and FEPerceat = | Soetoro fegresston Analyste: FaDercent vereus HePercent Le ota OSS! Coefficients cnet 4 get ovation Tern Goet ‘SD Gost T-Value P-Yalue poet 5.1% 9 o Constant 86.2 fad 2.78 0.005 ta Yrs eyplinal WePercent 0.067 0.210 3.18 0.0 J, Ths ep agen

You might also like