hooo

© All Rights Reserved

7 views

hooo

© All Rights Reserved

- Past exam q and answers compiled notes
- Final the Effectiveness of Internet Advertising on Consumer Behaviour
- Correlation
- The Impact of Financial Rewards on Financial Performance: The Case of Pioneer Insurance Company Limited
- Amazon Ecomod
- A Regression Analysis for Base Station Power Consumption under Real Traffic Loads – A Case of Nepal
- Javier Ordonez - Using @RISK in Cost Risk Analysis
- Technical Note
- Econometrics Assignment
- Comprehensive Income and Social Cost Reporting as a Measure of Firms’ Profitability and Corporate Image
- im_ch01
- Journal Inventory
- Econometrics 1
- TFM
- Questions to Review for AP Exam and Final Exam (1)
- amy morrison
- final-stat 03 (2)
- jurnal
- ECON 1203_Tut Q Summer 2014-15 Wk 1 (1)
- 142104087128 Pruthvirajsinh Foransic Medicine.pdf

You are on page 1of 5

Summary

How to look for relationships between continuous variables using correlation and regression

Functions

Introduction

This help sheet covers correlation and regression. The main difference between these approaches is

the issue of causality: correlation does not examine causality and simply describes whether and how

a change in one variable is related to another. For example, we might ask how CO2 levels relate to

the air temperature, without implying that CO2 drives temperature changes or vice-versa; we simply

want to know “does temperature increase (or decrease) as CO2 increases?” We get this information

from the p-value of the correlation test, which indicates the evidence for whether a correlation is

there or not. We can also ask the question “how strong is the correlation between temperature and

C02?”, which indicates how closely they are associated (see below).

In contrast, we could also model temperature according to CO2 levels using linear regression. Here,

we assume temperature responds to changes to CO2 levels. This allows us to say how much

temperature increase we would expect from a given CO2 increase; in other words, we can use CO2

levels to predict temperature.

This gives the speed and stopping distance (dist) of cars. We’ll start by having a look at the data:

120

100

80

dist

60

40

20

0

5 10 15 20 25

speed

Looks like stopping distance is pretty closely related to speed; let’s test this using Pearson’s

correlation:

Note the method=“pearson” bit, which tells R to use a Pearson’s correlation. Remember also that

the order of dist and speed doesn’t actually matter here as we aren’t assuming any causality.

We can see that there is strong evidence for a correlation (t48=9.46, p=1.5x10-12). We can also see

that there is a strong positive correlation; distance and speed are closely related (the correlation

statistic, cor, is 0.81). The strength of correlation can vary from 1 (perfectly positively correlated; dist

and speed fall on a perfect straight line, with no scatter) to 0 (correlation so weak that dist and speed

are virtually unrelated) to -1 (dist and speed are perfectly negatively correlated stopping distance

decreases as speed goes up, and vice-versa). Remember that it is possible to get a significant weak

correlation or a non-significant strong correlation; there is a difference between the evidence for the

correlation and the strength of that correlation.

Let’s say we discover that one (or both) of our variables isn’t normally distributed; what do we do

then? If we can’t transform the data to normalise it (help sheet 8), we need to use a non-parametric

alternative. The most common option here is Spearman’s rank correlation, which ranks both

variables separately and then sees if, for example, the cars with the highest speed tended to also

have high stopping distances:

The spearman test also supports a correlation between speed and stopping distance (S=3532, n=50,

p=8.83x10-14; note that we have to use the sample size n here instead of quoting the degrees of

freedom, because the test doesn’t estimate any parameters). The estimate of the strength of the

correlation, rho, is similar to that estimated in the Pearson correlation (0.83, indicating a strong

positive correlation).

Linear Regression

We know that stopping distance is positively correlated with the speed of the car: but how much

does stopping distance increase for every extra mph of speed? To predict this, we need to use linear

regression:

Our linear model (lm, assigned to the object model) predicts for every 1mph increase in speed, the

stopping distance increases by around 3.9 feet. It also predicts a stopping distance of -17.5 feet at a

speed of 0 mph, as indicated by the intercept with the y axis – not a particularly sensible prediction!

To make this clearer, let’s check what our modelled relationship looks like on a graph. Plot the graph

from earlier, this time using ylim to extend the y-axis between -20 and 120 and xlim to extend the x-

axis between 0 and 25 (see help sheet 9 for more on graphical functions). Next, use abline (a

function for plotting a straight line with intercept a and gradient b) to put a line on based on our

model:

120

100

80

60

dist

40

20

0

-20

0 5 10 15 20 25

speed

We can see that our line crosses x=0 (the y-axis) at just below -20 (-17.5 feet), and for every 10mph

increase we get an increase in stopping distance of around 40 feet (10 x 3.9 = 39 feet).

But to know if this fitted relationship is actually any good, we need to test the model to see if

explains a significant amount of variation. This is done using the summary command:

Lots of information! However, there isn’t that much here that you haven’t met before. Let’s start at

the bottom first. The test statistic is the F-statistic (89.57). The p-value is 1.49x10-12. Here, we have

two degrees of freedom: 1 for the line and 48 for the data*. This means we would report our result

like so: There was significant increase in stopping distance with speed (F=89.571,48, p=1.49x10-12).

Another way of saying this is that fitting our relationship between stopping distance and speed

provides a significantly better explanation of the stopping distance than just using a mean stopping

distance (intercept), with no relationship between stopping distance and speed.

How well does our model explain the variation in stopping distance? To test this, we can look at our

R-squared value, which tells us what proportion of the variation in the data is explained by our

model. Here, R-squared = 0.65 or 65%, so we’re doing a decent job of explaining stopping distance,

but there’s still quite a bit (35%) of unexplained variation in the data. This can be seen in the scatter

around the line in our graph above; if the points lined up perfectly on the line, our R-squared value

would be 100%. The adjusted R-squared value accounts for the fact that the more complicated you

make your model, the more of the data you can explain. We’ll be using pretty simple models though,

so it doesn’t really matter which R-squared value you choose to use.

The table of Coefficients just tell you the same information we looked at earlier (estimates of the

intercept and gradient of the model), together with information on how accurate those estimates

are. You don’t need to worry about the information on the residuals too much; this just tells you a

bit more about the scatter around the line.

We can use our modelled relationship to predict stopping distances based on speed. For example,

our model predicts that the stopping distance at 150 mph would be as follows:

= 572.3 feet, or nearly 175 metres – about one and a half football pitches! However, here, we’re

extrapolating beyond the range of our data, which is often ill-advised: we didn’t measure the

stopping distances of cars at speeds any higher than 25mph.

Linear regression makes a number of assumptions about the data, and isn’t valid if these

assumptions aren’t met - see help sheet 8 for details.

*The degrees of freedom thing is a little complex, but is to do with the way the test is being done. We’ve fitted

a line with two parameters: an intercept (-17.5 feet at speed = 0 mph) and a gradient (3.9 feet for every mph

increase in speed). We’re comparing this line to the null hypothesis of no relationship between stopping

distance and speed; this hypothesis explains stopping distance using only a single parameter, a mean speed,

which doesn’t vary with distance. So our more complicated model, with df=2 (intercept and gradient), is being

compared to the simpler null model, with df=1 (intercept, but no gradient). So our treatment degrees of

freedom = 2-1 = 1. Since we have 50 datapoints, and have fitted two parameters (mean and intercept), we’re

left with 50-2 = 48 freely varying datapoints: so our error degrees of freedom = 48.

- Past exam q and answers compiled notesUploaded byvicki_hood_2
- Final the Effectiveness of Internet Advertising on Consumer BehaviourUploaded byDeepak Sharma
- CorrelationUploaded byNurul Nabilah Ismail
- The Impact of Financial Rewards on Financial Performance: The Case of Pioneer Insurance Company LimitedUploaded bySyed Zubayer Alam
- Amazon EcomodUploaded byodcardozo
- A Regression Analysis for Base Station Power Consumption under Real Traffic Loads – A Case of NepalUploaded byAJER JOURNAL
- Javier Ordonez - Using @RISK in Cost Risk AnalysisUploaded byDamir Ramazanov
- Technical NoteUploaded byJnj Sharma
- Econometrics AssignmentUploaded byBhanu Kant Jhingan
- Comprehensive Income and Social Cost Reporting as a Measure of Firms’ Profitability and Corporate ImageUploaded byAlexander Decker
- im_ch01Uploaded byStefan Svärd
- Journal InventoryUploaded bypothigaiselvans
- Econometrics 1Uploaded byleonardmakuvaza
- TFMUploaded bySarfraz Khan
- Questions to Review for AP Exam and Final Exam (1)Uploaded bydhruhin
- amy morrisonUploaded byapi-242327341
- final-stat 03 (2)Uploaded byAlMumit
- jurnalUploaded byRudy Sartino
- ECON 1203_Tut Q Summer 2014-15 Wk 1 (1)Uploaded byJoyce Kim
- 142104087128 Pruthvirajsinh Foransic Medicine.pdfUploaded byErniRukmana
- LISRELSyntax.pdfUploaded byFadi Mohamed
- Behavioral Statistics in ActionUploaded byARPITA DUTTA
- Related ArticlesUploaded bymuralidharan
- 19051 (1).pptUploaded byManpreetKaur
- BodUploaded byTiaz Lusiana Perdana
- 2nd Evaluation Exam Management Services_January 11, 2017 (G.Sanchez).docUploaded byBeverlene Bati
- balotario+IV+ModuloUploaded byAna Tereza Trujillo Margarito
- LAMPIRAN - 08312244039 (1)Uploaded byIgnas Suku
- goodhand-2011Uploaded byHarjotBrar
- Study on Relationship-1Uploaded bymandeep kumar

- Van Budget 1Uploaded byAnna
- Breif SISUploaded byAnna
- WolseyUploaded byVini Tiastuti
- R Help 3 Getting HelpUploaded byAnna
- R_Help_2_The_R_Language.pdfUploaded byAnna
- R Help 2 the R LanguageUploaded byAnna
- R Help 1 Data ImportUploaded byAnna
- 2. Bridging the Gaps 2018. KmUploaded byAnna
- Challenging the ByProduct Theory of ReligionUploaded byAnna
- Lecture 11 Coop and Mutualism Post Lecture Slides 2017Uploaded byAnna
- meal planner.docxUploaded byAnna
- Journal SummarysUploaded byAnna
- Atomic Emission SpectraUploaded byAnna
- Cress Seeds Write UpUploaded byAnna
- Cells and OrganellesUploaded byAnna
- Controlled Assessment Plan ProformaUploaded byAnna
- Blank Controlled Assesment PlanUploaded byAnna
- prob + sol in trop rainfUploaded byAnna
- 97 Rural and Urb PopsUploaded byAnna
- Balancing EquationsUploaded byAnna
- Hello WorldUploaded byAnna
- Cress Seeds Write UpUploaded byAnna

- Effect of Dispersion and Deadend Pore Volume in Miscible FloodingUploaded byLuis Alberto Angulo Perez
- IPMVP Core ConceptsUploaded byWaleed A. Shreim
- [ABDI H.] Principal Component AnalysisUploaded byJenny Lee Penny
- project management.pptxUploaded byAyush Varshney
- The Effectiveness of Corporate Branding Strategy InUploaded byМария Лутова
- The Efficiency of Fan-pad Cooling System in GreenhouseUploaded bySaul Avila
- RELATIONSHIP OF SELECTED KINEMATIC VARIABLES WITH THE PERFORMANCE OF DOUBLE HANDEDBACKHAND IN TENNISUploaded byAnonymous CwJeBCAXp
- APA TablesUploaded byYet Barreda Basbas
- 3184041Uploaded byazida90
- Rebecca Spooner-Lane ThesisUploaded byOmar Christian Tuazon
- The Influence of After Sale Services on the Customer Loyalty (Marketing)Uploaded byMuhammad Nawaz Khan Abbasi
- Regression AnalysisUploaded byshoaib
- Multiple RegressionUploaded byM8R_606115976
- EVALUATION OF PISCICIDAL ACTIVITY OF PLANT ASCLEPIAS CURASSAVICA LINN (FAM. ASCLEPIADACEAE).Uploaded byDr. Jawale Chetan S.
- i c engine performanceUploaded byYogesh Chaudhari
- Exercise 3Uploaded bynikunjjain001
- Cross CorrelationUploaded bykumarshri
- A4 NabeelaUploaded byTheLostProphet
- 319-GuptaUploaded byanirbanccim
- GeeUploaded byHector Garcia
- Reliability and Validity of the Postural Balance AUploaded byoming
- Men Are From Mars Women Are From Venus PreviewUploaded byHee Jung
- Regresión y CalibraciónUploaded byAlbertoMartinez
- Analysis of a Complex of Statistical Variables Into Principal ComponentsUploaded byAhhhhhhh
- WMLN - MethodologyUploaded byEfrian Okta Lianda
- Asterix Cat30!31!32 252 Interface Specification Artas v6.2 092001Uploaded bynksnksnd
- Kinzelman Guiding Remediation Along the Root RiverUploaded bySweet Water
- Ozkan-2001PANEL DATA.pdfUploaded byCorina Mardar
- IBTC2015-p117-125Uploaded bysyazahaziq
- MBA Sem I-IV NewUploaded byomeet

## Much more than documents.

Discover everything Scribd has to offer, including books and audiobooks from major publishers.

Cancel anytime.