You are on page 1of 4

Epidemiology and Statistics 1

Dentistry Degree

Practice 10: Linear Regression


Relationship. We say that two statistical variables X and Y are related if knowing the value of
variable X, gives us some information about the value of variable Y.
Relationship is studied by means of different techniques. Depending on the type of variables
involved, we must use one technique or the other. In the case of two numerical continuous
variables, the simplest technique is Linear Regression.
Linear Regression. In a Linear Regression, we have two variables x and y. Variable x is
named independent variable (because it can take any value) and variable y is named
dependent variable (because its value depends on the value of x).
A linear regression consists in describing the relationship between variables x and y by means
of the following expression:
y = a + bx
where a anb b are two numbers called intercept and slope, respectively. When we peform a
linear regression on a dataset composed of pairs of variables x and y, we are essentially
looking for the values of a and b that describe the relationship between x and y. This
relationship is called linear because in a dispersion graph (y vs x), the cloud of points
representing the data lay along a line.
Correlation and Slope. Correlation is a measurement of the strength and the sign of the
relationship. It is a number represented by r or and it can take any value between -1 and 1. If
variable x increases while variable y increases, we say that correlation is positive. Otherwise,
if variable x increases while variable y decreases, we have negative correlation. If r = 0, we do
not have any correlation and variables are not related. If r = 1, variables are fully correlated.
The Slope b represents the mathematical dependence between variable x and y. If b = 0, we
say that there is no correlation.
Tests. When performing Linear Regression, we have 3 important statistical tests to check the
significance of the relationship: 1) Test of correlation coefficient (); 2) Test of existence of
correlation (b); and 3) F-test of equality of variances. Each test provides a p-value to check
the relationship.
Prediction. Once we know that variables x and y are related and the values of parameters a
and be, we can predict the expected value of y for any possible value of x, and vice-versa. We
only have to substitute the value of x in the previous formula (or the value of y, in case we
want to know x).

Epidemiology and Statistics 1

Dentistry Degree

Example
In order to study the relationship between the dental attrition (i.e. teeth
erosion) with the age, we have measured the dental attrition index (DAI) in 24
individuals of different ages. We expect higher dental attrition as the age
increases.

1. First of all, we can draw the dispersion graph of variable Age vs. variable DAI and check
visually that both variables are related by a line.

2. Once we see that the relationship between the Age and the DAI can be summarized by a
straight line, we perform a Linear Regression. We have to correctly select the independent and
the dependent variables.

3. The results of the Linear Regression include:

Epidemiology and Statistics 1

Dentistry Degree

The formula that relates both variables (Age, DAI) and the values of the parameters
(Intercept and Slope).
The p-value of the test of existence of relationship (in which H0: b = 0).
The value of the correlation coefficient and the p-value of the test.

4. The F-test of equality of variances shows the variance explained by the model.

Epidemiology and Statistics 1

Dentistry Degree

The squared sum of the model shows what variance of the data is explained by the
linear model.
The squared sum of the residuals shows what variance of the data remains without
explanation.
The F-value is the result of comparing the variance of the model with the variance of
the residuals. If they are very different, the linear regression model introduces a
significant explanation of the results. So the p-value supports the idea that a linear
regression is a good way of explaining the data.

5. Now that we have a formula that describes the relationship, we can ask what the expected
dental attrition that corresponds to an age of 45 is. So we get:
DAI = 12.78 + 1.35 * Age =

12.78 + 1.35*45

73.53

And we can also find what the age that corresponds to an attrition of 85 is:
DAI = 12.78 + 1.35 * Age

Age = (DAI 12.78) / 1.35 = (85-12.78)/1.35 = 53.5

Exercise to deliver
In a recent research of mercury (Hg) release on filled teeth and its effects on the body, the
mercury concentration in plasma (nmol/l) was measured together with the dental surface
covered with the amalgam in 18 patients. Study the relationship between these two variables.
Report the following results:
1. Dispersion graph.
2. Equation that gives the concentration of mercury in terms of the surface of teeth
exposed.
3. The estimation of the Intercept, the Slope and their Stantard Errors, respectively.
4. The p-values of the 3 common hypothesis tests (correlation, slope and F-test).
5. What is the expected concentration of mercury in a patient that has a surface of 500
covered with the amalgam? What surface of teeht is supposed to be covered if a
patient exhibits a mercury concentration of 10?

*You must deliver these exercises before next Friday at 12 pm. 1 point will be
subtracted for each day of delay in delivery. This is an individual work, if I notice
group work the mark will be divided by the total number of people I considered
to work together: avoid sending the same document and try to explain / argue all
the exercises with your own words.