You are on page 1of 2

Regression n Diagn

Diagnostics
Terms for
Key
e rror of the residuals
of
residuals,
S t a m d a r d i z e dr e s i d u a l s

ard
standard
error

the
divided by
Residuals

from the rest of the data


distant from
distant
data (.(or the
are
that a re
Outliers o u t c o m e
values)
pre-
Records (or
dicted outcome).

Infiuential value or
nce
absence makes a big difference
or
record
whose presence the
A value
regression equation.

Leverage recora nas on a regression equation


influence that a single
The degree of

Synonyms
hat-value

Non-normal residuals
residuals can invalidate some technical requireme
Non-normally distributed
in data science.
of regression, but are usually not
a concern

Heteroskedasticity
When some ranges of the outcome experience residuals with higher variance
(may indicate a predictor missing from the equation).

Partial residual plots


A diagnostic plot to illuminate the relationship between the outcome varank
and a
single predictor.
Synonyms
added variables plot

Outliers
Generally speaking, an extreme that is distant ro
most of the other
value, also called an outlier, is one tna one

observations. Just as outliers need to be


ocation and
variability nan
ability on
(see
"Estimates
page 13), outliers can cause
of Location" on page o
a adels.
mo
In
1On, an
outlier is a record problems with regress predicted
valu

whose actual y value is distant the rom

0u can detect outliers by ro which


h isis thethe
divided by the examining the standardized rest ,
standard error of the
There is no residuals. from noutliers. Rather
h e rt h e r

statistical theory that seri


are
separates outliers Iro
(arbitrary) rules of thumb for how ott h edata
distant from the bui
156 hanta
der to
order to called
be calle an outlier. For
example, with the boxplot, outli-
tion t o bein
needssto
oints that are too far above or below the box boundaries (see
re thosBoxplots" on page 20), where""to0 far" = "more than 1.5 times the
e r sa r e t h o

Percentile and Box

Lauartilerange.
In regression, the standardized residual is the metric that is typi
erquar to determine whether a record is classified as an outlier. Standardized
inter-quar
allyused t o
interpreted as "the number of standard errors away from the
als can be
rEsiduals
regres-
online

oression to the King County house sales data for all sales in zip code
regression
a
Lets fit
8105:
house[house$ZipCode == 98105,]
house98105<

in98105 - ln(AdjsalePrice SqFtTotLiving + SqFtLot + Bathrooms


Bedrooms + BldgGrade, data=house_98105)

We extract the standardized residuals using the rstandard function and obtain the
ndex of the smallest residual using
the order function:

sresid rstandard(ln_98105)
idx-order(sresid)
sresid[idx[1]]
20431
4.326732
ne biggest overestimate from the model is more than four standard errors above the
data
ESsion line, corresponding to an overestimate of $757,753. The original
Td
corresponding to this outlier is as follows:
oUSe_98165[idx[1], c('AdjSalePrice', 'SqFtTotLiving', "sqFtLot
Bathroons', 'Bedrooms', 'BldgGrade ')]
LePrice SqFtTotLiving SgFtLot Bathrooms Bedrooms Bldgcrade
(dbl) (dbl) (int) (int)
(int) (int)
119748 2900 7276 3 6
is
case, it pears that there is something wrong with the record: a house of that
typical
Cerpttypically
4-4 shows an
Figure
from sells that zip code.
the formuch more than $1 19,748 in
that the sale involved only
vantialintereslatuatory
intere deed from this sale: it is clear
corresonds to a sale that
is

enomalous
u s and
in the
property.
be included
In this
in the
case, the outlier
regression. Outliers could also be
the

problems,not
fother 1ould mismatch of units (e.g..
Horting
a saleDlems,
such as a "fat-finger" data entry or a

salein thousands of dollars versus simply dollars).

You might also like