CH 01

One
Introduction
1
1.1 O VERVIEW
Statistical Learning is a set of tools for understanding data. Typically, these

tools are classified into two groups:
• Supervised Learning
– Build statistical models with input and output

– Predict or estimate an output based on one or more inputs.
– e.g. weight based on height.
• Unsupervised Learning
– Learn relationships and structure with inputs

– e.g. grouping customers based on demographic information.
Here are three real world data sets to illustrate some applications.
2
1.2 WAGE D ATA
Income survey for males from the central Atlantic region of the United
States.
300
300
300
200
200
200
Wage
Wage
Wage
50 100
50 100
50 100
20 40 60 80 2003 2006 2009 1 2 3 4 5
Age Year Education Level
3
• Inputs: Age, Year, Education Level
• Output: Wage
• We can see trends quite clearly:
– Wage increases with Age, then decreases;

– Wage increases slowly but steadily over time;
– higher Education Level associated with higher Wage.
• A lot of variation using any one of the inputs.
• Combine the inputs to predict?
4
1.3 S TOCK M ARKET D ATA
Standard & Poor’s 500 (S&P) index between 2001 and 2005.
Yesterday Two Days Previous Three Days Previous

6
6
Percentage change in S&P

4
4
2
2
0
0
−2
−2
−2
−4
−4
−4
Down Up Down Up Down Up
Today’s Direction Today’s Direction Today’s Direction
5
• Inputs: past 5 days’ percentage change.
• Output: Up or Down today.
• Direction is more “important” than magnitude!
• Classification problem: categorical (or qualitative) output.
• Very weak trend. Expected?
6
1.4 G ENE E XPRESSION D ATA
The NCI60 data set consists of 6830 gene expression measurements for
each of 64 cancer cell lines.
• Inputs: Gene expression measurements.
• No outputs.
• Goal here is to determine if there are clusters within the 64 cancer

cell lines.
• Each cell line is described by 6830 numbers. How to visualize?
• One way is to use Principal Component Analysis (PCA), one of

many dimension reduction methods.
7
20
20
0
0
Z2
Z2
−20
−20
−40
−40
−60
−60
−40 −20 0 20 40 60 −40 −20 0 20 40 60
Z1 Z1
• The plot on the left suggests there are four clusters.
• The plot on the right is labelled with the actual 14 types of cancers.
8
1.5 F OCUS OF THIS COURSE / SEQUENCE
• Widely applicable methods across disciplines.
– do more than linear regression!
• Understand the motivation and trade-offs.
– not just a series of black boxes.

– be able to select most suitable method.
• Application to real-world problems.
– spend a substantial amount of time implementing in R.

– handle real world datasets.
9
1.6 T HE SEQUENCE ST3248/4248
• Part I gives a broad overview of common problems and popular

approaches.
– Linear regression model and its extensions, classification meth-
ods, resampling methods, regularization and model selection,
principal components and clustering methods.
– This corresponds to ISLR: Ch01-06, Ch12.
• Part II builds on the knowledge in Part I, introducing more tools.
– Non-linear regression, non-parametric smoothing methods, tree
based methods, support vector machines, neural networks and
ensemble learning.
– This corresponds to ISLR: Ch07-10 and select topics from ESL.
– Group Project.
10
1.7 C OMPARISON WITH OTHER COURSES
• There is significant overlap in topics with other modules.
• These are actually complementary with what we are doing, and we

encourage interested students to take them as well.
• ST3131 Regression Analysis (required for ST4248)
– theoretical motivation, distributional and optimality proofs,

analysis of variance, sum of squares.
• CS3244 Machine Learning
– more emphases on feasibility and efficiency; topics related to

AI.
– Python implementation.
11
1.8 N OTATION AND S IMPLE M ATRIX A LGEBRA
• n is number of distinct data points, or observations.
• p is number of variables (inputs).
• e.g. Wage data set has n = 3000 people and p = 12 variables (we only
saw 3 of them).
• xi j is the value of the jth variable for the ith observation, where i =
1, . . . , n, and j = 1, . . . , p.
• We let X be our n × p data matrix whose (i, j)th element is xi j .

 
x11 x12 ··· x1p
 x21 x22 ··· x2p 
X=
 .. .. ... .. 

xn1 xn2 ··· xnp
12
• Vectors are by default represented as columns.
– the ith row corresponding to the ith observation is represented

by
 
xi1
 xi2 
xi =  .
 . 

xip
– the jth column corresponding to the jth variable is represented

by
 
x1 j
 x2 j 
xj = 
 .. 

xn j
13
• so
x1T
   
x11 x12 ··· x1p
 x21 x22 ··· x2p   x2T

X=
 .. .. ... ..  =  ..  = (x1
   x2 ··· x p),
xn1 xn2 ··· xnp xnT
T
where denotes the transpose of a matrix or vector.
• The output variable of interest is denoted by
 
y1
 y2 
y= .
 . 

yn
• Hence, our observed data can be written as

{(x1, y1), (x2, y2), . . . , (xn, yn)}.
These are realizations of our random variables (X,Y ).
14
• Dimension notation
a ∈ R, a ∈ Rk , a ∈ Rn , A ∈ Rr×s,
denotes scalar, k-vector, n-vector, r by s matrix.

• Matrix multiplication. Let A ∈ Rr×d , B ∈ Rd×s.
d
(AB)i j = ∑ aikbk j
k=1
for i = 1, . . . , r and j = 1, . . . , s.
– e.g.

1 2 5 6 1×5+2×7 1×6+2×8 19 22
= =
3 4 7 8 3×5+4×7 3×6+4×8 43 50
– note internal dimension d must match.
15
1.9 G ETTING THE SOFTWARE R
• R is free (as in both beer and speech).
• Website for R project is

http://www.r-project.org
• A mirror site can be found at

http://cran.stat.nus.edu.sg
• Follow links to download R for your OS.
• Optional but recommended: RStudio.

https://www.rstudio.com
• Integrated development environment (IDE) for R. Also free (local

Desktop edition).
16

CH 01

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

CH 01

Uploaded by

Copyright:

Available Formats

One

Statistical Learning is a set of tools for understanding data. Typically, these

– Build statistical models with input and output

– Learn relationships and structure with inputs

Age Year Education Level

• We can see trends quite clearly:

– Wage increases with Age, then decreases;

• A lot of variation using any one of the inputs.

• Combine the inputs to predict?

Yesterday Two Days Previous Three Days Previous

Percentage change in S&P

Percentage change in S&P

Today’s Direction Today’s Direction Today’s Direction

• Output: Up or Down today.

• Direction is more “important” than magnitude!

• Classification problem: categorical (or qualitative) output.

• Very weak trend. Expected?

• Inputs: Gene expression measurements.

• Goal here is to determine if there are clusters within the 64 cancer

• Each cell line is described by 6830 numbers. How to visualize?

• One way is to use Principal Component Analysis (PCA), one of

• The plot on the left suggests there are four clusters.

• Widely applicable methods across disciplines.

– do more than linear regression!

• Understand the motivation and trade-offs.

– not just a series of black boxes.

• Application to real-world problems.

– spend a substantial amount of time implementing in R.

• Part I gives a broad overview of common problems and popular

• There is significant overlap in topics with other modules.

• These are actually complementary with what we are doing, and we

• ST3131 Regression Analysis (required for ST4248)

– theoretical motivation, distributional and optimality proofs,

• CS3244 Machine Learning

– more emphases on feasibility and efficiency; topics related to

• n is number of distinct data points, or observations.

• p is number of variables (inputs).

• We let X be our n × p data matrix whose (i, j)th element is xi j .

xn1 xn2 ··· xnp

– the ith row corresponding to the ith observation is represented

– the jth column corresponding to the jth variable is represented

• Hence, our observed data can be written as

denotes scalar, k-vector, n-vector, r by s matrix.

– note internal dimension d must match.

• R is free (as in both beer and speech).

• Website for R project is

• A mirror site can be found at

• Follow links to download R for your OS.

• Optional but recommended: RStudio.

• Integrated development environment (IDE) for R. Also free (local

You might also like