You are on page 1of 16

One

Introduction

1
1.1 O VERVIEW

Statistical Learning is a set of tools for understanding data. Typically, these


tools are classified into two groups:

• Supervised Learning

– Build statistical models with input and output


– Predict or estimate an output based on one or more inputs.
– e.g. weight based on height.

• Unsupervised Learning

– Learn relationships and structure with inputs


– e.g. grouping customers based on demographic information.

Here are three real world data sets to illustrate some applications.

2
1.2 WAGE D ATA

Income survey for males from the central Atlantic region of the United
States.
300

300

300
200

200

200
Wage

Wage

Wage
50 100

50 100

50 100
20 40 60 80 2003 2006 2009 1 2 3 4 5

Age Year Education Level

3
• Inputs: Age, Year, Education Level

• Output: Wage

• We can see trends quite clearly:

– Wage increases with Age, then decreases;


– Wage increases slowly but steadily over time;
– higher Education Level associated with higher Wage.

• A lot of variation using any one of the inputs.

• Combine the inputs to predict?

4
1.3 S TOCK M ARKET D ATA

Standard & Poor’s 500 (S&P) index between 2001 and 2005.

Yesterday Two Days Previous Three Days Previous


6

6
Percentage change in S&P

Percentage change in S&P

Percentage change in S&P


4

4
2

2
0

0
−2

−2

−2
−4

−4

−4
Down Up Down Up Down Up

Today’s Direction Today’s Direction Today’s Direction

5
• Inputs: past 5 days’ percentage change.

• Output: Up or Down today.

• Direction is more “important” than magnitude!

• Classification problem: categorical (or qualitative) output.

• Very weak trend. Expected?

6
1.4 G ENE E XPRESSION D ATA

The NCI60 data set consists of 6830 gene expression measurements for
each of 64 cancer cell lines.

• Inputs: Gene expression measurements.

• No outputs.

• Goal here is to determine if there are clusters within the 64 cancer


cell lines.

• Each cell line is described by 6830 numbers. How to visualize?

• One way is to use Principal Component Analysis (PCA), one of


many dimension reduction methods.

7
20

20
0

0
Z2

Z2
−20

−20
−40

−40
−60

−60
−40 −20 0 20 40 60 −40 −20 0 20 40 60

Z1 Z1

• The plot on the left suggests there are four clusters.

• The plot on the right is labelled with the actual 14 types of cancers.

8
1.5 F OCUS OF THIS COURSE / SEQUENCE

• Widely applicable methods across disciplines.

– do more than linear regression!

• Understand the motivation and trade-offs.

– not just a series of black boxes.


– be able to select most suitable method.

• Application to real-world problems.

– spend a substantial amount of time implementing in R.


– handle real world datasets.

9
1.6 T HE SEQUENCE ST3248/4248

• Part I gives a broad overview of common problems and popular


approaches.
– Linear regression model and its extensions, classification meth-
ods, resampling methods, regularization and model selection,
principal components and clustering methods.
– This corresponds to ISLR: Ch01-06, Ch12.
• Part II builds on the knowledge in Part I, introducing more tools.
– Non-linear regression, non-parametric smoothing methods, tree
based methods, support vector machines, neural networks and
ensemble learning.
– This corresponds to ISLR: Ch07-10 and select topics from ESL.
– Group Project.

10
1.7 C OMPARISON WITH OTHER COURSES

• There is significant overlap in topics with other modules.

• These are actually complementary with what we are doing, and we


encourage interested students to take them as well.

• ST3131 Regression Analysis (required for ST4248)

– theoretical motivation, distributional and optimality proofs,


analysis of variance, sum of squares.

• CS3244 Machine Learning

– more emphases on feasibility and efficiency; topics related to


AI.
– Python implementation.

11
1.8 N OTATION AND S IMPLE M ATRIX A LGEBRA

• n is number of distinct data points, or observations.

• p is number of variables (inputs).

• e.g. Wage data set has n = 3000 people and p = 12 variables (we only
saw 3 of them).

• xi j is the value of the jth variable for the ith observation, where i =
1, . . . , n, and j = 1, . . . , p.

• We let X be our n × p data matrix whose (i, j)th element is xi j .


 
x11 x12 ··· x1p
 x21 x22 ··· x2p 
X=
 .. .. ... .. 

xn1 xn2 ··· xnp

12
• Vectors are by default represented as columns.

– the ith row corresponding to the ith observation is represented


by
 
xi1
 xi2 
xi =  .
 . 

xip

– the jth column corresponding to the jth variable is represented


by
 
x1 j
 x2 j 
xj = 
 .. 

xn j

13
• so
x1T
   
x11 x12 ··· x1p
 x21 x22 ··· x2p   x2T

X=
 .. .. ... ..  =  ..  = (x1
   x2 ··· x p),
xn1 xn2 ··· xnp xnT
T
where denotes the transpose of a matrix or vector.
• The output variable of interest is denoted by
 
y1
 y2 
y= .
 . 

yn

• Hence, our observed data can be written as


{(x1, y1), (x2, y2), . . . , (xn, yn)}.
These are realizations of our random variables (X,Y ).

14
• Dimension notation

a ∈ R, a ∈ Rk , a ∈ Rn , A ∈ Rr×s,

denotes scalar, k-vector, n-vector, r by s matrix.


• Matrix multiplication. Let A ∈ Rr×d , B ∈ Rd×s.
d
(AB)i j = ∑ aikbk j
k=1

for i = 1, . . . , r and j = 1, . . . , s.
– e.g.
      
1 2 5 6 1×5+2×7 1×6+2×8 19 22
= =
3 4 7 8 3×5+4×7 3×6+4×8 43 50

– note internal dimension d must match.

15
1.9 G ETTING THE SOFTWARE R

• R is free (as in both beer and speech).

• Website for R project is


http://www.r-project.org

• A mirror site can be found at


http://cran.stat.nus.edu.sg

• Follow links to download R for your OS.

• Optional but recommended: RStudio.


https://www.rstudio.com

• Integrated development environment (IDE) for R. Also free (local


Desktop edition).

16

You might also like