Professional Documents
Culture Documents
Introduction
1
1.1 O VERVIEW
• Supervised Learning
• Unsupervised Learning
Here are three real world data sets to illustrate some applications.
2
1.2 WAGE D ATA
Income survey for males from the central Atlantic region of the United
States.
300
300
300
200
200
200
Wage
Wage
Wage
50 100
50 100
50 100
20 40 60 80 2003 2006 2009 1 2 3 4 5
3
• Inputs: Age, Year, Education Level
• Output: Wage
4
1.3 S TOCK M ARKET D ATA
Standard & Poor’s 500 (S&P) index between 2001 and 2005.
6
Percentage change in S&P
4
2
2
0
0
−2
−2
−2
−4
−4
−4
Down Up Down Up Down Up
5
• Inputs: past 5 days’ percentage change.
6
1.4 G ENE E XPRESSION D ATA
The NCI60 data set consists of 6830 gene expression measurements for
each of 64 cancer cell lines.
• No outputs.
7
20
20
0
0
Z2
Z2
−20
−20
−40
−40
−60
−60
−40 −20 0 20 40 60 −40 −20 0 20 40 60
Z1 Z1
• The plot on the right is labelled with the actual 14 types of cancers.
8
1.5 F OCUS OF THIS COURSE / SEQUENCE
9
1.6 T HE SEQUENCE ST3248/4248
10
1.7 C OMPARISON WITH OTHER COURSES
11
1.8 N OTATION AND S IMPLE M ATRIX A LGEBRA
• e.g. Wage data set has n = 3000 people and p = 12 variables (we only
saw 3 of them).
• xi j is the value of the jth variable for the ith observation, where i =
1, . . . , n, and j = 1, . . . , p.
12
• Vectors are by default represented as columns.
xip
xn j
13
• so
x1T
x11 x12 ··· x1p
x21 x22 ··· x2p x2T
X=
.. .. ... .. = .. = (x1
x2 ··· x p),
xn1 xn2 ··· xnp xnT
T
where denotes the transpose of a matrix or vector.
• The output variable of interest is denoted by
y1
y2
y= .
.
yn
14
• Dimension notation
a ∈ R, a ∈ Rk , a ∈ Rn , A ∈ Rr×s,
for i = 1, . . . , r and j = 1, . . . , s.
– e.g.
1 2 5 6 1×5+2×7 1×6+2×8 19 22
= =
3 4 7 8 3×5+4×7 3×6+4×8 43 50
15
1.9 G ETTING THE SOFTWARE R
16