2.1 Intro Statistical Learning 1

9/1/2021
Data Mining
BIF 524 - CSC 498
Data is the sword of the 21st century, those who wield it well,
the Samurai. – Jonathan Rosenberg
1
9/1/2021
Before we start
• Instructor: Joseph Rebehmed

• Contact: joseph.rebehmed@lau.edu.lb
• Office hours: TR, 9:00 – 11:00 AM;W 5:30 – 7:30 PM
& by appointment (Online)
• Lecture: MWF, 9:00 – 9:50 AM
AKSOB 1003, Online via Collaborate platform
• Grading: (subject to 5% variation)
• Midterm: 30%
• Project: 25%
• Final Exam: 35%
• Participation: 10%
Textbook
https://www.statlearning.com/
2
9/1/2021
Course Description
This course covers the fundamental techniques and applications
for mining data; topics include concepts from:
• Machine learning
• Statistics
• Techniques and algorithms for parametric and non-parametric
classification, clustering, classifier assessment.
• Supervised vs unsupervised learning.
• Expert system
• Graphical models
Course Description (2)
This course aims to provide a very applied overview to:

• modern non-linear methods as:
• Generalized Additive Models,
• Decision Trees,
• Boosting, Bagging,
• Support Vector Machines
• more classical linear approaches such as:

• Logistic Regression,
• Linear Discriminant Analysis,
• K-Means Clustering,
• Nearest Neighbors.
• Cover many cases/data sets in the course plus some additional

interesting applications + Lab sessions
3
9/1/2021
Teaching/Learning methods
• Active learning approaches, no more passive learners

• The most important kind of learning comes from doing, not
from standing on the sidelines.
• In parallel to “Lectures”, this course makes extensive use of:

• in class group activities
• Dialogues, discussions and sharing ideas
• Reading providing materials before class, lecture preparation
• Plenty of applications
Tips for success
• Actively participate in class
• Don’t wait until the last minute to start your assignments or to

study for an exam.
• Please communicate with me if you have any

questions/difficulties/challenges
4
9/1/2021
Additional Remarks
• Reading the textbook is a must.

• Deadlines must be respected.
• Make-ups and Incomplete: students are not automatically
entitled to make-ups; F will be given until reasons (in writing and
within one week of absence) are presented and approved.
• Some of the exam questions will be based on class discussion
and assignments.
• No mobile phones in the classroom.
Introduction
10
5
9/1/2021
Introduction (2)
• Statistical learning refers to a set of tools for modelling and

understanding complex datasets.
• With the explosion of “Big Data” problems, statistical learning

has become a very hot field in many scientific areas (marketing,
finance, CS, biology, etc.)
• People with statistical learning skills are in high demand.
• Many companies are using Machine Learning in different and/or

cool ways
11
Pinterest – Improved Content Discovery
12
6
9/1/2021
Twitter – Curated Timelines
13
IBM – Better Healthcare
14
7
9/1/2021
Statistical Learning Problems
• Identify the risk factors for prostate cancer.
• Predict whether someone will have a heart attack based

on demographic, diet and clinical measurements.
• Customize an email spam detection system
• Classify a tissue sample into one of several cancer

classes, based on gene expression profile
15
16
8
9/1/2021
17
18
9
9/1/2021
Notation
• Use n to represent the number of distinct data points, or
observations, in our sample; p the number of variables.
• xij represent the value of the jth variable for the ith observation,
where i = 1, 2, . . ., n and j = 1, 2, . . . , p
• X denote a n×p matrix.
• The input variables are typically denoted using the symbol X,

with a subscript to distinguish them. The inputs go by different
names, such as predictors, independent variables, features or
sometimes just variables.
• The output variable is often called the response or dependent
variable and is typically denoted using the symbol Y.
19
What is Statistical Learning?
• In ML, we have a large set of inputs X and corresponding

outputs Y but not the function f(X).
• We believe that there is a relationship between Y and at least
one of the X’s.
• The goal is to find/model the relationship as:
Yi  f (Xi )   i
• Where f is some fixed but unknown function and ε is a random

error term, which is independent of X with mean zero.
20
10
9/1/2021
Simple Example
The function f that connects the input variable to the output variable is
in general unknown. In this situation one must estimate f based on the
observed points.
21
Different Standard Deviations

sd=0.001 sd=0.005
0.10
0.10
0.05
0.05
0.00
0.00
y
y
-0.05
-0.05
The difficulty of
-0.10
-0.10
estimating f will
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
x x
depend on the
standard deviation of
sd=0.01 sd=0.03
the ε’s.
0.10
0.00 0.05 0.10

0.05
0.00
y
y
-0.05
-0.10
-0.10
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
x x
22
11

2.1 Intro Statistical Learning 1

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

2.1 Intro Statistical Learning 1

Uploaded by

Copyright:

Available Formats

9/1/2021

• Instructor: Joseph Rebehmed

Course Description (2)

This course aims to provide a very applied overview to:

• more classical linear approaches such as:

• Cover many cases/data sets in the course plus some additional

• Active learning approaches, no more passive learners

• In parallel to “Lectures”, this course makes extensive use of:

Tips for success

• Actively participate in class

• Don’t wait until the last minute to start your assignments or to

• Please communicate with me if you have any

• Reading the textbook is a must.

• Statistical learning refers to a set of tools for modelling and

• With the explosion of “Big Data” problems, statistical learning

• People with statistical learning skills are in high demand.

• Many companies are using Machine Learning in different and/or

Pinterest – Improved Content Discovery

Twitter – Curated Timelines

IBM – Better Healthcare

Statistical Learning Problems

• Identify the risk factors for prostate cancer.

• Predict whether someone will have a heart attack based

• Customize an email spam detection system

• Classify a tissue sample into one of several cancer

• The input variables are typically denoted using the symbol X,

What is Statistical Learning?

• In ML, we have a large set of inputs X and corresponding

• Where f is some fixed but unknown function and ε is a random

Different Standard Deviations

0.00 0.05 0.10

You might also like