You are on page 1of 11

9/1/2021

Data Mining
BIF 524 - CSC 498

Data is the sword of the 21st century, those who wield it well,
the Samurai. – Jonathan Rosenberg

1
9/1/2021

Before we start

• Instructor: Joseph Rebehmed


• Contact: joseph.rebehmed@lau.edu.lb
• Office hours: TR, 9:00 – 11:00 AM;W 5:30 – 7:30 PM
& by appointment (Online)
• Lecture: MWF, 9:00 – 9:50 AM
AKSOB 1003, Online via Collaborate platform
• Grading: (subject to 5% variation)
• Midterm: 30%
• Project: 25%
• Final Exam: 35%
• Participation: 10%

Textbook

https://www.statlearning.com/

2
9/1/2021

Course Description
This course covers the fundamental techniques and applications
for mining data; topics include concepts from:

• Machine learning
• Statistics
• Techniques and algorithms for parametric and non-parametric
classification, clustering, classifier assessment.
• Supervised vs unsupervised learning.
• Expert system
• Graphical models

Course Description (2)

This course aims to provide a very applied overview to:


• modern non-linear methods as:
• Generalized Additive Models,
• Decision Trees,
• Boosting, Bagging,
• Support Vector Machines

• more classical linear approaches such as:


• Logistic Regression,
• Linear Discriminant Analysis,
• K-Means Clustering,
• Nearest Neighbors.

• Cover many cases/data sets in the course plus some additional


interesting applications + Lab sessions

3
9/1/2021

Teaching/Learning methods

• Active learning approaches, no more passive learners


• The most important kind of learning comes from doing, not
from standing on the sidelines.

• In parallel to “Lectures”, this course makes extensive use of:


• in class group activities
• Dialogues, discussions and sharing ideas
• Reading providing materials before class, lecture preparation

• Plenty of applications

Tips for success

• Actively participate in class

• Don’t wait until the last minute to start your assignments or to


study for an exam.

• Please communicate with me if you have any


questions/difficulties/challenges

4
9/1/2021

Additional Remarks

• Reading the textbook is a must.


• Deadlines must be respected.
• Make-ups and Incomplete: students are not automatically
entitled to make-ups; F will be given until reasons (in writing and
within one week of absence) are presented and approved.
• Some of the exam questions will be based on class discussion
and assignments.
• No mobile phones in the classroom.

Introduction

10

5
9/1/2021

Introduction (2)

• Statistical learning refers to a set of tools for modelling and


understanding complex datasets.

• With the explosion of “Big Data” problems, statistical learning


has become a very hot field in many scientific areas (marketing,
finance, CS, biology, etc.)

• People with statistical learning skills are in high demand.

• Many companies are using Machine Learning in different and/or


cool ways

11

Pinterest – Improved Content Discovery

12

6
9/1/2021

Twitter – Curated Timelines

13

IBM – Better Healthcare

14

7
9/1/2021

Statistical Learning Problems

• Identify the risk factors for prostate cancer.

• Predict whether someone will have a heart attack based


on demographic, diet and clinical measurements.

• Customize an email spam detection system

• Classify a tissue sample into one of several cancer


classes, based on gene expression profile

15

16

8
9/1/2021

17

18

9
9/1/2021

Notation
• Use n to represent the number of distinct data points, or
observations, in our sample; p the number of variables.
• xij represent the value of the jth variable for the ith observation,
where i = 1, 2, . . ., n and j = 1, 2, . . . , p
• X denote a n×p matrix.

• The input variables are typically denoted using the symbol X,


with a subscript to distinguish them. The inputs go by different
names, such as predictors, independent variables, features or
sometimes just variables.
• The output variable is often called the response or dependent
variable and is typically denoted using the symbol Y.

19

What is Statistical Learning?

• In ML, we have a large set of inputs X and corresponding


outputs Y but not the function f(X).
• We believe that there is a relationship between Y and at least
one of the X’s.
• The goal is to find/model the relationship as:

Yi  f (Xi )   i

• Where f is some fixed but unknown function and ε is a random


error term, which is independent of X with mean zero.

20

10
9/1/2021

Simple Example

The function f that connects the input variable to the output variable is
in general unknown. In this situation one must estimate f based on the
observed points.
21

Different Standard Deviations


sd=0.001 sd=0.005
0.10

0.10
0.05

0.05
0.00

0.00
y

y
-0.05

-0.05

The difficulty of
-0.10

-0.10

estimating f will
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0

x x

depend on the
standard deviation of
sd=0.01 sd=0.03

the ε’s.
0.10

0.00 0.05 0.10


0.05
0.00
y

y
-0.05

-0.10
-0.10

0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0

x x

22

11

You might also like