You are on page 1of 45

Modified based on materials from: http://www.mmds.

org

Chenhao Ma
machenhao@cuhk.edu.cn
• Course Information
• What is data mining? -
Introduction

2
¡ Instructor:
§ Dr. Chenhao Ma
§ machenhao@cuhk.edu.cn, 319b Daoyuan Bldg
§ Office hour: 5:00-6:00PM Tuesday
§ Research interest: large-scale data management
and data mining

4
¡ Teaching assistants:
§ Miss. Yue Zhang
§ 223040247@link.cuhk.edu.cn
§ 109, 4F SDS research space Zhixin Bldg
§ Office hour: 3:00-4:00PM Friday

§ Mr. Yuyang Liang


§ 119010174@link.cuhk.edu.cn
§ 4F SDS research space Zhixin Bldg
§ Office hour: 11:00-12:00AM Friday
5
¡ USTFs
§ Mr Xinyang Gao
§ 120030046@link.cuhk.edu.cn

§ Miss Mengzhen Zhang


§ 120090384@link.cuhk.edu.cn

6
¡ Lecture
§ 3:30-4:50pm Tue/Thu
§ Teaching A Bldg 101

¡ Tutorial
§ 6:00-6:50pm Thu (starting from next week)
§ Teaching C Bldg 308

¡ Working language: English


§ After-class discusssion can be in English/Chinese
7
¡ Knowledge:
§ understand the key models and concepts of
contemporary data mining
§ understand the strengths and limitations of
popular data mining techniques
§ understand popular data mining algorithms to
solve the real-world problems
§ be able to identify promising applications of data
mining using biomedical, textual and graph data

8
¡ Skills:
§ Utilize a programming language to learn, visualize,
and mine new insights from big data
§ Utilize existing software to analyze available data
to inform critical decisions
§ Study data scientifically, and use it to prove
hypotheses
§ Be able to actively manage and participate in data
mining projects executed by consultants or
specialists in data mining

9
¡ Valued/Attitude:
§ Be equipped with theoretical data mining knowledge
and be able to communicate with domain experts to
solve their data science problems
§ Be more vigilant in some data science issues and
aware with the data science development in the
society
§ Have awareness of the impact of data mining in social,
industrial, environmental and technological context
§ Be more literate in data science, and develop a
knowledge in data science so that you can disseminate
data science related new development

10
¡ 4 assignments (40%)
§ Theoretical and programming questions
§ A1 will be given in around week #4
§ Roughly one assignment every threes after the
first
§ Late submission: up to 2 days with 20% penalty.
§ Not every API used in assignment will be
introduced in the lecture
§ Start early ~

11
¡ Midterm (20%)
§ On week #7 or #8
§ Probably 1.5-hour closed-book written exam
§ Details will be announced later

¡ Final exam (40%)


§ Details will be announced later

12
¡ Content: Programming practice and lecture
review
¡ Colab 0 (the tutorial for Spark) will be
released soon

13
¡ Leskovec, J., Rajaraman, A., and Ullman, J., Mining of
Massive Datasets. (3nd edition) - Cambridge
University Press, ISBN-13. 978-1108476348
§ We mainly follow this one in lectures
§ Available online: http://www.mmds.org/

¡ Tan, P., Steinbach, M., Karpatne, A., and Kumar, V.,


Introduction to data mining (second edition).
Pearson. ISBN 978-0321321367

¡ Han, J., Pei, J. and Kamber, M., 2011. Data mining:


concepts and techniques. Elsevier. ISBN 978-0-12-
381479-1
14
¡ Official prerequisite of this course is
§ CSC1001 or CSC1003 (programming skills)
§ STA2001 or MAT3280 or STA2003 (probability)
§ CSC3100 (Data structure)
¡ The following would be helpful:
§ Discrete math
§ Database systems (SQL, relational algebra)
§ Common Linux Commands

15
¡ Each of the topics listed is important for a part
of the course:
§ If you are missing an item of background, you can
consider just-in-time learning of the needed
material.

¡ Colab 0 can also help to decide whether to


take this course.
§ Programming skill is important.
§ You need to be comfortable with writing code in
Python or Java.

16
¡ ACM-ICPC
§ ACM: Association for Computing Machinery
§ ICPC: International Collegiate Programming Contest

¡ Online Judge (OJ) systems


§ Many programming problems
§ CUHKSZ OJ
§ http://oj.cuhk.edu.cn/ (access in campus)
§ SJTU OJ
§ https://acm.sjtu.edu.cn/OnlineJudge/
§ PKU OJ
§ http://poj.org/

17
Week Content/ topic/ activity
1 Introduction to data mining
2 Map-Reduce
3 Spark
4 Frequent items and association rules
5 Finding similar items
6 Mining data streams
7 Mid-term
8 Mining data streams, Link analysis
9 Link analysis
10 Clustering
11 Advertising on the Web
12 Recommender systems
13 Graph mining
14 Recap
19
¡ Feedback is important and highly
appreciated!
§ Talk to course instructors and TAs
§ Send us emails
§…

20
Social User Tracking & Government
Engagement

eCommerce Financial Services Real Time Search

22
¡ The Vs of big data were often referred to
as the "three Vs"
¡ Volume: In a big data environment, the
amounts of data collected and processed are
much larger than those stored in typical
relational databases.

23
¡ Variety: Big data consists of a rich variety of
data types.
¡ Velocity: Big data arrives to the organization
at high speeds and from multiple sources
simultaneously.

24
¡ In the big data era, huge amount of data is
being generated every day
Recent Twitter statistics

https://www.omnicoreagency.com/twitter-statistics/ 25
¡ Data volume is increasing exponentially (40%
increase per year)

Data amount in Zetabytes from 2010 to 2025

A forecast by IDC & SeaGate. Image by Sven Balnojan. 26


¡ Different Types:
§ Relational Data (Tables/Transaction/Legacy Data)
§ Text Data (Web)
§ Semi-structured Data (XML)
§ Spatial Data
§ Temporal Data
§ Graph Data
§ Social Network, Semantic Web (RDF), …
§ One application can be generating/collecting many
types of data
27
¡ Different Sources:
§ Movie reviews from IMDB and Rotten Tomatoes
§ Product reviews from different provider websites

To extract knowledgeè all these types of


data need to linked together

28
Social Banking
Finance
Media

Gaming
Customer Search
Engine

Entertain Purchase

29
¡ Velocity essentially measures how fast the
data is coming in.
¡ Data is being generated fast and need to be
processed fast
§ Late decisions -> missing opportunities

30
¡ It is usually met in online data analytics, for
example
§ E-Promotions: based on your current location,
your purchase history, what you like -> send
promotions right now for store next to you
§ Healthcare monitoring: sensors monitoring your
activities and body -> any abnormal
measurements require immediate reaction

31
The Owner of This iPhone Was in a Severe Car Crash'—or Just on a
Roller Coaster - WSJ
32
The statistics for 1 second in many applications.
http://www.internetlivestats.com/one-second/

33
Data contains value and knowledge
34
¡ But to extract the knowledge
data needs to be
§ Stored (systems)
§ Managed (databases)
§ And ANALYZED ß this class

Data Mining ≈ Big Data ≈


Predictive Analytics ≈ Data Science

35
¡ Given lots of data
¡ Discover patterns and models that are:
§ Valid: hold on new data with some certainty
§ Useful: should be possible to act on the item
§ Unexpected: non-obvious to the system
§ Understandable: humans should be able to
interpret the pattern

36
¡ Descriptive methods
§ Find human-interpretable patterns that
describe the data
§ Example: Clustering

¡ Predictive methods
§ Use some variables to predict unknown
or future values of other variables
§ Example: Recommender systems

37
¡ A risk with “Data mining” is that an analyst
can “discover” patterns that are meaningless
¡ Statisticians call it Bonferroni’s principle:
§ Roughly, if you look in more places for interesting
patterns than your amount of data will support,
you are bound to find crap

38
Example:
¡ We want to find (unrelated) people who at least twice
have stayed at the same hotel on the same day
§ 109 people being tracked
§ 1,000 days
§ Each person stays in a hotel 1% of time (1 day out of 100)
§ Hotels hold 100 people (so 105 hotels)
§ If everyone behaves randomly (i.e., no terrorists) will the
data mining detect anything suspicious?
¡ Expected number of “suspicious” pairs of people:
§ ~250,000
§ … too many combinations to check – we need to have some
additional evidence to find “suspicious” pairs of people in
some more efficient way
39
Usage

Quality

Context

Streaming

Scalability

40
¡ Data mining overlaps with:
§ Databases: Large-scale data, simple queries
§ Machine learning: Complex models
§ CS Theory: (Randomized) Algorithms
¡ Different cultures:
§ To a DB person, data mining is an extreme form of
analytic processing – queries that
examine large amounts of data CS
Machine
Theory
§ Result is the query answer Learning

§ To a ML person, data-mining Data


Mining
is the inference of models
§ Result is the parameters of the model Database
¡ In this class we will do both! systems
41
¡ This combines best of machine learning,
statistics, artificial intelligence, databases but
more stress on
§ Scalability (big data)
§ Algorithms Statistics Machine
Learning
§ Computing architectures
§ Automation for handling Data Mining
large data
Database
systems

42
¡ We will learn to mine different types of data:
§ Data is high dimensional
§ Data is a graph
§ Data is infinite/never-ending
§ Data is labeled
¡ We will learn to use different models of
computation:
§ MapReduce
§ Streams and online algorithms
§ Single machine in-memory
43
¡ We will learn to solve real-world problems:
§ Recommender systems
§ Market Basket Analysis
§ Spam detection
§ Duplicate document detection
¡ We will learn various “tools”:
§ Linear algebra (Rec. Sys., Communities)
§ Optimization (stochastic gradient descent)
§ Dynamic programming (frequent itemsets)
§ Hashing (LSH, Bloom filters)
44
¡ Please work on Colab 0

¡ Extra materials:
§ A Systematic View of Data Science
§ By M. Tamer Özsu (https://cs.uwaterloo.ca/~tozsu/)
§ https://cs.uwaterloo.ca/~tozsu/presentations/DataSci
ence-2022-04.pdf

§ OLTP and OLAP: a practical comparison


§ https://www.stitchdata.com/resources/oltp-vs-olap/

45
I
data♥

How do you want that data?


46

You might also like