ch01 Intro

Modified based on materials from: http://www.mmds.
org
Chenhao Ma
machenhao@cuhk.edu.cn
• Course Information
• What is data mining? -
Introduction
2
¡ Instructor:
§ Dr. Chenhao Ma
§ machenhao@cuhk.edu.cn, 319b Daoyuan Bldg
§ Office hour: 5:00-6:00PM Tuesday
§ Research interest: large-scale data management
and data mining
4
¡ Teaching assistants:
§ Miss. Yue Zhang
§ 223040247@link.cuhk.edu.cn
§ 109, 4F SDS research space Zhixin Bldg
§ Office hour: 3:00-4:00PM Friday
§ Mr. Yuyang Liang

§ 4F SDS research space Zhixin Bldg
§ Office hour: 11:00-12:00AM Friday
5
¡ USTFs
§ Mr Xinyang Gao
§ Miss Mengzhen Zhang

6
¡ Lecture
§ 3:30-4:50pm Tue/Thu
§ Teaching A Bldg 101
¡ Tutorial
§ 6:00-6:50pm Thu (starting from next week)
§ Teaching C Bldg 308
¡ Working language: English

§ After-class discusssion can be in English/Chinese
7
¡ Knowledge:
§ understand the key models and concepts of
contemporary data mining
§ understand the strengths and limitations of
popular data mining techniques
§ understand popular data mining algorithms to
solve the real-world problems
§ be able to identify promising applications of data
mining using biomedical, textual and graph data
8
¡ Skills:
§ Utilize a programming language to learn, visualize,
and mine new insights from big data
§ Utilize existing software to analyze available data
to inform critical decisions
§ Study data scientifically, and use it to prove
hypotheses
§ Be able to actively manage and participate in data
mining projects executed by consultants or
specialists in data mining
9
¡ Valued/Attitude:
§ Be equipped with theoretical data mining knowledge
and be able to communicate with domain experts to
solve their data science problems
§ Be more vigilant in some data science issues and
aware with the data science development in the
society
§ Have awareness of the impact of data mining in social,
industrial, environmental and technological context
§ Be more literate in data science, and develop a
knowledge in data science so that you can disseminate
data science related new development
10
¡ 4 assignments (40%)
§ Theoretical and programming questions
§ A1 will be given in around week #4
§ Roughly one assignment every threes after the
first
§ Late submission: up to 2 days with 20% penalty.
§ Not every API used in assignment will be
introduced in the lecture
§ Start early ~
11
¡ Midterm (20%)
§ On week #7 or #8
§ Probably 1.5-hour closed-book written exam
§ Details will be announced later
¡ Final exam (40%)

§ Details will be announced later
12
¡ Content: Programming practice and lecture
review
¡ Colab 0 (the tutorial for Spark) will be
released soon
13
¡ Leskovec, J., Rajaraman, A., and Ullman, J., Mining of
Massive Datasets. (3nd edition) - Cambridge
University Press, ISBN-13. 978-1108476348
§ We mainly follow this one in lectures
§ Available online: http://www.mmds.org/
¡ Tan, P., Steinbach, M., Karpatne, A., and Kumar, V.,

Introduction to data mining (second edition).
Pearson. ISBN 978-0321321367
¡ Han, J., Pei, J. and Kamber, M., 2011. Data mining:

concepts and techniques. Elsevier. ISBN 978-0-12-
381479-1
14
¡ Official prerequisite of this course is
§ CSC1001 or CSC1003 (programming skills)
§ STA2001 or MAT3280 or STA2003 (probability)
§ CSC3100 (Data structure)
¡ The following would be helpful:
§ Discrete math
§ Database systems (SQL, relational algebra)
§ Common Linux Commands
15
¡ Each of the topics listed is important for a part
of the course:
§ If you are missing an item of background, you can
consider just-in-time learning of the needed
material.
¡ Colab 0 can also help to decide whether to

take this course.
§ Programming skill is important.
§ You need to be comfortable with writing code in
Python or Java.
16
¡ ACM-ICPC
§ ACM: Association for Computing Machinery
§ ICPC: International Collegiate Programming Contest
¡ Online Judge (OJ) systems

§ Many programming problems
§ CUHKSZ OJ
§ http://oj.cuhk.edu.cn/ (access in campus)
§ SJTU OJ
§ https://acm.sjtu.edu.cn/OnlineJudge/
§ PKU OJ
§ http://poj.org/
17
Week Content/ topic/ activity
1 Introduction to data mining
2 Map-Reduce
3 Spark
4 Frequent items and association rules
5 Finding similar items
6 Mining data streams
7 Mid-term
8 Mining data streams, Link analysis
9 Link analysis
10 Clustering
11 Advertising on the Web
12 Recommender systems
13 Graph mining
14 Recap
19
¡ Feedback is important and highly
appreciated!
§ Talk to course instructors and TAs
§ Send us emails
§…
20
Social User Tracking & Government
Engagement
eCommerce Financial Services Real Time Search
22
¡ The Vs of big data were often referred to
as the "three Vs"
¡ Volume: In a big data environment, the
amounts of data collected and processed are
much larger than those stored in typical
relational databases.
23
¡ Variety: Big data consists of a rich variety of
data types.
¡ Velocity: Big data arrives to the organization
at high speeds and from multiple sources
simultaneously.
24
¡ In the big data era, huge amount of data is
being generated every day
Recent Twitter statistics
https://www.omnicoreagency.com/twitter-statistics/ 25
¡ Data volume is increasing exponentially (40%
increase per year)
Data amount in Zetabytes from 2010 to 2025
A forecast by IDC & SeaGate. Image by Sven Balnojan. 26

¡ Different Types:
§ Relational Data (Tables/Transaction/Legacy Data)
§ Text Data (Web)
§ Semi-structured Data (XML)
§ Spatial Data
§ Temporal Data
§ Graph Data
§ Social Network, Semantic Web (RDF), …
§ One application can be generating/collecting many
types of data
27
¡ Different Sources：
§ Movie reviews from IMDB and Rotten Tomatoes
§ Product reviews from different provider websites
To extract knowledgeè all these types of

data need to linked together
28
Social Banking
Finance
Media
Gaming
Customer Search
Engine
Entertain Purchase
29
¡ Velocity essentially measures how fast the
data is coming in.
¡ Data is being generated fast and need to be
processed fast
§ Late decisions -> missing opportunities
30
¡ It is usually met in online data analytics, for
example
§ E-Promotions: based on your current location,
your purchase history, what you like -> send
promotions right now for store next to you
§ Healthcare monitoring: sensors monitoring your
activities and body -> any abnormal
measurements require immediate reaction
31
The Owner of This iPhone Was in a Severe Car Crash'—or Just on a
Roller Coaster - WSJ
32
The statistics for 1 second in many applications.
http://www.internetlivestats.com/one-second/
33
Data contains value and knowledge
34
¡ But to extract the knowledge
data needs to be
§ Stored (systems)
§ Managed (databases)
§ And ANALYZED ß this class
Data Mining ≈ Big Data ≈

Predictive Analytics ≈ Data Science
35
¡ Given lots of data
¡ Discover patterns and models that are:
§ Valid: hold on new data with some certainty
§ Useful: should be possible to act on the item
§ Unexpected: non-obvious to the system
§ Understandable: humans should be able to
interpret the pattern
36
¡ Descriptive methods
§ Find human-interpretable patterns that
describe the data
§ Example: Clustering
¡ Predictive methods
§ Use some variables to predict unknown
or future values of other variables
§ Example: Recommender systems
37
¡ A risk with “Data mining” is that an analyst
can “discover” patterns that are meaningless
¡ Statisticians call it Bonferroni’s principle:
§ Roughly, if you look in more places for interesting
patterns than your amount of data will support,
you are bound to find crap
38
Example:
¡ We want to find (unrelated) people who at least twice
have stayed at the same hotel on the same day
§ 109 people being tracked
§ 1,000 days
§ Each person stays in a hotel 1% of time (1 day out of 100)
§ Hotels hold 100 people (so 105 hotels)
§ If everyone behaves randomly (i.e., no terrorists) will the
data mining detect anything suspicious?
¡ Expected number of “suspicious” pairs of people:
§ ~250,000
§ … too many combinations to check – we need to have some
additional evidence to find “suspicious” pairs of people in
some more efficient way
39
Usage
Quality
Context
Streaming
Scalability
40
¡ Data mining overlaps with:
§ Databases: Large-scale data, simple queries
§ Machine learning: Complex models
§ CS Theory: (Randomized) Algorithms
¡ Different cultures:
§ To a DB person, data mining is an extreme form of
analytic processing – queries that
examine large amounts of data CS
Machine
Theory
§ Result is the query answer Learning
§ To a ML person, data-mining Data

Mining
is the inference of models
§ Result is the parameters of the model Database
¡ In this class we will do both! systems
41
¡ This combines best of machine learning,
statistics, artificial intelligence, databases but
more stress on
§ Scalability (big data)
§ Algorithms Statistics Machine
Learning
§ Computing architectures
§ Automation for handling Data Mining
large data
Database
systems
42
¡ We will learn to mine different types of data:
§ Data is high dimensional
§ Data is a graph
§ Data is infinite/never-ending
§ Data is labeled
¡ We will learn to use different models of
computation:
§ MapReduce
§ Streams and online algorithms
§ Single machine in-memory
43
¡ We will learn to solve real-world problems:
§ Recommender systems
§ Market Basket Analysis
§ Spam detection
§ Duplicate document detection
¡ We will learn various “tools”:
§ Linear algebra (Rec. Sys., Communities)
§ Optimization (stochastic gradient descent)
§ Dynamic programming (frequent itemsets)
§ Hashing (LSH, Bloom filters)
44
¡ Please work on Colab 0
¡ Extra materials:
§ A Systematic View of Data Science
§ By M. Tamer Özsu (https://cs.uwaterloo.ca/~tozsu/)
§ https://cs.uwaterloo.ca/~tozsu/presentations/DataSci
ence-2022-04.pdf
§ OLTP and OLAP: a practical comparison

§ https://www.stitchdata.com/resources/oltp-vs-olap/
45
I
data♥
How do you want that data?

46

ch01 Intro

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

ch01 Intro

Uploaded by

Copyright:

Available Formats

Modified based on materials from: http://www.mmds.

§ Mr. Yuyang Liang

§ Miss Mengzhen Zhang

¡ Working language: English

¡ Final exam (40%)

¡ Tan, P., Steinbach, M., Karpatne, A., and Kumar, V.,

¡ Han, J., Pei, J. and Kamber, M., 2011. Data mining:

¡ Colab 0 can also help to decide whether to

¡ Online Judge (OJ) systems

eCommerce Financial Services Real Time Search

Data amount in Zetabytes from 2010 to 2025

A forecast by IDC & SeaGate. Image by Sven Balnojan. 26

To extract knowledgeè all these types of

Data Mining ≈ Big Data ≈

§ To a ML person, data-mining Data

§ OLTP and OLAP: a practical comparison

How do you want that data?

You might also like