Professional Documents
Culture Documents
org
Chenhao Ma
machenhao@cuhk.edu.cn
• Course Information
• What is data mining? -
Introduction
2
¡ Instructor:
§ Dr. Chenhao Ma
§ machenhao@cuhk.edu.cn, 319b Daoyuan Bldg
§ Office hour: 5:00-6:00PM Tuesday
§ Research interest: large-scale data management
and data mining
4
¡ Teaching assistants:
§ Miss. Yue Zhang
§ 223040247@link.cuhk.edu.cn
§ 109, 4F SDS research space Zhixin Bldg
§ Office hour: 3:00-4:00PM Friday
6
¡ Lecture
§ 3:30-4:50pm Tue/Thu
§ Teaching A Bldg 101
¡ Tutorial
§ 6:00-6:50pm Thu (starting from next week)
§ Teaching C Bldg 308
8
¡ Skills:
§ Utilize a programming language to learn, visualize,
and mine new insights from big data
§ Utilize existing software to analyze available data
to inform critical decisions
§ Study data scientifically, and use it to prove
hypotheses
§ Be able to actively manage and participate in data
mining projects executed by consultants or
specialists in data mining
9
¡ Valued/Attitude:
§ Be equipped with theoretical data mining knowledge
and be able to communicate with domain experts to
solve their data science problems
§ Be more vigilant in some data science issues and
aware with the data science development in the
society
§ Have awareness of the impact of data mining in social,
industrial, environmental and technological context
§ Be more literate in data science, and develop a
knowledge in data science so that you can disseminate
data science related new development
10
¡ 4 assignments (40%)
§ Theoretical and programming questions
§ A1 will be given in around week #4
§ Roughly one assignment every threes after the
first
§ Late submission: up to 2 days with 20% penalty.
§ Not every API used in assignment will be
introduced in the lecture
§ Start early ~
11
¡ Midterm (20%)
§ On week #7 or #8
§ Probably 1.5-hour closed-book written exam
§ Details will be announced later
12
¡ Content: Programming practice and lecture
review
¡ Colab 0 (the tutorial for Spark) will be
released soon
13
¡ Leskovec, J., Rajaraman, A., and Ullman, J., Mining of
Massive Datasets. (3nd edition) - Cambridge
University Press, ISBN-13. 978-1108476348
§ We mainly follow this one in lectures
§ Available online: http://www.mmds.org/
15
¡ Each of the topics listed is important for a part
of the course:
§ If you are missing an item of background, you can
consider just-in-time learning of the needed
material.
16
¡ ACM-ICPC
§ ACM: Association for Computing Machinery
§ ICPC: International Collegiate Programming Contest
17
Week Content/ topic/ activity
1 Introduction to data mining
2 Map-Reduce
3 Spark
4 Frequent items and association rules
5 Finding similar items
6 Mining data streams
7 Mid-term
8 Mining data streams, Link analysis
9 Link analysis
10 Clustering
11 Advertising on the Web
12 Recommender systems
13 Graph mining
14 Recap
19
¡ Feedback is important and highly
appreciated!
§ Talk to course instructors and TAs
§ Send us emails
§…
20
Social User Tracking & Government
Engagement
22
¡ The Vs of big data were often referred to
as the "three Vs"
¡ Volume: In a big data environment, the
amounts of data collected and processed are
much larger than those stored in typical
relational databases.
23
¡ Variety: Big data consists of a rich variety of
data types.
¡ Velocity: Big data arrives to the organization
at high speeds and from multiple sources
simultaneously.
24
¡ In the big data era, huge amount of data is
being generated every day
Recent Twitter statistics
https://www.omnicoreagency.com/twitter-statistics/ 25
¡ Data volume is increasing exponentially (40%
increase per year)
28
Social Banking
Finance
Media
Gaming
Customer Search
Engine
Entertain Purchase
29
¡ Velocity essentially measures how fast the
data is coming in.
¡ Data is being generated fast and need to be
processed fast
§ Late decisions -> missing opportunities
30
¡ It is usually met in online data analytics, for
example
§ E-Promotions: based on your current location,
your purchase history, what you like -> send
promotions right now for store next to you
§ Healthcare monitoring: sensors monitoring your
activities and body -> any abnormal
measurements require immediate reaction
31
The Owner of This iPhone Was in a Severe Car Crash'—or Just on a
Roller Coaster - WSJ
32
The statistics for 1 second in many applications.
http://www.internetlivestats.com/one-second/
33
Data contains value and knowledge
34
¡ But to extract the knowledge
data needs to be
§ Stored (systems)
§ Managed (databases)
§ And ANALYZED ß this class
35
¡ Given lots of data
¡ Discover patterns and models that are:
§ Valid: hold on new data with some certainty
§ Useful: should be possible to act on the item
§ Unexpected: non-obvious to the system
§ Understandable: humans should be able to
interpret the pattern
36
¡ Descriptive methods
§ Find human-interpretable patterns that
describe the data
§ Example: Clustering
¡ Predictive methods
§ Use some variables to predict unknown
or future values of other variables
§ Example: Recommender systems
37
¡ A risk with “Data mining” is that an analyst
can “discover” patterns that are meaningless
¡ Statisticians call it Bonferroni’s principle:
§ Roughly, if you look in more places for interesting
patterns than your amount of data will support,
you are bound to find crap
38
Example:
¡ We want to find (unrelated) people who at least twice
have stayed at the same hotel on the same day
§ 109 people being tracked
§ 1,000 days
§ Each person stays in a hotel 1% of time (1 day out of 100)
§ Hotels hold 100 people (so 105 hotels)
§ If everyone behaves randomly (i.e., no terrorists) will the
data mining detect anything suspicious?
¡ Expected number of “suspicious” pairs of people:
§ ~250,000
§ … too many combinations to check – we need to have some
additional evidence to find “suspicious” pairs of people in
some more efficient way
39
Usage
Quality
Context
Streaming
Scalability
40
¡ Data mining overlaps with:
§ Databases: Large-scale data, simple queries
§ Machine learning: Complex models
§ CS Theory: (Randomized) Algorithms
¡ Different cultures:
§ To a DB person, data mining is an extreme form of
analytic processing – queries that
examine large amounts of data CS
Machine
Theory
§ Result is the query answer Learning
42
¡ We will learn to mine different types of data:
§ Data is high dimensional
§ Data is a graph
§ Data is infinite/never-ending
§ Data is labeled
¡ We will learn to use different models of
computation:
§ MapReduce
§ Streams and online algorithms
§ Single machine in-memory
43
¡ We will learn to solve real-world problems:
§ Recommender systems
§ Market Basket Analysis
§ Spam detection
§ Duplicate document detection
¡ We will learn various “tools”:
§ Linear algebra (Rec. Sys., Communities)
§ Optimization (stochastic gradient descent)
§ Dynamic programming (frequent itemsets)
§ Hashing (LSH, Bloom filters)
44
¡ Please work on Colab 0
¡ Extra materials:
§ A Systematic View of Data Science
§ By M. Tamer Özsu (https://cs.uwaterloo.ca/~tozsu/)
§ https://cs.uwaterloo.ca/~tozsu/presentations/DataSci
ence-2022-04.pdf
45
I
data♥