Session 1-Introduction To Data Mining

Session 1: Introduction to Data Mining
Lecturer: Dr. Le Hoang Son

Vietnam National University (VNU)
lehoangson@hus.edu.vn
sonlh@vnu.edu.vn
Contents
1 What is Data Mining?
2 Data Mining Data
3 Data Mining Functionalities
4 Data Mining Systems
5 Major Issues
6 Discussion & Exercises
2
1. What is Data Mining?
 The progress of computer hardware
technology:
 Great boost to the database and
information industry
 Huge number of databases and
information repositories available
 The abundance of data raises the need for

powerful data analysis tools.
 It is the fact that important decisions are

often made based on a decision maker’s
intuition.
 Decision makers indeed do not have the

tools to extract the valuable knowledge
embedded in the vast amounts of data.
3
Situation 1
HOW DO YOU KNOW THE PERSON WHO

HAS THE ATM CARD IS THE REAL OWNER
OR NOT?
4
Situation 2
How’s about the evade of customer with TID =

100?
5
Situation 3
HOW ABOUT THE STOCK TOMORROW?
6
Challenges to data store and management
 Data mining can be viewed as a result of the natural evolution of information
technology
 The database system industry:

 Data collection and database creation
 Data management (including data storage and retrieval, database
transaction processing).
 Data analysis (involving data warehousing and data mining)
 The development in database systems:

 Hierarchical and network database systems
 Relational database systems
 Object-oriented model
 Spatial, temporal, multimedia, active, stream, sensor, engineering databases
 Distribution, diversification, and sharing of data on WWW
7
Facebook
A large social network (~ 1 billion users) that allows registered users to
create profiles, upload photos and video, send messages with friends
and family
8
YouTube
A large video social network that allows billions of people to discover,
watch, share originally-created videos and connect to others across the
globe for original content creators and advertisers large and small
9
Gmail
A free, advertising-supported email and service provided by Google. As
of February 2016, it was the most widely used web-based email
provider with over 1 billion active users worldwide.
10
The need of data mining tools
 The fast-growing, tremendous amount of
data, collected and stored in large and
numerous data repositories, has far
exceeded our human ability for
comprehension without powerful tools
 As a result, data collected in large data

repositories become “data tombs”—data
archives that are seldom visited
 The widening gap between data and

information calls for a systematic
development of data mining tools that will
turn data tombs into “golden nuggets” of
knowledge
11
Definitions
 Data mining refers to
extracting or “mining”
knowledge from large
amounts of data
 Data mining should have

been more appropriately
named “knowledge mining
from data”
 Knowledge mining from

data, knowledge extraction,
data/pattern analysis, data
archaeology, and data
dredging
9
Steps in Data Mining
13
Descriptions
DATA PROCESSING:
1. Data cleaning (to remove noise and inconsistent data)
2. Data integration (where multiple data sources may be combined)
3. Data selection (where data relevant to the analysis task are retrieved from the database)
4. Data transformation (where data are transformed or consolidated into forms appropriate for
mining by performing summary or aggregation operations, for instance)
DATA MINING:
1. Data mining (an essential process where intelligent methods are applied in order to extract
data patterns)
KNOWLEDGE EXTRACTION:
1. Pattern evaluation (to identify the truly interesting patterns representing knowledge based
on some interestingness measures)
2. Knowledge presentation (where visualization and knowledge representation techniques are
used to present the mined knowledge to the user)
14
Architecture of a typical data mining system
15
Explanation
 Database, data warehouse, World Wide Web, or other
information repository: Data cleaning and data
integration techniques may be performed
 Database or data warehouse server: responsible for

fetching the relevant data, based on the user’s data
mining request
 Knowledge base: This is the domain knowledge that is

used to guide the search or evaluate the interestingness
of resulting patterns
 Data mining engine: characterization, association,

correlation analysis, classification, prediction, cluster
analysis, outlier analysis and evolution analysis
 Pattern evaluation module: This component typically

employs interestingness measures to find patterns
16
Knowledge discovery process
17
Data Mining vs. others
 A data analysis system that does not handle large amounts of data should be
more appropriately categorized as a machine learning system, a statistical data
analysis tool, or an experimental system prototype
 A system that can only perform data or information retrieval, including finding
aggregate values, or that performs deductive query answering in large databases
should be more appropriately categorized as a database system, an information
retrieval system, or a deductive database system
 Data mining involves an integration of techniques:

 Database and data warehouse technology
 Statistics, machine learning
 High-performance computing,
 Pattern recognition, neural networks, data visualization, information retrieval, image and signal
processing, and spatial or temporal data analysis
 Data mining is considered one of the most important frontiers in database and
information systems
18
Contents
2 Data Mining Data
5 Major Issues
19
Relational Database
 The AllElectronics company is described by
the following relation tables: customer, item,
employee, and branch
 Relational data can be accessed by

database queries written in a relational query
language, such as SQL
 Analysis question: “How many sales

transactions occurred in the month of
December?”
 Data mining systems can analyze customer

data to predict the credit risk of new
customers based on their income, age, and
previous credit information
20
Data Warehouse
 AllElectronics has branches around the world.
Each branch has its own set of databases.
 Provide an analysis of the company’s sales per

item type per branch for the third quarter
 A data warehouse is a repository of information

collected from multiple sources, stored under a
unified schema, and that usually resides at a
single site
 Data warehouses are constructed via a process

of data cleaning, data integration, data
transformation, data loading, periodic data
refreshing
 Data are organized around major subjects to

provide information from a historical
perspective (5–10 years) and are typically
summarized
21
Data Cube Representation
22
Transactional Database
 A transactional database consists of a file
where each record represents a transaction
 A transaction typically includes a unique

transaction identity number (trans ID) and a list
of the items making up the transaction (such as
items purchased in a store)
 “Which items sold well together?” This kind of

data analysis enable to bundle groups of items
together as a strategy for maximizing sales
 A regular data retrieval system is not able to

answer queries like the one above
 Data mining systems for transactional data can

do so by identifying frequent itemsets, that is,
sets of items that are frequently sold together
23
Object-Relational Database
 Extends the relational model by providing a rich
data type for handling complex objects and
object orientation
 Inherits the essential concepts of object-

oriented databases, where, in general terms,
each entity is considered as an object
 Each object has associated with it the following:

 A set of variables that describe the objects
 A set of messages that the object can use

to communicate with other objects, or with
the rest of the database system
 A set of methods, where each method

holds the code to implement a message
24
Temporal Database
 Stores relational data that include time-related
attributes involving several timestamps, each
having different semantics
 Data mining techniques can be used to find the

characteristics of object evolution, or the trend
of changes for objects in the database
 Analysis question: “When is the best time to

purchase AllElectronics stock?”
 Such analyses typically require defining

multiple granularity of time, for example,
time may be decomposed according to fiscal
years, academic years, or calendar years
25
Spatial Database
 Spatial database contain spatial-related
information, for examples geographic map
databases, very large-scale integration (VLSI)
or computed-aided design databases, and
medical and satellite image databases
 Data mining may uncover patterns describing

the characteristics of houses located near a
specified kind of location
 A spatial database that stores spatial objects

that change with time is called a spatio-
temporal database, from which interesting
information can be mined
 For example, we may be able to group the

trends of changing an event on a map
26
Multimedia Database
 Multimedia databases store image, audio, and
video data.
 They are used in applications such as picture

content-based retrieval, voice-mail systems,
video-on-demand systems, the World Wide
Web, and speech-based user interfaces that
recognize spoken commands
 Multimedia databases must support large

objects (~ several GB video) so that specialized
storage and search techniques are also
required
 An example of analysis: “what are the behavior

of objects in the scene?”
27
Heterogeneous Database
 A heterogeneous database consists of a set of
interconnected, autonomous component
databases
 The components communicate in order to

exchange information and answer queries
 Objects in one component database may differ

greatly from objects in other component
databases, making it difficult to assimilate their
semantics into the overall heterogeneous
database
 For example, the problem in exchanging

information regarding student academic
performance among different schools. Each
school may have its own computer system and
use its own curriculum and grading system
 How to find best students among them?

28
Stream Database
 Data flow in and out of an observation platform
(or window) dynamically.
 Unique features: huge or possibly infinite

volume, dynamically changing, flowing in and
out in a fixed order, allowing only one or a small
number of scans, and demanding fast (often
real-time) response time.
 Examples: time-series data, power supply,

network traffic, stock exchange,
telecommunications, Web click streams, video
surveillance, weather monitoring
 Analysis: “Detect intrusions of a computer

network based on the anomaly of message
flow?”
 Continuous query model
29
World Wide Web
 Understanding user access patterns:
 Improve system design by providing efficient access
between highly correlated objects
 Better marketing decisions e.g., by placing
advertisements in frequently visited documents,
or by providing better customer/user classification and
behavior analysis
 Web usage mining or Weblog mining

 Web page analysis: based on linkages among Web
pages can help rank Web pages based on their
importance, influence, and topics
 Automated Web page clustering and
classification: arrange Web pages in a
multidimensional manner based on their contents
 Web community analysis: identify hidden Web
social networks and communities and observe their
evolution
 It is important to understand the semantic

meaning of diverse Web pages and structure
them in an organized way for systematic
information retrieval and data mining
30
Contents
2 Data Mining Data
5 Major Issues
31
Data Mining Functionalities
Data Mining Tasks
Descriptive Predictive
 Descriptive mining tasks characterize the general properties of the data in the
database
 Predictive mining tasks perform inference on the current data in order to make
predictions
 In some cases, users may have no idea regarding what kinds of patterns in their
data may be interesting, and hence may like to search for several different kinds of
patterns in parallel
32
Statistical division
33
Machine learning division
34
Characterization and Discrimination
 Question:
 Produce a description summarizing the characteristics of
customers who spend more than $1,000 a year at AllElectronics
 The system should allow users to drill down on any dimension,
such as on occupation in order to view these customers
according to their type of employment
 Data characterization: summarizing data of the class (the target
class) in general terms
 Question:
 Compare two groups of AllElectronics customers, such as those
who shop for computer products regularly vs. those who rarely
shop for such products
 80% of the customers who frequently purchase computer
products are between 20 and 40 years old and have a university
education, whereas 60% of the customers who infrequently buy
such products are either seniors or youths, and have no
university degree.
 Data discrimination: comparison of the target class with one or a
set of comparative classes (often called the contrasting classes)
35
Mining Frequent Patterns, Associations, and Correlations
 Determine which items are frequently purchased

together within the same transactions in AllElectronics?
 buys(X, “computer”) ⇒ buys(X, “software”)
[support= 1%, confidence = 50%]
 age(X, “20...29”) ∧ income(X, “20K...29K”) ⇒
buys(X, “CD player”) [support= 2%, confidence =
60%]
 X is a variable representing a customer

 A confidence of 50% means that if a customer buys a
computer, there is a 50% chance that she will buy
software as well
 1% support means that 1% of all of the transactions
under analysis showed that computer and software
were purchased together
Association Rule Mining: discovery of interesting

associations and correlations within data
36
Classification and Prediction
 Question:
 Classify a large set of items in the store, based
on three kinds of responses to a sales
campaign: good response, mild response, and
no response
 Derive a model for each of these three classes
based on the descriptive features of the items,
such as price, brand, place made, type, and
category
 The resulting classification should maximally
distinguish each class from the others,
presenting an organized picture of the data set
 Classification: finding a model (or function) that

describes and distinguishes data classes or
concepts, for the purpose of being able to use the
model to predict the class of objects whose class
label is unknown
 The derived model is based on the analysis of a set

of training data (i.e., data objects whose class label
is known)
37
Cluster Analysis
 Question:
 Identify homogeneous
subpopulations of customers
 These clusters may represent
individual target groups for
marketing
 Clustering: analyzes data objects

without consulting a known class
label.
 The objects are clustered or grouped

based on the principle of maximizing
the intraclass similarity and
minimizing the interclass similarity
38
Outlier and Evolution Analysis
 Question:
 Uncover fraudulent usage of credit
cards by detecting purchases of
extremely large amounts for a given
account number in comparison to
regular charges incurred by the
same account
 Outlier Analysis: detect data objects that

do not comply with the general behavior
or model of the data
 Question:
 Identify stock evolution regularities
for overall stocks and for the stocks
of particular companies
 Evolution Analysis: describes and

models regularities or trends for objects
whose behavior changes over time
39
Visualization Analysis
 Input: 3D cubes, distribution
charts, curves, surfaces, link
graphs, image frames and
movies, parallel coordinates
 Output: pie charts, scatter

plots, box plots,
association rules, parallel
coordinates, dendograms,
temporal evolution
40
A brief of data mining methods…
41
A brief of data mining methods (2)
42
43
44
Contents
2 Data Mining Data
5 Major Issues
45
Data Mining Systems
 Data mining is an interdisciplinary field, the
confluence of a set of disciplines, including
database systems, statistics, machine
learning, visualization, and information
science
 Depending on the data mining approach

used, techniques from other disciplines
may be applied, such as neural networks,
fuzzy and/or rough set theory
 Depending on the kinds of data to be

mined or on the given data mining
application, the data mining system may
also integrate techniques from spatial data
analysis, information retrieval, etc.
 It is necessary to provide a clear

classification of data mining systems
46
Classification by Database
 Data mining systems can be classified
according to data models
 Relational, transactional, object-
relational, or data warehouse
mining system

according to the special types of data
handled
 Spatial, time-series, text, stream
data, multimedia data mining
system, or a World Wide Web
mining system
47
Classification by kinds of knowledge mined
 Data mining systems can be classified by
data mining functionalities
 Characterization, discrimination,
association and correlation analysis,
classification, prediction, clustering,
outlier analysis, and evolution

data regularities (occurring patterns) versus
data irregularities (such as exceptions or
outliers)

levels of abstraction
 Generalized knowledge (at a high level
of abstraction), primitive-level knowledge
(at a raw data level), or knowledge at
multiple levels (considering several
levels of abstraction)
48
Classification by techniques
according to underlying data mining
techniques employed
 Degree of user interaction involved
(autonomous systems, interactive
exploratory systems, query-driven
systems)
 Methods of data analysis employed

(database-oriented or data
warehouse–oriented techniques,
machine learning, statistics,
visualization, pattern recognition,
neural networks, and so on)
49
Classification by applications
 Finance
 Telecommunications
 DNA analysis
 Stock
 Markets
 E-mail
 Etc.
50
Data Mining Query
 A data mining query is defined in
terms of data mining task primitives
 The set of task-relevant data to be
mined
 The kind of knowledge to be
mined
 The background knowledge to be
used in the discovery process
 The interestingness measures
and thresholds for pattern
evaluation
 The expected representation for
visualizing the discovered patterns
51
An example of Data Mining Query Language
52
Contents
2 Data Mining Data
5 Major Issues
53
Mining methodology and user interaction issues
 Mining different kinds of knowledge in databases:

Because different users can be interested in
different kinds of knowledge, data mining should
cover a wide spectrum of data analysis and
knowledge discovery tasks (classification,
prediction)
 Interactive mining of knowledge at multiple levels of

abstraction: user can interact with the data mining
system to view data and discovered patterns at
multiple granularities and from different angles
 Incorporation of background knowledge: Domain

knowledge such as integrity constraints and
deduction rules, can help focus and speed up a data
mining process, or judge the interestingness of
discovered patterns
54
Mining methodology and user interaction issues (2)
 Data mining query languages and ad hoc data mining:

high-level data mining query languages need to be
developed to allow users to describe ad hoc data mining
tasks by facilitating the specification of the relevant sets of
data for analysis
 Presentation and visualization of data mining results:

Discovered knowledge should be expressed in high-level
languages, visual representations
 Handling noisy or incomplete data: Data cleaning methods

and data analysis methods that can handle noise are
required, as well as outlier mining methods for the
discovery and analysis of exceptional cases
 Pattern evaluation—the interestingness problem: The use

of interestingness measures or user-specified constraints
to guide the discovery process and reduce the search
space is another active area of research
55
Performance issues
 Efficiency and scalability of data mining algorithms: the running time of a

data mining algorithm must be predictable and acceptable in large
databases
 Parallel, distributed, and incremental mining algorithms: The huge size of

many databases, the wide distribution of data, and the computational
complexity of some data mining methods are factors motivating the
development of parallel and distributed data mining algorithms.
56
The diversity of database types issues
 Handling of relational and complex types of

data: databases may contain complex data
objects
 It is unrealistic to expect one system to
mine all kinds of data, given the diversity
of data types and different goals of data
mining
 Specific data mining systems should be
constructed for mining specific kinds of
data
 Mining information from heterogeneous

databases and global information systems:
 Web mining, which uncovers interesting
knowledge about Web contents, Web
structures, Web usage, and Web
dynamics
57
Contents
2 Data Mining Data
5 Major Issues
58
Summary
 Data mining is the task of discovering interesting patterns from large amounts of data,
where the data can be stored in databases, data warehouses, or other information
repositories
 Data patterns can be mined from many different kinds of databases, such as relational
databases, data warehouses, and transactional, and object-relational databases
 Data mining functionalities include the discovery of concept/class descriptions,

associations and correlations, classification, prediction, clustering, trend analysis,
outlier and deviation analysis, and similarity analysis. Characterization and
discrimination are forms of data summarization
 Data mining systems can be classified according to the kinds of databases mined, the
kinds of knowledge mined, the techniques used, or the applications adapted
 Efficient and effective data mining in large databases poses numerous requirements
and great challenges to researchers and developers
59
Questions
60
Exercises
61
Exercises (2)
62
Exercises (3)
63
Main reference
64
Click to edit company slogan .

Session 1-Introduction To Data Mining

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Session 1-Introduction To Data Mining

Uploaded by

Copyright:

Available Formats

Session 1: Introduction to Data Mining

Lecturer: Dr. Le Hoang Son

2 Data Mining Data

3 Data Mining Functionalities

4 Data Mining Systems

6 Discussion & Exercises

 The abundance of data raises the need for

 It is the fact that important decisions are

 Decision makers indeed do not have the

HOW DO YOU KNOW THE PERSON WHO

How’s about the evade of customer with TID =

 The database system industry:

 The development in database systems:

 As a result, data collected in large data

 The widening gap between data and

 Data mining should have

 Knowledge mining from

 Database or data warehouse server: responsible for

 Knowledge base: This is the domain knowledge that is

 Data mining engine: characterization, association,

 Pattern evaluation module: This component typically

 Data mining involves an integration of techniques:

2 Data Mining Data

3 Data Mining Functionalities

4 Data Mining Systems

6 Discussion & Exercises

 Relational data can be accessed by

 Analysis question: “How many sales

 Data mining systems can analyze customer

 Provide an analysis of the company’s sales per

 A data warehouse is a repository of information

 Data warehouses are constructed via a process

 Data are organized around major subjects to

 A transaction typically includes a unique

 “Which items sold well together?” This kind of

 A regular data retrieval system is not able to

 Data mining systems for transactional data can

 Inherits the essential concepts of object-

 Each object has associated with it the following:

 A set of messages that the object can use

 A set of methods, where each method

 Data mining techniques can be used to find the

 Analysis question: “When is the best time to

 Such analyses typically require defining

 Data mining may uncover patterns describing

 A spatial database that stores spatial objects

 For example, we may be able to group the

 They are used in applications such as picture

 Multimedia databases must support large

 An example of analysis: “what are the behavior

 The components communicate in order to

 Objects in one component database may differ

 For example, the problem in exchanging

 How to find best students among them?

 Unique features: huge or possibly infinite

 Examples: time-series data, power supply,

 Analysis: “Detect intrusions of a computer

 Continuous query model

 Web usage mining or Weblog mining

 It is important to understand the semantic

2 Data Mining Data

3 Data Mining Functionalities

4 Data Mining Systems