You are on page 1of 36

Data Mining

Some Definitions
• Data : Data are any facts, numbers, or text that can be
processed by a computer.
– operational or transactional data such as, sales, cost, inventory,
payroll, and accounting
– nonoperational data, such as industry sales, forecast data, and
macro economic data
– meta data - data about the data itself, such as logical database
design or data dictionary definitions

• Information: The patterns, associations, or relationships


among all this data can provide information.
Definitions Continued..
• Knowledge: Information can be converted into knowledge
about historical patterns and future trends.
• For example, summary information on retail supermarket
sales can be analyzed in terms of promotional efforts to
provide knowledge of consumer buying behavior. Thus, a
manufacturer or retailer could determine which items are
most susceptible to promotional efforts.

• Data Warehouses: Data warehousing is defined as a process


of centralized data management and retrieval.
Data Warehouse example
Data Warehouse: A Multi-Tiered Architecture

Monitor
Metadata & OLAP Server
Other
sources Integrator

Analysis
Operational Extract Query
DBs Transform Data Serve Reports
Load
Refresh
Warehouse Data mining

Data Marts

Data Sources Data Storage OLAP Engine Front-End Tools


5/14/22 Data Mining: Concepts and Techniques 5
Multidimensional Data
• Sales volume as a function of product, month,
and region Dimensions: Product, Location, Time
Hierarchical summarization paths
on
gi

Industry Region Year


Re

Category Country Quarter


Product

Product City Month Week

Office Day

Month
5/14/22 Data Mining: Concepts and Techniques 6
A Sample Data Cube

Total annual sales


Date of TV in U.S.A.
1Qtr 2Qtr 3Qtr 4Qtr sum
t
uc

TV
od

PC U.S.A
Pr

VCR

Country
sum
Canada

Mexico

sum

5/14/22 Data Mining: Concepts and Techniques 7


Typical OLAP Operations
• Roll up (drill-up): summarize data
– by climbing up hierarchy or by dimension reduction
• Drill down (roll down): reverse of roll-up
– from higher level summary to lower level summary or detailed
data
• Slice : selection on one dimension of the given cube, resulting in a
sub cube
• Dice: define a sub cube by performing a selection on two or more
dimensions
• Pivot (rotate): rotates the data axes in order to provide an
alternative presentation of the data.

5/14/22 Data Mining: Concepts and Techniques 8


• Other operations

– drill across: involving (across) more than one fact table

– drill through: through the bottom level of the cube to its


back-end relational tables (using SQL)

5/14/22 Data Mining: Concepts and Techniques 9


5/14/22 Data Mining: Concepts and Techniques 10
5/14/22 Data Mining: Concepts and Techniques 11
5/14/22 Data Mining: Concepts and Techniques 12
5/14/22 Data Mining: Concepts and Techniques 13
5/14/22 Data Mining: Concepts and Techniques 14
5/14/22 Data Mining: Concepts and Techniques 15
5/14/22 Data Mining: Concepts and Techniques 16
5/14/22 Data Mining: Concepts and Techniques 17
OLTP vs. OLAP
OLTP OLAP
users clerk, IT professional knowledge worker
function day to day operations decision support
DB design application-oriented subject-oriented
data current, up-to-date historical,
detailed, flat relational summarized, multidimensional
isolated integrated, consolidated
usage repetitive ad-hoc
access read/write lots of scans
index/hash on prim. key
unit of work short, simple transaction complex query
# records accessed tens millions
#users thousands hundreds
DB size 100MB-GB 100GB-TB
metric transaction throughput query throughput, response

5/14/22 Data Mining: Concepts and Techniques 18


Data Rich, Information Poor
Data Mining process
What Is Data Mining?

• Data mining (knowledge discovery from data)


– Extraction of interesting (non-trivial, implicit, previously
unknown and potentially useful) patterns or knowledge
from huge amount of data

• Other names
– Knowledge discovery (mining) in databases (KDD),
knowledge extraction, data/pattern analysis, data
archeology, data dredging, information harvesting, business
intelligence, etc.
The KDD process
The non-trivial process of identifying valid, novel,
potentially useful, and ultimately understandable
patterns in data
Multiple process

non-trivial process
Justified patterns/models
valid
novel Previously unknown

useful Can be used

understandable by human and machine


15
Knowledge discovery from data
KDD process includes

1. Data cleaning (to remove noise and inconsistent data)

2. data integration (where multiple data sources may be


combined)

3. data selection (where data relevant to the analysis task are


retrieved from the database)

4. data transformation (where data are transformed or


consolidated into forms appropriate for mining by performing
summary or aggregation operations)
KDD continued….
5. data mining (an essential process where intelligent methods
are applied in order to extract data patterns.

6. pattern evaluation (to identify the truly interesting patterns


representing knowledge based on some interestingness
measures)

7. knowledge presentation (where visualization and knowledge


representation techniques are used to present the mined
knowledge to the user)

Data mining is a core of knowledge discovery process


Knowledge Discovery (KDD) Process

Pattern Evaluation

Data Mining

Task-relevant Data

Data Warehouse Selection

Data Cleaning

Data Integration

Databases
Architecture: Typical Data Mining System

Graphical User Interface

Pattern Evaluation

Knowled
Data Mining Engine ge-Base

Database or Data Warehouse


Server

data cleaning, integration, and selection

Data World-Wide Other Info


Database Repositories
Warehouse Web
5/14/22 Data Mining: Concepts and Techniques 26
What kind of pattern can be mined?
(Data mining functionalities)
• Concept/Class Description: Characterization and
Discrimination

• Mining Frequent Patterns, Associations and correlations

• Classification and Prediction

• Cluster Analysis

• Outlier Analysis
Which Technologies are used?
• Statistics
– Studies the collection , analysis, interpretation or
explanation and presentation of data
– Statistical model are widely used to model data
• Machine Learning
– Investigate how computers can learn based on data
– Computer is program to automatically learn to
recognize complex pattern and make intelligent
decision base on data
• Supervised learning
• Unsupervised learning
• Semi-supervised learning
Which Technologies are used?

• Database systems and Data warehouse


– Focuses on the creation, maintenances and use of
databases for organizations and end users
– Data warehouse consolidate data in
multidimensional space
• Information retrieval (IR)
– Science of searching a document or information
which can be text or multimedia in a document
which may reside on a web.
Which kind of applications are targeted?

• Business Intelligence
– Provide historical, current, and predictive views of
business operations
• Web search engines
– It is a specialized computer server that searches
for information on the web
Major issues in data Mining
It is partition into five groups

1. Mining methodology
2. User interaction
3. Efficiency and scalability
4. Diversity of data types
5. Data mining and society
Major issues in data Mining

• Mining methodology issues

– Mining different kinds of knowledge in


database
– Mining knowledge in multidimensional space
– Handling noisy or incomplete data
– Pattern evaluation
• User interaction issues

– Interactive mining of knowledge at multiple levels


of abstraction
– Incorporation of background knowledge
– Data mining query languages and ad hoc data
mining
– Presentation and visualization of data mining
result
• Efficiency and scalability issues
– Efficiency and scalability of data mining algorithm
– Parallel, distributed and incremental mining
algorithms

• Diversity of database type issues


– Handling of relational and complex types of data
– Mining information from heterogeneous databases
and global information system
Major issues in data Mining

• Data mining and Society

– Social impact of data mining


– Privacy preserving data mining
– Invisible data mining
Thank you!

You might also like