Professional Documents
Culture Documents
2
1. Introduction
3
Cont..
Data: a set of discrete facts about
events (conceptualization)
No meaning attached to it as a result of
which it may have multiple meaning
Example:
what does “Alex” mean?
Information: Aggregation of data
(contextualization) that makes decision
making easier.
Meaning is attached and contextualized
Answers questions: what, who, when,
where
Knowledge: know-how gained through experience. It
includes facts about the real world entities and the
relationship between them. It is an Understanding gained
through experience
Answer ‘how’ question
Cont..
Data mining:
searching for
knowledge (interesting
patterns) in data.
8
What is Data Mining?
Data Mining(knowledge discovery from data) is a technology that
uses various techniques to discover hidden knowledge from
heterogeneous and distributed historical data stored in large
databases, warehouses and other massive information
repositories so as to find patterns in data that are:
valid: not only represent current state, but also hold on new
data with some certainty
novel: non-obvious to the system that are generated as new
facts
useful: should be possible to act on the item or problem
understandable: humans should be able to interpret the pattern
9
Knowledge Discovery (KDD) Process
Data mining—core of
Pattern Evaluation
knowledge discovery
process
Data Mining
Task-relevant Data
Data Selection
Warehouse
Data Cleaning
Data Integration
Databases
10 Aug
Why DM Now: The need for Business
Intelligence
14
Database Processing vs. Data Mining
Processing
Database Data mining Comments
20
Cont..
Non volatile Once data enter the data warehouse, they are never
removed. Because the data in the warehouse represent the
company’s entire history.
Operational update of data does not occur in the data warehouse
environment
Does not require transaction processing, recovery, and
concurrency control mechanisms
Requires only two operations in data accessing:
initial loading of data and access of data
Because data is added all the time, warehouse is growing.
21
Cont..
Operational update of data does not occur in the data warehouse
environment
Does not require transaction processing, recovery, and
concurrency control mechanisms
Requires only two operations in data accessing:
initial loading of data and access of data
22
Data Warehouse Stores Heterogeneous Data
Relational Data
databases extraction
------------------ process
Hierarchical
databases Data
----------------- cleanup
Network process
databases
-----------------
Flat files Data
----------------- warehouse
Spreadsheets
End user access
Query and
analysis
tools
Data Mart
A subset of a data warehouse for small and medium-size
businesses or departments within larger companies.
Its scope is confined to specific, selected groups, such as
marketing data mart.
Motivation
Only a small portion of data cells
may be “above the water’’ .
Only “interesting”
data cells above certain threshold.
24
Data Warehouse Back-End Tools
and Utilities
Data extraction
get data from multiple, heterogeneous, and external sources
Data cleaning
detect errors in the data and rectify them when possible
Data transformation
convert data from legacy or host format to warehouse format
Load
sort, summarize, consolidate, compute views, check integrity,
and build indicies and partitions
Refresh
propagate the updates from the data sources to the warehouse
Aug 25
Database & data warehouse: Differences
The data warehouse and operational environments are
separated. Data warehouse receives its data from
operational databases.
Data warehouse environment is characterized by read-only
transactions to very large data sets.
Operational environment is characterized by numerous update
transactions to a few data entities at a time.
Data warehouse contains historical data over a long time horizon.
Ultimately Information is created from data warehouses. Such
Information becomes the basis for rational decision making.
The data found in data warehouse is analyzed to discover previously
unknown data characteristics, relationships, dependencies, or trends.
Data Mining in Business Intelligence
Business Intelligence:
Business intelligence is information about a company's past
performance that is used to help predict the company's future
performance.
It can reveal emerging trends from which the company might profit.
BI takes advantage of data mining and data warehousing to help
organizations gather their information in a timelier and in a more
valuable manner
BI keeps the organization:
informed about the market trends,
alerts to new market potentials,
helps to determine how competitors are doing
Without such information and knowledge the organization may
suffer false growth or setbacks
Data Mining in Business Intelligence
Increasing potential
to support
business decisions End User
Decision
Making
Data Exploration
Statistical Summary, Querying, and Reporting
32
Are All the “Discovered” Patterns
Interesting?
Data mining may generate thousands of patterns: Not all of them are
interesting
Suggested approach: Human-centered, query-based, focused mining
Interestingness measures
A pattern is interesting if it is easily understood by humans, valid on new or test
data with some degree of certainty, potentially useful, novel, or validates some
hypothesis that a user seeks to confirm
Objective vs. subjective interestingness measures
Objective: based on statistics and structures of patterns, e.g., support,
confidence, etc.
Subjective: based on user’s belief in the data, e.g., unexpectedness, novelty,
actionability, etc.
39
Why Confluence of Multiple
Disciplines?
Tremendous amount of data
Algorithms must be highly scalable to handle such as tera-bytes of data
High-dimensionality of data
Micro-array may have tens of thousands of dimensions
High complexity of data
Data streams and sensor data
Time-series data, temporal data, sequence data
Structure data, graphs, social networks and multi-linked data
Heterogeneous databases and legacy databases
Spatial, spatiotemporal, multimedia, text and Web data
Software programs, scientific simulations
New and sophisticated applications
40
Data Mining Functionalities
Multidimensional concept description: Characterization and discrimination
Generalize, summarize, and contrast data characteristics, e.g., dry vs.
wet regions
Frequent patterns, association, correlation vs. causality
Diaper Beer [0.5%, 75%] (Correlation or causality?)
Classification and prediction
Construct models (functions) that describe and distinguish classes or
concepts for future prediction
E.g., classify countries based on (climate), or classify cars based on
(gas mileage)
Predict some unknown or missing numerical values
Aug 41
Data Mining Functionalities (2)
Cluster analysis
Class label is unknown: Group data to form new classes, e.g., cluster
houses to find distribution patterns
Maximizing intra-class similarity & minimizing interclass similarity
Outlier analysis
Outlier: Data object that does not comply with the general behavior of the
data
Noise or exception? Useful in fraud detection, rare events analysis
Trend and evolution analysis
Trend and deviation: e.g., regression analysis
Sequential pattern mining: e.g., digital camera large SD memory
Periodicity analysis
Similarity-based analysis
Other pattern-directed or statistical analyses