Professional Documents
Culture Documents
1 IT326 - Ch1 - Introduction
1 IT326 - Ch1 - Introduction
Data explosion:
KB, MB, GB, TB, PB, EB, ZB...
“We are living in the information age ” is a popular saying; however, we are actually
living in the data age .
Why Data Mining?
4
Data Mining:
Extraction of interesting (non-trivial, implicit,
previously unknown and potentially useful)
patterns or knowledge from huge amount of
data.
Alternative names:
• KDD (Knowledge Discovery from Data)
• Knowledge extraction
• Data/pattern analysis
What is Data Mining?
8
1. Data collection
2. Data selection
3. Data cleaning
4. Data integration Data Pre-
processing
5. Data transformation
6. Data mining
7. Pattern evaluation (Section 1.4.6)
8. Knowledge presentation
Multi-Dimensional View of Data Mining
Data view
Kinds of data to be mined
Method view
Kinds of techniques utilized
Application view
Kinds of applications adapted
Chapter Outline
Why Data Mining?
What Is Data Mining?
What Kinds of Data Can Be Mined?
What Kinds of Patterns Can Be Mined?
Which Technologies Are Used?
Which Kinds of Applications Are Targeted?
Major Issues in Data Mining
What Kinds of Data Can Be Mined?
Relational database system is a collection of tables with ER for modeling and SQL for
querying.
• Example: Data mining system can analyze customer data to predict the credit risk of new
customers based on their income, age and previous credit information
What Kinds of Data Can Be Mined?
Data Warehouses
Transactional database
A file where each record represents a transaction
such as a customer’s purchase: sales (transID, list of item IDs)
trans_ID list_of_item_IDs
T100 I1, I13, I8, I16
T200 I2, I8
…. …
• Data mining can bring answer to “Which items sold well together”
What Kinds of Data Can Be Mined?
1. Class/concept description
2. Mining frequent patterns, associations, and correlations
3. Classification and regression for predictive analysis
4. Cluster analysis
5. Outlier analysis
What Kinds of Patterns Can Be Mined?
19
Frequent
Outlier
Classification Clustering Pattern and
Analysis
[Predictive] [Descriptive] Association
[Predictive]
[Descriptive]
Class/Concept Description
Data characterization:
Summarization of the general characteristics or features of a target class of data.
Example: customers who spend more than $2000 a year
age 40-50, employed, good credit ratings
Data discrimination:
Comparison of the general features of the target class against one or a set of contrasting
classes.
Example: frequent vs. infrequent customers
age, education, employed
• Dry vs. wet regions
temperature, humidity
Frequent Patterns and Associations
Frequent itemsets
Example: (milk, bread), (computer, software)
Frequent subsequences
Example: <printer, toner>, <dinner, movie>
Frequent substructures
Example:
Frequent Patterns and Associations
Association Analysis:
Mining frequent patterns leads to the discovery of interesting
associations and correlations within data.
Applications:
Classification:
Construct a model (function)
based on some training examples
to describe and distinguish data
classes or concepts for future
prediction.
Classification and Prediction
Typical methods:
Decision trees, naïve Bayesian classification, support vector machines, neural networks,
classification rules (i.e., IF-THEN rules), logistic regression, …
Applications:
Credit card fraud detection, direct marketing, classifying diseases..
Predicting wind velocity, temperature, sales amount of a product, stock market,…
Classification and Prediction
Figure 1.9 A classification model can be represented in various forms: (a) IF-THEN rules, (b) a decision
tree, or (c) a neural network.
Cluster Analysis
Cluster analysis
Unsupervised learning (Class label is unknown)
Group data to form new categories (i.e., clusters)
Clustering Principle:
Maximizing intra-class similarity (Similar to one another within
the same cluster)
Minimizing interclass similarity (Dissimilar to the objects in other
clusters)
Cluster Analysis
Applications:
Cluster houses to find distribution patterns.
Document clustering.
Figure 1.10 A 2-D plot of customer data with respect to customer locations in
a city, showing three data clusters.
Outlier Analysis
Mining Methodology:
Mining various and new kinds of knowledge.
Mining knowledge in multi-dimensional space.
Data mining: An interdisciplinary effort.
Handling noise, uncertainty, and incompleteness of data.
Pattern evaluation.
User Interaction:
Incorporation of background knowledge.
Presentation and visualization of data mining results.
Major Issues in Data Mining