You are on page 1of 37

1 Introduction

IT 326: Data Mining


Third Term 2022-2023

Chapter 1, “Data Mining: Concepts and Techniques” (3 rd ed.)


Chapter Outline
 Why Data Mining?
 What Is Data Mining?
 What Kinds of Data Can Be Mined?
 What Kinds of Patterns Can Be Mined?
 Which Technologies Are Used?
 Which Kinds of Applications Are Targeted?
 Major Issues in Data Mining
The Digital Era
 People’s daily lives
 4.6 billion Internet users
 500 million tweets/day

 Data explosion:
 KB, MB, GB, TB, PB, EB, ZB...

 IDC Digital Universe Report


 0.8ZB (2009) =>40ZB (2020)

 “We are living in the information age ” is a popular saying; however, we are actually
living in the data age .
Why Data Mining?
4

 The explosive growth of data: from terabytes to zettabytes


 Data collection and data availability
• Automated data collection tools, database systems, Web, computerized society
 Major sources of abundant data:
• Business: e-commerce, transactions, stocks, product descriptions…
• Science: Remote sensing, bioinformatics, scientific experiments, …
• Society and everyone: news, social networks, digital cameras, YouTube ..

 We are drowning in data, but starving for knowledge!


Why Data Mining?
5

 We need automated analysis of massive data  Data Mining


Chapter Outline
 Why Data Mining?
 What Is Data Mining?
 What Kinds of Data Can Be Mined?
 What Kinds of Patterns Can Be Mined?
 Which Technologies Are Used?
 Which Kinds of Applications Are Targeted?
 Major Issues in Data Mining
What is Data Mining?
7

 Data Mining:
Extraction of interesting (non-trivial, implicit,
previously unknown and potentially useful)
patterns or knowledge from huge amount of
data.

 Alternative names:
• KDD (Knowledge Discovery from Data)
• Knowledge extraction
• Data/pattern analysis
What is Data Mining?
8

 Data mining can be viewed as a result of the natural evolution


of information technology.

 The evolutionary path:

 1960s : Data collection, database creation, network DBMS


 1970s : Relational DBMS
 1980s : Advanced data models (extended relational, OO, etc.)
 1990s : data analysis and understanding (data mining & data
warehousing)
 2000s : Data mining with variety of applications
What is Data Mining?
9

Is everything “data mining” ?

What is not Data Mining? What is Data Mining?


 Look up phone number in phone  Certain names are more prevalent in certain US
directory locations (O’Brien, O’Reilly.. in Boston area)
 Query a Web search engine for  Group together similar documents returned by
information about “Amazon” search engine according to their context (e.g.
 Find all customers who have purchased Amazon rainforest)
milk
Knowledge Discovery (KDD) Process
10

 Data mining plays an essential role in the knowledge discovery


process:
Data Pre-processing

1. Data collection
2. Data selection
3. Data cleaning
4. Data integration Data Pre-
processing
5. Data transformation
6. Data mining
7. Pattern evaluation (Section 1.4.6)
8. Knowledge presentation
Multi-Dimensional View of Data Mining
 Data view
 Kinds of data to be mined

 Knowledge view (Data mining functions)


 Kinds of knowledge or patterns to be discovered

 Method view
 Kinds of techniques utilized

 Application view
 Kinds of applications adapted
Chapter Outline
 Why Data Mining?
 What Is Data Mining?
 What Kinds of Data Can Be Mined?
 What Kinds of Patterns Can Be Mined?
 Which Technologies Are Used?
 Which Kinds of Applications Are Targeted?
 Major Issues in Data Mining
What Kinds of Data Can Be Mined?

 Database Data (Relational database)

Relational database system is a collection of tables with ER for modeling and SQL for
querying.

• Example: Data mining system can analyze customer data to predict the credit risk of new
customers based on their income, age and previous credit information
What Kinds of Data Can Be Mined?

 Data Warehouses

Data warehouse is a repository of information collected from multiple sources, stored


under a unified schema at a single site in order to facilitate management decision
making.
What Kinds of Data Can Be Mined?

 Transactional database
A file where each record represents a transaction
such as a customer’s purchase: sales (transID, list of item IDs)

trans_ID list_of_item_IDs
T100 I1, I13, I8, I16
T200 I2, I8
…. …

• Data mining can bring answer to “Which items sold well together”
What Kinds of Data Can Be Mined?

 Other Kinds of Data (Advanced datasets)


 Data streams and sensor data
 Spatial data
 Time-series data, temporal data, sequence data
 Graphs, social networks data
 Object-relational databases
 Multimedia database
 Text databases
 The World-Wide Web
Chapter Outline
 Why Data Mining?
 What Is Data Mining?
 What Kinds of Data Can Be Mined?
 What Kinds of Patterns Can Be Mined?
 Which Technologies Are Used?
 Which Kinds of Applications Are Targeted?
 Major Issues in Data Mining
What Kinds of Patterns Can Be Mined?

Data Mining Functionalities:

1. Class/concept description
2. Mining frequent patterns, associations, and correlations
3. Classification and regression for predictive analysis
4. Cluster analysis
5. Outlier analysis
What Kinds of Patterns Can Be Mined?
19

 Two main types of data mining tasks:


 Descriptive mining tasks characterize properties of the data in a target data set.
 Predictive mining tasks perform induction on the current data in order to predict
values of new data.

Data Mining Functions

Frequent
Outlier
Classification Clustering Pattern and
Analysis
[Predictive] [Descriptive] Association
[Predictive]
[Descriptive]
Class/Concept Description

Data characterization:
 Summarization of the general characteristics or features of a target class of data.
 Example: customers who spend more than $2000 a year
 age 40-50, employed, good credit ratings

Data discrimination:
 Comparison of the general features of the target class against one or a set of contrasting
classes.
 Example: frequent vs. infrequent customers
 age, education, employed
• Dry vs. wet regions
 temperature, humidity
Frequent Patterns and Associations

Frequent Patterns:  patterns that occur frequently in data.

 Frequent itemsets
Example: (milk, bread), (computer, software)
 Frequent subsequences
Example: <printer, toner>, <dinner, movie>
 Frequent substructures
Example:
Frequent Patterns and Associations

Association Analysis:
 Mining frequent patterns leads to the discovery of interesting
associations and correlations within data.

 buys (X, “ computer”) => buys (X, “ software”)


[support = 1%, confidence = 50%]

 Support: chance of A and B appearing together


 Confidence: if A appears, chance of B appears

Tahani Almanie | IS 463


Frequent Patterns and Associations

Applications:

 Marketing and Sales Promotion.


 Supermarket shelf management.
 Inventory Management.

Tahani Almanie | IS 463


Classification and Prediction

Classification:
 Construct a model (function)
based on some training examples
to describe and distinguish data
classes or concepts for future
prediction.
Classification and Prediction

Typical methods:
Decision trees, naïve Bayesian classification, support vector machines, neural networks,
classification rules (i.e., IF-THEN rules), logistic regression, …

 Classification predicts categorical (discrete) labels


 Regression is used to predict numerical (continuous) values.

Applications:
 Credit card fraud detection, direct marketing, classifying diseases..
 Predicting wind velocity, temperature, sales amount of a product, stock market,…
Classification and Prediction

Figure 1.9 A classification model can be represented in various forms: (a) IF-THEN rules, (b) a decision
tree, or (c) a neural network.
Cluster Analysis

Cluster analysis
 Unsupervised learning (Class label is unknown)
 Group data to form new categories (i.e., clusters)
 Clustering Principle:
 Maximizing intra-class similarity (Similar to one another within
the same cluster)
 Minimizing interclass similarity (Dissimilar to the objects in other
clusters)
Cluster Analysis

Applications:
 Cluster houses to find distribution patterns.
 Document clustering.

Figure 1.10 A 2-D plot of customer data with respect to customer locations in
a city, showing three data clusters.
Outlier Analysis

Outlier: A data object that does not comply with


the general behavior of the data (noise or
exception)
 
 Useful in fraud detection, rare events analysis
Chapter Outline
 Why Data Mining?
 What Is Data Mining?
 What Kinds of Data Can Be Mined?
 What Kinds of Patterns Can Be Mined?
 Which Technologies Are Used?
 Which Kinds of Applications Are Targeted?
 Major Issues in Data Mining
Which Technologies Are Used?
31
Chapter Outline
 Why Data Mining?
 What Is Data Mining?
 What Kinds of Data Can Be Mined?
 What Kinds of Patterns Can Be Mined?
 Which Technologies Are Used?
 Which Kinds of Applications Are Targeted?
 Major Issues in Data Mining
Which Kinds of Applications Are Targeted?

 Retail, telecommunication and advertising


 Banking and stock market
 Market analysis and management
Target marketing, customer relationship management (CRM),
market basket analysis
 Fraud detection and detection of unusual patterns (outliers)
 Text and Web mining
 Health Care, Sports and Entertainment
 Bioinformatics and bio-data analysis
Chapter Outline

 Why Data Mining?


 What Is Data Mining?
 What Kinds of Data Can Be Mined?
 What Kinds of Patterns Can Be Mined?
 Which Technologies Are Used?
 Which Kinds of Applications Are Targeted?
 Major Issues in Data Mining
Major Issues in Data Mining

Mining Methodology:
 Mining various and new kinds of knowledge.
 Mining knowledge in multi-dimensional space.
 Data mining: An interdisciplinary effort.
 Handling noise, uncertainty, and incompleteness of data.
 Pattern evaluation.

User Interaction:
 Incorporation of background knowledge.
 Presentation and visualization of data mining results.
Major Issues in Data Mining

Efficiency and Scalability:


 Efficiency and scalability of data mining algorithms.
 Parallel, distributed, and incremental mining methods.

Diversity of data types:


 Handling complex types of data.
 Mining dynamic, networked, and global data repositories.

Data mining and society:


 Social impacts of data mining.
 Privacy-preserving data mining.
Summary

 Why Data Mining?


 What Is Data Mining?
 What Kinds of Data Can Be Mined?
 What Kinds of Patterns Can Be Mined?
 Which Technologies Are Used?
 Which Kinds of Applications Are Targeted?
 Major Issues in Data Mining

You might also like