1 IT326 - Ch1 - Introduction

1 Introduction
IT 326: Data Mining

Third Term 2022-2023
Chapter 1, “Data Mining: Concepts and Techniques” (3 rd ed.)

Chapter Outline
 Why Data Mining?
 What Is Data Mining?
 What Kinds of Data Can Be Mined?
 What Kinds of Patterns Can Be Mined?
 Which Technologies Are Used?
 Which Kinds of Applications Are Targeted?
 Major Issues in Data Mining
The Digital Era
 People’s daily lives
 4.6 billion Internet users
 500 million tweets/day
 Data explosion:
 KB, MB, GB, TB, PB, EB, ZB...
 IDC Digital Universe Report

 0.8ZB (2009) =>40ZB (2020)
 “We are living in the information age ” is a popular saying; however, we are actually
living in the data age .
Why Data Mining?
4
 The explosive growth of data: from terabytes to zettabytes

 Data collection and data availability
• Automated data collection tools, database systems, Web, computerized society
 Major sources of abundant data:
• Business: e-commerce, transactions, stocks, product descriptions…
• Science: Remote sensing, bioinformatics, scientific experiments, …
• Society and everyone: news, social networks, digital cameras, YouTube ..
 We are drowning in data, but starving for knowledge!

Why Data Mining?
5
 We need automated analysis of massive data  Data Mining

Chapter Outline
What is Data Mining?
7
 Data Mining:
Extraction of interesting (non-trivial, implicit,
previously unknown and potentially useful)
patterns or knowledge from huge amount of
data.
 Alternative names:
• KDD (Knowledge Discovery from Data)
• Knowledge extraction
• Data/pattern analysis
8
 Data mining can be viewed as a result of the natural evolution

of information technology.
 The evolutionary path:
 1960s : Data collection, database creation, network DBMS

 1970s : Relational DBMS
 1980s : Advanced data models (extended relational, OO, etc.)
 1990s : data analysis and understanding (data mining & data
warehousing)
 2000s : Data mining with variety of applications
9
Is everything “data mining” ?
What is not Data Mining? What is Data Mining?

 Look up phone number in phone  Certain names are more prevalent in certain US
directory locations (O’Brien, O’Reilly.. in Boston area)
 Query a Web search engine for  Group together similar documents returned by
information about “Amazon” search engine according to their context (e.g.
 Find all customers who have purchased Amazon rainforest)
milk
Knowledge Discovery (KDD) Process
10
 Data mining plays an essential role in the knowledge discovery

process:
Data Pre-processing
1. Data collection
2. Data selection
3. Data cleaning
4. Data integration Data Pre-
processing
5. Data transformation
6. Data mining
7. Pattern evaluation (Section 1.4.6)
8. Knowledge presentation
Multi-Dimensional View of Data Mining
 Data view
 Kinds of data to be mined
 Knowledge view (Data mining functions)

 Kinds of knowledge or patterns to be discovered
 Method view
 Kinds of techniques utilized
 Application view
 Kinds of applications adapted
Chapter Outline
What Kinds of Data Can Be Mined?
 Database Data (Relational database)
Relational database system is a collection of tables with ER for modeling and SQL for
querying.
• Example: Data mining system can analyze customer data to predict the credit risk of new
customers based on their income, age and previous credit information
 Data Warehouses
Data warehouse is a repository of information collected from multiple sources, stored

under a unified schema at a single site in order to facilitate management decision
making.
 Transactional database
A file where each record represents a transaction
such as a customer’s purchase: sales (transID, list of item IDs)
trans_ID list_of_item_IDs
T100 I1, I13, I8, I16
T200 I2, I8
…. …
• Data mining can bring answer to “Which items sold well together”
 Other Kinds of Data (Advanced datasets)

 Data streams and sensor data
 Spatial data
 Time-series data, temporal data, sequence data
 Graphs, social networks data
 Object-relational databases
 Multimedia database
 Text databases
 The World-Wide Web
Chapter Outline
What Kinds of Patterns Can Be Mined?
Data Mining Functionalities:
1. Class/concept description
2. Mining frequent patterns, associations, and correlations
3. Classification and regression for predictive analysis
4. Cluster analysis
5. Outlier analysis
What Kinds of Patterns Can Be Mined?
19
 Two main types of data mining tasks:

 Descriptive mining tasks characterize properties of the data in a target data set.
 Predictive mining tasks perform induction on the current data in order to predict
values of new data.
Data Mining Functions
Frequent
Outlier
Classification Clustering Pattern and
Analysis
[Predictive] [Descriptive] Association
[Predictive]
[Descriptive]
Class/Concept Description
Data characterization:
 Summarization of the general characteristics or features of a target class of data.
 Example: customers who spend more than $2000 a year
 age 40-50, employed, good credit ratings
Data discrimination:
 Comparison of the general features of the target class against one or a set of contrasting
classes.
 Example: frequent vs. infrequent customers
 age, education, employed
• Dry vs. wet regions
 temperature, humidity
Frequent Patterns and Associations
Frequent Patterns: patterns that occur frequently in data.
 Frequent itemsets
Example: (milk, bread), (computer, software)
 Frequent subsequences
Example: <printer, toner>, <dinner, movie>
 Frequent substructures
Example:
Association Analysis:
 Mining frequent patterns leads to the discovery of interesting
associations and correlations within data.
 buys (X, “ computer”) => buys (X, “ software”)

[support = 1%, confidence = 50%]
 Support: chance of A and B appearing together

 Confidence: if A appears, chance of B appears
Tahani Almanie | IS 463

Applications:
 Marketing and Sales Promotion.

 Supermarket shelf management.
 Inventory Management.
Tahani Almanie | IS 463

Classification and Prediction
Classification:
 Construct a model (function)
based on some training examples
to describe and distinguish data
classes or concepts for future
prediction.
Typical methods:
Decision trees, naïve Bayesian classification, support vector machines, neural networks,
classification rules (i.e., IF-THEN rules), logistic regression, …
 Classification predicts categorical (discrete) labels

 Regression is used to predict numerical (continuous) values.
Applications:
 Credit card fraud detection, direct marketing, classifying diseases..
 Predicting wind velocity, temperature, sales amount of a product, stock market,…
Figure 1.9 A classification model can be represented in various forms: (a) IF-THEN rules, (b) a decision
tree, or (c) a neural network.
Cluster Analysis
Cluster analysis
 Unsupervised learning (Class label is unknown)
 Group data to form new categories (i.e., clusters)
 Clustering Principle:
 Maximizing intra-class similarity (Similar to one another within
the same cluster)
 Minimizing interclass similarity (Dissimilar to the objects in other
clusters)
Cluster Analysis
Applications:
 Cluster houses to find distribution patterns.
 Document clustering.
Figure 1.10 A 2-D plot of customer data with respect to customer locations in
a city, showing three data clusters.
Outlier Analysis
Outlier: A data object that does not comply with

the general behavior of the data (noise or
exception)

 Useful in fraud detection, rare events analysis
Chapter Outline
Which Technologies Are Used?
31
Chapter Outline
Which Kinds of Applications Are Targeted?
 Retail, telecommunication and advertising

 Banking and stock market
 Market analysis and management
Target marketing, customer relationship management (CRM),
market basket analysis
 Fraud detection and detection of unusual patterns (outliers)
 Text and Web mining
 Health Care, Sports and Entertainment
 Bioinformatics and bio-data analysis
Chapter Outline

Major Issues in Data Mining
Mining Methodology:
 Mining various and new kinds of knowledge.
 Mining knowledge in multi-dimensional space.
 Data mining: An interdisciplinary effort.
 Handling noise, uncertainty, and incompleteness of data.
 Pattern evaluation.
User Interaction:
 Incorporation of background knowledge.
 Presentation and visualization of data mining results.
Major Issues in Data Mining
Efficiency and Scalability:

 Efficiency and scalability of data mining algorithms.
 Parallel, distributed, and incremental mining methods.
Diversity of data types:

 Handling complex types of data.
 Mining dynamic, networked, and global data repositories.
Data mining and society:

 Social impacts of data mining.
 Privacy-preserving data mining.
Summary


1 IT326 - Ch1 - Introduction

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

1 IT326 - Ch1 - Introduction

Uploaded by

Copyright:

Available Formats

1 Introduction

IT 326: Data Mining

Chapter 1, “Data Mining: Concepts and Techniques” (3 rd ed.)

 IDC Digital Universe Report

 The explosive growth of data: from terabytes to zettabytes

 We are drowning in data, but starving for knowledge!

 We need automated analysis of massive data  Data Mining

 Data mining can be viewed as a result of the natural evolution

 The evolutionary path:

 1960s : Data collection, database creation, network DBMS

Is everything “data mining” ?

What is not Data Mining? What is Data Mining?

 Data mining plays an essential role in the knowledge discovery

 Knowledge view (Data mining functions)

 Database Data (Relational database)

Data warehouse is a repository of information collected from multiple sources, stored

 Other Kinds of Data (Advanced datasets)

Data Mining Functionalities:

 Two main types of data mining tasks:

Data Mining Functions

Frequent Patterns: patterns that occur frequently in data.

 buys (X, “ computer”) => buys (X, “ software”)

 Support: chance of A and B appearing together

Tahani Almanie | IS 463

 Marketing and Sales Promotion.

Tahani Almanie | IS 463

 Classification predicts categorical (discrete) labels

Outlier: A data object that does not comply with

 Retail, telecommunication and advertising

 Why Data Mining?

Efficiency and Scalability:

Diversity of data types:

Data mining and society:

 Why Data Mining?

You might also like