You are on page 1of 38

Chapter 1

Introduction to Data Mining

1
Meaning of data mining
 Extracting information from huge sets of data

 Procedure of mining knowledge from data

 Efficient discovery of previously unknown, valid, potentially useful,

understandable patterns in large datasets

 Analysis of observational data sets to find

Unsuspected relationships
2
Cont…
Summarize the data in novel ways that are both understandable
and useful to the data owner
 Popularly known as Knowledge Discovery in Databases (KDD)
 Extracted knowledge can be used for any of the following applications
Market Analysis
Fraud Detection
Customer Retention
Production Control
Science Exploration
3
Knowledge discovery steps
 Data cleaning

Noise data (random error or variance in a measured variable) and

irrelevant data are removed from the collection

Fill in missing values, smooth out noise while identifying outliers,

and correct inconsistencies in the data

4
Cont…
 Data integration

Multiple data sources, often heterogeneous, may be combined in a

common source

 Data selection (reduction)

Data relevant to the analysis task are retrieved from the database

5
Cont…
 Data transformation
Data are transformed and consolidated into forms appropriate for
mining

By performing summary or aggregation operations

 Data mining
Intelligent methods are applied to extract data patterns

 Pattern evaluation
Identifying the truly interesting patterns representing knowledge based
on interestingness measure
6
Cont…
 Knowledge representation
Visualization and knowledge representation techniques are used to
present mined knowledge to users

7
What kinds of data can be mined?
 Data mining can be applied to any kind of data as long as the data are meaningful for a
target application
 The most basic forms of data for mining applications are:
Relational database
o Collection of tables, each of which is assigned a unique name
o Each table consists of a set of attributes and usually stores a large set of
tuples
o Most commonly available and richest information repositories for searching
trends
8
Cont…
Data Warehouses

o A repository of information collected from multiple sources

o Stored under a unified schema and usually residing at a single site

o Constructed via a process of data cleaning, data integration, data

transformation, data loading, and periodic data refreshing

o Data in a data warehouse are organized around major subjects

9
Cont…
Transactional data
oset of records representing transactions
oEach with a time stamp, an identifier and a set of items
oTransaction files could also be descriptive data for items
oTypical data mining analysis on transactional data is
Market basket analysis or association rules
Multimedia databases
o Include video, images, audio and text media

10
Cont…
Spatial databases
o Store geographical information like maps, and global or regional
positioning

Time-series databases
o Contain time related data like stock market data or logged activities

o Have a continuous flow of new data coming in

o Data mining in such databases commonly includes

Study of trends and correlations between evolutions of different


variables
11
What kinds of patterns can be mined?
 Data mining used to specify the kinds of patterns to be found in data mining tasks. Such
tasks can be classified into two categories
1. Descriptive task
Deals with the general properties of data in the database
Descriptive functions are:-
o Class/Concept Description
o Mining of Frequent Patterns
o Mining of Associations
o Mining of correlations
o Mining of clusters 12
Cont…
2. Predictive/classification task
 Perform induction on the current data in order to make predictions

 Process of finding a model that describes the data classes

 Use the model to predict the class of objects whose class label is unknown

 Derived model is based on the analysis of sets of training data

 Derived model can be presented in the following forms:

Classification (IF-THEN) Rules, Decision Trees, Mathematical


Formula and Neural Networks

13
Data mining functionalities

Characterization
o Summarization of general features of objects in a target class

o Data corresponding to the user-specified class are typically collected by a


query

o Output of data characterization can be presented in various forms

Pie charts, bar charts, curves, multidimensional and data cubes

14
Cont…
Discrimination
o Comparison of the general features of objects between two classes referred
to as the target class and the contrasting class

o Target and contrasting classes can be specified by a user

o Corresponding data objects can be retrieved through database queries

Association analysis
o Studies the frequency of items occurring together in transactional database

o Commonly used for market basket analysis

15
Cont…
Classification
o Organization of data in given classes

o Process of finding a model (function)

That describes and distinguishes data classes or concepts

o Use a training set where all objects are already associated with known
class labels

o Classification algorithm learns from the training set and builds a model

16
Cont…
o Classification model can be represented in various forms

IF-THEN rules, decision tree and neural network

17
Cont…
Prediction
o The major idea is to use a large number of past values to consider
probable future values

Clustering
o Organization of data in classes
o In clustering, class labels are unknown and it is up to the clustering
algorithm to discover acceptable classes
o clustering approaches are
 Maximizing the similarity between objects in a same class (intra-
class similarity)
18
Cont…
Minimizing the similarity between objects of different classes (inter-
class similarity)

Outlier analysis
o Data elements that cannot be grouped in a given class or cluster
o Data set that do not comply with the general behavior or model of the
data
o Many data mining methods discard outliers as noise or exceptions
o However, in some applications (e.g., fraud detection) the rare events
can be more interesting than the more regularly occurring ones

19
Technologies used in data mining
 Data mining has incorporated many techniques from other domains

 The following figure shows adopted techniques from different domains

20
Essence of data mining
Moving toward the Information Age

o Vast amounts of data are collected daily analyzing such data is an important need

o Explosive growth of available data volume is the result of

Computerization of our society

Fast development of powerful data collection and storage tools

o Global backbone telecommunication networks carry tens of petabytes of data traffic

every day
21
Cont…

o Explosively growing, widely available, and gigantic body of data makes our

time truly the data age

o Powerful and versatile tools are critically needed

To uncover valuable information from the tremendous amounts of data

To transform such data into organized knowledge

o This necessity has led to the birth of data mining

22
Cont…

Data mining as the evolution of information technology


o Data mining can be viewed as a result of the natural evolution of information
technology

o Database and information technology has evolved systematically

o Database management systems technology moved towards

 The development of advanced database systems and data warehousing

23
Cont…
oComputer hardware technology

 Supplies of powerful and affordable computers

 Data collection equipment and storage media

This technology provides a great boost

 To the database and information industry

 It enables a huge number of databases and information repositories to be

available for
24
Cont…
Transaction management

Information retrieval

Data analysis

o Internet-based global information bases, such as

 WWW

 Various kinds of interconnected and heterogeneous databases

Emerged and play a vital role in the information industry

25
Cont…
The abundance of data, coupled with the need for powerful data analysis
tools
o Described as a data rich but information poor situation
o Fast-growing, tremendous amount of data, collected and stored in large
and numerous data repositories
Exceeded our human ability for comprehension without powerful
tools
o Widening gap between data and information calls for
Systematic development of data mining tools
That can turn data tombs into “golden nuggets” of knowledge
26
Relationship b/n Data mining, Data warehousing and OLAP

o Data warehouse is a repository of information collected from multiple sources

o Stored under a unified schema and usually residing at a single site

o Data warehouses are constructed via a process of data cleaning, data integration,

data transformation, data loading, and periodic data refreshing

o The following figure shows typical framework for construction and use of a data

warehouse for a particular electronics shop

27
28
Cont…
o Data in a data warehouse are organized around major subjects

o Data are stored to provide information from a historical perspective

o Data warehouse is usually modeled by a multidimensional data structure, called a data


cube

o Each dimension corresponds to an attribute in the schema

o Each cell stores the value of some aggregate measure

o Data cube allows the precomputation and fast access of summarized data

o Data warehouse systems can provide inherent support for OLAP

29
Cont…
OLAP (Online analytical processing)

o Use background knowledge regarding the domain of the data being studied

o To allow the presentation of data at different levels of abstraction

o Such operations accommodate different user viewpoints

o Drill-down and roll-up are examples of OLAP

o Allow the user to view the data at different degrees of summarization

o The following figure shows examples of drill-down and roll-up operations

30
31
32
Issues in data mining
 Many pending issues have to be addressed in data mining

 Some of these issues are


Mining methodology
o Issues affecting the data mining approaches applied and their
limitations
o Examples that can dictate mining methodology choices
Versatility of the mining approaches
Diversity of data available
Dimensionality of the domain
33
Cont…
Broad analysis needs

Assessment of the knowledge discovered

Control and handling of noise in data

Performance issues

o The issues of scalability and efficiency of the data mining methods when

processing considerably large data

o Incremental updating and parallel programming

34
Cont…
Data source issues
o Many issues related to the data sources are exist such as
Diversity of data types
Data glut problem
Storing different types of data in a variety of repositories
o Different kinds of data and sources may require distinct algorithms and
methodologies
o Proliferation of heterogeneous data sources poses important challenges on
the database community and data mining community

35
Cont…
Security and social issues
o Sensitive and private information about individuals or companies is
gathered and stored
o This information is collected for
Customer profiling, user behavior understanding, correlating personal data
with other information

o This could disclose new implicit knowledge about individuals or groups


that could be against privacy policies
o Important information could be withheld, while other information could be
widely distributed and used without control
36
Cont…
User interface issues

o Knowledge discovered by data mining tools is useful and above

understandable by the user

o Good data visualization eases the interpretation of data mining results

o Major issues related to user interfaces and visualization are

Screen real-estate

Information rendering and interaction

37
Thank You

38

You might also like