You are on page 1of 21

Bahir DarInstitute of Technology

Faculty
of Computing
Department of Information System
Data Mining and Data Warehousing

By: Belete B.

1
Chapter one
Introduction
 In this chapter we will cover the following issues in
brief
– Motivation: Why data mining?
– What is data mining?
– Data Mining: On what kind of data?
– Data mining functionalities

2
Motivation:
“Necessity is the Mother of Invention”
 Our capacity of generating and collecting data have been
increased rapidly in the last several decades
 Huge amount of data is available at the tip of our hand
 It is predicted that more data will be produced in the next
year than has been generated during the entire existence of
humankind!
 According to Witten and Frank, it is estimated that the
amount of data stored in the world's database grows every
twenty months at a rate of 100%

3
Motivation:
“Necessity is the Mother of Invention”
 Contributing factors include
– Widespread use of bar code for most commercial products,
– Computerization of many business, scientific, and governmental
transactions,
– Advances in data collection tools (audio, video, satellite remote
sensing, scanning, image capturing tools)
– Usage of WWW as a global information system (the Internet in
general),
– Development of comprehensive application software,
– New computing and storage technologies

4
Motivation:
“Necessity is the Mother of Invention”
 All this have made it easier to create, collect, and store all
types of data
 As a result it creates a problem what is called data explosion
 Data explosion is the problem of having huge amount of
data in an enterprise stored in databases, data warehouses
and other information repositories generated by automated
data collection tools
 As the volume of data increases, the proportion of
information in which people could understand decreases
substantially or as the size of data get larger, analyzing the
data becomes very difficult
5
Motivation:
“Necessity is the Mother of Invention”
 This shows that the level of understanding of people about
the data at hand could not keep pace with the rate of
generation of data in various forms, which results in
increasing information gap
 Consequently, scholars begin to realize this bottleneck and
to look into possible remedies/solutions
 Current technological progress permits the storage and
6
access of large amounts of data at virtually no cost
Motivation:
“Necessity is the Mother of Invention”
 The true value is not in storing the data, but rather in our ability to
extract useful reports and to find interesting trends & correlations
to support decisions and policies made by businesses
 We are drowning in data, but starving for knowledge!

 To bridge the gap of analyzing large volume of data and


extracting useful information and knowledge for decision making
that the new generation of computerized methods known as Data
Mining (DM) has emerged in recent years
7
What is Data Mining?
 Different scholars provided different definitions about DM
 According to Berry and Linoff (2000); Han and Kamber (2006), DM
is the process of extracting or mining knowledge from large amounts
of data in order to discover meaningful patterns and rules
 Data mining is extraction of interesting (non-trivial, implicit,
previously unknown and potentially useful) information or patterns
from data in large databases (e.g. data warehouse)
 The term Data mining is a misnomer as it doesn’t directly related to
what is does
 Data mining should best describe as knowledge mining from data
rather than data mining
 Any way, we will use the term with this understanding
8
What is Data Mining?
 Alternative names
– Knowledge discovery (mining) from databases (KDD),

– knowledge extraction,

– data/pattern analysis,

– data archeology,

– data dredging,

– information harvesting,

– business intelligence, etc.

9
What is Data Mining?
 DM involves the use of sophisticated data analysis tools
to discover previously unknown, valid patterns and
relationships in large datasets
– These tools can include statistical models, mathematical
algorithms, and machine learning methods
 According to Han and Kamber (2006), the major reason
that DM has attracted a great deal of attention in the
information industry in recent years is due to
– the wide availability of huge amounts of data and
– the imminent/expected need for turning such data into useful
information and knowledge
 The information and knowledge gained can be used for
applications ranging from market analysis, fraud detection,
and customer retention, to production control and science
10
exploration
Data Mining: On What Kind of Data?
 In principle, data mining is not specific to one type of media
or data
– Data mining should be applicable to any kind of information
repository
– Data mining is being put into use and studied for databases,
 Relational databases
– a collection of tables, each of which is assigned a unique name.
– are one of the most commonly available and rich information
repositories, and thus they are a major data form in our study of data
mining
– DM algorithms using relational databases can be more versatile than
DM algorithms specifically written for flat files, since they can take
advantage of the structure inherent to relational databases
– While DM can benefit from SQL for data selection, transformation and
consolidation, it goes beyond what SQL could provide, such as
predicting, comparing, detecting deviations, etc 11
Data Mining: On What Kind of Data?
 Data warehouses
– is a repository of information collected from multiple sources,
stored under a unified schema, and that usually resides at a single
site
– Data warehouses are constructed via a process of data cleaning,
data integration, data transformation, data loading, and periodic
data refreshing
– To facilitate decision making, the data in a data warehouse are
organized around major subjects, such as customer, item, supplier,
and activity
– The data are stored to provide information from a historical
perspective and are typically summarized
 Transactional databases
– consists of a file where each record represents a transaction
– One typical data mining analysis on such data is the so-called market
basket analysis 12
Data Mining: On What Kind of Data?
 Advanced DB and information repositories
– Spatial databases: store geographical information like maps,
and global or regional positioning
• Such spatial databases present new challenges to data mining
algorithms
– Multimedia databases: include video, images, audio, and text
media
• It is characterized by its high dimensionality, which makes data
mining even more challenging
– WWW: is the most heterogeneous and dynamic repository
available
• Conceptually, the World Wide Web is comprised of three major
components the content of the Web, the structure of the Web , & the
usage of the web
• Data mining in the WWW, or web mining, is often divided into web
content mining, web structure mining and web usage mining 13
Data Mining Functionalities
 Data mining functionalities are used to specify the kind of
patterns to be found in data mining task
 Generally data mining task can be broadly classified as
– Descriptive (unsupervised)
– Predictive (supervised)
 Descriptive data mining task characterize the general
properties of the data in a database
 Predictive data mining task perform inference on the
current data in order to make prediction to the future
reference
– permits the value of one variable to be predicted from the known
values of other variables 14
Data Mining Functionalities
 The supervised predictive data mining functionalities includes
 Classification
 Regression
 Time series
 Prediction
 The unsupervised descriptive data mining functionalities includes
 Association rule discovery
 Clustering analysis
 Summarization
 Sequence discovery

15
Data Mining Functionalities:
Classification

 Classification is the process of finding a set of models that


describe and distinguish data classes for the purpose of being able
to use the model to predict the class of an object whose class is
unknown
– The derived class is based on training data set and can be represented in
various forms such as classification IF—THEN rule, decision tree,
mathematical formulae or neural networks
– Classification approaches normally use a training set where all objects are
already associated with known class labels
– The classification algorithm learns from the training set and builds a model
– The model is used to classify new objects
 There are different algorithms that are used for classification
purpose such as, decision tree, neural network, genetic algorithm,
naïve bayes, etc 16
Data Mining Functionalities
Cluster analysis
 Clustering is a DM technique that finds similarities between
data according to the characteristics found in the data and
group’s similar data objects into one cluster
 In cluster Analysis, class labels are unknown and a group of
data is given to be classified
 The objective of clustering is to distribute cases (people,
objects, events etc.) into groups, so that the degree of
association can be strong between members of the same
cluster (intra-class similarity) and weak between members of
different clusters (inter-class similarity) 17
17
Data Mining Functionalities
Cluster analysis
 Clustering tools assign groups of records to the same cluster if
they have something in common, making it easier to discover
meaningful patterns from the dataset
 Clustering often serves as a starting point for some supervised
DM techniques or modeling
 Generally, similar to classification, clustering is the organization
of data in classes
 However, unlike classification, in clustering, class labels are
unknown and it is up to the clustering algorithm to discover
acceptable classes 18
18
Data Mining Functionalities:
Association Rule Mining

 Association rule mining aims to extract interesting


correlations, frequent patterns, associations or casual
structures among sets of items in the transaction databases
or other data repositories
 It studies the frequency of items occurring together in
transactional databases, and based on a threshold called
support, identifies the frequent item sets
 Another threshold, confidence, which is the conditional
probability that an item appears in a transaction when
another item appears, is used to pinpoint association rules
 Association analysis is commonly used for market basket
analysis 19
19
20
Quiz 1
1. What is Data mining and what is it used for?
2. What is association rule mining technique? Give
two example association rule mining algorisms
3. What are the main reasons for DM to attract a
great deal of attention in the information
industry in recent years according to Han and
Kamber?
4. What do we call DM which is applied in WWW and what are the
three aspects of DM that can be applied on WWW

21

You might also like