Professional Documents
Culture Documents
Faculty
of Computing
Department of Information System
Data Mining and Data Warehousing
By: Belete B.
1
Chapter one
Introduction
In this chapter we will cover the following issues in
brief
– Motivation: Why data mining?
– What is data mining?
– Data Mining: On what kind of data?
– Data mining functionalities
2
Motivation:
“Necessity is the Mother of Invention”
Our capacity of generating and collecting data have been
increased rapidly in the last several decades
Huge amount of data is available at the tip of our hand
It is predicted that more data will be produced in the next
year than has been generated during the entire existence of
humankind!
According to Witten and Frank, it is estimated that the
amount of data stored in the world's database grows every
twenty months at a rate of 100%
3
Motivation:
“Necessity is the Mother of Invention”
Contributing factors include
– Widespread use of bar code for most commercial products,
– Computerization of many business, scientific, and governmental
transactions,
– Advances in data collection tools (audio, video, satellite remote
sensing, scanning, image capturing tools)
– Usage of WWW as a global information system (the Internet in
general),
– Development of comprehensive application software,
– New computing and storage technologies
4
Motivation:
“Necessity is the Mother of Invention”
All this have made it easier to create, collect, and store all
types of data
As a result it creates a problem what is called data explosion
Data explosion is the problem of having huge amount of
data in an enterprise stored in databases, data warehouses
and other information repositories generated by automated
data collection tools
As the volume of data increases, the proportion of
information in which people could understand decreases
substantially or as the size of data get larger, analyzing the
data becomes very difficult
5
Motivation:
“Necessity is the Mother of Invention”
This shows that the level of understanding of people about
the data at hand could not keep pace with the rate of
generation of data in various forms, which results in
increasing information gap
Consequently, scholars begin to realize this bottleneck and
to look into possible remedies/solutions
Current technological progress permits the storage and
6
access of large amounts of data at virtually no cost
Motivation:
“Necessity is the Mother of Invention”
The true value is not in storing the data, but rather in our ability to
extract useful reports and to find interesting trends & correlations
to support decisions and policies made by businesses
We are drowning in data, but starving for knowledge!
– knowledge extraction,
– data/pattern analysis,
– data archeology,
– data dredging,
– information harvesting,
9
What is Data Mining?
DM involves the use of sophisticated data analysis tools
to discover previously unknown, valid patterns and
relationships in large datasets
– These tools can include statistical models, mathematical
algorithms, and machine learning methods
According to Han and Kamber (2006), the major reason
that DM has attracted a great deal of attention in the
information industry in recent years is due to
– the wide availability of huge amounts of data and
– the imminent/expected need for turning such data into useful
information and knowledge
The information and knowledge gained can be used for
applications ranging from market analysis, fraud detection,
and customer retention, to production control and science
10
exploration
Data Mining: On What Kind of Data?
In principle, data mining is not specific to one type of media
or data
– Data mining should be applicable to any kind of information
repository
– Data mining is being put into use and studied for databases,
Relational databases
– a collection of tables, each of which is assigned a unique name.
– are one of the most commonly available and rich information
repositories, and thus they are a major data form in our study of data
mining
– DM algorithms using relational databases can be more versatile than
DM algorithms specifically written for flat files, since they can take
advantage of the structure inherent to relational databases
– While DM can benefit from SQL for data selection, transformation and
consolidation, it goes beyond what SQL could provide, such as
predicting, comparing, detecting deviations, etc 11
Data Mining: On What Kind of Data?
Data warehouses
– is a repository of information collected from multiple sources,
stored under a unified schema, and that usually resides at a single
site
– Data warehouses are constructed via a process of data cleaning,
data integration, data transformation, data loading, and periodic
data refreshing
– To facilitate decision making, the data in a data warehouse are
organized around major subjects, such as customer, item, supplier,
and activity
– The data are stored to provide information from a historical
perspective and are typically summarized
Transactional databases
– consists of a file where each record represents a transaction
– One typical data mining analysis on such data is the so-called market
basket analysis 12
Data Mining: On What Kind of Data?
Advanced DB and information repositories
– Spatial databases: store geographical information like maps,
and global or regional positioning
• Such spatial databases present new challenges to data mining
algorithms
– Multimedia databases: include video, images, audio, and text
media
• It is characterized by its high dimensionality, which makes data
mining even more challenging
– WWW: is the most heterogeneous and dynamic repository
available
• Conceptually, the World Wide Web is comprised of three major
components the content of the Web, the structure of the Web , & the
usage of the web
• Data mining in the WWW, or web mining, is often divided into web
content mining, web structure mining and web usage mining 13
Data Mining Functionalities
Data mining functionalities are used to specify the kind of
patterns to be found in data mining task
Generally data mining task can be broadly classified as
– Descriptive (unsupervised)
– Predictive (supervised)
Descriptive data mining task characterize the general
properties of the data in a database
Predictive data mining task perform inference on the
current data in order to make prediction to the future
reference
– permits the value of one variable to be predicted from the known
values of other variables 14
Data Mining Functionalities
The supervised predictive data mining functionalities includes
Classification
Regression
Time series
Prediction
The unsupervised descriptive data mining functionalities includes
Association rule discovery
Clustering analysis
Summarization
Sequence discovery
15
Data Mining Functionalities:
Classification
21