Professional Documents
Culture Documents
Steps :
1.Data Selection
2. Data Cleaning and pre processing
3. Data transformation and reduction
4. Data Mining algorithm selection
5. post processing and interpretation of the discovered knowledge
6. Data Visualization
There are 3 different ways in which data mining systems use a relational DBMS .
1. DM may not use DBMS at all : DM uses it’s own memory and storage management
DBMS is treated as a data repository from which data is expected to be downloaded
into DM’s own memory structures before DM algorithm starts . Advantage is that one
can optimize the memory management specific to the data mining algorithm . These
systems ignore the field proven technologies of DBMS like recovery and
concurrency .
2. Loosely coupled : DBMS used only for storage and retrieval of data . One can use
loosely coupled SQL to fetch data records as required by data mining algorithm .
Front end of the application is implemented in a host programming language with
embedded SQL statements in it .
3. Tightly Coupled Approach : Portion of application programs are selectively pushed
into the database system to perform the necessary computation . Data is stored in the
database and all processing is done at the database end . This avoids performance
degradation and takes full advantage of database technology . Performance of the
approach depends non the way to optimize the DM process while mapping it to a query
. There are two suggested approaches . One is built-in query optimizer of the DBMS
and second is We can have an external Optimizer .
Prediction : It makes use of existing variables in the database in order to predict unknown or
future values of Interest .
Description : It focuses on finding patterns describing the data and the subsequent
presentation for user interpretation .
EXAMPLE : In a super market , with a limited budget for a mailing campaign to launch
a new product , it is important to identify the section of the population most likely to
buy the new product . User formulates a hypothesis to identify potential customers and
their common characteristics . Historical data about transactions and demographics
information can then be queried to reveal comparable purchases and the characteristics
shared by those purchasers . The whole operation can be repeated by successive
refinements of hypotheses until the required limit is reached . The user may come up
with a new hypothesis or may refine the existing one and verify it against the database .
Discovery Model
Association rule is an expression of the form X => Y where X and Y are the
sets of Items .Given a database , the goal is to discover all the rules that have the
support and confidence greater than or equal to the minimum support and
confidence respectively .
Example : A retailer may want to know where similarities exist in his customer base , so
that he can create and understand different groups . He can use the existing database of
the different customers or more specifically different transactions collected over a period of
time . Clustering methods will help him in identifying different categories of customers .
During discovery process , the difference between data sets can be discovered in order to
separate them into different groups and similarities between data sets can be used to group
similar data together .
DISCOVERY OF CLASSIFICATION RULES
Classification of large data sets is an important problem in data mining . For example
database with a number of records and for a set of classes such that each record belongs to
one of the given classes , the problem of classification is to decide the class to which a given
record belongs. Classification problem is also concerned with generating a description or a
model for each class from the given data set . There are
decision trees
Neural Networks
Genetic Algorithms
Statistical models like linear/geometric discriminates .
Applications include
NEURAL NETWORK
SVMs is based on statistical learning theory and is increasing becoming useful in data
mining . The main idea is to non linearly map the data set into a high dimensional feature
space and use a linear discriminator to classify the data . It’s success has been demonstrated
in the area of regression , classification and decision-tree Consruction .
DM Problems
Sequence Mining : It is concerned with mining sequence data .It may be noted that in the
discovery of association rules ,we are interested in finding associations between items
irrespective of their order of occurrence. For example , we may be interested in the
association between the purchase of a particular brand of soft drinks and the occurrence of
stomach upsets . But it is more relevant to identify whether there is some pattern in the
stomach upset which occurs after the purchase of soft drink .
Web Mining : WWW is a fertile area for DM research . Web mining can be broken down into following
sub tasks .
3. Generalization : to automatically discover general patterns at individual web sites as well as across
multiple sites
Text Mining
In Text mining , text documents can be structured by means of information Extraction ,text
categorization or applying NLP techniques as a preprocessing step before performing any kind of KDTs .
Text Mining covers
1. Text Categorization
2. Exploratory Data Analysis
3. Text Clustering
4. Finding Pattern in text Databases
5. Information Extraction
SPATIAL DATA MINING
It deals with spatial or location data .Development in IT ,Digital Mapping ,remote Sensing AND
the global diffusion of GIS places demands on developing data driven inductive approaches to
spatial analysis .