Professional Documents
Culture Documents
ABSTRACT
In the last years there has been a huge growth and consolidation of the Data Mining field. Some efforts are being done
that seek the establishment of standards in the area. Included on these efforts there can be enumerated SEMMA and
CRISP-DM. Both grow as industrial standards and define a set of sequential steps that pretends to guide the
implementation of data mining applications. The question of the existence of substantial differences between them and
the traditional KDD process arose. In this paper, is pretended to establish a parallel between these and the KDD process
as well as an understanding of the similarities between them.
KEYWORDS
Data Mining Standards, Knowledge Discovery in Databases, Data Mining.
1. INTRODUCTION
Fayyad considers Data Mining (DM) as one of the phases of the KDD1 process (Fayyad et al., 1996). The
data mining phase concerns, mainly, to the means by which the patterns are extracted and enumerated from
data. The literature is a source of some confusion because de two terms are indistinctively used, making it
difficult to determine exactly each of the concepts (Benoît, 2002). In this paper, data mining is seen as one of
the phases of the KDD process, as presented in (Fayyad et al., 1996) and in (Brachman & Anand, 1996).
The growth of the attention paid to the area emerged from the rising of big databases in an increasing and
differentiate number of organizations. There is the risk of wasting all the value and wealthy of information
contained on these databases unless there are used the adequate techniques to extract useful knowledge (Chen
et al, 1996) (Simoudis, 1996) (Fayyad, 1996).
In the latest years, it has been occurring the growth and consolidation of the data mining area. Some
efforts are being done that seek the establishment of standards in the area, both by academics and by people
in the industry field. The academics efforts are centered in the attempt to formulate a general framework for
DM (Dzeroski, 2006). The bulk of these efforts are centered in the definition of a language for DM that can
be accepted as a standard, in the same way that SQL was accepted as a standard for relational databases (Han
et al, 1996) (Meo et al, 1998) (Imielinski et al, 1999) (Sarawagi, 2000) (Botta et al, 2004). The efforts in the
industrial field concern mainly the definition of processes/methodologies that can guide the implementation
of DM applications. In this paper, SEMMA and CRISP-DM have been chosen, because they are considered
to be the most popular. Although it is not scientific this perception exists, because SEMMA and CRISP-DM
are presented in many of the publications of the area and are really used in practice.
During the analysis of the documentation on SEMMA and on CRISP-DM, the question of the existence
of substantial differences between them and the traditional KDD process arose. In this paper, it is pretended
The sequence of the six stages is not rigid, as is schematize in figure 2. CRISP-DM is extremely complete
and documented. All his stages are duly organized, structured and defined, allowing that a project could be
easily understood or revised (Santos & Azevedo, 2005). Although the CRISP-DM process is independent
from de DM chosen tool, it is linked to the SPSS Clementine software
3. A COMPARATIVE STUDY
By doing a comparison of the KDD and SEMMA stages we would, on a first approach, affirm that they
are equivalent:
• Sample can be identified with Selection,
• Explore can be identified with Pre processing
• Modify can be identified with Transformation
• Model can be identified with Data Mining
• Assess can be identified with Interpretation/Evaluation.
Examining it thoroughly, we may affirm that the five stages of the SEMMA process can be seen as a
practical implementation of the five stages of the KDD process, since it is directly linked to the SAS
Enterprise Miner software.
Comparing the KDD stages with the CRISP-DM stages is not as straightforward as in the SEMMA
situation. Nevertheless, we can first of all observe that the CRISP-DM methodology incorporates the steps
that, as referred above, must precede and follow the KDD process that is to say:
• The Business Understanding phase can be identified with the development of an understanding
of the application domain, the relevant prior knowledge and the goals of the end-user
• The Deployment phase can be indentified with the consolidation by incorporating this
knowledge into the system.
Concerning the remaining stages, we can say that:
• The Data Understanding phase can be identified as the combination of Selection and Pre
processing
• The Data Preparation phase can be identified with Transformation
• The Modeling phase can be identified with Data Mining
• The Evaluation phase can be identified with Interpretation/Evaluation.
In table 1, we present a summary of the presented correspondences.
Table 1. Summary of the correspondences between KDD, SEMMA and CRISP-DM
REFERENCES
Fayyad, U. M. et al. 1996. From data mining to knowledge discovery: an overview. In Fayyad, U. M.et al (Eds.),
Advances in knowledge discovery and data mining. AAAI Press / The MIT Press.
Benoît, G., 2002. Data Mining. Annual Review of Information Science and Technology, Vol. 36, No. 1, pp 265-310.
Brachman, R. J. & Anand, T., 1996. The process of knowledge discovery in databases. In Fayyad, U. M. et al. (Eds.),
Advances in knowledge discovery and data mining. AAAI Press / The MIT Press.
Chen, M. et al, 1996. Data Mining: An Overview from a Database Perspective. IEEE Transactions on Knowledge and
Data Engineering, Vol. 8, No. 6, pp 866-883.
Simoudis, E., 1996. Reality check for data mining. IEEE Expert, Vol. 11, No. 5, pp 26-33.
Fayyad, U. M., 1996. Data mining and knowledge discovery: making sense out of data. IEEE Expert, Vol. 11 No. 5, pp
20-25.
Dzeroski, S., 2006. Towards a General Framework for Data Mining.. In Dzeroski, S and Struyf, J (Eds.), Knowledge
Discovery in Inductive Databases. LNCS 47474. Springer-Verlag.
Han, J. et al, 1996. DMQL: A Data Mining Query Language for Relational Databases. In proceedings of DMKD-96
(SIGMOD-96 Workshop on KDD). Montreal. Canada.
Meo, R. e tal, 1998. An Extension to SQL for Mining Association Rules. Data Mining and Knowledge Discovery Vol. 2,
pp 195-224. Kluwer Academic Publishers.
Imielinski, T.; Virmani, A., 1999. MSQL: A Query Language for Database Mining. Data Mining and Knowledge
Discovery Vol. 3, pp 373-408. Kluwer Academic Publishers.
Sarawagi, S. et al, 2000. Integrating Association Rule Mining with Relational Database Systems: Alternatives and
Implications. Data Mining and Knowledge Discovery, Vol. 4, pp 89–125.
Botta, Marco, et al, 2004. Query Languages Supporting Descriptive Rule Mining: A Comparative Study. Database
Support for Data Mining Applications. LNAI 2682, pp 24-51.
Santos, M & Azevedo, C (2005). Data Mining – Descoberta de Conhecimento em Bases de Dados. FCA Publisher.