You are on page 1of 6

ISSN : 2229-6093

S.P.Deshpande,V.M.Thakare, Int. J. Comp. Tech. Appl., Vol 2 (1), 165-170

Intelligence Ingrained Data Mining Engine Architecture

S. P. Deshpande#1, Dr. V. M. Thakare *2


#
P.G. Department of Computer Science and Technology, D.C.P.E., Amravati (MS), India.
1
Shrinivasdeshpande68@gmail.com
*
Department of Computer Science, S.G.B. Amravati University, Amravati (MS), India.
2
vilthakare@yahoo.co.in

Abstract: Information plays vital role in every data mining applications cannot take its own
field. To take complete advantage of data; it these decisions but guide users for selection of
requires a tool for automatic summarization of data, selection of data mining method and for
data, extraction of the essence of information the interpretation of the results.
stored, and the discovery of patterns in raw
data. „Data mining‟ is a tool to accomplish the The domain specific applications are
above mentioned needs. Several attempts have focused to use the domain specific data and
been made to design and develop the generic data mining algorithm that targeted for
data mining system but no system found specific objective. The data generating sources
completely generic. The domain experts play generate different type of data. Data can be
important role in the different stages of data from a simple text, numbers to more complex
mining. The decisions at different stages are audio-video data. To mine the patterns and
influenced by the factors like domain and data thus knowledge from this data, different types
details, aim of the data mining, and the context of data mining algorithms are used. The
parameters. This paper present the architecture collection and selection of context specific
of a Data mining engine that work for specific data and applying the data mining algorithm to
problem domain. The engine is developed with generate the context specific knowledge is
some prerequisite domain knowledge therefore thus a skillful job. In many domains specific
a non expert can use the system effectively for data mining applications the domain experts
data mining. plays vital role to mine useful knowledge.
Key Words: Data Mining, Engine Architecture,
problem specific data mining engine. II. The Problem Definition
I. Introduction Most of the previous studies on data
Data mining is the process of mining applications in various fields like:
extraction of hidden predictive information Language Science [11,12,11], E-learning[34],
from large databases; it is a powerful Medical science and Health care
technology with great potential to help [12,13,15,18], Education [14,33], Banking
organizations focus on the most important [29], Market research [26], E-commerce [31],
information in their data warehouses [1,2,3,4]. Engineering [15], Sports Science [24,28,36],
The automated, prospective analyses offered Detection of terrorist activities [17,27,30,35]
by data mining move beyond the analyses of Intrusion detection [21], Quality Control [39],
past events provided by retrospective tools Software Engineering [19,20], Library Science
typical of decision support systems. Data [32], Bio-informatics [22,23] etc. uses the
mining is one of the tasks in the process of variety of data types range from text to images
knowledge discovery from the data. The data and stores in variety of databases and data
stored in the database is used to discover the structures. The different methods of data
patterns of data, which then interpreted by mining are used to extract the patterns and
applying the domain knowledge. The data thus the knowledge from this variety
mining applications can be generic or domain databases. Selection of data and methods for
specific. The generic application is required to data mining is an important task in this process
be an intelligent system that by its own can and needs the knowledge of the domain.
takes certain decisions like: selection of data, Several attempts have been made to design
selection of data mining method, presentation and develop the generic data mining system
and interpretation of the result. Some generic but no system found completely generic

165
ISSN : 2229-6093
S.P.Deshpande,V.M.Thakare, Int. J. Comp. Tech. Appl., Vol 2 (1), 165-170

[6,7,8,9,10]. Thus, for every domain the situation system design become easy. The
domain expert‟s assistant is mandatory. The essential domain knowledge is maintained in
domain experts shall be guided by the system the knowledgebase. The system has four major
to effectively apply their knowledge for the components: The user Interface, The data
use of data mining systems to generate store, Data mining engine and
required knowledge [5]. The domain experts Knowledgebase.
are required to select specific data for data
mining, cleaning and transformation of data, The problem related data selected from the
extracting patterns for knowledge generation data repository is preprocessed and then
and finally interpretation of the patterns for copied in to the data mart. The user interface
knowledge generation. for selection of specific data mining algorithm
enables user in the selection of specific
Most of the domain specific data mining algorithm. The domain knowledge required for
applications show accuracy above 90%. The actual mining purpose shall be referenced
generic data mining applications are having from the domain knowledge component. The
the limitations. From the study of various data discovered knowledge shall be presented to
mining applications it is observed that, no the users through the user interfaces. The
application called generic application is 100 % generated knowledge is also useful for the
generic. The intelligent interfaces and future DM applications and the knowledge
intelligent agents up to some extent make the discovery process therefore it shall be stored in
application generic but have limitations. The the knowledge base also. The Proposed Data
domain experts play important role in the Mining Engine Architecture is as given below:
different stages of data mining. The decisions
at different stages are influenced by the factors
like domain and data details, aim of the data
mining, and the context parameters [37]. The
domain specific applications are aimed to
extract specific knowledge. The domain
experts by considering the user‟s interest and
other context parameters guide the system to
select the data, preprocess the data, apply the
data mining algorithm and to discover the
knowledge. The results yields from the domain
specific applications are more accurate and
useful. Therefore it is concluded that the
domain specific applications are more specific
for data mining. From above study it seems
very difficult to design and develop a data
mining system, which can work dynamically
for any domain. Designing a domain specific
system is also a challenging task. Data mining
engine that work for specific problem domain
and can be used by the user do not have much IV. Implementation of Proposed
knowledge of the domain is thus the central Architecture
idea of this study.
The proposed architecture is
III. The Proposed System Architecture implemented for generating players profile
In this research work a new from sports related text. In the pilot
architectural model of a data-mining engine is application „Profile Generator‟ the system
proposed. As the problem domain is known contains four main components: The user
and well defined, the environment in which interface, Data mining engine, data store and
the system run is also well defined. The user‟s knowledge base. The data store is simple
requirements are clearly defined and target folder contain text of the interest. This text is
processing on the data is also known, in this preprocessed; the interfaces are available to
accomplish the preprocessing work. The

166
ISSN : 2229-6093
S.P.Deshpande,V.M.Thakare, Int. J. Comp. Tech. Appl., Vol 2 (1), 165-170

preprocessing includes conversion of text in extraction of values for the specific attributes.
lower case letters, removal of punctuation In the training phase of the system the
marks, removal of commonly used words association rule mining algorithm is used to
using stop-list of words, and tokenization of generate knowledge like associated keywords
the text. The knowledge base contain in any game This algorithm with 5% support
prerequisite knowledge like: Name of and 50% confidence generate the rules those
countries, Name of the sports/games, specifies the set of key words associated with
commonly used words (stop words) etc. Rules the specific game. Ex. Game Cricket
like: Word presided with, word followed by, associated words: Runs, Wicket, Catch, Bold,
if_word_is etc. Wicketkeeper, Batsman, Bowler, Fielder etc.,
Ex. Game Hockey associated words: Goal,
IF_PRESIDED_WIT IF_TEXT_IS THEN_VALUE
Goalkeeper, Penalty corner, Free hit, Hockey
H stick etc. In the process of extraction of values
Mr or Ms or Miss Not Name of the profile attributes the N-Gram analysis
or Mrs or Master or Name_of_Country
Shri or Smt or Ku. Not
along with association rules is used. The 1-
Commonly_used_W tokan, 2-tokans and 3-tokans are used
ord interchangeably for the extraction of different
Not Numeric string
Not date string attribute values. The interfaces available in
Any word Not May be Name this system provide opportunity to the user to
Name_of_Country interact at each level in the data mining
Not
Commonly_used_W process. They are user friendly and designed
ord to use effectively by novice user.
Not Numeric string
Not date string
V. Result and Discussion
IF_TEXT_IS IF_FOLLOWED_BY THEN_VALUE
The results obtained from the
<Name> Of <Country> Country He/She application developed using proposed
representing architecture are more convincing. In the pilot
application „Profile Generator‟ the values for
few attributes of a player profile are derived.
The rule base for other attributes like:
The attributes of which values are derived:
Date of Birth, Country he/she representing,
Name, Date of Birth, Game he/she plays,
Playing hobby, Favorite food, Favorite Movie,
Country he/she representing, other Sports
Favorite clothing and Sports Record etc. is
he/she like, favorite food, favorite movie,
created and made available with the system.
favorite clothing. It has derived values
The knowledge base used with the data mining
successfully for these attributes. The results
engine is lace with domain and problem
obtained are as given below:
specific knowledge. The training phase also
Sr. Attribute Value retrieve
fulfills the knowledge base with relevant
No. Success Rate
knowledge and therefore a user can use data
mining tool without domain expertise. 1 Full Name of 86 %
Player
This knowledge base is used for the extraction 2 Date of Birth 92 %
of values for the above-mentioned variables. 3 Game He/She Plays 99%
The concept of use of knowledge base in the
data mining application is new. The domain 4 Country He/She 93%
experts are required to create the initial Representing
knowledge base. The derived values for 5 Other Sports 84%
different attributes are also stored in the He/She Likes
knowledge base. This table is used to train the 6 Favorite Food 82%
system and to use in the next pass of the 7 Favorite Movie 84%
system.
8 Favorite Clothing 79%
The Data mining engine is design to apply 9 Sports Record 61%
different data mining algorithm for the

167
ISSN : 2229-6093
S.P.Deshpande,V.M.Thakare, Int. J. Comp. Tech. Appl., Vol 2 (1), 165-170

For the attributes for which success rate is data mining guide, NCR Systems
above 90% the algorithm works good and very Engineering Copenhagen (USA and
rare retrieves false value. For the attributes for Denmark), DaimlerChrysler AG
which success rate is between 80% to 90 % the (Germany), SPSS Inc. (USA) and OHRA
system prompts the user with few options and Verzekeringenen Bank Group B.V (The
asks for selection of most correct value. In the Netherlands), 2000.
extraction of sports record, the text document [5] Abraham Bernstein and Foster Provost,
plays important role, the algorithm requires An Intelligent Assistant for the
Knowledge Discovery Process, Working
modification to extract sport record as it has Paper of the Center for Digital Economy
different parameters in different games. This Research, New York University and also
architecture of a data mining engine is using presented at the IJCAI 2001 Workshop on
domain and problem relevant knowledge. This Wrappers for Performance Enhancement
knowledge is partially loaded by domain in Knowledge Discovery in Databases.
experts and partially generated in the training [6] H. Baazaoui Zghal, S. Faiz, and H. Ben
phase of the algorithm. Due to the use of Ghezala, A Framework for Data Mining
knowledge base this system can be used by Based Multi-Agent: An Application to
novice users. Spatial Data, volume 5, ISSN 1307-6884,
Proceedings of World Academy of
VI. Conclusion
Science, Engineering and Technology,
The application developed using April 2005.
proposed architecture shown success rate [7] Ralf Rantzau and Holger Schwarz, A
84.44 % in average therefore it can be Multi-Tier Architecture for High-
comfortably used for data mining. The major Performance Data Mining, A Technical
advantage of this architecture is the presence Project Report of ESPRIT project, The
of domain specific knowledge along with data. consortium of CRITIKAL project, Attar
This facilitates the non domain-expert users in Software Ltd. (UK), Gehe AG (Denmark);
data mining. Existing data mining tools are Lloyds TSB Group (UK), Parallel
required domain expertise and also training of Applications Centre, University of
using the tool. They are limited to extraction Southampton (UK), BWI, University of
Stuttgart (Denmark), IPVR, University of
of patterns of data of features which need Stuttgart (Denmark).
further interpretation by the domain experts to [8] J. A., Botia, M. y Garijo, J. R Velasco,.,
generate knowledge. Using this architecture A. F Skarmeta., A Generic Data mining
system can generate the knowledge but System basic design and
developing knowledgebase is the main issue in implementation guidelines, A Technical
this context. Project Report of CYCYT project of
Spanish Government.
http://citeseerx.ist.psu.edu/viewdoc/summ
References ary?doi=10.1.1.53.1935
[9] Marcos M. Campos, Peter J. Stengard,
[1] Introduction to Data Mining and Boriana L. Milenova, Data-Centric
Knowledge Discovery, Third Edition Automated Data Mining, , Publication of
ISBN: 1-892095-02-5, Publication of Two Oracle Corporation, Web Site.
Crows Corporation, 10500 Falls Road, www.oracle.com/technology/products/bi/o
Potomac, MD 20854 (U.S.A.), 1999. dm/pdf/automated_data_mining_paper_12
[2] Daniel T. Larose, Discovering 05.pdf
Knowledge in Data: An Introduction to [10] J. Sirgo, A. Lopez, R. Janez, R. Blanco, N.
Data Mining, ISBN 0-471-66657-2, John Abajo, M. Tarrio, R. Perez, A Data
Wiley & Sons, Inc, 2005. Mining Engine based on Internet,
[3] Dunham, M. H., Sridhar S., Data Mining: Emerging Technologies and Factory
Introductory and Advanced Topics, Automation ETFA 2003, Proceedings of
Pearson Education, New Delhi, ISBN: 81- IEEE Conference, 16-19 Sept. 2003.Web
7758-785-4, 1st Edition, 2006. Sitewww.citeseerx.ist.psu.edu/viewdoc/su
[4] Pete Chapman, Julian Clinton, Randy mmary?doi=10.1.1.11.8955
Kerber, Thomas Khabaza, Thomas [11] Identification of foreign-accented French
Reinartz, Colin Shearer and Rüdiger using data mining techniques, By Bianca
Wirth, CRISP-DM 1.0 : Step-by-step Vieru-Dimulescu, Philippe Boula de

168
ISSN : 2229-6093
S.P.Deshpande,V.M.Thakare, Int. J. Comp. Tech. Appl., Vol 2 (1), 165-170

Mareüil and Martine Adda-Decker, Care, SIGKDD Explorations Volume 8,


Computer Sciences Laboratory for Issue 1.
Mechanics and Engineering Sciences [24] Kanellopoulos, Y., Dimopulos, T.,
(LIMSI).WebSitewww.limsi.fr/Individu/bi Tjortjis, C., Makris, C., Mining Source
anca/article/Vieru&Boula&Madda_ParaLi Code Elements for Comprehending
ng07.pdf Object-Oriented Systems and
[12] Halteren, H. van, Linguistic Profiling for Evaluating Their Maintainability,
Author Recognition and Verification, SIGKDD Explorations Volume 8, Issue 1.
Proceedings of the 42nd Annual [25] Schultz, M. G., Eskin, Eleazar, Zadok,
Meeting on Association for Erez, and Stolfo, Salvatore, J., Data
Computational Linguistics USA, Mining Methods for Detection of New
Barcelona, Spain, Article No. 199, Year of Malicious Executables, Proceedings of
Publication: 2004. the 2001 IEEE Symposium on Security
[13] Linguistic profiling of texts for the and Privacy, IEEE Computer Society
purpose of language verification, By Hans Washington, DC, USA , ISSN:1081-6011,
Van Halteren and Nelleke Oostdijk, The 2001.
ILK research group, Tilburg centre for [26] Cai, W. and Li L., Anomaly Detection
Creative Computing and the Department using TCP Header Information,
of Communication and Information STAT753 Class Project Paper, May 2004.
Sciences of the Faculty of Humanities, Web Site:
Tilburg University, The Netherlands. http://www.scs.gmu.edu/~wcai/stat753/sta
WebSite : t753report.pdf.
www.ilk.uvt.nl/~antalb/textmining/LingPr [27] Comparative genomics using data mining
ofColingDef.pdf tools, By Tannistha Nandi, Chandrika B-
[14] Application of Data Mining Techniques Rao and Srinivasan Ramchandran, Journal
for Medical Image Classification, By of Bio-Science, Indian Academy of
Maria-Luiza Antonie, Osmar R. Zaiane, Sciences, Vol. 27,No. 1, Suppl. 1, page
Alexandru Coman, Proceedings of the No. 15-25, February 2002.
Second International Workshop on [29] A survey of data mining methods for
Multimedia Data Mining (MDM/KDD linkage dis-equilibrium mapping, By Paivi
2001) in conjunction with ACM SIGKDD Onkamo and Hannu Toivonen, Henry
conference, San Francisco, August 26, Stewart Publications 1473 – 9542. Human
2001. Genomics. VOL 2, NO 5, Page No. 336–
[15] Data Mining: Medical and Engineering 340, MARCH 2006.
Case Studies, By A. Kusiak, K.H. [30] Data Mining in Sports: Predicting Cy
Kernstine, J.A. Kern, K.A. McLaughlin, Young Award Winners, By Lloyd Smith,
and T.L. Tseng, Proceedings of the Bret Lipscomb, and Adam Simkins,
Industrial Engineering Research 2000 Journal of Computer Science, Vol. 22,
Conference, Cleveland, Ohio, pp. 1-7,May Page No. 115-121,April 2007.
21-23, 2000. [31] Deng, B., Liu, X., Data Mining in
[18] Data Warehousing and Data Mining Quality Improvement, Proceedings of
System Applied to E-Learning By R.Luis, the Twenty-Seventh Annual SAS Users
J.Redol, D.Simoes, N.Horta, Proceedings Group International Conference 2002 by
of the II International Conference on SAS Institute Inc., Cary, NC, USA. ISBN
Multimedia and Information & 1-59047-061-3. Web Site :
Communication Technologies in http://www2.sas.com/proceedings/sugi27/
Education, Badajoz, Spain, December 3- Proceed27.pdf
6th 2003. [32] Cohen, J. J., Olivia, C., Rud, P., Data
[19] Crime Data Mining: An Overview and Mining of Market Knowledge in The
Case Studies, By Hsinchun Chen, Pharmaceutical Industry, Proceeding of
Wingyan Chung, Yi Qin, Michael Chau, 13th Annual Conference of North-East
Jennifer Jie Xu, Gang Wang, Rong Zheng, SAS Users Group Inc., NESUG2000,
Homa Atabakhsh, A project under NSF Philadelphia Pennsilvania, September 24-
Digital Government Programme, USA, 26 2000.
“COPLINK Center: Information and [33] Elovici, Y., Kandel, A., Last, M., Shapira,
Knowledge Management for Law B., Zaafrany, O.,Using Data Mining
Enforcement,”, July 2000 – June 2003. Techniques for Detecting Terror-
[21] Rao, R. B., Krishnan, S. and Niculescu, R. Related Activities on the Web, Website:
S., Data Mining for Improved Cardiac www.ise.bgu.ac.il/faculty/mlast/papers/JI
W_Paper.pdf

169
ISSN : 2229-6093
S.P.Deshpande,V.M.Thakare, Int. J. Comp. Tech. Appl., Vol 2 (1), 165-170

[34] Data Mining in Sports: A Research [41] Anjewierden, A., Koll¨offel, B., and
Overview, By Osama K. Solieman, A Hulshof C., Towards educational data
Technical Report, MIS Masters Project, mining: Using data mining methods for
August 2006. automated chat analysis to understand
[36] Foster, D. P. and Stine, R. A., Variable and support inquiry learning processes,
Selection in Data Mining: Building a
Predictive Model for Bankruptcy, International Workshop on Applying Data
Journal of the American Statistical Mining in e-Learning, ADML'07, Vol-
Association, Alexandria, VA, ETATS- 305, Page No 23-32, Sissi, Lassithi - Crete
UNIS, vol. 99, ISSN 0162-1459, pp. 303- Greece, 18 September, 2007.
313 January 15, 2004. [42] Knowledge Discovery with Genetic
Programming for Providing Feedback to
[37] Kraft, M. R., Desouza, K. C., Androwich, Courseware Authors, By Cristobal
I., Data Mining in Healthcare Romero, Sebastian Ventura and Paul De
Information Systems: Case Study of a Bra, Kluwer Academic Publishers, Printed
Veterans’ Administration Spinal Cord in the Netherlands, 30/08/2004.
Injury Population, IEEE, Proceedings of [43] Crime Data Mining: A General
the 36th Hawaii International Conference Framework and Some Examples, By
on System Sciences, 0-7695-1874-5/03, Hsinchun Chen, Wingyan Chung, Jennifer
2002. Jie Xu, Gang Wang, Yi Qin, Michael
[39] Integrating E-Commerce and Data Chau, Technical Report, Published by the
Mining: Architecture and Challenges, By IEEE Computer Society, 0018-9162/04,
Suhail Ansari, Ron Kohavi, Llew Mason, pp 50-56, April 2004.
and Zijian Zheng, IEEE International [44] Chodavarapu Y., Using data-mining for
Conference on Data Mining, 2001. effective (optimal) sports squad selections,
[40] Jadhav, S. R., and Kumbargoudar, P., Web Site:
Multimedia Data Mining in Digital ttp://insightory.com/view/74//using_data-
Libraries: Standards and Features, mining_for_effective_(optimal)_sports_sq
Proceedings of conference Recent uad_selections
advances in Information Science and [60] Vajirkar Pravin, Singh Sachin, and Lee
Technology READIT – 2007, pp 54-59, Yugyung, Context-Aware Data Mining
Organized by Madras Library Association Framework for Wireless Medical
- Kalpakkam Chapter & Scintific Application, Lecture Notes in Computer
information Resource Division, Indira Science (LNCS), Volume 2736, Springer-
Gandhi Center for Atomic research, Verlag. ISBN 3-540-40806-1, pp. 381 –
Department of Atomic Energy, 391.
Kalpakkam, Tamilnadu,India. 12-13 July
2007.

170

You might also like