Welcome to Scribd, the world's digital library. Read, publish, and share books and documents. See more
Download
Standard view
Full view
of .
Save to My Library
Look up keyword
Like this
1Activity
0 of .
Results for:
No results containing your search query
P. 1
A New Improved Algorithm for Distributed Databases

A New Improved Algorithm for Distributed Databases

Ratings: (0)|Views: 69 |Likes:
Published by ijcsis
The development of web, data stores from disparate sources has contributed to the growth of very large data sources and distributed systems. Large amounts of data are stored in distributed databases, since it is difficult to store these data in single place on account of communication, efficiency and security. Researches on mining association rules in distributed databases have more relevance in today’s world. Recently, as the need to mine patterns across distributed databases has grown, Distributed Association Rule Mining algorithms have gained importance. Research was conducted on mining association rules in the distributed database system and classical Apriori algorithm was extended based on transactional database system. The Association Rule mining and extraction of data in distributed sources combined with the obstacles involved in creating and maintaining central repositories motivates the need for effective distributed information extraction and mining techniques. We present a new distributed association rule mining algorithm for distributed databases (NIADD). Theoretical analysis reveals a minimal error probability than a sequential algorithm. Unlike existing algorithms, NIADD requires neither knowledge of a global schema nor that the distribution of data in the databases.
The development of web, data stores from disparate sources has contributed to the growth of very large data sources and distributed systems. Large amounts of data are stored in distributed databases, since it is difficult to store these data in single place on account of communication, efficiency and security. Researches on mining association rules in distributed databases have more relevance in today’s world. Recently, as the need to mine patterns across distributed databases has grown, Distributed Association Rule Mining algorithms have gained importance. Research was conducted on mining association rules in the distributed database system and classical Apriori algorithm was extended based on transactional database system. The Association Rule mining and extraction of data in distributed sources combined with the obstacles involved in creating and maintaining central repositories motivates the need for effective distributed information extraction and mining techniques. We present a new distributed association rule mining algorithm for distributed databases (NIADD). Theoretical analysis reveals a minimal error probability than a sequential algorithm. Unlike existing algorithms, NIADD requires neither knowledge of a global schema nor that the distribution of data in the databases.

More info:

Published by: ijcsis on Nov 25, 2011
Copyright:Attribution Non-commercial

Availability:

Read on Scribd mobile: iPhone, iPad and Android.
download as PDF, TXT or read online from Scribd
See more
See less

11/25/2011

pdf

text

original

 
 A New Improved Algorithm for Distributed Databases
K.Karpagam
Assistant Professor, Dept of Computer Science,H.H. The Rajah’s College (Autonomous),(Affiliated to Bharathidasan University, Tiruchirappalli)Pudukkottai, Tamil Nadu, India.
 Dr.R.Balasubramanian
Dean, Faculty of Computer Applications,EBET Knowledge Park,Tirupur, Tamil Nadu, India.
 Abstract
—The development of web, data stores from disparatesources hascontributed to the growth of very large data sourcesand distributed systems. Large amounts of data are stored indistributed databases, since it is difficult to store these data insingle place on account of communication, efficiency andsecurity. Researches on mining association rules in distributeddatabases have more relevance in today’s world. Recently, as theneed to mine patterns across distributed databases has grown,Distributed Association Rule Mining algorithms have gainedimportance. Research was conducted on mining association rulesin the distributed database system and classical Apriorialgorithm was extended based on transactional database system.The Association Rule mining and extraction of data in distributedsources combined with the obstacles involved in creating andmaintaining central repositories motivates the need for effectivedistributed information extraction and mining techniques. Wepresent a new distributed association rule mining algorithm fordistributed databases (NIADD). Theoretical analysis reveals aminimal error probability than a sequential algorithm. Unlikeexisting algorithms, NIADD requires neither knowledge of aglobal schema nor that the distribution of data in the databases.
 Keywords-Distributed Data Mining, Distributed Association Rules
I.I
 NTRODUCTION
The essence of KDD is Acquisition of knowledge.Organizations have a need for data mining, since Data miningis the process of non-trivial extraction of implicit, previouslyunknown and potentially useful information from historicaldata. Mining association rules is one of the most importantaspects in data mining. Association rules Mining (ARM) can predict occurrences of related. Many applications use DataMining for rankings of products or data based decisions. Themain task of every ARM algorithm is to discover the sets of items that frequently appear together (Frequent item sets).Many organizations are geographically distributed andmerging data from locations into a centralized site has its owncost and time implications.Parallel processing is important in the world of database computing. Databases often grow to enormous sizesand are accessed by more and more users. This volume strainsthe ability of single-processors systems. Many organizationsare turning to parallel processing technologies for performance,scalability, and reliability. Much progress has also been madein parallelized algorithms. The algorithms have been effectivein reducing the number of database scans required for the task.Many algorithms were proposed which take advantage of thespeed in network or the memory or parallel computers. Parallelcomputers are costly. The alternative is distributed algorithms,which can run on lesser costing clusters of PCs. Algorithmssuitable for such systems include the CD and FDM algorithms[2, 3], both parallelized versions of Apriori. CD and FDMalgorithms did not scale well on the increase of the clusteredPC’s [4]
.
II.D
ISTRIBUTED
D
ATABASES
There are many reasons for organizations to implement aDistributed Database system. A distributed database (DDB) is acollection of multiple, logically interrelated databasesdistributed over a computer network. The distribution of databases on a network achieves the advantages of  performance, reliability, availability and modularity that areinherent in distributed systems. Many organizationswhich userelational database management system (RDBMS) havemultiple databases. Organizations have their own reasons for using more than a single database in a distributed architectureas in Figure 1. Distributed databases are used in scenarioswhere each database isassociated with particular businessfunctions like manufacturing. Databases may also beimplemented based on geographical boundaries likeheadquarters and branch offices.The users accessing these databases access the same data indifferent ways. The relationship between multiple databases is part of a well-planned architecture, in which distributeddatabases are designed and implemented. A distributeddatabase system helps organizations serve their objectives like
 Availability, Data collection, extraction
and
 Maintenance
.Oraclean RDBMS has inter database connectivity withSQL*Net. Oracle also supports Distributed Databases by
 Advanced replication
or 
multi-master replication.
Advancedreplication is used to deliver high availability. Advancedreplication involves numerous databases. Oracle’s parallelquery option (PQO) is a technology that divides complicated or long-running queries into many small queries which areexecuted independently.
Figure 1 Distributed Database system
DDDLoc-1Loc-2
(IJCSIS) International Journal of Computer Science and Information Security,Vol. 9, No. 10, October 2011107http://sites.google.com/site/ijcsis/ISSN 1947-5500
 
III.B
ENEFITS OF
D
ISTRIBUTED
D
ATABASES
The separation of the various system components,especially the separation of application servers from databaseservers, yields tremendous benefits in terms of cost,management, and performance. A machine's optimalconfiguration is a function of its workload. Machines thathouse web servers, for example, need to service a high volumeof small transactions, whereas a database server with a datawarehouse has to service a relatively low volume of largetransactions (i.e., complex queries). A distributed architectureis less drastic than an environment in which databases andapplications are maintained on the same machine. Locationtransparency implies neither applications nor users need to beconcerned with the logistics of where data actuallyresides.Distributed databases allow various locations to share their data. The components of the distributed architecture arecompletely independent of one another, which meanthat everysite can be maintained independently. Oracle Database’sDatabase links makes Distributed Databases to be linkedtogether.For ExampleCREATE PUBLIC DATABASE LINK LOC1.ORG.COMUSING hq.ORG.COM.An example of a Distributed query would beSELECT emplyeename, Departmentfrom EmployeeTable E, DepartmentTable@hq.ORG.COM DWHERE E.empno = D.empnoIV.P
ROBLEM
D
EFINITION
Association Rule mining is an important data mining toolused in many applications. Association rule mining findsinteresting associations and/or correlation relationships amonglarge sets of data. Association rules show attributes valueconditions that occur frequently together in a given dataset. Atypical and widely-used example of association rule mining ismarket basket analysis. For example, data collected insupermarkets having large number of transactions. Answeringa question like set of items purchased often is not so easy.Association rules provide information of this type in the formof “if-then” statements. The rules computed from the data are based on probability. Association rules are one of the mostcommon techniques of data mining for local-pattern discoveryin unsupervised learning systems [5]. A random sample of thedatabase is used to predict all the frequent item sets, which arethen validated in a single database scan. Because this approachis probabilistic not only the frequent item sets are counted inthe scan but also the negative border (an itemset is in thenegativeborder if it is not frequent but all its “neighbors” in thecandidate itemset are frequent) is considered. When the scanreveals item sets in the negative border are frequent, a secondscan is performed to discover whether any superset of theseitem sets is also frequent. The number of scans increases thetime complexity and more so in Distributed Databases. The purpose of this paper is to introduce a new Mining Algorithmfor Distributed Databases. Alarge number of parameters affectthe performance of distributed queries.Relations involved in adistributed query may be fragmented and/or replicated. Withmany sites to access, query response time may become veryhigh.V.P
REVIOUS WORK 
Researchers and practitioners have been interested indistributed database systems since 1970s. At that time, themain focus was on supporting distributed data management for large corporations and organizations that kept their data atdifferent locations. Distributed data processing is both feasibleand needed. Almost all major database system vendors offer  products to support distributed data processing (e.g.,IBM,Informix, Microsoft, Oracle, Sybase). Since its introduction in1993 [5], the ARM problem has been studied intensively.Many algorithms, representing several different approaches,were suggested. Some algorithms, such as Apriori, Partition,DHP, DIC, and FP-growth [6, 7, 8, 9, 10], are bottom-up,starting from item sets of size and working up. Others, likePincer-Search [11], use a hybrid approach, trying to guess largeitem sets at an early stage. Most algorithms, including thosecited above, adhere to the original problem definition, whileothers search for different kinds of rules [9, 12, 13]. Algorithmsfor the Distributed ARM can be viewed asparallelizations of sequential ARM algorithms. The CD, FDM, and DDM [2, 3,14] algorithms parallelize Apriori [6], and PDM [15] parallelizes DHP [16]. The parallel algorithms use thearchitecture of the parallel machine, where shared memory isused [17].VI.APRIORIALGORITHMFORFINDINGFREQUENTITEMSETSThe Apriori algorithm for finding frequent item sets and isexplained. Let k-item set be an item set which consists of k items, then Frequent itemset F
is an itemset with sufficientsupport and a large itemset is denoted by L
k .
Letc
 beaset of candidate k-item sets. The Apriori property is, if an item X is joined with item Y, thenSupport(X U Y) = min(Support(X), Support(Y))The first iteration isto find L1, all single items withSupport > threshold. The second iteration would be to find L2using L1. The iterations would continue until no more frequentk item sets can be found. Each iteration i consistof two phases:Candidategeneration -Construct a candidate set of largeitem setsCounting and selection -Count the number of occurrencesof each candidate item set and Determine large item sets based on predetermined supportSet L
is defined as the set containing the frequent k itemsets which satisfySupport > threshold.L
*L
is defined as:L
*L
= {X U Y, where X, Y belong to L
and|X
∩Y| = k 
-1}.
(IJCSIS) International Journal of Computer Science and Information Security,Vol. 9, No. 10, October 2011108http://sites.google.com/site/ijcsis/ISSN 1947-5500
 
VII.DISTRIBUTEDALGORITHMSINASSOCIATIONRULES
 A.
PARALLEL PROCESSING FOR DATABASESThree issues drive the use of parallel processing in databaseenvironments namely speed of performance, scalability andavailability. Increase in Database size increases the complexityof queries. Organizations need to effectively scale their systems to match the Database growth. With the increasinguse of the Internet, companies need to accommodate users 24hours a day. Most parallel or distributed association rulealgorithms parallelize either the data or the candidates. Other dimensions in differentiating the parallel association rulealgorithms are the load-balancing approach used and thearchitecture. The data parallelism algorithms require thatmemory at each processor be large enough to store allcandidates at each scan. The task parallel algorithms adapt tothe amount of available memory at each site, since all partitions of the candidates may not be of the same size. Theonly restriction is that the totalsize of all candidates be smallenough to fit into the total size of memory in all processorscombined.
 B.FDM ALGORITH
The FDM (Fast Distributed Algorithm for Data Mining)algorithm, proposed in (Cheung
et al.
1996) hasthe followingdistinguishing characteristics:Candidate set generation is Apriori-like.After the candidate sets are generated, differenttypes of reduction techniques are applied, namely a local reduction anda global reduction, to eliminate some candidates in each site.The FDM algorithmis shown below.
Input:
 DBi
//database partition at each site
Si
Output:
 L
//set of all globally large itemsets
Algorithm:
Iteratively execute the following program fragment(for the
th iteration) distributively at each site
Si
.The algorithm terminates when either 
 L(k) =
, or the set of candidate sets
CG(k) =
.
if 
k = 1
then
Ti(1) = get_local_count(DBi,
 , 1)
else
{CG(k) =
ni=1 CGi(k) =
ni=1 Apriori_gen(GLi(k-1))Ti(k) = get_local_count(DBi, CG(k), i) }
for each
 X 
Ti(k)
doif 
 X.supi
s
×
 Di
thenfor
 j = 1 to n
doif 
 polling_site(X) = Sj
then
insert 
 X, X.supi
into LLi,j(k)
for
 j = 1 to n
do
send LLi,j(k) to site Sj
for
 j = 1 to n
do
{receive LLj,i(k)
for each
 X 
 LLj,i(k)
do
{
if 
 X 
 LPi(k)
then
insert X into LPi(k)update X.large_sites } }
foreach
 X 
 LPi(k)
do
send_polling_request(X);reply_polling_request(Ti(k))
for each
 X 
 LPi(k)
do
{receive X.supj from sites Sjwhere Sj
 X.large_sites X.sup =
Σ
ni=1 X.supi
if 
 X.sup
s
×
 D
then
insert X into Gi(k) }
1.
broadcast Gi(k)receive Gj(k) from all other sites Sj, (j
i) L(k) =
ni=1 Gi(k)divide L(k) into GLi(k), (I = 1,…,n)
1.
return
 L(k).
VIII.NIADDALGORITHMParallel processing involves taking a large task, dividing itinto several smaller tasks, and then working on each of thosesmaller tasks simultaneously. The goal of this divide-and-conquer approach is to complete the larger task in less timethan it would have taken to do it in one large chunk. In parallel computing, Computer hardware is designed to work with multiple processors and provides a means of communication between those processors. Applicationsoftware has to break large tasks into multiple smaller tasksand perform in parallel.NIADD is algorithm striving to getthe maximum advantage of using the RDBMS like parallel processing.
(IJCSIS) International Journal of Computer Science and Information Security,Vol. 9, No. 10, October 2011109http://sites.google.com/site/ijcsis/ISSN 1947-5500

You're Reading a Free Preview

Download
/*********** DO NOT ALTER ANYTHING BELOW THIS LINE ! ************/ var s_code=s.t();if(s_code)document.write(s_code)//-->