You are on page 1of 18

International Journal of Applied Engineering Research ISSN 0973-4562 Volume 9, Number 24 (2014) pp. 30795-30812 © Research India Publications http://www.ripublication.com

Anomaly Detection Via Eliminating Data Redundancy and Rectifying Duplication in Uncertain Data Streams

M. Nalini 1* and S. Anbu 2

1* Research Scholar, Department of Computer Science and Engineering, St.Peter's University, Avadi, Chennai, India, email: nalinicseme@gmail.com 2 Professor, Department of Computer Science and Engineering, St.Peter's College of Engineering and Technology, Avadi, Chennai ,India, email: anbuss16@gmail.com

Abstract

One of the important problems is anomaly detection which necessitates in emerging fields, such as data warehouse and data mining etc. Various generic anomaly detection techniques were already developed for common applications. Most of the real time systems using consistent data offering high-quality services are affected by record duplication, quasi replicas or partial erred data. Even government and private organizations are developing procedures for eliminating replicas from their data repositories. In data maintenance the quality of the data, database, and database related applications needs error-free, replica removed, de-duplicated to improve the accuracy of the query passed. In this paper, we propose Particle Swarm Optimization approach to record de-duplication which combines various pieces of evidence extracted from the data content to identify a de-duplication method. which can able to find out whether two entries in a data repository are replicas or not. From the experiment, our approach outperforms available state-of-the-art method found in the literature. Our PSO approach is capable than the existing approaches. The experiment result shows that the proposed approach is more efficient than the existing approaches where the proposed approach is implemented in DOTNET framework-2010.

Keywords: Data Duplication, Error Correction, DBMS, Data Mining, TPR, FNR.

INTRODUCTION Anomaly detection refers to the problem of finding patterns in any kind of data which

30796

M.Nalini and S.Anbu

cannot satisfy the customer’s expected behavior. These non-conforming patterns are often referred to as anomalies, outliers, discordant observations, exceptions, aberrations, surprises, peculiarities or contaminants in different application domains. Of these anomalies and outliers are two terms used most commonly in the context of anomaly detection and sometimes interchangeably. Anomaly detection finds extensive use in a wide variety of applications. Such as fraud detection for credit cards, insurance or health care, intrusion detection for cyber-security, fault detection in safety critical systems, and military surveillance for enemy activities. The importance of anomaly detection is due to the fact that anomalies in data that translates to significant (and often critical) actionable information in a wide variety of application domains. For example, an anomalous traffic pattern in a computer network could mean that a hacked computer is sending out sensitive data to an unauthorized destination. An anomalous MRI image may indicate presence of malignant tumors. Anomalies in credit card transaction data could indicate credit card or identity theft or anomalous readings from a space craft sensor could signify a fault in some component of the space craft. Detecting outliers or anomalies in data has been studied in the statistics community as early as the 19 th century. Over time, a variety of anomaly detection techniques have been developed in several research communities. Many of these techniques have been specifically developed for certain application domains, while others are more generic. Also in this paper the data taken into account is uncertain data, the size of the data is also too large.PSO approach is applied here to find and count the de- duplication where it combines various pieces of evidence extracted from the data content to produce de-duplication method. This approach is able to identify whether any entries in a repository are same or not. In the same manner, by combining more pieces in the data, taken as evidence, and compare with the whole data as training data. This function is applied repeatedly on the whole data or in the repositories. Newly inserted data can also be compared in the same manner to avoid replica by comparing with evidence. A method applied to record de-duplication should accomplish individual but contradictory objectives: this process should effectively increase the identification of the records replicated. The approach GP [15] is chosen as the basic approach which is suitable for finding accurate answers to a given problem without searching all the data on the whole. Due to the record de-duplication problem, in the existing approach [14, 16] the genetic programming is applied to provide good solutions to it. In this paper, the existing system results in [16] are taken for comparison without PSO based approach, where our approach is able to automatically find more effective de-duplication methods. Moreover, PSO based approach can interoperable with existing best de-duplication methods to change on the replication identification limits used to classify a pair of records as a match or not. In our experiment, real time dataset having all scientific domain based article citations and hotel index records. Also, the real time data set having synthetic generated datasets to control in best experimental environment. In all the scenarios, our approach can be applied to all the possible scenarios.

Anomaly Detection Via Eliminating Data Redundancy

30797

On the whole, our contribution of this paper is PSO based approach to find and count the De-Duplication is as follows:

A less computational time based solution can be obtained in terms of duplication detection

Reduce the individual comparison using PSO approach to find out the similarity values.

Choosing the replicas by computing TPR and FPR among the data

Rectify the errors in the data entries.

RELATED WORKS In [3] the author proposed an approach to data reduction. This data reduction functions are very essential to machine learning and data mining. An agent based population algorithm is used for solving data reduction. Only data reduction is not only the solution for improving the quality of databases. Various sizes of database are used to provide high classification among the data to find out anomalies. Two algorithms such as evolutionary and non-evolutionary are applied and the results are compared to finding the best suitable algorithm for anomaly detection in [4]. N-ary relations are computed to define the patterns in the dataset [5] where it provides relations in one-dimensional data. DBLEARN and DBDISCOVER [6] are two systems developed to analyze RDBMS. The main objective of the data mining technique is to detect and classify data in a huge set of database [7] without negotiating the speed of the process. PCA is used for data reduction and SVM is used for data classification in [7]. In [8] the data redundancy method is explored using mathematical representation. Software developed with safe, correct and reliable operations for avionics and automobile based database systems [9]. A statistical QA- [Question Answer] model is applied to develop a prototype to avoid web based data redundancy [10]. GDW-[Geographic Data Warehouses] [11], SOLAP (Spatial On- Line Analytical Processing) is applied to Gist database and other spatial database analysis, indexing, and generating various set of reports without any error. In [12], an effective method was proposed for P-2-P sharing data. During the data sharing the data duplication is removed using the effective method. Web entity data extraction associated with the attributes of the data [13] can be obtained using a novel approach which uses duplicated attribute value pairs. G. de Carvalho et al. [1] used the Genetic Algorithm to mark the duplication and convert de-duplication in the data also mainly concentrated on identifying the entries are repository or replica. This approach outperformed and provides 6.2% of accuracy more than the earlier approaches for two different data sets found in [2]. Our proposed approach can be extended for various benchmark data with real time data such as time series data, clinical data, 20-20 new group etc.

PARTICLE SWARM OPTIMIZATION GENERAL CONCEPTS The natural selection process influences all virtually living things and it can be inspired by evolutionary programming approaches based ideas. Particle swarm

30798

M.Nalini and S.Anbu

optimization programming is one of the best known evolutionary programming techniques. It is considered as a heuristic approach and initially applied for optimizing the data properties and availabilities. PSO is also considered as a multi-objective problem which can restrict the environment. PSO and the other various evolutionary approaches are mostly known and applied variety of applications due to their good performance in terms of searching over a large set of data. PSO creates more populations for individuals, instead of processing on a single point in the search space of the problem. This behavior is the essential aspect of the PSO approach and it creates additional new solutions with new combined features. It also moves forward comparing with the existing solutions in the search space.

PARTICLE OPERATIONS PSO generates random particles representing individuals. In this paper, the current modeling the trees representing arithmetic functions which are illustrated in Figure-1. When using, this tree representation of the PSO based approach, the set of all inputs, variables, constants and methods should be defined [8]. Some of the nodes terminating the trees are called as leaves. The collection of operators, statements and methods are used in the PSO evolutionary process to manipulate the terminal values. All these methods are placed in the internal nodes of the tree is shown in Figure-1.In general PSO is analyzing social behavior of birds. In order to search for food, every bird in a flock of birds is referred by velocity based on the personal experience and information collected by interacting with each other inside the flock. This is the basic idea about the PSO. Each particle denoting each bird, flies denotes searching in the subspace for the optimization problem searching for the optimum solution. In PSO, the solutions within the iteration are called as swarm and equal in population.

+ x / x z Tree(a, b, c) = a + ( b + b)
+
x
/
x
z
Tree(a, b, c) = a + ( b + b)

FIGURE-1: Tree Used For Mapping A Function

Anomaly Detection Via Eliminating Data Redundancy

30799

PROPOSED APPROACH The proposed approach utilizes the functionality of the PSO optimization method for finding the difference among entities in each record in a database. This difference indicates the similarity index among two data entities can decide the duplication. In

this case, if the distance between two data entities [ and ] is less than a threshold value [ ], then and are decided as duplicate. The algorithm of PSO applied in this paper is given here:

1. Generate random population P is representing each individual of data entries.

2. Assume a random feasible solution from particles.

3. For I = 1to P

4. Evaluate all particles based on the objective function

5. The objective function = ( [ ], [ ])

6. Gbest = best solution based particle

7. Compute the velocity of the Gbest particles

8. Update the current position of the best solution

9. Next i

≥∝

A Database is a rectangular table consists of number of Records as:

= { , , … , }

---(1)

And each record has number of Entities as:

= =

:

:

:

---(2)

is the entity at row and column in the data. Here represent the rows and j represents the column. In this paper the threshold value is user defined very small value among 0 and 1.

30800 Load Data Pre-process the data D Pre Processing the Data B
30800
Load Data
Pre-process the
data
D
Pre Processing the Data
B
M.Nalini and S.Anbu Divide data as windows Normalize the Data Normalizing the Data
M.Nalini and S.Anbu
Divide data as
windows
Normalize the Data
Normalizing the Data

Mark redundant data

Yes

the Data Normalizing the Data Mark redundant data Yes Check data redundancy & Error Finding Similarity
the Data Normalizing the Data Mark redundant data Yes Check data redundancy & Error Finding Similarity

Check data redundancy & Error

Finding Similarity Ratio

Check data redundancy & Error Finding Similarity Ratio Anomaly Detection No Persistent the data Fig.1: Proposed
Check data redundancy & Error Finding Similarity Ratio Anomaly Detection No Persistent the data Fig.1: Proposed

Anomaly Detection

& Error Finding Similarity Ratio Anomaly Detection No Persistent the data Fig.1: Proposed Approach The overall

No

Persistent the data

Fig.1: Proposed Approach

The overall functionality of the proposed approach is depicted in Fig.1. The database may be in any form like ORACLE, SQL, and MY-SQL, MS-ACCESS or EXCEL.

PREPROCESSING Let us consider an example of an employee data for an MNC company, where the company branches are located overall world. The entire data are read from the database and investigate that if there any ‘~’, empty space, “#”, “*” and irrelevant characters placed as an entity in the database. [Example, if an entity is numerical data, then it should contain only the digits from 0 to 9. If it is a name, then it should represent all alphabets combined only with “.”, “_”,”-“]. In case of irrelevant characters presented in any dataset, then those data entity are treated as error data and it will be corrected, removed or changed by any other relevant characters. If the data-type of the field is a string, then the preprocessing function assigns “NULL” in the corresponding entity, else if the data type of the field is a numeric, then preprocessing function assigns 0’s [according to the length of the numeric data type] in the corresponding entity. Similarly, preprocessing function replaces the entity as today’s-date if the data-type is ‘date’, ‘*’ for data-type is ‘character’ and so on. Once the data is preprocessed the results of the SQL-Query are good else error generated. For example, in the following table-1, the first row says the Field name and all the rows contain set of records. In the given Table-1, the first record fourth field is having an irrelevant character as “~”. In the same way the second record 3 rd field consists “##” instead of numbers. It gives an error, when a query

Anomaly Detection Via Eliminating Data Redundancy

;

Select City from EMP;

30801

is passed in the table EMP [Table-1]. To avoid error during query process the City, Age fields are corrected by verifying the original data sources. If it is not possible then for alphanumeric fields “NULL” and numeric field “0” are applied for replacing and correcting the error. If it is not possible to correct the record, those data are marked [‘*’] and moved to a separate pool area.

Table-1: Sample Error Records Pre-Processed and Marked [‘*’] [EMP].

No

Name

Age

City

State

Comment

0001*

Kabir

45

~ty

 

Employee

0002*

Ramu

##

Chennai

TN

Employee

The entire data can be divided as sub windows for easy and fast process. Let the dataset is DB and it can be divided as sub windows shown in Fig.2 as DB1 and DB2. Each DB1 and DB2 has a number of windows as 1, 2 … .

DATA NORMALIZING In general an uncertain data stream is considered for anomaly detection. The main problem defined in this paper is anomaly detection for any kind of Data streams. Since, the size of the data stream is huge, in our approach the complete data are divided into subsets of data streams. A data stream DS is divided into two uncertain data streams DS1 and DS2, are taken for our problem, where both data stream consists of a sequence of continuously occurring uncertain objects in various time intervals, are denoted as

DS1

=

{ [1],

[2], … … [ ], … }

--- (3)

DS2 = { [1], [2], …

[ ], … }

--- (4)

Where [ ] or [ ] is a k-dimensional uncertain objects at the time interval and is the current time interval. According to grouping the nearest neighbor, the objects should retrieve a close pair of objects within a period. Thus a compartment window concept is adapted for the uncertain stream group operator. From figure-2, a USG operator always considers the most recent CW uncertain data in the stream, that is

CW(DS1) = { [ + 1], [ + 2], … … … , [ ]}

CW(DS2) = { [ + 1], [ + 2], … … … , [ ]}

--- (5)

--- (6)

30802

M.Nalini and S.Anbu

At the current time intervals .It can say in other words, when a new certain object x[t+1] (y[t+1]) comes at the next time interval (t+1), the new object x[t+1] (y[t+1]) is appended to DS1(DS2). In that particular time the old object x[t-cw+1] (y[t-cw+1]) expires and is ejected from the memory. Thus, USG at a time interval (t+1) is conducted on a new compartment window {x[t-cw+2], ……x[t+1]} (y[t- w+2],….,y[t+1]}) of size cw.

Uncertain data

stream

DS1

DS2

Expired uncertain object at time interval (t+1)

Compartment window at

New uncertain

time interval t - CW (DS1)

object

y[1]…….,y[t-cw+1]………… | …………………………….y[t] y[t+1] Compartment window at time interval
y[1]…….,y[t-cw+1]…………
|
…………………………….y[t]
y[t+1]
Compartment window at
time interval t - CW (DS2)

USG

answers

Fig.2: Data Set Divided as Sub-Windows

For Grouping the uncertain Data Streams the two data streams DS1 and DS2 and distance threshold value and a probabilistic threshold α [0, 1].A group on uncertain data streams continuously monitors pairs of uncertain objects x[i] and y[i] within compartment windows CW(DS1) and CW(DS2) respectively of size cw at the current time stamp t. Here, the data streams DS1 and DS2 are compared to finding the similarity distance can be obtained using PSO. Such that

PSO (Pr { ( [ ], [ ]) ≥∝)

--- Equ (7)

Holds, where t-cw+1 I, j t, and dist(., .) is a Euclidean distance function between two objects. To perform a USG Equation (7), users need to register two parameters, distance threshold and probabilistic threshold α in PSO. Since, each uncertain object at a timestamp consists of R samples, the grouping probability P|r{dist( x[i], y[i]) } in Inequality (7) can be rewritten via samples as

Pr{ ( [ ], [ ])

} = { 1[

2[ ]. ,

0

( 1[ ], 2[ ]) ;

--- Equ (8).

Anomaly Detection Via Eliminating Data Redundancy

30803

Note that, one straightforward method to directly perform USG over compartment windows is to follow the USG definition. That is for every object pair <X[i], Y[i] > fromcompartment windows CW(DS1) and CW(DS2) respectively. We compute the grouping probability that X[i] is within distance from Y[i] (via samples) based on (8).If the resulting probability is greater than or equal to probabilistic threshold α, then this pair <X[i], Y[i] > is reported as the USG answer, otherwise it is a false alarm and can be carefully discarded. The number of false alarm is counted using PSO by repeating n number of times and generating numbers of particles in the search space for each individual data. For any comparison, verification and other relevant tasks the window based data makes easy and fulfill the task very quickly for any DBMS. For example, if the database is having 1000 records can be divided into 4 sub datasets having 250 records each. Data in the database can be normalized using any normalization form for fast and accurate query process. In this paper user defined normalization is also applied to improve the efficiency such as arranging the data in a proper manner like ascending order or descending order according to the SQL query keywords.

PSO BASED SIMILARITY COMPUTATION This paper focuses on applying a PSO based comparison for finding similar or dissimilar. PSO has a measurement among two data in a database well defined by appropriate features. Since it accounts for unequal variances as well as the correlation between features it will adequately evaluate the distance by assigning different weights or important factors to the features of data entities. In this paper the inconsistency of data can be removed in real-time digital libraries. Assume two sets of groups and having data about girls and boys in a school. Let number girls are categorized as same sub-group in since their attribute or characteristics are same. It is computed by PSO as

= ( ) 1

--- (9)

The correlation among dataset is computed using Similarity-Distance. Data entities are the main objects of data mining. The data entities are arranged in an order according to the attributes. The data set with number of attributes is considered as K-dimensional vector is represented as:

N number

= ( , , … , ).

--- (10)

data entities form a set

= ( , , … , )

--- (11)

is known as data set. can be represented by an matrix

=

--- (12)

30804

M.Nalini and S.Anbu

where is the jth component of the data set .There are various methods used for data mining. Numerous such methods, for example, NN-classification techniques, cluster investigation, and multi-dimensional scaling methods are based on the processes of similarity between data. As a replacement for measuring similarity, dissimilarity among the entities too will give the same results. For measuring dissimilarity one of the parameters that can be used is distance. This category of measures is also known as separability, divergence or discrimination measures. A distance metric is a real-values function , such that for any data points , , and :

( , ) 0, ( , ) = 0, =

--- (13)

( , ) = ( , ) --- (14)

( , ) ( , ) + ( , ) --- (15)

The first line (13), positive definiteness assures the distance is a non- negative value. The distance can be zero for the points to be the same. The second property indicates the symmetry nature of distance. There are various distance formulas are available like Euclidean, manhattans, Lp-Norm and Similarity distance. In this paper the Similarity-Distance is taken as the main method to find the similarity distance among two data sets. The distance among a set of observed groups in m-dimensional space determined by m variables is known as Similarity-Distance method. The less distance value says the data in the groups are very close and the other is not close. The mathematical formula for Similarity-Distance for two set of data samples as X and Y is written as:

( , ) = ( ) ( )

--- (16)

is the inverse co-variance matrix. The similarity value among the sub-windows of the dataset DB1 and the dataset DB2 is computed and the result is stored in a variable named score.

[ ] =

( 1) ( 2)

---- (17)

[ ] = 0

[ ] 1

0 1

[ ] >

---- (18)

The first line in (18) says that the data available in both windows of 1 and 2 are more or less similar.The next line says that exactly same and the third line says that the data are different. Whenever the distance among dataset satisfies [ ] = 0 and [ ] both data are marked in the DB. The value of [ ] gives two solution such as:

Anomaly Detection Via Eliminating Data Redundancy

30805

TPR—if the similarity value lies above this boundary [-1 to 1], the records are considered as replicas; TNR—if the similarity value lies below this boundary then the records are considered as not being replicas.

In this situation the similarity values lies among the two boundaries then the records are classified as “possiblematches”. In this case, a human judgment is also necessary to find the matching score. Usually most of the existing approaches to replica identification depend on several choices to set their parameters and they may not be always optimal. Setting these parameters requires the accomplishment of the following tasks:

Selecting the best proof to use- as evidence, it takes more time to find out the duplication due to apply more processes to compute the similarity among the data. Decide how to merge the best evidence, some evidence may be more effective for duplication identification than others. Finding the best boundary values to be used, Bad boundaries may increase the number of identification errors (e.g., false positives and false negatives),nullifying the whole process. Window1 from DB1 is compared with Window1, window2, window3 and so on from DB2 can be written as:

[ ] = ( 1) − ∑

( 2)

---- (19)

If the [ ] =0, then both ( 1) and ( 2) are same and mark it as duplicate. Else ( 1) is compared with ( 2). The objective of this paper is to improve the quality of the data in a DBMS is error free and can provide fast outputs for any SQL query. It is also concentrates on de-duplication if possible in the data model. The removal of duplicate is not efficient in Government based organization and it is difficult to remove. Avoiding duplicate data provides high retrieval of quality data from huge data set like banking.

DATA:

For experimenting the proposed approach two real time data sets commonly employed for evaluating the record de-duplication purposes. They are based on the current data gathered from the web index. Additionally, some more data sets also created using a synthetic data set generator. One of the dataset is the data set is a assembly of 1,295 different credentials to 122 computer science papers occupied from the Cora research paper through search engine. These credentials were separated into numerous characteristics (author names, year, title, venue, and pages and other info) by an information mining system. The another real-time data set is the Restaurants data set comprises 864 records of restaurant names and supplementary data together with 112 replicas that were attained by incorporating records from Fodor and Zagat’s guidebooks. It is used the following attributes from this data set: (restaurant) name, address, city, and specialty. The synthetic data sets were created using the Synthetic Data Set Generator (SDG) [32] available in the Febrl [26] package.

30806

M. Nalini and S.Anbu

Since the real time dataset are not sufficient and not easily accessible for the experiment, such as time series data set, 20-20 news data set and customer data from OLX.in. It contains the fields as name, age, city, address, phone numbers etc, (like social security number). Using SDG it also can create manually some errors in data and duplication in data. Some of the modifications also can be applied on the records attribute level. The data taken for experiments are

DATA-1: This data set contains four files of 1000 records (600 originals and 400 duplicates) with a maximum of five duplicates based on one original record (using a Poisson distribution of duplicate records) and with a maximum of two modifications in a single attribute and in full record.

DATA-2: This data set contains four files of 1000 records (750 originals and 250 duplicates), with a maximum of five duplicates based on one original record (using a Poisson distribution of duplicate records) and with a maximum of two modifications in a single attribute and four in the full record.

DATA-3: This data set contains four files of 1000 records (800 originals and 200 duplicates) with a maximum of seven duplicates, based on one original record (using a Poisson distribution of duplicate records) and with a maximum of four modifications in a single attribute and five in the full record. The duplication can be applied on each attribute of the data in the form of [i.e the evidence]

,

The experiment on the time series data is done in MATLAB software and the time complexity is compared with the existing system. The elapsed time taken for implementing the proposed approach is 5.482168 seconds. The results obtained for all the functionalities defined in Fig.1 are depicted in Fig.3 to Fig.6.

Anomaly Detection Via Eliminating Data Redundancy

30807

Anomaly Detection Via Eliminating Data Redundancy 30807 Fig.3: Original Data Not Preprocessed Fig.3 shows the data

Fig.3: Original Data Not Preprocessed

Fig.3 shows the data originality as such taken from web. It has errors, redundancy and noise. The three lines show that the data DB is divided into DB1, DB2 and DB3. It is clear in the above figure that DB1, DB2 and DB3 are consisted and overlapped in many places which indicates the data redundancy. Also it is drawn in zigzag form says it is not preprocessed. In the time series data there are 14 numerical data are preprocessed [replaced as 0’s], it is verified from the database.

are 14 numerical data are preprocessed [replaced as 0’s], it is verified from the database. Fig.4:

Fig.4: Preprocessed Data

30808

M.Nalini and S.Anbu

The data is preprocessed and normalized is shown in Fig.4. User defined normalization of the data is arranging the data in an order for easy process. Even DB1, DB2 and DB3 have overlapped data itself, which indicates more data are similar. DB1 and DB2 have more similar overlapped data is clearly shown in Fig.4. Finding the similarity index and those same data can be removed for easy process and make it as duplication.

can be removed for easy process and make it as duplication. Fig.5: Single Window Data in

Fig.5: Single Window Data in DB1

After normalization the data is divided into windows and shown in Fig.5 where the window size is 50 defined by the developer. Each window has 50 data for fast comparison. In order to confirm this behavior observed with real data, we conducted additional experiments using our synthetic data sets. The user-selected evidence setup used in this experiment was built using the following list of evidence:

<firstname, PSO>, <lastname, PSO>, <street number, string distance>,

<address1, PSO>, <address2, PSO>, <suburb, PSO>, <postcode, string distance>,

<state, PSO>, <date of birth, string distance>, <age, string distance>,

<phone number, string distance>, <social security number, string distance>.

This list of evidence, using the PSO similarity function for free text attributes and a string distance function for numeric attributes was chosen. Since, it required less time to be processed in our initial tuning tests.

Anomaly Detection Via Eliminating Data Redundancy

Table-2: Original Data

30809

Data set

Original Data

Good Data

Similar Data

Error Data

Time Series

1000

600

400

24%

Restaurant

1000

750

250

15%

Student Database

1000

800

200

12.4%

Cora

1000

700

300

19.2%

Table-3: Data Duplication Detection and De-Duplication

Data set

Original Data

Marked Duplication

De-Duplicated

Not De-Duplicated

Time Series

1000

400

395

5

Restaurant

1000

250

206

44

Student DB

1000

200

146

54

Cora

1000

300

244

46

200 146 54 Cora 1000 300 244 46 Fig.6: Performance Evaluation of Proposed Approach The

Fig.6: Performance Evaluation of Proposed Approach

The performance of proposed approach is evaluated by comparing the detection of duplication, error, marking duplication, number of de-duplication achieved and error correction for various datasets. Fig.6 shows the performance evaluation of the proposed approach using the Similarity-Distance . According the distance score the duplicate error records are detected and marked. Similarity- Distance rectifies the error of 24%, 15%, 12.4% and 19.2% for Time series data, Restaurant data, student data and Cora respectively. The number of duplicate records detected by PSO is 400, 250, 200 and 300 for Time series data, Restaurant data, student data and Cora and de-duplicate the data are

30810

M.Nalini and S.Anbu

395, 206, 146 and 244 respectively. Due more complex or error in the data the de- duplication is not obtained 100%. Some of the performance metrics can be calculated to find out the accuracy of our proposed approach as:

= Number of Duplication Find correctly Total number of data

TNR =

= Number of Duplication wrongly obtained Total Number of data to be Identi ied

FNR =

Sensitivity =

= 99%

Specificity = = 88.5%

Accuracy = = 96.3%

Where P = TP+FN and N =FP+TN. The proposed approach proved the efficiency is better in terms of Duplication detection, Error Detection, and De- duplication in terms of accuracy is 96.3%. Hence Similarity-Distance based duplication detection is more efficient.

CONCLUSION In this paper the PSO based distance method is taken as the main method for finding the similarity [redundancy] in any database. Where the similarity score is computed for various databases and the performance is compared. The accuracy obtained using this proposed approach is 96.3% for four different databases. The time series data is in the form of Excel, Cora data is in the form of table, student data is in the form of MS- Access and the restaurant data is in the form of SQL table. It is concluded from the experiment results obtained using our proposed approach it is easy to do anomaly detection and removal in terms of data redundancy and error. In future the reliability and scalability is investigated in terms of data size and data variations.

Anomaly Detection Via Eliminating Data Redundancy

REFERENCES

30811

[1].

Moise´s G. de Carvalho, Alberto H.F. Laender, Marcos Andre Gonc¸alves,

Vikrant Sabnis, Neelu Khare, “An Adaptive Iterative PCA-SVM Based

[2].

andAltigran S. da Silva, “A Genetic Programming Approach to Record Deduplication”, IEEE Transactions On Knowledge And Data Engineering, Vol. 24, NO. 3, March 2012. M. Wheatley, “Operation Clean Data,” CIO Asia Magazine, http:// www.cio-

[3].

asia.com, Aug. 2004. Ireneusz Czarnowski, Piotr J drzejowicz, “Data Reduction Algorithm for

[4].

Machine Learning and Data Mining”,Volume 5027, 2008, pp 276-285. Jose Ramon Cano, Francisco Herrera, Manuel Lozano, “Strategies for Scaling

[5].

Up Evolutionary Instance Reduction Algorithms for Data Mining”, Book-Soft Computing, Volume 163, 2005, pp 21-39. Gabriel Poesia, Loïc Cerf, “A Lossless Data Reduction for Mining Constrained

[6].

Patterns in n-ary Relations”, Machine Learning and Knowledge Discovery in Databases, Volume 8725, 2014, pp 581-596. Nick J. Cercone, Howard J. Hamilton, Xiaohua Hu, Ning Shan, “Data Mining

[7]

Using Attribute-Oriented Generalization and Information Reduction”, Rough Sets and Data Mining1997, pp 199-22

[8]

Technique for Dimensionality Reduction to Support Fast Mining of Leukemia Data”, SocProS 2012. Paul Ammann, Dahlard L. Lukes, John C. Knight, “Applying data redundancy

[9]

to differential equation solvers”, Journal of Annals of Software Engineering, 1997, Volume 4, Issue 1, pp 65-77. P. E. Ammann, “Data Redundancy for the Detection and Tolerance of Software Faults”, Computing Science and Statistics”,1992, pp 43-52.

[10] Rita Aceves-Pérez, Luis Villaseñor-Pineda, Manuel Montes-y-Gomez, “ Towards a Multilingual QA System Based on the Web Data Redundancy”,

[11].

Computer Science Volume 3528, 2005, pp 32-37. Thiago Luís Lopes Siqueira, Cristina Dutra de Aguiar Ciferri, Valéria Cesário Times, Anjolina Grisi de Oliveira, Ricardo Rodrigues Ciferri, “The impact of spatial data redundancy on SOLAP query performance”,Journal of the Brazilian Computer Society, June 2009, Volume 15, Issue 2, pp 19-34.

[12]. Ahmad Ali Iqbal, Maximilian Ott, Aruna Seneviratne, “Removing the

[13]

[14]

Redundancy from Distributed Semantic Web Data”, Database and Expert Systems Applications Lecture Notes in Computer Science, Volume 6261, 2010, pp 512-519. Yanxu Zhu, Gang Yin, Xiang Li, Huaimin Wang, Dianxi Shi, Lin Yuan, “Exploiting Attribute Redundancy for Web Entity Data Extraction”, Digital Libraries: For Cultural Heritage, Knowledge Dissemination, and Future Creation Lecture Notes in Computer Science Volume 7008, 2011, pp 98-107. M.G. de Carvalho, M.A. Gonc¸alves, A.H.F. Laender, and A.S. da Silva, “Learning to Deduplicate,” Proc. Sixth ACM/IEEE CS Joint Conf. Digital Libraries, pp. 41-50, 2006.

and A.S. da Silva, “Learning to Deduplicate,” Proc. Sixth ACM/IEEE CS Joint Conf. Digital Libraries, pp.

30812

M.Nalini and S.Anbu

[15]

J.R. Koza, Gentic Programming: On the Programming of Computers byMeans

[16]

of Natural Selection. MIT Press, 1992. M.G. de Carvalho, A.H.F. Laender, M.A. Gonc¸alves, and A.S. da Silva, “Replica Identification Using Genetic Programming,” Proc. 23rd Ann. ACM Symp. Applied Computing (SAC), pp. 1801-1806, 2008.