You are on page 1of 1

International Conference on Computer Science and Information Technology 2008

Handling Noisy Data using Attribute Selection and Smart Tokens

J. Jebamalar Tamilselvi Dr. V. Saravanan


PhD Research Scholar Associate Professor & HOD
Department of Computer Application Department of Computer Application
Karunya University Karunya University
Coimbatore – 641 114,Tamilnadu, INDIA Coimbatore – 641 114, Tamilnadu, INDIA
E-mail: jebamalar@karunya.edu E-mail: saravanan@karunya.edu

Abstract initial step in data cleaning. There are many


approaches available for selecting the attributes for the
Data cleaning is a process of identifying or mining process to reduce dimensionality of the data
determining expected problem when integrating data warehouse. However, the recent increase of
from different sources or from a single source. There dimensionality of data poses difficulty with respect to
are so many problems can be occurred in the data efficiency and effectiveness in data cleaning process.
warehouse while loading or integrating data. The The efficiency and effectiveness of attribute selection
main problem in data warehouse is noisy data. This method is demonstrated through extensive
noisy data error is due to the misuse of abbreviations, comparisons with this proposed method using real-
data entry mistakes, duplicate records and spelling world data of high dimensionality [3], [5].
errors. The proposed algorithm will be efficient in The token based approach will be applied in the
handling the noisy data by expanding abbreviation, selected attribute fields only. This attribute is selected
removing unimportant characters and eliminating based on the certain criteria. This attribute selection is
duplicates. The attribute selection algorithm is used mainly for the data cleaning process. The similarity
for the attribute selection before the token formation. function with long sting will take more time for the
An attribute selection algorithm and token formation comparison process as well as it requires multi-pass
algorithm is used for data cleaning to reduce a approach. The proposed token formation algorithm is
complexity of data cleaning process and to clean data used to form a token for the selected attribute fields.
flexibly and effortlessly without any confusion. This The token based approach is proposed to reduce the
research work uses smart token to increase the speed time for the comparison process and to increase the
of the mining process and improve the quality of the speed of the data cleaning process [8], [9]. This
data. research paper deals about developing an approach to
handle noisy data using Attribute selection and Smart.
1. Introduction
2. Related Work
The data cleaning process is used to improve the
quality of the data before the mining process. There The large data file is sorted on designated fields to
are so many error will be introduced while integrating bring potentially identical records together. However,
the data warehouses or while loading a single data sorting is based on “dirty” fields, which may fail to
warehouse by the misuse of data entry problem. One bring matching records together, and its time
of the main errors in data warehouse is noisy data. The complexity is quadratic in the number of records. This
noisy data is a random error or variance in a measured sorting technique is inefficient while dealing with
variable. The noisy data errors are due to the misuse of large data file [BD, 83]. The merge/purge problem in a
abbreviations, data entry mistakes, duplicate records large database is solved by forming keys from some
and spelling errors [16]. selected fields, sorting the entire data set on the keys,
An attribute selection is very important to reduce clustering the sorted records and using a scanning
the time of the data cleaning process. An attribute window of a fixed size to reduce the number of
selection algorithm is effective in reducing attribute, comparisons [6], [7].
removing irrelevant attribute, increasing speed of the The several steps are used to clean the data
data cleaning process, and improving result in clarity. warehouse [10]. The first step is Scrub dirty data
An intelligent attribute selection is proposed as an fields. This step attempts to remove typographical

978-0-7695-3308-7/08 $25.00 © 2008 IEEE 770


DOI 10.1109/ICCSIT.2008.62

You might also like