Welcome to Scribd, the world's digital library. Read, publish, and share books and documents. See more ➡
Download
Standard view
Full view
of .
Add note
Save to My Library
Sync to mobile
Look up keyword
Like this
2Activity
×
0 of .
Results for:
No results containing your search query
P. 1
Optimization of work flow execution in ETL using Secure Genetic Algorithm

Optimization of work flow execution in ETL using Secure Genetic Algorithm

Ratings: (0)|Views: 977|Likes:
Published by ijcsis
Data Warehouses (DW) typically grows asynchronously, fed by a variety of sources which all serve a different purpose resulting in, for example, different reference data. ETL is a key process to bring heterogeneous and asynchronous source extracts to a homogeneous environment. The range of data values or data quality in an operational system may exceed the expectations of designers at the time validation and transformation rules are specified. Data profiling of a source during data analysis is recommended to identify the data conditions that will need to be managed by transformation rules and its specifications. This will lead to implementation of the ETL process. Extraction-Transformation-Loading (ETL) tools are set of processes by which data is extracted from numerous databases, applications and systems transformed as appropriate and loaded into target systems - including, but not limited to, data warehouses, data marts, analytical applications, etc. Usually ETL activity must be completed in certain time frame. So there is a need to optimize the ETL process. A data warehouse (DW) contains multiple views accessed by queries. One of the most important decisions in designing a data warehouse is selecting views to materialize for the purpose of efficiently supporting decision making. Therefore heuristics have been used to search for an optimal solution. Evolutionary algorithms for materialized view selection based on multiple global processing plans for queries are also implemented. The ETL systems work on the theory of random numbers, this research paper relates that the optimal solution for ETL systems can be reached in fewer stages using genetic algorithm. This early reaching of the optimal solution results in saving of the bandwidth and CPU time which it can efficiently use to do some other task. Therefore, the proposed scheme is secure and efficient against notorious conspiracy goals, information processing.
Data Warehouses (DW) typically grows asynchronously, fed by a variety of sources which all serve a different purpose resulting in, for example, different reference data. ETL is a key process to bring heterogeneous and asynchronous source extracts to a homogeneous environment. The range of data values or data quality in an operational system may exceed the expectations of designers at the time validation and transformation rules are specified. Data profiling of a source during data analysis is recommended to identify the data conditions that will need to be managed by transformation rules and its specifications. This will lead to implementation of the ETL process. Extraction-Transformation-Loading (ETL) tools are set of processes by which data is extracted from numerous databases, applications and systems transformed as appropriate and loaded into target systems - including, but not limited to, data warehouses, data marts, analytical applications, etc. Usually ETL activity must be completed in certain time frame. So there is a need to optimize the ETL process. A data warehouse (DW) contains multiple views accessed by queries. One of the most important decisions in designing a data warehouse is selecting views to materialize for the purpose of efficiently supporting decision making. Therefore heuristics have been used to search for an optimal solution. Evolutionary algorithms for materialized view selection based on multiple global processing plans for queries are also implemented. The ETL systems work on the theory of random numbers, this research paper relates that the optimal solution for ETL systems can be reached in fewer stages using genetic algorithm. This early reaching of the optimal solution results in saving of the bandwidth and CPU time which it can efficiently use to do some other task. Therefore, the proposed scheme is secure and efficient against notorious conspiracy goals, information processing.

More info:

Published by: ijcsis on Dec 04, 2010
Copyright:Attribution Non-commercial

Availability:

Read on Scribd mobile: iPhone, iPad and Android.
download as PDF, TXT or read online from Scribd
See More
See less

08/10/2011

pdf

text

original

 
(IJCSIS) International Journal of Computer Science and Information Security,Vol.
8
 , No.
8
 , 2010
Optimization of work flow execution in ETL usingSecure Genetic Algorithm
Raman Kumar 
1
, Saumya Singla
2
, Sagar Bhalla
3
and Harshit Arora
4
 
1,2,3,4
Department of Computer Science and Engineering,
1,2,3,4
D A V Institute of Engineering and Technology, Jalandhar, Punjab, India.
1,2,3,4
er.ramankumar@aol.in
 Abstract
— Data Warehouses (DW) typically growsasynchronously, fed by a variety of sources which all serve adifferent purpose resulting in, for example, different referencedata. ETL is a key process to bring heterogeneous andasynchronous source extracts to a homogeneous environment.The range of data values or data quality in an operational systemmay exceed the expectations of designers at the time validationand transformation rules are specified. Data profiling of a sourceduring data analysis is recommended to identify the dataconditions that will need to be managed by transformation rulesand its specifications. This will lead to implementation of the ETLprocess. Extraction-Transformation-Loading (ETL) tools are setof processes by which data is extracted from numerousdatabases, applications and systems transformed as appropriateand loaded into target systems - including, but not limited to,data warehouses, data marts, analytical applications, etc. UsuallyETL activity must be completed in certain time frame. So there isa need to optimize the ETL process. A data warehouse (DW)contains multiple views accessed by queries. One of the mostimportant decisions in designing a data warehouse is selectingviews to materialize for the purpose of efficiently supportingdecision making. Therefore heuristics have been used to searchfor an optimal solution. Evolutionary algorithms for materializedview selection based on multiple global processing plans forqueries are also implemented. The ETL systems work on thetheory of random numbers, this research paper relates that theoptimal solution for ETL systems can be reached in fewer stagesusing genetic algorithm. This early reaching of the optimalsolution results in saving of the bandwidth and CPU time which itcan efficiently use to do some other task. Therefore, the proposedscheme is secure and efficient against notorious conspiracy goals,information processing.
 Keywords- Extract, Transform, Load, Data Warehouse(DW),Genetic Algorithm (GA), Architecture, Information ManagementSystem, Virtual Storage Acess Method and Indexed Sequential  Access Method 
I.
 
I
 NTRODUCTION
 Companies know they have valuable data lying aroundthroughout their networks that needs to be moved from one place to another—such as from one business application toanother or to a data warehouse for analysis. The only problemis that the data lies in all sorts of heterogeneous systems andtherefore in all sorts of formats. To integrate data to onewarehouse for analysis a tool is required which can integratedata from various systems. To solve the problem, companiesuse extract, transform and load (ETL) software. Usually ETLactivity must be completed in certain time frame. So there is aneed to optimize the ETL process. Typical ETL activityconsists of three major tasks: extraction, transformation andloading.This research paper is the study of extraction andtransformation stages. Data extraction can be seen as reader writer problem, which has been reformulated using multiple buffers instead of using single finite buffer. Transformation isset of activities which convert the data from one form to other.This thesis studies the use of Genetic Algorithm to optimizethe ETL workflow.
 A.
 
 Basic concepts of ETLa) Extract 
The first part of an ETL process involves extracting the datafrom the source systems. Most data warehousing projectsconsolidate data from different source systems. Each separatesystem may also use a different data organization/format.Common data source formats are relational databases and flatfiles, but may include non-relational database structures suchas Information Management System (IMS) or other datastructures such as Virtual Storage Access Method (VSAM) or Indexed Sequential Access Method (ISAM), or even fetchingfrom outside sources such as web speeding or screen-scraping.Extraction converts the data into a format for transformation processing. An intrinsic part of the extraction involves the parsing of extracted data, resulting in a check if the data meetsan expected pattern or structure. If not, the data may berejected entirely.
b) Transform
The transform stage applies to a series of rules or functionsto the extracted data from the source to derive the data for loading into the end target. Some data sources will requirevery little or even no manipulation of data. In other cases, oneor more of the following transformations types to meet the business and technical needs of the end target may berequired:
 
Selecting only certain columns to load (or selectingnull columns not to load).
 
Translating coded values (For example, if the sourcesystem stores 1 for male and 2 for female, but thewarehouse stores M for male and F for female), this
213http://sites.google.com/site/ijcsis/ISSN 1947-5500
 
(IJCSIS) International Journal of Computer Science and Information Security,Vol.
8
 , No.
8
 , 2010
calls for automated data cleansing; no manualcleansing occurs during ETL.
 
Encoding free-form values (For example, mapping"Male" to "1" and "Mr" to M).
 
Deriving a new calculated value (For example,sale_amount = qty * unit_price).
 
Filtering.
 
Sorting.
 
Joining data from multiple sources (For example,lookup, merge).
 
Aggregation (for example, rollup - summarizingmultiple rows of data - total sales for each store, andfor each region, etc).
 
Generating surrogate-key values.
 
Transposing or pivoting (turning multiple columnsinto multiple rows or vice versa).
 
Splitting a column into multiple columns (For example, putting a comma-separated list specified asa string in one column as individual values indifferent columns).
 
Applying any form of simple or complex datavalidation. If validation fails, it may result in a full, partial or no rejection of the data, and thus none,some or all the data is handed over to the next step,depending on the rule design and exception handling.Many of the above transformations may result inexceptions, for example, when a code translation parses an unknown code in the extracted data.
c) Load 
The load phase loads the data into the end target, usuallythe Data Warehouse (DW). Depending on the requirements of the organization, this process varies widely. Some datawarehouses may overwrite existing information withcumulative, updated data every week, while other DW (or even other parts of the same DW) may add new data in ahistories form, for example, hourly. The timing and scope toreplace or append are strategic design choices dependent onthe time available and the business needs. More complexsystems can maintain a history and audit trail of all changes tothe data loaded in the DW. As the load phase interacts with adatabase, the constraints defined in the database schema — aswell as in triggers activated upon data load — apply (for example, uniqueness, referential integrity, mandatory fields),which also contribute to the overall data quality performanceof the ETL process.II.
 
ETL
 
EQUIREMENTS
 A practical and secure optimization of workflow in ETLwhich must satisfy the following basic requirements which can be explored as follows [1], [2], [3]:ETL stands for extract, transform and load, the processesthat enable companies to move data from multiple sourcesreformat and cleanse it, and load it into another database, adata mart or a data warehouse for analysis, or on another operational system to support a business process.
 
Companiesknow they have valuable data lying around throughout their networks that needs to be moved from one place to another— such as from one business application to another or to a datawarehouse for analysis
.
The only problem is that the data liesin all sorts of heterogeneous systems, and therefore in all sortsof formats. For instance, a CRM (Customer RelationshipManagement) system may define a customer in one way,while a back-end accounting system may define the samecustomer differently. To solve the problem, companies useextract, transform and load (ETL) software, which includesreading data from its source, cleaning it up and formatting ituniformly, and then writing it to the target repository to beexploited. The data used in ETL processes can come from anysource: a mainframe application, an ERP application, a CRMtool, a flat file, an Excel spreadsheet—even a message queue.Extraction can be done via Java Database Connectivity,Microsoft Corporation’s Open Database Connectivitytechnology, proprietary code or by creating flat files. After extraction, the data is transformed, or modified, depending onthe specific business logic involved so that it can be sent to thetarget repository. There are a variety of ways to perform thetransformation, and the work involved varies. The data mayrequire reformatting only, but most ETL operations alsoinvolve cleansing the data to remove duplicates and enforceconsistency. Part of what the software does is, examinesindividual data fields and applies rules to consistently convertthe contents to the form required by the target repository or application. In addition, the ETL process could involvestandardizing name and address fields, verifying telephonenumbers or expanding records with additional fieldscontaining demographic information or data from other systems. The transformation occurs when the data from eachsource is mapped, cleansed and reconciled so it all can be tiedtogether, with receivables tied to invoices and so on. After reconciliation, the data is transported and loaded into the datawarehouse for analysis of things such as cycle times and totaloutstanding receivables. In the past, companies that weredoing data warehousing projects often used homegrown codeto support ETL processes. However, even those that had donesuccessful implementations found that the source data fileformats and the validation rules applying to the data evolved,requiring the ETL code to be modified and maintained. Andcompanies encountered problems as they added systems andthe amount of data in them grew. Lack of scalability has beena serious issue with homegrown ETL software. Providers of  packaged ETL systems include Microsoft, which offers datatransformation services bundled with its SQL Server database.Oracle has embedded some ETL capabilities in its database,and IBM offers a DB2 Information Integrator component for its warehouse offerings. More than half of all developmentwork for data warehousing projects is typically dedicated tothe design and implementation of ETL processes. Poorly
214http://sites.google.com/site/ijcsis/ISSN 1947-5500
 
(IJCSIS) International Journal of Computer Science and Information Security,Vol.
8
 , No.
8
 , 2010
designed ETL processes are costly to maintain, change andupdate, so it is critical to make the right choices in terms of theright technology and tools that will be used for developinglogic involved so that it can be sent to the target repository.The basic steps used for the development of the ETL Lifecycle are as follows:1.
 
Cycle initiation.2.
 
Build reference data.3.
 
Extract (from sources).4.
 
Validate.5.
 
Transform (clean, apply business rules, check for dataintegrity, create aggregates).6.
 
Stage (load into staging tables, if used).7.
 
Audit reports (for example, on compliance with business rules. Also, in case of failure, helps todiagnose/repair).8.
 
Publish (to target tables).9.
 
Archive.10.
 
Clean up.
1)
 
 Architecture of Connector 
Basic architecture of this connector will be:The following operations will be performed by theconnector;
 
Reading data from source to the Quick ETL buffer.
 
Writing data to target from the Quick ETL buffer.
 
Managing the Meta data.Figure 1 - Architecture of Connector Various components of the connector are:
 
Reader.
 
Writer.
 
Client GUI.III.
 
GENETIC ALGORITHMS
 
Melanie, M., in his book “An Introduction to GeneticAlgorithms” stated that generally speaking, genetic algorithmsare simulations of evolution, of what kind ever. In most cases,however, genetic algorithms are nothing else than probabilisticoptimization methods which are based on the principles of evolution. He further suggested that if there is a solvable problem, a definition of an appropriate programminglanguage, and a sufficiently large set of representative testexamples (correct input-output pairs), a genetic algorithm isable to find a program which (approximately) solves the problem [12].Goldberg, D in his book “Genetic Algorithms in Search,Optimization and Machine Learning” stated that crossover encourages information exchange among different individuals.It helps the propagation of useful genes in the population andassembling better individuals. In a lower level evolutionaryalgorithm, the crossover is implemented as a kind of cut-and-swap operator[6].In 2003 Ulrich Bodenhofer suggested that in solving the problems related to genetic algorithms the following steps can be taken [4]:
Algorithm
t := 0;Compute initial population B0;
WHILE
stopping condition not fulfilled
DOBEGIN
select individuals for reproduction;create off springs by crossing individuals;eventually mutate some individuals;compute new generation
END
As obvious from the above algorithm, the transition fromone generation to the next consists of following basiccomponents:
Selection:
Mechanism for selecting individuals (strings) for reproduction according to their fitness (objective functionvalue).
Crossover:
Method of merging the genetic information of two individuals; if the coding is chosen properly, two good parents produces good children.
Mutation:
In real evolution, the genetic material can bychanged randomly by erroneous reproduction or other deformations of genes, for example, by gamma radiation. Ingenetic algorithms, mutation can be realized as a randomdeformation of the strings with a certain probability. The positive effect is preservation of genetic diversity and, as aneffect, that local maxima can be avoided.
215http://sites.google.com/site/ijcsis/ISSN 1947-5500

You're Reading a Free Preview

Download
/*********** DO NOT ALTER ANYTHING BELOW THIS LINE ! ************/ var s_code=s.t();if(s_code)document.write(s_code)//-->