(IJCSIS) International Journal of Computer Science and Information Security,Vol.
calls for automated data cleansing; no manualcleansing occurs during ETL.
Encoding free-form values (For example, mapping"Male" to "1" and "Mr" to M).
Deriving a new calculated value (For example,sale_amount = qty * unit_price).
Joining data from multiple sources (For example,lookup, merge).
Aggregation (for example, rollup - summarizingmultiple rows of data - total sales for each store, andfor each region, etc).
Generating surrogate-key values.
Transposing or pivoting (turning multiple columnsinto multiple rows or vice versa).
Splitting a column into multiple columns (For example, putting a comma-separated list specified asa string in one column as individual values indifferent columns).
Applying any form of simple or complex datavalidation. If validation fails, it may result in a full, partial or no rejection of the data, and thus none,some or all the data is handed over to the next step,depending on the rule design and exception handling.Many of the above transformations may result inexceptions, for example, when a code translation parses an unknown code in the extracted data.
The load phase loads the data into the end target, usuallythe Data Warehouse (DW). Depending on the requirements of the organization, this process varies widely. Some datawarehouses may overwrite existing information withcumulative, updated data every week, while other DW (or even other parts of the same DW) may add new data in ahistories form, for example, hourly. The timing and scope toreplace or append are strategic design choices dependent onthe time available and the business needs. More complexsystems can maintain a history and audit trail of all changes tothe data loaded in the DW. As the load phase interacts with adatabase, the constraints defined in the database schema — aswell as in triggers activated upon data load — apply (for example, uniqueness, referential integrity, mandatory fields),which also contribute to the overall data quality performanceof the ETL process.II.
A practical and secure optimization of workflow in ETLwhich must satisfy the following basic requirements which can be explored as follows , , :ETL stands for extract, transform and load, the processesthat enable companies to move data from multiple sourcesreformat and cleanse it, and load it into another database, adata mart or a data warehouse for analysis, or on another operational system to support a business process.
Companiesknow they have valuable data lying around throughout their networks that needs to be moved from one place to another— such as from one business application to another or to a datawarehouse for analysis
The only problem is that the data liesin all sorts of heterogeneous systems, and therefore in all sortsof formats. For instance, a CRM (Customer RelationshipManagement) system may define a customer in one way,while a back-end accounting system may define the samecustomer differently. To solve the problem, companies useextract, transform and load (ETL) software, which includesreading data from its source, cleaning it up and formatting ituniformly, and then writing it to the target repository to beexploited. The data used in ETL processes can come from anysource: a mainframe application, an ERP application, a CRMtool, a flat file, an Excel spreadsheet—even a message queue.Extraction can be done via Java Database Connectivity,Microsoft Corporation’s Open Database Connectivitytechnology, proprietary code or by creating flat files. After extraction, the data is transformed, or modified, depending onthe specific business logic involved so that it can be sent to thetarget repository. There are a variety of ways to perform thetransformation, and the work involved varies. The data mayrequire reformatting only, but most ETL operations alsoinvolve cleansing the data to remove duplicates and enforceconsistency. Part of what the software does is, examinesindividual data fields and applies rules to consistently convertthe contents to the form required by the target repository or application. In addition, the ETL process could involvestandardizing name and address fields, verifying telephonenumbers or expanding records with additional fieldscontaining demographic information or data from other systems. The transformation occurs when the data from eachsource is mapped, cleansed and reconciled so it all can be tiedtogether, with receivables tied to invoices and so on. After reconciliation, the data is transported and loaded into the datawarehouse for analysis of things such as cycle times and totaloutstanding receivables. In the past, companies that weredoing data warehousing projects often used homegrown codeto support ETL processes. However, even those that had donesuccessful implementations found that the source data fileformats and the validation rules applying to the data evolved,requiring the ETL code to be modified and maintained. Andcompanies encountered problems as they added systems andthe amount of data in them grew. Lack of scalability has beena serious issue with homegrown ETL software. Providers of packaged ETL systems include Microsoft, which offers datatransformation services bundled with its SQL Server database.Oracle has embedded some ETL capabilities in its database,and IBM offers a DB2 Information Integrator component for its warehouse offerings. More than half of all developmentwork for data warehousing projects is typically dedicated tothe design and implementation of ETL processes. Poorly