You are on page 1of 7

Data Transformation Overview

Contents 1. Data Transformation a. Introduction b. Input files c. Source Systems d. Mapping file e. Output file f. Output file structure 2. Process Steps a. Initial data assessment b. Pre-transformation c. Transformation d. Post-transformation e. Quality checks f. Common issues and challenges 3. Cartographer a. Introduction b. Current status c. Requested features

1. Data Transformation
1

a) Introduction
The process of converting the data from one format to another, which includes cleansing, reformatting, standardization, joining data from multiple Master files and applying business rules if any. The data is transformed from the client specified format to the format which can run on Spend visibility platform.

Purpose

The purpose of Data Transformation is to convert the client data to the standard format so that the data is ready for the data enrichment End User

The end user of the transformed data is Data Enrichment team.

b) Input files
The input files are customer extracted raw (flat) files from their Operational data sources (OLTP database)
Types of Input file formats: .xls, .xlsx, .csv, .txt, .dat formats. Denormalized data: a) Has multiple source systems in single raw file b) Facts and Dimensions are created from the single raw file Normalized data: Facts and dimension tables are created from the different raw files

c) Source Systems
2

Source systems are the individual systems in which the transactional data is stored. We transform the data based on source systems with a separate mapping file.

d) Mapping File
Data Mapping maps the data from the raw file to the Standard Schema tables. This is a document provided by the customer to perform transformation based on applying business rules

e) Output Files
Output should be the transformed facts and Dimension files as per the standard Data Acquisition Schema Data Acquisition Schema Guide_10s2.xls Data_Acquisition_Schema_Guide_10s3.xls

f) Output file structure


The Transformed files contains normalized Fact and Dimension Tables with UTF-8 as encoding Fact Tables: Fact table is the master table that contains all primary Key data or Main data Fact Tables Invoice2/ Invoice3 PO2/ PO3 Dimension Tables: Dimension tables contain attributes that describes fact records in the fact table

Dimension Tables Account Companysite Contract CostCenter CostCenterMgmt ERPCommodity FlexDimension1 FlexDimension2 FlexDimension3 FlexDimension4 FlexDimension5
3

FlexDimension6 FlexDimension7 FlexDimension8 FlexDimension9 FlexDimension10 FlexDimension11 FlexDimension12 FlexDimension13 FlexDimension14 Part User Supplier

2. Process Steps
PM will send mail communication to DT Team about the availability of raw files. Raw files are generally made available in *** Analysis. He also sends a data worksheet with raw file names, Spend Amount and record count in each file.

a) Initial data assessment


In this phase we assess the quality of raw data provided, identify what all tables need to be created during transformation, time estimation for transformation and also reporting the issues found in the data assessment to the customer at the initial stage. We also check for: Discrepancy in Number of raw files, Record Count & Amount Refer correct version of Mapping file Suitable encoding for special characters in data files Mappings are in line with Raw files Sufficient data in lookup tables Raw data maximum length vs SV Schema field limit Format of Amount field Data availability in mandatory fields Duplicate records Invalid Date data Non-numeric data in Numeric fields Accuracy of Currency rates (only if the file is available)

b) Pre-Transformation
4

In this phase, we need to update the raw files and make it ready for the transformation, ie. Format the dates, pulling data from different lookup/ Master table; update the fields as per the customer rules, etc.

c) Transformation
In this phase, we proceed with transformation for source systems those data is clean and eligible for transformation, i.e. those do not have data issues. We will create the necessary Facts and Dimension tables according to the mapping file. We will also apply the following rules on Fact and Dimension files: Refer previous cycle data issues file for recurrence(only for recurring cycles) Amount field format Negative spend representation Field Id & Field Name gaps filling Check duplicate records in both Facts and Dimensions Check broken links between Facts and Dimensions Check null records for mandatory fields Date format as per Analysis standards Default values Apply all special instructions

d) Post Transformation
Fix the duplicates in both Facts and Dimensions if any Fix the referential integrity (data integrity) between Facts and Dimensions if any Fix the Hierarchy issues if any

e) Quality Checks
Duplicate checks to make sure that we do not have any data duplication in Facts and dimensions Violation of referential integrity check is to ensure data integrity between facts and dimensions Additional Checks:
o o o Check if leading zeroes are intact Make sure that all the records in table have data for columns Accounting Date in Invoice table and Ordered Date in PO. Check for incompatible data in date and amount columns

o Check for null AmountCurrency o Check if the final and initial stats matches

f) Common Issues and challenges

Some of the common issues and challenges faced during the entire process Issues during loading of data We face issues like Escape Character for double Quotes Truncation errors Header in more than one row Length of a field more than 255 Number data type in first few records and later alpha numeric Duplicate Field Names One record spanned to more than one row Problem with special characters and encoding Data getting split into more than one field We need to fix these issues manually and import the file without any error Issues during Transformation Some of the issues we face during transformation are Empty PK records Amount in exponential Empty records in mandatory fields Amount in parenthesis

Cartographer
a) Introduction
6

Cartographer is a data mapping tool developed by Tim Pittman used to map the customer fields to the schema tables. Cartographer is connected to the MS SQL Server 2008 to perform the transformation. Since the current tool MS Access used for transformation has limitation on holding the data upto 2GB. So we came up with the idea of having a transformation tool which can support huge data sets.

Version: 1.0.6.0