Professional Documents
Culture Documents
5. Enrichment
• rearrangement and simplification of individual fields to make them more useful for the data
warehouse environment
• one or more fields from the same input record may be used to create a better view of the
data for the data warehouse
Major Transformation Types
• Format Revisions
• changes to the data types and lengths of individual fields
• standardize and change the data type
• Decoding of Fields
• same data items described by a plethora of field values in source systems
• some systems may have cryptic codes to represent business values
• Calculated and Derived Values
• Ex. profit margin, average daily balances, operating ratios
• Splitting of Single Fields
• To improve the operating performance
• users may need to perform analysis by using individual components such as city, state, and
Zip Code
Major Transformation Types
• Merging of Information
• information about a single subject may come from multiple sources
• Character Set Conversion
• Ex. EBCDIC format to ASCII format
• Conversion of Units of Measurements
• Date/Time Conversion
• Summarization
• creating summaries to be loaded in the data warehouse
Major Transformation Types
• Key Restructuring
• Deduplication
Typical data source environment
Data Integration and Consolidation - issues
• Entity Identification Problem
• Multiple source systems/departments may have their own unique key for customer
identification
• problem of identification - which of the records relate to the same subject
• design complex algorithms to match records from multiple sources and then review the
exceptions to automated procedures
• May be done in two phases
• First phase : irrespective of whether they are duplicates or not, assign unique identifiers
• Second phase consists of reconciling the duplicates periodically through automatic
algorithms and manual verification
Data Integration and Consolidation - issues
• Multiple Sources Problem
• single data element may have more than one source
• Ex. unit cost of products may be available from two systems
• standard costing application
• order processing system
• There could be slight variations in the cost figures from these two systems
Loading
Introduction- Loading
• Loading is taking the prepared data and applying it to the data warehouse
• loading the data warehouse may take an inordinate amount of time therefore loads are generally
cause for great concern
• During the loads, the data warehouse has to be offline
• loads may be scheduled for a window of time without affecting your data warehouse users
• the whole load process may be divided into smaller chunks and only a few files may be
populated at a time
• Certain load images may not load so procedures may be provided to handle it
• quality of the loaded records is also to be ensured
Types of loads
1. Initial Load
• populating all the data warehouse tables for the very first time
2. Incremental Load
• applying ongoing changes as necessary in a periodic manner
3. Full Refresh
• completely erasing the contents of one or more tables and reloading with
fresh data
Staging area and warehouse links
• If the data staging area and the data warehouse database are on the same server, it saves the
effort of moving the load images to the data warehouse server
• When the staging area files and the data warehouse repository are on different servers, database
access methods are used
• Loading process involves consideration of following:
• Speed of loading
• Heterogeneity
• necessary bandwidth needed
• impact of the transmissions on the network
• data compression requirements
• contingency plans
Database access methods
• Web Services
• Allows a system to connect to an outside hosted system via a simple API call
• In general, a web service is a slow method of communicating for bulk data loads
• Database Links
• Provides a convenient way of connecting to a remote database, but this convenience comes
at the cost of performance
• does not allow parallel loading and therefore often provides a bottleneck in the data
movement process
• Data Pump
• Allows to choose specific columns and tables for a very fast way of moving data between two or more Oracle
instances
Database access methods
• Flat Files
• one of the best performing means of moving large data volumes between databases
especially in a heterogeneous environment
• Transportable Tablespaces
Modes of applying data
1. Load
• If the target table to be loaded already exists and data exists in the table, the
load process wipes out the existing data and applies the data from the
incoming file
• If the table is already empty before loading, the load process simply applies
the data from the incoming file
2. Append
• If data already exists in the table, the append process unconditionally adds
the incoming data, preserving the existing data in the target table
• When an incoming record is a duplicate of an already existing record, you may
define how to handle an incoming duplicate
Modes of applying data
3. Destructive Merge
• If the primary key of an incoming record matches with the key of an existing
record, update the matching target record
• If the incoming record is a new record without a match with any existing
record, add the incoming record to the target table
4. Constructive Merge
• If the primary key of an incoming record matches with the key of an existing
record, leave the existing record, add the incoming record, and mark the
added record as superceding the old record
Modes of applying data