You are on page 1of 36

TRA04.

05 – Extract, Transform, and Load (ETL)


Processing
Version 1.0
10/25/2016

Prepared by:
South Carolina Department of Health and Human Services (SCDHHS)
Enterprise Services (ES)
TRA04.05 – Extract, Transform, and Load (ETL) Processing

Table of Contents
1. Introduction.................................................................................................................................7
2. ETL in the MES Architecture........................................................................................................8
3. Two ETL Patterns—E-T-L and E-L-T............................................................................................10
3.1. E-T-L Pattern........................................................................................................................10
3.2. E–L–T Pattern......................................................................................................................11
4. ETL Process Setup......................................................................................................................11
4.1. Direct Database Connections..............................................................................................12
4.2. File-Based Exchanges..........................................................................................................13
5. ETL Process Standard.................................................................................................................14
5.1. Authentication and Authorization......................................................................................16
5.2. Data Access.........................................................................................................................17
5.2.1. Direct Database Exchanges..........................................................................................18
5.2.2. File-Based Exchanges...................................................................................................19
5.2.3. Metadata Repository....................................................................................................20
5.3. Data Extraction....................................................................................................................20
5.4. Data Validation....................................................................................................................21
5.5. Data Transformation...........................................................................................................22
5.6. Load Step.............................................................................................................................22
5.7. Common Infrastructure......................................................................................................23
5.7.1. Process Logging............................................................................................................23
5.7.2. Exception Handling.......................................................................................................24
5.7.3. Component Documentation.........................................................................................26
5.8. Operational Considerations................................................................................................27
5.8.1. High Availability and Failover Mechanisms..................................................................27
5.8.2. Causes of Infrastructure Failure...................................................................................27

10/25/2016 SCDHHS - Confidential Page 2 of 36


TRA04.05 – Extract, Transform, and Load (ETL) Processing

5.8.3. Clustering......................................................................................................................27
5.8.4. Restartability................................................................................................................28
5.9. Scheduling...........................................................................................................................28
5.9.1. Process Monitoring......................................................................................................29
5.9.2. Reporting......................................................................................................................29
5.9.3. Alerts and Notification.................................................................................................30
6. Use Cases...................................................................................................................................30
6.1. Initial Load...........................................................................................................................30
6.2. Incremental Loads / Deltas.................................................................................................30
6.3. Initial as Well as Ongoing Loads..........................................................................................32
6.4. Transformation Performed on Target Systems..................................................................32
6.4.1. Data Loads into Data Marts..........................................................................................32
6.4.2. Data Loads into NoSQL Databases...............................................................................33
7. Tools and Technologies.............................................................................................................34
8. Appendices................................................................................................................................34
8.1. Relevant Documentation....................................................................................................34
8.2. Revision History...................................................................................................................35
8.3. Acronyms............................................................................................................................35
8.4. Glossary...............................................................................................................................36

Table of Figures
Figure 1. ETL in the MES architecture..............................................................................................7
Figure 2. MES ETL architecture........................................................................................................9
Figure 3. E-T-L pattern steps..........................................................................................................10
Figure 4. E-L-T pattern steps..........................................................................................................11
Figure 5. Detailed ETL process.......................................................................................................15
Figure 6. MES ETL connection architecture...................................................................................18

10/25/2016 SCDHHS - Confidential Page 3 of 36


TRA04.05 – Extract, Transform, and Load (ETL) Processing

Figure 7. Initial load use case.........................................................................................................30


Figure 8. Incremental loads/deltas use case.................................................................................31
Figure 9. Initial as well as ongoing loads use case.........................................................................32
Figure 10. Data loads into data mart use case..............................................................................33
Figure 11. Data loads into NoSQL database use case....................................................................33

Table of Tables
Table 1. Events and metrics to record in ETL process logs............................................................24
Table 2. Information logged in event of ETL process error...........................................................25
Table 3. Considerations for scheduling ETL process jobs..............................................................29

Table of Standards
TRA04.05-S1. The ETL team will establish failure response scenarios..........................................12
TRA04.05-S2. ETL processes will log exceptions in accordance with the enterprise standards.. .12
TRA04.05-S3. Changes affecting the ETL process will adhere to ITIL framework.........................12
TRA04.05-S4. Database connection details will be set up as an encrypted configuration of the
ETL tool..........................................................................................................................................13
TRA04.05-S5. Database table structures will be managed through the metadata repository.....13
TRA04.05-S6. The ETL process will comply with TRA04.06 – Managed File Transfer (MFT)........13
TRA04.05-S7. File schemas will be managed through the metadata repository..........................13
TRA04.05-S8. ETL processes will adhere to TRA01.02 – Access Control and Identity
Management.................................................................................................................................16
TRA04.05-S9. At each step in the ETL process, access to the data will be controlled..................16
TRA04.05-S10. The sources systems will ensure the ETL processes have access to the source
data................................................................................................................................................18
TRA04.05-S11. The ETL tool will have read-only access to the source data.................................18
TRA04.05-S12. The ETL application will perform file validations..................................................19

10/25/2016 SCDHHS - Confidential Page 4 of 36


TRA04.05 – Extract, Transform, and Load (ETL) Processing

TRA04.05-S13. The ETL process will validate the structure of the database or file......................21
TRA04.05-S14. The ETL process will validate that the data follows specified business rules.......21
TRA04.05-S15. The ETL tool will have read/write access to staging and destination databases. 22
TRA04.05-S16. The ETL process will log important events and metrics.......................................23
TRA04.05-S17. The ETL process will not write personal or protected data in logs.......................23
TRA04.05-S18. The ETL application will be able to increase/decrease log detail.........................23
TRA04.05-S19. The ETL application will adhere to the MES log retention policy.........................23
TRA04.05-S20. The ETL process will log all defined and undefined errors...................................25
TRA04.05-S21. The ETL process will anticipate failures and perform prevention procedures.....25
TRA04.05-S22. The ETL process will handle errors as defined during set-up...............................25
TRA04.05-S23. The ETL process will send an alert in the event of an exception..........................25
TRA04.05-S24. The ETL process will capture full error context....................................................25
TRA04.05-S25. Each ETL process will document its business purpose and technology
components...................................................................................................................................26
TRA04.05-S26. The ETL will be executed per the configurations in scheduler.............................28

10/25/2016 SCDHHS - Confidential Page 5 of 36


TRA04.05 – Extract, Transform, and Load (ETL) Processing

SCDHHS expressly restricts the distribution of this document to include only those SCDHHS
staff, SCDHHS processing environment contractors, or any entity given explicit access to this
document with SCDHHS executive or management approval.
Intended audience
The SCDHHS Technical Reference Architecture (TRA) audience includes SCDHHS staff, trading
partners, and third-party vendors who will be implementing, integrating, managing, or
operating the SCDHHS Medicaid Enterprise System (MES). The SCDHHS TRA provides the
reference model by which technologies and components of the SCDHHS MES will be measured
for development and implementation.
Trademarks
Microsoft and Excel are registered trademarks of the Microsoft Corporation.

10/25/2016 SCDHHS - Confidential Page 6 of 36


TRA04.05 – Extract, Transform, and Load (ETL) Processing

1. Introduction
Extract, transfer, and load (ETL) processing broadly refers to processes that extract large
volumes of data from source systems, transform the data to fit into the schema of the
destination systems, and load the data into the target (destination) systems. Although ETL
refers to the three distinct steps (extract, transform, and load), these major steps can be
performed in both the E-T-L and E-L-T pattern, depending on the purpose of the process. For
simplification, unless explicitly specified, when this document uses the term 'ETL' the document
refers to the general ETL process in either the E-T-L or E-L-T pattern.
The ETL process involves more than just extracting, transforming, and loading data. For
example, the ETL process designer needs to understand the schema of the source database and
the schema of the target database. Other ETL components include data validation, process
logging, authentication, exception handling, etc. Section 5.. ETL Process Standard describes the
ETL process in detail, and Section 6.. Use Cases presents sample use cases to load data into
transactional databases and to perform transformation in the target systems.
The ETL process can be designed and executed in a number of technologies, for instance,
database stored procedures, Java programs, and ETL tools. However, the South Carolina
Department of Health and Human Services
(SCDHHS) Medicaid Enterprise System (MES)
uses a commercial off-the-shelf (COTS) ETL
application. COTS ETL applications offer an
integrated development environment (IDE)
that allows developers drag-and-drop
functionality to configure data
transformation.
COTS ETL applications are ideal for dealing
with large volumes of data and accessing
data from more than one source. Figure 1.
ETL in the MES architecture highlights
integration points where ETL processes may
be appropriate to move MES data through
Enterprise Data Management Pipeline
(EDMP). The TRA05 –Enterprise Data
Services (EDS) supplement describes the EDS Figure 1. ETL in the MES architecture and
its sub-components in further detail.

10/25/2016 SCDHHS - Confidential Page 7 of 36


TRA04.05 – Extract, Transform, and Load (ETL) Processing

To bring the dissimilar data residing in the MES modules into the EDS requires a well-planned
strategy and design, as well as attention to daily operations and maintenance tasks. Using COTS
ETL tools in the MES provides the following benefits:
 Connectors to common data sources such as databases, flat files, mainframe
systems, etc.
 Data transformations across disparate data sources, including filtering, reformatting,
sorting, joining, merging, aggregation, and other operations
 Scheduling and monitoring
 Version control
 Unified metadata management
 Integration with business intelligence (BI) tools
 Built-in support for establishing templates and enabling standards and reuse
The information in this supplement applies to all COTS ETL applications approved for use in the
MES architecture. See Section 7.. Tools and Technologies for technology details and the TRA06
– Technology Products Portfolio (TPP) supplement for the approved COTS ETL technologies.

2. ETL in the MES Architecture


The MES uses a service-oriented architecture (SOA) with system functionality implemented into
loosely coupled components and subsystems (illustrated in Figure 1. ETL in the MES
architecture). See the TRA – Technical Reference Architecture for a description of the
architecture.
ETL processes extract data from the systems of record (e.g. legacy member management
system) and send that data to the EDS for ingestion into the raw data lake (RDL), as shown in
Figure 2. MES ETL architecture. ETL processes also can extract data from the operational data
store (ODS) to populate data marts.
The MES data transformed by ETL processes include:
 Business data, such as member, eligibility, and service records encoded in JavaScript
Object Notation (JSON) or eXtensible Markup Language (XML).
 Binary data, such as Portable Document Format (PDF) and Microsoft® Word and
Microsoft Excel® files.
 Metadata describing database schema, file formats, etc.
The ESB provides the infrastructure between the ETL and the EDS.

10/25/2016 SCDHHS - Confidential Page 8 of 36


TRA04.05 – Extract, Transform, and Load (ETL) Processing

Figure 2. MES ETL architecture

10/25/2016 SCDHHS - Confidential Page 9 of 36


TRA04.05 – Extract, Transform, and Load (ETL) Processing

3. Two ETL Patterns—E-T-L and E-L-T


The pattern a particular ETL interface follows depends on enterprise factors including the
following conditions:
 The target environment (server and database) has storage and processing capacity to
perform the necessary computations and transformations.
 The staging environment is more capable (faster, safer, and more reliable) of doing the
transformation than the target environment.
 The regulatory environment permits the process to create copies of data. That is, the
process is permitted to stage data outside the target database.

3.1. E-T-L Pattern


With the E-T-L pattern, the ELT process
stages and transforms the data in an
intermediate database (not in the
source or destination database) before
uploading the data to the destination.
Considerations for choosing the E-T-L
pattern include the following scenarios:
 The target location is not
optimized for transformation of
data and the demand for
additional processing resources
to transform data would
negatively affect performance.
 The target location is a file or
another medium that does not
allow for staging data.
The E-T-L pattern example (illustrated in
Error: Reference source not found)
shows the steps to extract data from
the source system, and to load the data
to the staging tables. Data
transformation occurs according to the Figure 3. E-T-L pattern steps
business rules of the process within the

10/25/2016 SCDHHS - Confidential Page 10 of 36


TRA04.05 – Extract, Transform, and Load (ETL) Processing

staging tables. Transformed data is loaded to the destination tables and the process is
complete. Throughout the process, logs document steps performed on the extracted records. In
addition, throughout the process, if an error occurs, appropriate exception handling is executed
and the error is logged.

3.2. E–L–T Pattern


E-L-T patterns perform transformations
on the target platform and avoid the
intermediate transformation
environment. E-L-T patterns require
better infrastructure and greater
processing capability; however, using the
E-L-T pattern results in:
 Fewer number of data hops
 Reduced data transit times
 Reduced network traffic
 Faster data loads
The ELT pattern example in Error:
Reference source not found shows the
steps to extract the data from the source
Figure 4. E-L-T pattern steps
system, and to load the extracted data
directly into the destination database.
Once the data is loaded into the destination database, the transformation routines convert the
data into the format required per the business rules. Process logs document the activities
performed on the extracted records. If an error occurs, the appropriate exception handling is
performed, and an error log documents the event.

4. ETL Process Setup


Direct database connections are preferable for ETL processing. However, when the source
and/or destination locations do not support direct data access, file-based exchanges can also be
used.
Some partnering applications may have pre-existing interfaces and practices, for instance, to
always extract required data into a file and to send the file to the requesting system.
Accommodating the partner’s practice allows uniformity across the partner’s enterprise and
allows the partner to leverage the existing code base and design experience. This may be

10/25/2016 SCDHHS - Confidential Page 11 of 36


TRA04.05 – Extract, Transform, and Load (ETL) Processing

especially true when partnering with legacy applications that are nearing their end-of-life, or
with partners with limited resources.

TRA04.05-S1. The ETL team will establish failure response scenarios.

The ETL team should agree on, and establish failure response scenarios depending on the type
of exception. For example, if certain records do not meet the structure requirements, that row
should be rejected and the process should continue and if the entire table is missing a field,
then the process should log the appropriate error code and halt further processing.
Some exception handling scenarios include the following
 Detect an error, stop the process, and present the error code
 Detect an error and write the record in an error table with the corresponding code
 Detect an error, write the record in both the target and the error table with the error
code, and flag the record as error in the target table.

TRA04.05-S2. ETL processes will log exceptions in accordance with the enterprise standards.

Processes for all scenarios must embed logging mechanisms that adhere to the standards for
documenting process steps for process control and audits.
All potential and anticipated errors shall be assigned unique error codes and description.
See also Section 5.7.2.. Exception Handling.

TRA04.05-S3. Changes affecting the ETL process will adhere to ITIL framework.

Changes affecting the ETL process will be made according to the Information Technology
Infrastructure Library (ITIL) change management process and will be subject to version and
configuration management. Examples of changes affecting the ETL process include:
 File layout changes
 Database schema changes
 Addition or removal of data elements
The TRA09 – Project Delivery Framework supplement describes the SCDHHS implementation of
ITIL.

4.1. Direct Database Connections


With direct database connections, the source database team provides the following connection
details:
 Database host

10/25/2016 SCDHHS - Confidential Page 12 of 36


TRA04.05 – Extract, Transform, and Load (ETL) Processing

 Database name
 Port
 Service ID
 Password
 Appropriate drivers
For the source system, the database access will be read-only and for the target system, the
database access will be read-write.

TRA04.05-S4. Database connection details will be set up as an encrypted configuration of the


ETL tool.

The connection details will be setup as an encrypted configuration of the ETL tool and will
under no circumstances be hard-coded or human-readable.

TRA04.05-S5. Database table structures will be managed through the metadata repository.

The table structure of the source and target databases will be managed through the metadata
repository. (See Section 5.2.3.. Metadata Repository.)

4.2. File-Based Exchanges


ETL processes often use files as input or produce files as output to its processing.

TRA04.05-S6. The ETL process will comply with TRA04.06 – Managed File Transfer (MFT).

To ensure secure transmission of files across enterprise systems and between internal and
external partners, the ETL process will use a secure file transport mechanism (for both input
and output files). File transfers in the MES architecture conform to the standards described in
the TRA04.06 – Managed File Transfer (MFT) supplement.
The integrating applications agree on the file server and the directory to be used to transport
files based on the systems’ ability to transport the files in and out of the enterprise firewall.
In addition to the file format, the structure of fields in the file is also agreed upon. For example,
in delimited files, the sequence of fields should be pre-defined and distributed among parties;
and in XML files, the nested relationships should be shared prior in the interface design stage.

TRA04.05-S7. File schemas will be managed through the metadata repository.

The schemas of the files being exchanged shall be stored in the metadata repository. (See
Section 5.2.3.. Metadata Repository.)

10/25/2016 SCDHHS - Confidential Page 13 of 36


TRA04.05 – Extract, Transform, and Load (ETL) Processing

5. ETL Process Standard


In practice, the extraction, transformation, and loading steps may overlap and repeat several
times within the execution of a business process. For example, an ETL process may extract
source data, load data into staging tables for cleaning, and then extract data again to load data
to target staging tables to apply transformation logic before final extraction and loading into
the final destination table. Figure 3. Detailed ETL process illustrates the ETL process.
The standard for ETL processes includes the following structure and considerations:
 Process steps
o Authentication and authorization
o Data access
o Data extract
o Data validation
o Data transformation
o Data load
 Common requirements (across each process step)
o Process logging
o Exception handling
o Component documentation
 Operational considerations
o High availability
o Scheduling
o Monitoring

10/25/2016 SCDHHS - Confidential Page 14 of 36


TRA04.05 – Extract, Transform, and Load (ETL) Processing

Figure 3. Detailed ETL process

10/25/2016 SCDHHS - Confidential Page 15 of 36


TRA04.05 – Extract, Transform, and Load (ETL) Processing

5.1. Authentication and Authorization


Authentication refers to the process by which the username and
password (or authentication token) is validated. Authorization is
the process by which the MFT application determines which
activities the logged in user is allowed to perform. See the
TRA01.02 – Access Control and Identity Management
supplement for more information about authentication and
authorization.

TRA04.05-S8. ETL processes will adhere to TRA01.02 – Access


Control and Identity Management.

During the execution of ETL process activities, the ETL processes


access files, databases, and web services (referred to collectively as end-points). Authentication
and access to end-points are performed according to the TRA01.02 – Access Control and
Identity Management supplement.
The integrating systems shall provide the necessary login credentials and Uniform Resource
Identifier (URI) information to the ETL process.
The credentials shall be securely maintained as per the guidelines laid out in TRA01.02 – Access
Control and Identity Management supplement.

TRA04.05-S9. At each step in the ETL process, access to the data will be controlled.

ETL processes may make multiple copies of data in staging areas, such as extracted flat files or
temporary relational tables. At each step, the data must be subjected to access control to avoid
inadvertent access.

5.2. Data Access


Before extracting data from a source or loading data into the
target, the ETL process needs to connect to the data
source/target. ETL processes in the MES architecture must
support many kinds of data stores including, but not limited to,
the following:
 Structured files
 XML
 JSON
 Delimited files

10/25/2016 SCDHHS - Confidential Page 16 of 36


TRA04.05 – Extract, Transform, and Load (ETL) Processing

 Fixed-width files
 Relational databases
 Non-relational databases
 NoSQL databases
Direct database connections are preferable for ETL processing. However, when the source
and/or destination locations do not support direct data access, file-based exchanges can also be
used. Figure 3. Detailed ETL process illustrates the MES ETL architecture for both direct
database and file-based data exchanges. The use cases in Section 6.. Use Cases provide details
about the enactment of this architecture.

Figure 4. MES ETL connection architecture


TRA04.05-S10. The sources systems will ensure the ETL processes have access to the source
data.

Each source system ensures that the ETL process has access to the source data. The level of
access given to the ETL application should be the minimum level of access required to access
only the data required for the ETL process. A common alternative is to have the application

10/25/2016 SCDHHS - Confidential Page 17 of 36


TRA04.05 – Extract, Transform, and Load (ETL) Processing

support teams extract the data and provide extracted data to the ETL process in the form of flat
files or other staging formats.

TRA04.05-S11. The ETL tool will have read-only access to the source data.

ETL tool extracts the data from the source database without updating any data. No other
updates will be made to the source database during the read process.

5.2.1. Direct Database Exchanges


The preferred method for data exchange is to connect to the database (whether the source or
the target) using the ESB. The ETL application connects to the source database directly using a
Java Database Connectivity (JDBC) connection provided by the ESB and queries the data.
The ETL process identifies the tables with which to exchange data and imports the data
structure of these tables into the metadata repository (see Section 5.2.3.. Metadata
Repository).
The table structure of the source and target databases will be managed through the metadata
repository. (See Section 5.2.3.. Metadata Repository.)
See also Section 4.. ETL Process Setup for information about setting up direct database
exchanges.

5.2.2. File-Based Exchanges


When direct database access (either at the source or destination) is not feasible, file-based
exchanges can be leveraged to integrate applications according to a pre-defined file format and
structure.
Common file formats include:
 Delimited flat files, including, comma-separated value (CSV), tab-separated value (TSV),
or character-delimited files (such as pipe delimiters, |)
 Fixed-width flat files, where records are delimited by number of characters
 XML files
 JSON files
In addition to the file format, the sequence of fields should conform to the pre-defined
structure agreed upon when setting up the ETL process. The metadata repository stores the
schemas of the files being exchanged. (See Section 5.2.3.. Metadata Repository.)
To ensure secure transmission of files across enterprise systems, the ETL process will use a
secure file transport mechanism (for both input and output files) as specified in the TRA04.06 –

10/25/2016 SCDHHS - Confidential Page 18 of 36


TRA04.05 – Extract, Transform, and Load (ETL) Processing

Managed File Transfer (MFT) supplement.


See also Section 4.. ETL Process Setup for information about setting up file-based exchanges.

TRA04.05-S12. The ETL application will perform file validations.

File-based exchanges are sensitive to file structure changes and require validations to avoid
inadvertent data corruption. Hence, the ETL application will validate that the file format and file
encoding are as specified (when setting up the ETL process) before reading and staging the
data.
When validation fails, the ETL process will handle the exception gracefully and terminate
execution (see Section 5.7.2.. Exception Handling). Examples of file-based processing validation
failures include:
 An ETL processor encounters a comma-separated file instead of a pipe-delimited file.
 An ETL processor encounters a file with only 9 fields while it is expecting 10 fields
 A field that usually takes up 9 characters is extended to 10 characters, thereby
invalidating all subsequent mapped fields

5.2.3. Metadata Repository


Metadata is defined as “data about data”. Metadata typically describes database schemas, file
layouts, data types, transformation codes, and other descriptive definitions that allow for
dynamic execution and interpretation of data in flight. The metadata repository stores the
metadata associated with any data processing technologies, including the ETL tools. Mapping
between the metadata of the source and destination data sources occurs during the ETL design
process.
Internal logical and physical schemas for source systems (such as the member eligibility system)
are used to create the staging tables inside the ETL tool (such as Oracle Data Integrator, or ODI).
Metadata is necessary to extract the data from corresponding tables and fields and to ensure
the referential integrity of the extracted data.

5.3. Data Extraction


The ETL process retrieves data according to its business purpose,
in the data extraction step. The extracted data may vary from a
full-database scan or a full-table extract to records that were
updated within a specific time period. The scope of data will be
defined in the design phase of the ETL process. For the sake of
information security, and processing performance, only data

10/25/2016 SCDHHS - Confidential Page 19 of 36


TRA04.05 – Extract, Transform, and Load (ETL) Processing

directly attributed to the business purpose of the ETL should be accessed and/or extracted.
The scope of data extraction is more of a consideration for direct database exchanges, where
filtering of the data is built into the ETL process. For file-based exchanges, data filtering is
outside of the span of control of the ETL process and data filtering designed into the source
system batch processes.
Metadata stored in the metadata repository is used to translate the structure of the data
source to intermediate structures, if any, for further processing.

5.4. Data Validation


The ETL process performs structural and business validation on
data at the source and prior to insertion into the target. These
validations depend upon the domain and are performed
according to the metadata in the metadata repository. ETL
tools perform some validations by default such as checking to
see if the file is empty or if the data has a required file
extension (e.g. .txt, .doc, .xml, etc.). For other, slightly more
complex validations, the ETL developer will leverage the
capabilities of the ETL tool to code the validations according to
the purpose of the ETL process.

TRA04.05-S13. The ETL process will validate the structure of the database or file.

Validate the structure of the database or files, according to the data exchange agreement
and/or approved design. Failing structural validation means that records are rejected for
further processing of transformation and loading. The file shall be rejected after proper error
handling.

TRA04.05-S14. The ETL process will validate that the data follows specified business rules.

The ETL process will validate the data to ensure that the data follows certain business rules, for
instance:
 A phone number should have 10 characters with all of them being numeric.
 The value in the state field should be a valid state code.
 The email field should have one and only one @ character.
 A member address record contains a valid address according to the USPS.
 A valid ICD-10 code is provided in a claims record.

5.5. Data Transformation

10/25/2016 SCDHHS - Confidential Page 20 of 36


TRA04.05 – Extract, Transform, and Load (ETL) Processing

The transformation step processes data from the source data


schema to the target data schema. During this process,
additional data processing can take place, depending on the
business requirements of the ETL process.

TRA04.05-S15. The ETL tool will have read/write access to


staging and destination databases.

Data transformation process may include converting the data


into a same format, sorting, joining data, etc. Access to target
systems will be read-write and restructured to the purpose of
the update.
Activities in the transformation step include:
 Transporting extracted data to a staging area.
 Applying transformation according to the business rules of the ETL process design.
 Loading transformed data into temporary staging tables inside the ETL application
before loading the data into the target database.
Transformation functions include:
 Data mapping from source format to destination format
 Data aggregations is often used in data warehousing
 Lookup transformation for the purpose of data validation and data quality
improvements

5.6. Load Step


During the load step of the ETL process, data is loaded directly
into the target database or file, depending on the requirement.
Once the data is loaded into the desired tables or files, these
loaded tables/files may in-turn be the data source for a
different ETL process. This depends on the nature and
complexity of the applications in the enterprise.
Some of the extract step design considerations apply to the
loading step design, for example, understanding the target
schema and format, the method of connecting to the target,
and the means to transport the data.

10/25/2016 SCDHHS - Confidential Page 21 of 36


TRA04.05 – Extract, Transform, and Load (ETL) Processing

5.7. Common Infrastructure


In addition to the standard ETL steps previously discussed, a common infrastructure spans the
ETL process end-to-end. These common infrastructure requirements define how process
logging, error logging, and exception handling are performed on each ETL process. These
requirements, at a minimum, will provide additional structure to the operational readiness
review of ETL processes.

5.7.1. Process Logging


ETL process logs provide process execution information to enterprise operations and
maintenance team members, such as IT helpdesk, system administrators, and developers. The
systems administrators and business users shall decide and configure the application to an
optimum logging level, taking into account technical and compliance requirements.

TRA04.05-S16. The ETL process will log important events and metrics.

The ETL process shall have a mechanism to log important events and metrics before, during,
and after the execution of the process.

TRA04.05-S17. The ETL process will not write personal or protected data in logs.

No protected health information (PHI), personally identifying information (PII), or taxpayer


information should be written to log files.

TRA04.05-S18. The ETL application will be able to increase/decrease log detail.

The ETL application, or any program block used to execute the ETL process, should have a
customizable process logging mechanism that allows adjustment for the level of detail captured
in process log files.
Detailed logs assist investigations; however, creating and maintaining detailed logs is expensive
in terms of processing power and memory. Administrators should be to adjust the ETL process
logging.

TRA04.05-S19. The ETL application will adhere to the MES log retention policy.

MES shall have a policy to determine the retention time of the logs. A shorter log retention
period ensures that the process logs do not quickly fill up with the logs being generated daily,
while a longer log retention period helps analyze ETL loads and may fulfill the regulatory
requirements.
Systems operations and maintenance staff can use process logs to perform an initial analysis of
process execution. To ensure that appropriate, and consistent, data is available for analysis, the

10/25/2016 SCDHHS - Confidential Page 22 of 36


TRA04.05 – Extract, Transform, and Load (ETL) Processing

following events will be captured consistently across all ETL processes.


Table 1. Events and metrics to record in ETL process logs

Event/Metric Description

Process name The process name, according to the Configuration Management


Database.

Start and stop The beginning and ending time stamp for the ETL process as a whole, as
events well as the individual steps, should be stored.

Status Process steps can succeed or fail individually, and as such, their status
(not started, running, succeeded, or failed) should be logged
individually.

Errors and other While logging failures and anomalies often consumes the most time
exceptions when building a logging infrastructure, these logs also yield the most
value during testing and troubleshooting. See Section 5.7.2.. Exception
Handling for failure response scenarios, logging requirements, and
information logged for errors and exceptions.

Audit information This can vary from simply capturing the number of rows loaded in each
process execution, to a full analysis of row count and dollar value from
source to destination.

Testing and This is particularly useful during the development and testing phase,
debugging most notably for processes that are heavy on the transformation part of
information ETL.

Security events Security events, such as user login, login time, etc.

5.7.2. Exception Handling


Any application process, including the ETL, is bound to encounter unexpected errors. In
anticipation of such an event, the processes should be designed to handle the errors gracefully.
The process should either quit or execute an alternate routine based on the design and should
provide the systems administrators enough information regarding the occurrence to perform
routine fault-analysis.

10/25/2016 SCDHHS - Confidential Page 23 of 36


TRA04.05 – Extract, Transform, and Load (ETL) Processing

TRA04.05-S20. The ETL process will log all defined and undefined errors.

The ETL process will log any and all defined or undefined errors occurring in all stages of the ETL
process.

TRA04.05-S21. The ETL process will anticipate failures and perform prevention procedures.

The ETL process shall contain routines to perform a sanity check on the data before processing
it. The routines should be designed to automatically correct some common data discrepancies
and error scenarios, for example, removing preceding zeros for certain data elements,
removing hyphens in a phone number, etc.

TRA04.05-S22. The ETL process will handle errors as defined during set-up.

During set-up, the ETL team agrees on, and establishes, the error handling scenarios. See
Section 4.. ETL Process Setup.

TRA04.05-S23. The ETL process will send an alert in the event of an exception.

The ETL process will send alerts reliably and consistently in the form of email in event of an
exception. The list of individuals alerted should be customizable by the Enterprise Services (ES)
administrators.

TRA04.05-S24. The ETL process will capture full error context.

In the event of an error, the ETL process will capture and save the information listed in Table 2.
Information logged in event of ETL process error. All potential and anticipated errors shall be
assigned unique error codes and description. Based on the error that occurred, the error
handling routine should be designed to capture the appropriate error code.
Table 2. Information logged in event of ETL process error

 Element   Attribute  Comments 

Process Info   Details about the process logging the error

  Process Name Name of the process in which the error occurred

  Process ID The process id assigned 

  Purpose The purpose of the process

  Number Of The number of records associated with the error code


Records

10/25/2016 SCDHHS - Confidential Page 24 of 36


TRA04.05 – Extract, Transform, and Load (ETL) Processing

 Element   Attribute  Comments 

  Application Type ODI, Oracle Service Bus (OSB), or service-oriented


software (SOA)

  Time Stamp The timestamp at which the error occurred

Error Code   A unique code for each potential error

Error   A unique description associated with the error code


Description

Error Message   Actual error message or exception at a specific step in


the process

Error Severity   Critical, high, medium, or low

Environment   Development, quality assurance, user acceptance


testing, or production

5.7.3. Component Documentation


Component documentation is descriptive text that programmers and operations staff can use
to analyze code. All components must contain documentation blocks that describe the purpose,
description, dependencies, and version control history.

TRA04.05-S25. Each ETL process will document its business purpose and technology
components.

Each ETL process job will specify the supported business purpose and the technology
components that directly feed data to the ETL job or consume data directly from the ETL job.
ETL jobs will not be allowed production migration, without documentation of the business
purpose. In addition, each ETL process will include the following documentation:
 Programmer
 Programmer name that last updated the code.
 Version
 Version information and update history.
 Last update date
 Date of last code update.

10/25/2016 SCDHHS - Confidential Page 25 of 36


TRA04.05 – Extract, Transform, and Load (ETL) Processing

 Dependency
 Dependencies on other components as well as which components are dependent upon
this process.

5.8. Operational Considerations


The operational environment and business criticality of ETL processes have to be carefully
considered prior to deploying the processes in the production environment.

5.8.1. High Availability and Failover Mechanisms


High availability systems continuously operate with as little down-time as possible. Highly
available systems maximize the amount of time that they are operational so that end users can
access and use the system despite temporary network or hardware failures. High availability is
accomplished by providing failover environments with load balancing and disaster recovery
technologies that can manage any disruptions that are encountered without the need for
human interaction.

5.8.2. Causes of Infrastructure Failure


The main causes of ETL process failures include:
 Errors in the ETL process/workflow
 Systemic failures
Errors in the ETL process/workflow include unexpected data in the source files, faulty logic
during the execution, and running out of disk space. These types of failures can be handled at
the code level to avoid system crashes. Section 5.7.2.. Exception Handling describes
considerations for handling such incidents.
The second scenario is not directly related to a code issue. Potential incidents include:
hardware failures, web server failures, and scheduler failures. These issues will be handled at
the systems level and the ETL applications will be configurable to handle such contingencies.

5.8.3. Clustering
One of the methods of handling systems-level failures and achieving high availability is
clustering - implementing a group of hosts that act like a single system to provide continuous
uptime. Clustering is generally used for load balancing and failover purposes to aid in making
the system highly-available.
The MES clustered environment for ETL processing has multiple web server and schedulers. This
environment provides failover capabilities for load balancing and scheduling to handle

10/25/2016 SCDHHS - Confidential Page 26 of 36


TRA04.05 – Extract, Transform, and Load (ETL) Processing

situations, such as if one of the nodes in the systems were to go down. The ETL tool shall have
the mechanism to replicate/synchronize the configuration across the server nodes.

5.8.4. Restartability
Restartability is the ability to restart an ETL job if a processing step fails to execute properly.
This will avoid the need of any manual cleaning up before a failed job can restart. Depending on
the design of individual jobs, the ETL design will address the ability to restart processing at the
step where it failed as well as the ability to restart the entire ETL session.

5.9. Scheduling
A job scheduler can be part of the ETL tool or a compatible application that enables the
developers and systems administrators to control and monitor the execution of ETL processes
across enterprises.
After design, development, and testing, ETL processes need to be scheduled to be executed on
the application server. These processes can be designed to be executed separately in a
sequence of simple standalone tasks or together as a combination of tasks.
Examples of standalone tasks include:
 Creating an extract file from a database
 Validating addresses against the USPS
An example of a sequence of tasks:
 Extracting data from source table and loading the records into a staging table
 Running a transformation logic on staged data using data maps

TRA04.05-S26. The ETL will be executed per the configurations in scheduler.

Schedules define when and how often the ETL process or group of processes need to be run.
ETL processes can run repeatedly based on an interval defined in the scheduler.
A common definition of the schedule contains the following information:
 The date and time when a process should begin to run
 The frequency in which the process should run
 The date and time when the process should end its runs
 The process identification details (e.g. the process name, process code, etc.)
Table 3. Considerations for scheduling ETL process jobs describes a number of things to consider
when scheduling ETL process.

10/25/2016 SCDHHS - Confidential Page 27 of 36


TRA04.05 – Extract, Transform, and Load (ETL) Processing

Table 3. Considerations for scheduling ETL process jobs

Consideration Description

Sequence The sequence in which the process should run, for instance, reference
data must be loaded first.

Duration Processes are scheduled taking their end times into consideration. For
example, if a process is supposed to be done by 8AM and might take 3
hours to run then it needs to be started before 5 AM

CPU utilization CPU intensive processes should be scheduled at off-peak times.

Preconditions Events that must occur before running the process, for instance, a
particular process should be completed before the start of another
process.

5.9.1. Process Monitoring


A complex IT enterprise environment needs to execute ETL processes based on scheduled time,
but also based on events (for instance, an arrival of a file) or, based on the success or failure of
other processes. The ETL scheduler should allow definition of the events, dependencies, time
schedules, and alerts. The scheduler should also provide a graphical view that enables the users
to monitor the progress of these processes by displaying various details such as start time,
completion status etc. The graphical console should also provide controls to stop or pause the
execution of processes.

5.9.2. Reporting
The scheduler should be able to generate a report daily and/or on-demand to report on all the
processes executed and details of completion. Details to be included in the reports include:
 Process name and code
 Completion status
 Start time
 End time
 Duration
 Name of the server on which the task was run
 Exit code

10/25/2016 SCDHHS - Confidential Page 28 of 36


TRA04.05 – Extract, Transform, and Load (ETL) Processing

5.9.3. Alerts and Notification


The scheduler should be configurable to send alerts and notifications to users based on process
execution progress. It should also allow the users to associate people or group of people to
each job so that only relevant messages are sent out.

6. Use Cases
This section provides a few examples of ETL processes in an operational environment that show
how standards are applied within the context of the MES.

6.1. Initial Load


Extracting the entire data set residing in a source database and then loading the data into a
target database is commonly known as an initial data load. In this scenario, the ETL process
takes place just once for the entire business requirement. This practice is followed in cases,
such as:
 Sun-setting legacy applications
 Consolidation of two data sources
 Establishing an operational data store
 Establishing a data warehouse or data mart
Figure 5. Initial load use case explains the flow of data and tasks performed by various systems
in an initial load use case.

Figure 5. Initial load use case

6.2. Incremental Loads / Deltas


Incremental loads are the most common practice in data synchronization. If the data is updated
or new data is added to the system of record, the same changes must be reflected in other
related enterprise systems. For example, if new providers enroll in the provider management

10/25/2016 SCDHHS - Confidential Page 29 of 36


TRA04.05 – Extract, Transform, and Load (ETL) Processing

system, the changes may have to be reflected in the claims and financial systems for invoices to
be processed correctly. In such cases, the provider management system becomes the source,
and the operational data store becomes the target. When the source data is being updated and
the target database needs to be in sync with the source database, the ETL process runs
periodically (in predefined time intervals, such as hourly, monthly, etc.) bringing in data from
the source database to the target. Figure 6. Incremental loads/deltas use case illustrates this
incremental load use case.

Figure 6. Incremental loads/deltas use case


One major consideration while designing this interface is the mechanism to identify the new
data (also referred to as delta data or incremental data) to be extracted by the ETL application.
Identifying delta data commonly is done using one of the following:
 Time stamp on the relational database management system (RDBMS) data record in the
source database
 Transactional logs
 Process logs
 Daily files received from the source system
Incremental loads/delta use case examples include:
 Updating the ODS with member data from updated records in the eligibility system
 Updating eligibility history of a member in the ODS from updated data in the eligibility
system
 Receiving claims data from the Administrative Services Organization (ASO) and
uploading it to the ODS

10/25/2016 SCDHHS - Confidential Page 30 of 36


TRA04.05 – Extract, Transform, and Load (ETL) Processing

6.3. Initial as Well as Ongoing Loads


This use case combines the initial load and the incremental loads use case, and may be followed
during an implementation of:
 A new central data store being implemented where the legacy systems continue to
operate and store data in their own databases
 A new source is added into the existing data warehouse
In this pattern, the entire source data is loaded initially into the target database (described in
Section 6.1.. Initial Load). After the new data has been entered into these source databases, the
ETL application identifies the deltas and performs periodic incremental loads into the target
database (described in Section 6.2.. Incremental Loads / Deltas).
Figure 7. Initial as well as ongoing loads use case illustrates the initial and ongoing loads use
case.

Figure 7. Initial as well as ongoing loads use case

6.4. Transformation Performed on Target Systems


This section lists examples of the processes that follow E-L-T instead of E-T-L. The
transformation of data to the desired format is performed in conjunction (or later) as part of
the loading step.

6.4.1. Data Loads into Data Marts


Data marts are smaller sets (as opposed to the ODS or a traditional data warehouse) of
analytical data that are specific to a defined business purpose and are used to facilitate

10/25/2016 SCDHHS - Confidential Page 31 of 36


TRA04.05 – Extract, Transform, and Load (ETL) Processing

business analytics. Data from one or more operational systems needs to be extracted and
copied into the data mart following steps similar to those listed in Section 6.1.. Initial Load.
However, the transformation step takes place in the target system. Figure 8. Data loads into
data mart use case shows the steps to load data into the data mart using an E-L-T pattern.

Figure 8. Data loads into data mart use case

6.4.2. Data Loads into NoSQL Databases


Figure 9. Data loads into NoSQL database use case shows a flow of data where the ETL process
occurs two times before the data is deposited in the final destination. The first ETL process
extracts, transforms, and loads data into the raw data lake. The second ETL process extracts
data from the raw data lake, loads the data into the ODS, and transforms the data to meet the
ODS standards.

10/25/2016 SCDHHS - Confidential Page 32 of 36


TRA04.05 – Extract, Transform, and Load (ETL) Processing

Figure 9. Data loads into NoSQL database use case

7. Tools and Technologies


Purpose Tool

Authentication Liferay Identity Manager

File transfer Oracle MFT

ETL Oracle Data Integrator

Execution Scheduling ESS (Enterprise Scheduling Services)

8. Appendices
8.1. Relevant Documentation
Document name Dependency

TRA01.02– Access Control Definition of identity management implementation across


and Identity Management portal, database and service assets.   Defines the use of
service accounts that make component-to-component calls.

TRA04.02 – Web Services Development guide for web services and integration into the
ESB.

TRA04.06 – Managed File Specifies the mechanism by which files are transported
Transfer (MFT) between source and destination locations

TRA04.07 – Electronic Data Guidance on when to use the X12 standards, when other
Interchange (EDI) formats (XML, JSON) are allowed.  Decoding, interpretation
and encoding of X12 files.

TRA05 – The Enterprise Data Overview of the data architecture and the data flow.
Services (EDS) framework

TRA06 – Technology Overview of the technology products portfolio (TP) and the
Products Portfolio process governing the TPP.

10/25/2016 SCDHHS - Confidential Page 33 of 36


TRA04.05 – Extract, Transform, and Load (ETL) Processing

8.2. Revision History


Version Number Date Author(s) Description of Change

0.1 07/12/2016 Gerhard Ungerer Initial draft

0.5 10/10/2016 Murali Vaddi Draft Version

0.6 10/11/2016 Nancy Hobbs Edited copy

1.0 10/25/2016 Gerhard Ungerer Final Edits, version 1.

8.3. Acronyms
Term Meaning

BI Business Intelligence

COTS Commercial off-the-shelf products

CSV Comma-separated values file

EDI Electronic data interchange

EDS Enterprise Data Service

ELT Extract, load, and transform

ES Enterprise Services

ETL Extract, transform, and load

FTP File Transport Protocol

JDBC Java Database Connectivity

JSON JavaScript Object Notation (lightweight data-interchange format)

MEDS Medicaid Eligibility Determination System

MES Medicaid Enterprise System

10/25/2016 SCDHHS - Confidential Page 34 of 36


TRA04.05 – Extract, Transform, and Load (ETL) Processing

Term Meaning

MFT Managed File Transfer

MMIS Medicaid Management Information System

NoSQL Non-sequential query language

ODI Oracle Data Integrator

ODS Operational data store

OSB Oracle Service Bus

PHI Protected health information

PII Personally identifying information

RDBMS Relational database management system

RDL Raw data lake

SOA Service-oriented architecture

TRA Technical Reference Architecture

TSV Tab-separated values file

URI Uniform Resource Identifier

XML Extensible markup language

8.4. Glossary
Term Definition

Cúram/ACCESS An eligibility determination system that will eventually replace MEDS --


currently contains the MAGI Medicaid population.
ACCESS is the brand name given by the state to the COTS product
purchased by SCDHHS.

10/25/2016 SCDHHS - Confidential Page 35 of 36


TRA04.05 – Extract, Transform, and Load (ETL) Processing

Term Definition

OnBase Provider of enterprise content management (ECM) and process


management software.

10/25/2016 SCDHHS - Confidential Page 36 of 36

You might also like