You are on page 1of 37

Architectures, Features, and

Details of Data Integration Tools

Architectures and Marketplace


Lesson Objectives
• Discuss motivation for data integration tools
• Explain the differences between the ETL and ELT
architectures
• Reflect on market summary dimensions of
execution and vision

2
Motivation for Data Integration Tools
• Support initial population and refresh processes
• Project failures partly due to lack of tools and
poor performance
• Improve software productivity
– Integrated development environments
– Graphical and visual specification
– Minimize custom coding
• Achieve high performance

3
ETL Architecture
The transformation engine is
independent of the DBMS for
data warehouse tables.
Data
Source

Data Extract Transform Load


Source (ETL Engine) DW Tables

Data
Source

4
ELT Architecture
ELT, uses a relational DBMS to
perform transformations after
extraction loading.
Data
Source

Data Extract Transform


Load DW Tables
Source (Relational
DBMS)

1 2 , 3
Data
Source While ELT architecture
supporters emphasize superior
optimization technology
in relational DBMS engines. 5
Architecture Evaluation
• Major advantages
– ETL : DBMS independence for ETL
– ELT : Requires Superior optimization technology in
relational DBMS
• Other issues
– More complex operations for ETL in transformations
– Less network bandwidth for ELT
• Hybrid Tools for Data Integration
– Combination of both architectures possible

6
Data Integration Marketplace Features
• Diverse with (1) proprietary and (2) open-source
products from DBMS vendors and third party
vendors (3) Hybrid Products: E.g., Open source
Products with Base products and subscription
services for extended products and support
• Vibrant: Continuous Development for all
diversities. Developing marketplace with
substantial product development and
consolidations.
• At the end of 2014, a Gartner report estimated
the data integration market size at 2.5 billion, 7

with annual growth rate of 10% through 2019.



Data Integration Tool Vendors
• Traditional vendor products
– Database vendors: Oracle, IBM, Microsoft
– Other vendors: SAP, Informatica, SAS, Information
Builders
• Open source with subscription services
– Pentaho Data Integration
– Talend Open Studio for Data Integration
– CloverETL

8
Architectures, Features, and
Details of Data Integration Tools

Common Features of Data Integration


Tools
Lesson Objectives
• Describe common features of data integration
tools
• List some common transformation processes
supported by data integration tools

10
Data Integration tools Feature Overview
Secondary features are provided
by most products, but sometimes
Essential Secondary require an extended product with
an additional license.

IDE: support complex software


projects with a source code editor, visual Integrated Job management involves monitoring
specification tools, debugger, and Job and scheduling of workflows for
development
code generator. management execution of data integration workflows.
environment

use rule and graphical specifications Change data capture typically uses a
within procedural coding to indicate Workflow and publish and subscribe model to control
Change data
workflow and transformations. Some component change data availability and notify
capture
tools can generate code that can be specifications subscribers about change data
customized for more flexibility. availability.

Data profiling helps a data warehouse


Data source
Data profiling administrator assess quality levels in
connectivity data sources.

Repository

11
IDE Overview

This screen snapshot depicts the IDE in Pentaho Data Integration for 12

graphical design of workflows known as transformations in Pentaho.


Workflow and Component Specification
Specification Window
Workflow

This screen snapshot was taken from


Pentaho Data Integration.
The workflow graphically depicts data flow
among data sources and components.
The merge Join Window provides details
about this component in a workflow. 13
Workflow Components
Talend Open Studio uses these icons and major categories of components.

Processing (e.g. filter rows and filter columns)

Orchestration (e.g. for loop and combine flows)

Business intelligence (e.g. slowly changing


dimension algorithms & bar chart)
Database (e.g. Oracle bulk load & stored
procedure execution)
Data quality (e.g. compliance check for columns)

File processing (e.g. file comparison and file copy)

14
Repository
A repository is a design database for all features of a data integration tool.

• Stores Design objects and


relationships
• Maintains Dependencies
• Provides Documentation

15
Job Management
Job management supports scheduling and monitoring of jobs for
executing data integration workflows.

Scheduling
- Setting a schedule,
- Repetition
- Conditional execution of steps
- Execution: start, stop, pause

Monitoring
- Logging
- Performance alerts 16
- and reports
Data Profiling
Data profiling helps a data warehouse administrator and
data owners understand and approve data quality and data sources.
A data profiling tool can reduce unexpected delays and
surprises in populating and refreshing a data warehouse.
MSSQL data profiler provides following
functions to understand data quality:
• Descriptive statistics about the
distribution of values in a column,
• Null value ratio
• Uniqueness showing a number of
distinct values for column or
combination of columns
• Pattern matching coverage
• Field relationships
17
- Column pattern: coverage for values
matching regular expressions
- Descriptive statistics (min, max,
Change Data Capture
Source Logs
Tables

Publisher
Triggers

Log
Change extraction
Tables processes

Subscriber

Publishing
processes

Change data capture uses a publish and subscribe model to control change data 18

availability and notify subscribers about change data availability.


Module 5
Architectures, Features, and
Details of Data Integration Tools

Talend Open Studio


Lesson Objectives
• List major features of Talend Open Studio for
data integration
• Gain familiarity with Talend features for jobs and
transformations
• Explore more details about Talend Open Studio

20
Talend Product Editions
• Community edition with a standard open source
license
• Enterprise editions with a subscription service for
technical support and extended features
• Several enterprise and platform editions for
Talend Data Integration product

21
Data Integration Features

• Graphical job design using components


• Palette of data transformation components
• Job execution with database connectivity
• Meta data repository

22
Talend IDE

Repository
pane
Component
palette

Canvas with job design

Design
pane Job
pane

23
Palette with Components
Category Prominent Components
Data quality
Databases
ELT
Processing
Internet
XML
Big data

24
Simple Job Design
In this snapsshot, the Excel data source (SSExcelSource) contains rows to be loaded
into the SSSales table, the fact table of the Store Sales data warehouse.

25
Component Details
For example, the tMap
component uses a graphical
display to support join
specification. In this slide, the
fields in the Excel data source are
mapped to columns in the
SSTimeDim Oracle table. The
columns in the top part (row1)
are from the input file. The
columns in the bottom part
(row2) of the window are the
SSTimeDim table. A “drag and
drop” method is used to match
the columns in the data source
and table.
26
Job Execution Example

• Excel input file contains 12 row. The tSchemaComplianceCheck component


rejects two rows for null value or data type violations, passing 10 rows to the
tMap component.
• The tMap component rejects two rows for FK violations, passing 8 rows to the
tOracleOutput component.
27
Summary
• Prominent open source data integration tool
• Supports graphical specification of
transformations, components, and job
management
• Install and use Talend for more details

28
Talend uses Open Core Model
• Feature-limited core product under a standard
open source
• Commercial versions with proprietary extensions
• Paid support services
• First commercial open source vendor
• Becoming widespread

29
Architectures, Features, and
Details of Data Integration Tools

Pentaho Data Integration


Lesson Objectives
• List major features of Pentaho Data Integration
• Gain familiarity with Pentaho features for jobs
and transformations
• Gain experience with Pentaho on the practice
exercise and assignment

31
Pentaho Products
• Platform for data integration, business analytics,
and big data
• Open core business model
• Pentaho Data Integration
– Visual designer for transformations and jobs
• Pentaho Business Analytics
- Interactive visual analysis
- Geo-mapping, heat grids, scatter/bubble charts
- Visualization plugins
• Pentaho Big Data Analytics 32
Pentaho Data Integration
• Editions
– Subscription service from Pentaho website
– Community edition: Kettle (from sourceforge)
https://sourceforge.net/projects/pentaho/
• Basic concepts
– Transformation with data flow among steps and hops
– Job with data flow among transformations and external
entities
• Tools:
– Spoon: graphical design of transformations and jobs
– Pan: execution of transformations 33

– Kitchen: execution of jobs


Transformations
• Step: process in a data flow
– Input/Output
– Transform: sort, split, concatenate, …
– Flow: filter rows
– Lookup: existence of rows, tables, files, …
– Join: merge join, multiway merge, …
– Validation: credit card, mail, data
• Hop: directed connection between steps
• Database connections
• Distributed processing: partition, cluster, …

34
Spoon IDE

View pane
Execution
controls

Canvas

35
Example Transformations

36
Merge Join Step

37

You might also like