Professional Documents
Culture Documents
2
Motivation for Data Integration Tools
• Support initial population and refresh processes
• Project failures partly due to lack of tools and
poor performance
• Improve software productivity
– Integrated development environments
– Graphical and visual specification
– Minimize custom coding
• Achieve high performance
3
ETL Architecture
The transformation engine is
independent of the DBMS for
data warehouse tables.
Data
Source
Data
Source
4
ELT Architecture
ELT, uses a relational DBMS to
perform transformations after
extraction loading.
Data
Source
1 2 , 3
Data
Source While ELT architecture
supporters emphasize superior
optimization technology
in relational DBMS engines. 5
Architecture Evaluation
• Major advantages
– ETL : DBMS independence for ETL
– ELT : Requires Superior optimization technology in
relational DBMS
• Other issues
– More complex operations for ETL in transformations
– Less network bandwidth for ELT
• Hybrid Tools for Data Integration
– Combination of both architectures possible
6
Data Integration Marketplace Features
• Diverse with (1) proprietary and (2) open-source
products from DBMS vendors and third party
vendors (3) Hybrid Products: E.g., Open source
Products with Base products and subscription
services for extended products and support
• Vibrant: Continuous Development for all
diversities. Developing marketplace with
substantial product development and
consolidations.
• At the end of 2014, a Gartner report estimated
the data integration market size at 2.5 billion, 7
8
Architectures, Features, and
Details of Data Integration Tools
10
Data Integration tools Feature Overview
Secondary features are provided
by most products, but sometimes
Essential Secondary require an extended product with
an additional license.
use rule and graphical specifications Change data capture typically uses a
within procedural coding to indicate Workflow and publish and subscribe model to control
Change data
workflow and transformations. Some component change data availability and notify
capture
tools can generate code that can be specifications subscribers about change data
customized for more flexibility. availability.
Repository
11
IDE Overview
This screen snapshot depicts the IDE in Pentaho Data Integration for 12
14
Repository
A repository is a design database for all features of a data integration tool.
15
Job Management
Job management supports scheduling and monitoring of jobs for
executing data integration workflows.
Scheduling
- Setting a schedule,
- Repetition
- Conditional execution of steps
- Execution: start, stop, pause
Monitoring
- Logging
- Performance alerts 16
- and reports
Data Profiling
Data profiling helps a data warehouse administrator and
data owners understand and approve data quality and data sources.
A data profiling tool can reduce unexpected delays and
surprises in populating and refreshing a data warehouse.
MSSQL data profiler provides following
functions to understand data quality:
• Descriptive statistics about the
distribution of values in a column,
• Null value ratio
• Uniqueness showing a number of
distinct values for column or
combination of columns
• Pattern matching coverage
• Field relationships
17
- Column pattern: coverage for values
matching regular expressions
- Descriptive statistics (min, max,
Change Data Capture
Source Logs
Tables
Publisher
Triggers
Log
Change extraction
Tables processes
Subscriber
Publishing
processes
Change data capture uses a publish and subscribe model to control change data 18
20
Talend Product Editions
• Community edition with a standard open source
license
• Enterprise editions with a subscription service for
technical support and extended features
• Several enterprise and platform editions for
Talend Data Integration product
21
Data Integration Features
22
Talend IDE
Repository
pane
Component
palette
Design
pane Job
pane
23
Palette with Components
Category Prominent Components
Data quality
Databases
ELT
Processing
Internet
XML
Big data
24
Simple Job Design
In this snapsshot, the Excel data source (SSExcelSource) contains rows to be loaded
into the SSSales table, the fact table of the Store Sales data warehouse.
25
Component Details
For example, the tMap
component uses a graphical
display to support join
specification. In this slide, the
fields in the Excel data source are
mapped to columns in the
SSTimeDim Oracle table. The
columns in the top part (row1)
are from the input file. The
columns in the bottom part
(row2) of the window are the
SSTimeDim table. A “drag and
drop” method is used to match
the columns in the data source
and table.
26
Job Execution Example
28
Talend uses Open Core Model
• Feature-limited core product under a standard
open source
• Commercial versions with proprietary extensions
• Paid support services
• First commercial open source vendor
• Becoming widespread
29
Architectures, Features, and
Details of Data Integration Tools
31
Pentaho Products
• Platform for data integration, business analytics,
and big data
• Open core business model
• Pentaho Data Integration
– Visual designer for transformations and jobs
• Pentaho Business Analytics
- Interactive visual analysis
- Geo-mapping, heat grids, scatter/bubble charts
- Visualization plugins
• Pentaho Big Data Analytics 32
Pentaho Data Integration
• Editions
– Subscription service from Pentaho website
– Community edition: Kettle (from sourceforge)
https://sourceforge.net/projects/pentaho/
• Basic concepts
– Transformation with data flow among steps and hops
– Job with data flow among transformations and external
entities
• Tools:
– Spoon: graphical design of transformations and jobs
– Pan: execution of transformations 33
34
Spoon IDE
View pane
Execution
controls
Canvas
35
Example Transformations
36
Merge Join Step
37