You are on page 1of 32

Big Data Essentials

Removing the Skills Barrier

2014
2011
Devices
& Machines

2007
Communities
& Society

1990s
Business
Ecosystems

1980s

BUSINESS
1960s-1970s

USERS

VALUE

Few
Employees

Back Office
Automation

Customers/
Consumers
Many
Employees

Front Office
Productivity

Line-of-Business
Self-Service

Social
Engagement

Real-Time
Optimization

E-Commerce

TECHNOLOGIES
OS/360

SOURCES

TECHNOLOGY
MAINFRAME

10 2
CLIENT-SERVER

10 4

WEB

10 6
CLOUD

10 7

SOCIAL

10 9

INTERNET
OF THINGS

10 11

Business is connecting innovation to Big Data


Media & Entertainment

Financial Services

Retail & Telco

Risk & Portfolio


Analysis,
Investment
Recommendations

Proactive Customer
Engagement,
Location Based
Services

Online & In-Game


Behavior
Customer X/Up-Sell

Manufacturing

Healthcare & Pharma

Public Sector

Connected Vehicle,
Predictive Maintenance

Predicting Patient
Outcomes,
Total Cost of Care
Drug Discovery

Health Insurance
Exchanges,
Public Safety,
Tax Optimization
Fraud Detection

IT is struggling with the cost of Big Data

Lack of big data skills and


expertise

Growing data volume is


quickly consuming capacity
Need to onboard, store, &
process new types of data

Big Data
Project

80% of the work in big data projects is


data integration and data quality

Data
Integration
& Quality

Unleash the Power of Hadoop


Informatica Developers are Now Hadoop Developers
Profile

Parse

ETL

Cleanse

Match

Analytics Teams
Relational, Mainframe

Load

Load
Data Warehouse

Documents and
Emails
Replicate

Social Media, Web


Logs

Services

Stream

Events

Archive

Topics

Analytics & Op
Dashboards

Mobile Apps

Machine Device,
Cloud
Alerts

The Value of a Virtual Data Machine (like Vibe):


Integration Flexibility: Same skills, multiple deployment modes.
Skills leverage
Future proof investment
Development Acceleration

Development
Deployment

Data
Virtualization

Cloud

Desktop

Server

Data
Federation

Embedded
DQ in apps

Embedded
data quality
in apps

Data
Integration
Hub

HADOOP
7

PowerCenter Big Data Edition


The Safe On-Ramp To Big Data
No-Code Productivity

Big Transaction Data


Online Transaction
Processing (OLTP)
Oracle
DB2
Ingres
Informix
Sysbase
SQL Server

Online Analytical
Processing (OLAP) &
DW Appliances
Teradata
Redbrick
EssBase
Sybase IQ
Netezza
Exadata

High-Speed Data
Ingestion and
Extraction

HANA
Greenplum
DataAllegro
Asterdata
Vertica
Paraccel

Cloud

Business-IT
Collaboration
Unified Administration

Complex Data
Parsing on Hadoop

Social Media & Web Data


Facebook
Twitter
Linkedin
Youtube

Web applications
Blogs
Discussion forums
Communities
Partner portals

Universal Data Access


Other Interaction Data

PowerCenter
Big Data Edition

Salesforce.com
Concur
Google App Engine
Amazon

ETL on Hadoop

Big Interaction Data

the VibeTM virtual


data machine

Clickstream
image/Text
Scientific
Genomoic/pharma
Medical

Medical/Device
Sensors/meters
RFID tags
CDR/mobile

Entity Extraction and


Data Classification on
Hadoop

Big Data Processing


Profiling on Hadoop

Changing the Analytics Equation:


Shift Effort from Data Preparation to Data Analysis
Develop new products
Time available for
data analysis

and services faster and


cheaper

Free up data scientists


to focus on analysis

Allow more available &


affordable PowerCenter
developers to handle
data preparation
Hand
Coding

Time spent on data preparation


(parse, profile, cleanse, transform, match)

Changing the Analytics Equation:


Shift Effort from Data Preparation to Data Analysis
Develop new products
Time available for
data analysis

and services faster and


cheaper

Free up data scientists


to focus on analysis

Allow more available &


With
Informatica
PowerCenter
Hand
Big Data
Edition
Coding

affordable PowerCenter
developers to handle
data preparation

Time spent on data preparation


(parse, profile, cleanse, transform, match)

Implement a Lean Big Data Supply Chain

P&L Goals

Analyst

Prioritize
Goals

Generate
Insights

Data Scientist

Developer

Validate
Hypothesis

Business

Make
Operational

Inspire
Action

Big Data Supply Chain


Acquire &
Store

Big Data

Refine &
Enrich

Explore &
Curate

Distribute
& Manage

Data Management & Analytic Systems

Business
Value

Prioritize Your Business Goals


Sales and Marketing Example

What is the optimal


marketing campaign
mix?

How can I engage


with our key
influencers?

How can I retain my


most profitable
customers?

What are the best


pricing models to
increase sales?

Connect Business Goals to Data

Data

Customer orders
Social data
Web logs
Market data

Information

Customers likely
to churn
Next best offers
Optimal channels
Optimal pricing
models

Value

Increase
customer loyalty
Build sustainable
relationships
Increase
marketing ROI
Increase market
share

Agile Analytics

Data Sources

Visualization

Data Ingestion

Transactions,
OLTP, OLAP

Batch Load

Advanced
Analytics

Machine
Learning

Data Management

Data
Integration &
Data Quality

Replication

Applications
Data Delivery

Data
Integration
Hub

Analytics & Op
Dashboards

Data
Governance

Documents and Emails


Data
Virtualization

Virtual Data
Machine
Social Media, Web Logs

Change Data
Capture

Data Streaming
Machine Device,
Scientific

Mobile
Apps

Data Security

MDM /
PIM
Event-Based
Processing

Archiving

Data
Warehouse

Real-Time
Alerts

Flexible architecture to support rapid changes


The Challenge. Data volumes growing at 3-5 times over the next 2-3 years

The Solution

The Result
Manage data integration
and load of 10+ billion
records from multiple
disparate data sources

Traditional Grid
Mainframe

RDBMS

EDW

Data Virtualization

DW

Business
Reports

Flexible data integration


architecture to support
changing business
requirements in a
heterogeneous data
management
environment

DW
Unstructured
Data

Large Government Agency

Minimize risk and grow digital business


The Challenge. Grow digital business to 30% ($1.8B) and reduce fraud

The Solution

Relational - SQL Server, Oracle,


DB2, AS400, Mainframe

PowerCenter Big Data Edition

Profile

Parse

The Result

BI / Analytics
Visualization & Reporting

ETL

Comprehensive data
integration platform to
integrate large
volumes of data from
over 18+ systems
Ability to use existing
skill sets & make
them more productive

Surveys & Net Promoter


Scores (NPS)

Lowest risk as
industry leader

Social Media, Web Logs,


JSON, XML
Netezza, SQL
Server, Oracle, SAS
Machine, Forensic, Splunk

Large Global Financial Services and Communications Company

Reduce Costs & Increase Revenue


Consolidate Data on Hadoop & Provide 360 View of Customer

The Challenge

Data increasing 20x every year with costs rising from $17K per
day to $50K per day within 6 months. Time to deliver information taking too long.

The Solution

Business
Reports

Traditional Grid

Gain 360 view of


customer behavior,
increase cross-sell &
up-sell revenue

Transactions from
70 Data Centers

In-Store POS
Data

B2B Data
Exchange

Power Center
Big Data Edition

Expected Result

Data
Warehouse

Reduce data storage


costs from $50K per
day to $500 per day

172 TB

Data from Gaming


Consoles, TV, Tablets,
Readers, & Clickstreams
from 5000 Web Sites

& Data
Validation

Reduce time to
deliver information to
business from 48
hours to 15 minutes

Large Global Media & Entertainment Company

Maximize Your Return On Big Data


Hadoop complements your existing infrastructure
Data Assets

Operational Systems

OLTP

Analytical Systems

Data Products

Data
Warehouse

MDM

Transactions,
OLTP, OLAP

OLTP

Data
Mart

ODS

Documents,
Email

& other NoSQL


Social Media,
Web Logs

Machine Device,
Scientific

Access
& Ingest

Parse &
Prepare

Discover
& Profile

Transform
& Cleanse

Extract &
Deliver

Manage (i.e. Security, Performance, Governance, Collaboration)

Unleash the Power of Big Data


With high performance Universal Data Access
Messaging,
and Web Services

Relational and
Flat Files

Mainframe
and Midrange

Unstructured
Data and Files

MPP Appliances

WebSphere MQ
JMS
MSMQ
SAP NetWeaver XI

Web Services
TIBCO
webMethods

Oracle
DB2 UDB
DB2/400
SQL Server
Sybase

Informix
Teradata
Netezza
ODBC
JDBC

ADABAS
Datacom
DB2
IDMS
IMS

VSAM
C-ISAM
Binary Flat Files
Tape Formats

Word, Excel
PDF
StarOffice
WordPerfect
Email (POP, IMPA)
HTTP
Pivotal
Vertica
Netezza

Flat files
ASCII reports
HTML
RPG
ANSI
LDAP
Teradata
Aster

JD Edwards
SAP NetWeaver
Lotus Notes
SAP NetWeaver BI
Oracle E-Business SAS
PeopleSoft
Siebel
Salesforce CRM
Force.com
RightNow
NetSuite

ADP
Hewitt
SAP By Design
Oracle OnDemand

EDIX12
EDI-Fact
RosettaNet
HL7
HIPAA

AST
FIX
SWIFT
Cargo IMP
MVR

Packaged
Applications

SaaS/BPO

Industry
Standards

XML Standards

XML
LegalXML
IFX
cXML
Facebook
Twitter
LinkedIn

ebXML
HL7 v3.0
ACORD (AL3, XML)

Kapow
Datasift
Social Media

Vibe Data Stream for Machine Data


Management
and Monitoring

Handhelds, Smart
Meters, etc.
Discrete Data
Messages

Internet of Things,
Sensor Data

VDS
Node

VDS
Node

VDS
Node

Sources

Ultra Messaging Bus

Web Servers,
Operations
Monitors, rsyslog,
SLF4J, etc.

Publish / Subscribe

Zookeeper

VDS
Node

Hadoop HDFS,
HBase,

VDS
Node

Real Time
Analysis,
Complex Event
Processing

VDS
Node

No SQL
Databases:
Cassandara,
Riak, MongoDB

Leverage High Performance


Messaging Infrastructure
Publish with Ultra Messaging
for global distribution without
additional staging or landing.

Targets

Parse and Prepare Data on Hadoop


The broadest coverage for Big Data
Engine
invocation
is
ato
shared
library.
DT
engine
runs
The
DT
engine
is
also
thread-safe
and
re-entrant.
As
shown
below,
the
actual
transformation
logic
issend
2.3.
The
calling
application
can
buffer
the
data
and
DT
can
be
invoked
in
two
general
ways:
DT
can
also
be
embedded
inThe
other
middleware
To
2.
Developer
deploy
the
deploys
server,
transformation
this
service
1.
Developer
uses
Studio
to
The
DT
Engine
is
fully
For
4.
The
simple
DT
integration,
engine
can
immediately
a
command
Internal
custom
applications
can
PowerCenter
leverages
DT
via
fully within independent
the
process
of
completely
of the
any
callingapplication.
application. the
buffers
to
DT
for calling
processing.
technologies.
folder
to
local
is moved
service
to
repository
the
server
(directory).
via
FTP,
develop
a
transformation
embeddable
and
can
be
invoked
XML
Interaction
data
line
use
interface
this
service
is
available
to
process
to
invoke
data.
embed
transformation
services
Unstructured
Data
Transformation
Industry
Standards
ThisOn
allows
the calling
application
to write
invoke
DT to
in memory
multiple
For
some
(WBIMB,
WebMethods,
BizTalk)
INFA
the
output
side,
DT
can
also
back
1.
Filenames
can
be
passed
toany
it, once,
and
DT
script,
etc.
This
means
you
cancopy,
develop
asupported
transformation
andwill
using
any
of
the
APIs.
It is not
anfiles
external
engine.
This
removes
overhead
threads
tothe
increase
throughput.
All
needed
for
the
transformation
services.
using
various
APIs.
(UDT).
provides
similar
GUI
widgets
(agents)
for
the
buffers
which
are
returned
to
the
calling
application.
directlyenvironments
open the file(s)
for processing.
leverage it in multiple
simultaneously
resulting

Flat Files &


Documents
Name

= Value

^/>Delimited<\^
Svc Repository

Productivity

Visual
parsing
environment

Predefined
translations

from
passing
dataserver
between
processes,
thesocial
network,
NOTE:
If the
file
system
isacross
mountable
from
are
moved.
respective
design
environments.
in
reduced
development
and
maintenance
times
and
lower
etc. The
engine
is
also
dynamically
invoked
and
does
This
isDTs
athe
GUI
transformation
the
developer
machine
directly,
then
step not
2
A good
example
is
support
of
PowerCenter
partitioning
Though
not
shown
below,
engine
fully
supports
multiple
input
impact
of
change.
On
the
output
side,
DT
can
also
directly
Fortoothers
the
API
layer
can
be
used
directly.
need
be
started
up
or
maintained
externally.
would
deploy
directly
tothe
thetransformation.
server.
and output
files
orto
buffers
as
needed
by
scale
up
processing.
widget
in Powercenter
which

Java, C++, C,
web services
write.NET,
to the filesystem.
Device/sensor
wraps around the DT API and
scientific
engine.
Any DI/BI architecture

PIG

EDW
MDM

PowerCenter developers are now Hadoop developers

No-code visual
development
environment

Preview results at
any point in the
data flow

Data Integration & Quality on Hadoop


1. Entire Informatica mapping
translated to Hive Query Language
2. Optimized HQL converted to
MapReduce & submitted to Hadoop
cluster (job tracker).
3. Advanced mapping transformations
executed on Hadoop through User
Defined Functions using Vibe
SELECT
T1.ORDERKEY1 AS ORDERKEY2, T1.li_count, orders.O_CUSTKEY AS CUSTKEY, customer.C_NAME,
customer.C_NATIONKEY, nation.N_NAME, nation.N_REGIONKEY
FROM
(
SELECT TRANSFORM (L_Orderkey.id) USING CustomInfaTx
FROM lineitem
GROUP BY L_ORDERKEY
) T1
JOIN orders ON (customer.C_ORDERKEY = orders.O_ORDERKEY)
JOIN customer ON (orders.O_CUSTKEY = customer.C_CUSTKEY)
JOIN nation ON (customer.C_NATIONKEY = nation.N_NATIONKEY)
WHERE nation.N_NAME = 'UNITED STATES'
) T2
INSERT OVERWRITE TABLE TARGET1 SELECT *
INSERT OVERWRITE TABLE TARGET2 SELECT CUSTKEY,
count(ORDERKEY2) GROUP BY CUSTKEY;

Hive-QL

MapReduce
UDF

Accelerate Development
Reuse and Import PowerCenter Metadata

Import and validate


existing PowerCenter
mappings before running
on Hadoop

Natural Language Processing


Entity Extraction & Data Classification

Train NLP to find and


classify entities in
unstructured data

Hadoop Data Profiling Results


Value and Pattern
Frequency to isolated
inconsistent/dirty data or
unexpected patterns

Hadoop Data Profiling


results exposed to
anyone in enterprise
via browser
CUSTOMER_ID example

COUNTRY CODE example

1. Profiling Stats:
Min/Max Values, NULLs,
Inferred Data Types, etc.

Stats to identify
outliers and
anomalies in data

2. Value &
Pattern
Analysis of
Hadoop Data

3. Drilldown Analysis (into Hadoop Data)

Drill down into actual


data values to inspect
results across entire
data set, including
potential duplicates

Hadoop Data Domain Discovery


Finding functional meaning of Data in Hadoop
Leverage INFA rules/mapplets to
identify functional meaning of
Hadoop data
Sensitive data
(e.g. SSN, Credit Card number, etc.)

PHI: Protected Health Information


PII: Personally Identifiable Information
Scalable to look for/discover ANY Domain type

View/share report of data


domains/sensitive data contained in
Hadoop. Ability to drill down to see
suspect data values.

Single Data Privacy Solution for Selective


Visibility From the Apps to the DW to Hadoop
1.
2.
3.
4.
5.
6.
7.

Load data sources via


PowerCenter Big Data
Edition
Look up data security
policy
Encrypt sensitive data
Data scientist accesses
Hadoop data store through
tool of choice
DDM proxy intercepts HQL
or MapReduce jobs and
looks up security policy
Dynamically inject function
to unencrypt and mask
sensitive data
Result set masked or
cleared

BI & Analytic Layer


Query, Reporting, Data Mining, Predictive Analytics

1
4

PowerCenter
Big Data
Edition

DDM
5

Data Security
Transformation

2
Security
Policy

Hadoop Execution Layer


MapReduce
Hadoop Data Store

UDF

Dynamic Data Masking


In-line Proxy Server Delivers Seamless Security Layer for Hive and Hadoop*
Role-based anonymization and
real-time prevention
Values Presented:

Values Presented:

BLAKE

BL****

JONES

JO****

KING

KI****

Business user
application screen

Dynamic Data Masking Layer


applies real-time HQL rewrites
to mask returned result set

Private Information
Stored

Application screens
and tools used by
production support,
DBAs, Outsourced or
unauthorized workforce

(2)Select substring(name,1,2)||***
from table1

BLAKE
JONES
KING

Hadoop
29

Archive to Hadoop
Compression Extends Hadoop Cluster Capacity
Without INFA Optimized
Archive Compression

10 TB

10 TB

With INFA Optimized


Archive 95% Compression

10 TB
500 GB

10 TB replicated 3X = 30TB

500 GB

500 GB

10 TB compressed 95% = 500GB

Replicated 3X = 1.5 TB
20X less I/O bandwidth required
20 min vs. 1 min response time
8 hours vs. 24 mins backup window

Unified Administration
Single Place to Manage & Monitor
Full traceability from workflow
to MapReduce jobs

View generated
Hive scripts

Learn more at

http://bit.ly/powercenterbde
32

You might also like