Infa Big Data Essentials Skills 41391

Big Data Essentials
Removing the Skills Barrier
2014
2011
Devices
& Machines
2007
Communities
& Society
1990s
Business
Ecosystems
1980s
BUSINESS
1960s-1970s
USERS
VALUE
Few
Employees
Back Office
Automation
Customers/
Consumers
Many
Employees
Front Office
Productivity
Line-of-Business
Self-Service
Social
Engagement
Real-Time
Optimization
E-Commerce
TECHNOLOGIES
OS/360
SOURCES
TECHNOLOGY
MAINFRAME
10 2
CLIENT-SERVER
10 4
WEB
10 6
CLOUD
10 7
SOCIAL
10 9
INTERNET
OF THINGS
10 11
Business is connecting innovation to Big Data

Media & Entertainment
Financial Services
Retail & Telco
Risk & Portfolio

Analysis,
Investment
Recommendations
Proactive Customer
Engagement,
Location Based
Services
Online & In-Game

Behavior
Customer X/Up-Sell
Manufacturing
Healthcare & Pharma
Public Sector
Connected Vehicle,
Predictive Maintenance
Predicting Patient
Outcomes,
Total Cost of Care
Drug Discovery
Health Insurance
Exchanges,
Public Safety,
Tax Optimization
Fraud Detection
IT is struggling with the cost of Big Data
Lack of big data skills and

expertise
Growing data volume is

quickly consuming capacity
Need to onboard, store, &
process new types of data
Big Data
Project
80% of the work in big data projects is

data integration and data quality
Data
Integration
& Quality
Unleash the Power of Hadoop

Informatica Developers are Now Hadoop Developers
Profile
Parse
ETL
Cleanse
Match
Analytics Teams
Relational, Mainframe
Load
Load
Data Warehouse
Documents and
Emails
Replicate
Social Media, Web

Logs
Services
Stream
Events
Archive
Topics
Analytics & Op
Dashboards
Mobile Apps
Machine Device,
Cloud
Alerts
The Value of a Virtual Data Machine (like Vibe):

Integration Flexibility: Same skills, multiple deployment modes.
Skills leverage
Future proof investment
Development Acceleration
Development
Deployment
Data
Virtualization
Cloud
Desktop
Server
Data
Federation
Embedded
DQ in apps
Embedded
data quality
in apps
Data
Integration
Hub
HADOOP
7
PowerCenter Big Data Edition

The Safe On-Ramp To Big Data
No-Code Productivity
Big Transaction Data

Online Transaction
Processing (OLTP)
Oracle
DB2
Ingres
Informix
Sysbase
SQL Server
Online Analytical
Processing (OLAP) &
DW Appliances
Teradata
Redbrick
EssBase
Sybase IQ
Netezza
Exadata
High-Speed Data
Ingestion and
Extraction
HANA
Greenplum
DataAllegro
Asterdata
Vertica
Paraccel
Cloud
Business-IT
Collaboration
Unified Administration
Complex Data
Parsing on Hadoop
Social Media & Web Data

Facebook
Twitter
Linkedin
Youtube
Web applications
Blogs
Discussion forums
Communities
Partner portals
Universal Data Access

Other Interaction Data
PowerCenter
Big Data Edition
Salesforce.com
Concur
Google App Engine
Amazon
ETL on Hadoop
Big Interaction Data
the VibeTM virtual

data machine
Clickstream
image/Text
Scientific
Genomoic/pharma
Medical
Medical/Device
Sensors/meters
RFID tags
CDR/mobile
Entity Extraction and

Data Classification on
Hadoop
Big Data Processing

Profiling on Hadoop
Changing the Analytics Equation:

Shift Effort from Data Preparation to Data Analysis
Develop new products
Time available for
data analysis
and services faster and

cheaper
Free up data scientists

to focus on analysis
Allow more available &

affordable PowerCenter
developers to handle
data preparation
Hand
Coding
Time spent on data preparation

(parse, profile, cleanse, transform, match)
Changing the Analytics Equation:

Shift Effort from Data Preparation to Data Analysis
Develop new products
Time available for
data analysis
and services faster and

cheaper
Free up data scientists

to focus on analysis
Allow more available &

With
Informatica
PowerCenter
Hand
Big Data
Edition
Coding
affordable PowerCenter
developers to handle
data preparation
Time spent on data preparation

(parse, profile, cleanse, transform, match)
Implement a Lean Big Data Supply Chain
P&L Goals
Analyst
Prioritize
Goals
Generate
Insights
Data Scientist
Developer
Validate
Hypothesis
Business
Make
Operational
Inspire
Action
Big Data Supply Chain

Acquire &
Store
Big Data
Refine &
Enrich
Explore &
Curate
Distribute
& Manage
Data Management & Analytic Systems
Business
Value
Prioritize Your Business Goals

Sales and Marketing Example
What is the optimal

marketing campaign
mix?
How can I engage

with our key
influencers?
How can I retain my

most profitable
customers?
What are the best

pricing models to
increase sales?
Connect Business Goals to Data
Data
Customer orders
Social data
Web logs
Market data
Information
Customers likely
to churn
Next best offers
Optimal channels
Optimal pricing
models
Value
Increase
customer loyalty
Build sustainable
relationships
Increase
marketing ROI
Increase market
share
Agile Analytics
Data Sources
Visualization
Data Ingestion
Transactions,
OLTP, OLAP
Batch Load
Advanced
Analytics
Machine
Learning
Data Management
Data
Integration &
Data Quality
Replication
Applications
Data Delivery
Data
Integration
Hub
Analytics & Op
Dashboards
Data
Governance
Documents and Emails

Data
Virtualization
Virtual Data
Machine
Social Media, Web Logs
Change Data
Capture
Data Streaming
Machine Device,
Scientific
Mobile
Apps
Data Security
MDM /
PIM
Event-Based
Processing
Archiving
Data
Warehouse
Real-Time
Alerts
Flexible architecture to support rapid changes

The Challenge. Data volumes growing at 3-5 times over the next 2-3 years
The Solution
The Result
Manage data integration
and load of 10+ billion
records from multiple
disparate data sources
Traditional Grid
Mainframe
RDBMS
EDW
Data Virtualization
DW
Business
Reports
Flexible data integration

architecture to support
changing business
requirements in a
heterogeneous data
management
environment
DW
Unstructured
Data
Large Government Agency
Minimize risk and grow digital business

The Challenge. Grow digital business to 30% ($1.8B) and reduce fraud
The Solution
Relational - SQL Server, Oracle,

DB2, AS400, Mainframe
PowerCenter Big Data Edition
Profile
Parse
The Result
BI / Analytics
Visualization & Reporting
ETL
Comprehensive data
integration platform to
integrate large
volumes of data from
over 18+ systems
Ability to use existing
skill sets & make
them more productive
Surveys & Net Promoter

Scores (NPS)
Lowest risk as
industry leader
Social Media, Web Logs,

JSON, XML
Netezza, SQL
Server, Oracle, SAS
Machine, Forensic, Splunk
Large Global Financial Services and Communications Company
Reduce Costs & Increase Revenue

Consolidate Data on Hadoop & Provide 360 View of Customer
The Challenge
Data increasing 20x every year with costs rising from $17K per
day to $50K per day within 6 months. Time to deliver information taking too long.
The Solution
Business
Reports
Traditional Grid
Gain 360 view of

customer behavior,
increase cross-sell &
up-sell revenue
Transactions from
70 Data Centers
In-Store POS
Data
B2B Data
Exchange
Power Center
Big Data Edition
Expected Result
Data
Warehouse
Reduce data storage

costs from $50K per
day to $500 per day
172 TB
Data from Gaming

Consoles, TV, Tablets,
Readers, & Clickstreams
from 5000 Web Sites
& Data
Validation
Reduce time to
deliver information to
business from 48
hours to 15 minutes
Large Global Media & Entertainment Company
Maximize Your Return On Big Data

Hadoop complements your existing infrastructure
Data Assets
Operational Systems
OLTP
Analytical Systems
Data Products
Data
Warehouse
MDM
Transactions,
OLTP, OLAP
OLTP
Data
Mart
ODS
Documents,
Email
& other NoSQL

Social Media,
Web Logs
Machine Device,
Scientific
Access
& Ingest
Parse &
Prepare
Discover
& Profile
Transform
& Cleanse
Extract &
Deliver
Manage (i.e. Security, Performance, Governance, Collaboration)
Unleash the Power of Big Data

With high performance Universal Data Access
Messaging,
and Web Services
Relational and
Flat Files
Mainframe
and Midrange
Unstructured
Data and Files
MPP Appliances
WebSphere MQ
JMS
MSMQ
SAP NetWeaver XI
Web Services
TIBCO
webMethods
Oracle
DB2 UDB
DB2/400
SQL Server
Sybase
Informix
Teradata
Netezza
ODBC
JDBC
ADABAS
Datacom
DB2
IDMS
IMS
VSAM
C-ISAM
Binary Flat Files
Tape Formats
Word, Excel
PDF
StarOffice
WordPerfect
Email (POP, IMPA)
HTTP
Pivotal
Vertica
Netezza
Flat files
ASCII reports
HTML
RPG
ANSI
LDAP
Teradata
Aster
JD Edwards
SAP NetWeaver
Lotus Notes
SAP NetWeaver BI
Oracle E-Business SAS
PeopleSoft
Siebel
Salesforce CRM
Force.com
RightNow
NetSuite
ADP
Hewitt
SAP By Design
Oracle OnDemand
EDIX12
EDI-Fact
RosettaNet
HL7
HIPAA
AST
FIX
SWIFT
Cargo IMP
MVR
Packaged
Applications
SaaS/BPO
Industry
Standards
XML Standards
XML
LegalXML
IFX
cXML
Facebook
Twitter
LinkedIn
ebXML
HL7 v3.0
ACORD (AL3, XML)
Kapow
Datasift
Social Media
Vibe Data Stream for Machine Data

Management
and Monitoring
Handhelds, Smart
Meters, etc.
Discrete Data
Messages
Internet of Things,
Sensor Data
VDS
Node
VDS
Node
VDS
Node
Sources
Ultra Messaging Bus
Web Servers,
Operations
Monitors, rsyslog,
SLF4J, etc.
Publish / Subscribe
Zookeeper
VDS
Node
Hadoop HDFS,
HBase,
VDS
Node
Real Time
Analysis,
Complex Event
Processing
VDS
Node
No SQL
Databases:
Cassandara,
Riak, MongoDB
Leverage High Performance

Messaging Infrastructure
Publish with Ultra Messaging
for global distribution without
additional staging or landing.
Targets
Parse and Prepare Data on Hadoop

The broadest coverage for Big Data
Engine
invocation
is
ato
shared
library.
DT
engine
runs
The
DT
engine
is
also
thread-safe
and
re-entrant.
As
shown
below,
the
actual
transformation
logic
issend
2.3.
The
calling
application
can
buffer
the
data
and
DT
can
be
invoked
in
two
general
ways:
DT
can
also
be
embedded
inThe
other
middleware
To
2.
Developer
deploy
the
deploys
server,
transformation
this
service
1.
Developer
uses
Studio
to
The
DT
Engine
is
fully
For
4.
The
simple
DT
integration,
engine
can
immediately
a
command
Internal
custom
applications
can
PowerCenter
leverages
DT
via
fully within independent
the
process
of
completely
of the
any
callingapplication.
application. the
buffers
to
DT
for calling
processing.
technologies.
folder
to
local
is moved
service
to
repository
the
server
(directory).
via
FTP,
develop
a
transformation
embeddable
and
can
be
invoked
XML
Interaction
data
line
use
interface
this
service
is
available
to
process
to
invoke
data.
embed
transformation
services
Unstructured
Data
Transformation
Industry
Standards
ThisOn
allows
the calling
application
to write
invoke
DT to
in memory
multiple
For
some
(WBIMB,
WebMethods,
BizTalk)
INFA
the
output
side,
DT
can
also
back
1.
Filenames
can
be
passed
toany
it, once,
and
DT
script,
etc.
This
means
you
cancopy,
develop
asupported
transformation
andwill
using
any
of
the
APIs.
It is not
anfiles
external
engine.
This
removes
overhead
threads
tothe
increase
throughput.
All
needed
for
the
transformation
services.
using
various
APIs.
(UDT).
provides
similar
GUI
widgets
(agents)
for
the
buffers
which
are
returned
to
the
calling
application.
directlyenvironments
open the file(s)
for processing.
leverage it in multiple
simultaneously
resulting
Flat Files &

Documents
Name
= Value
^/>Delimited<\^
Svc Repository
Productivity
Visual
parsing
environment
Predefined
translations
from
passing
dataserver
between
processes,
thesocial
network,
NOTE:
If the
file
system
isacross
mountable
from
are
moved.
respective
design
environments.
in
reduced
development
and
maintenance
times
and
lower
etc. The
engine
is
also
dynamically
invoked
and
does
This
isDTs
athe
GUI
transformation
the
developer
machine
directly,
then
step not
2
A good
example
is
support
of
PowerCenter
partitioning
Though
not
shown
below,
engine
fully
supports
multiple
input
impact
of
change.
On
the
output
side,
DT
can
also
directly
Fortoothers
the
API
layer
can
be
used
directly.
need
be
started
up
or
maintained
externally.
would
deploy
directly
tothe
thetransformation.
server.
and output
files
orto
buffers
as
needed
by
scale
up
processing.
widget
in Powercenter
which
Java, C++, C,
web services
write.NET,
to the filesystem.
Device/sensor
wraps around the DT API and
scientific
engine.
Any DI/BI architecture
PIG
EDW
MDM
PowerCenter developers are now Hadoop developers
No-code visual
development
environment
Preview results at
any point in the
data flow
Data Integration & Quality on Hadoop

1. Entire Informatica mapping
translated to Hive Query Language
2. Optimized HQL converted to
MapReduce & submitted to Hadoop
cluster (job tracker).
3. Advanced mapping transformations
executed on Hadoop through User
Defined Functions using Vibe
SELECT
T1.ORDERKEY1 AS ORDERKEY2, T1.li_count, orders.O_CUSTKEY AS CUSTKEY, customer.C_NAME,
customer.C_NATIONKEY, nation.N_NAME, nation.N_REGIONKEY
FROM
(
SELECT TRANSFORM (L_Orderkey.id) USING CustomInfaTx
FROM lineitem
GROUP BY L_ORDERKEY
) T1
JOIN orders ON (customer.C_ORDERKEY = orders.O_ORDERKEY)
JOIN customer ON (orders.O_CUSTKEY = customer.C_CUSTKEY)
JOIN nation ON (customer.C_NATIONKEY = nation.N_NATIONKEY)
WHERE nation.N_NAME = 'UNITED STATES'
) T2
INSERT OVERWRITE TABLE TARGET1 SELECT *
INSERT OVERWRITE TABLE TARGET2 SELECT CUSTKEY,
count(ORDERKEY2) GROUP BY CUSTKEY;
Hive-QL
MapReduce
UDF
Accelerate Development
Reuse and Import PowerCenter Metadata
Import and validate

existing PowerCenter
mappings before running
on Hadoop
Natural Language Processing

Entity Extraction & Data Classification
Train NLP to find and

classify entities in
unstructured data
Hadoop Data Profiling Results

Value and Pattern
Frequency to isolated
inconsistent/dirty data or
unexpected patterns
Hadoop Data Profiling

results exposed to
anyone in enterprise
via browser
CUSTOMER_ID example
COUNTRY CODE example
1. Profiling Stats:
Min/Max Values, NULLs,
Inferred Data Types, etc.
Stats to identify
outliers and
anomalies in data
2. Value &
Pattern
Analysis of
Hadoop Data
3. Drilldown Analysis (into Hadoop Data)
Drill down into actual

data values to inspect
results across entire
data set, including
potential duplicates
Hadoop Data Domain Discovery

Finding functional meaning of Data in Hadoop
Leverage INFA rules/mapplets to
identify functional meaning of
Hadoop data
Sensitive data
(e.g. SSN, Credit Card number, etc.)
PHI: Protected Health Information

PII: Personally Identifiable Information
Scalable to look for/discover ANY Domain type
View/share report of data

domains/sensitive data contained in
Hadoop. Ability to drill down to see
suspect data values.
Single Data Privacy Solution for Selective

Visibility From the Apps to the DW to Hadoop
1.
2.
3.
4.
5.
6.
7.
Load data sources via

PowerCenter Big Data
Edition
Look up data security
policy
Encrypt sensitive data
Data scientist accesses
Hadoop data store through
tool of choice
DDM proxy intercepts HQL
or MapReduce jobs and
looks up security policy
Dynamically inject function
to unencrypt and mask
sensitive data
Result set masked or
cleared
BI & Analytic Layer

Query, Reporting, Data Mining, Predictive Analytics
1
4
PowerCenter
Big Data
Edition
DDM
5
Data Security
Transformation
2
Security
Policy
Hadoop Execution Layer

MapReduce
Hadoop Data Store
UDF
Dynamic Data Masking

In-line Proxy Server Delivers Seamless Security Layer for Hive and Hadoop*
Role-based anonymization and
real-time prevention
Values Presented:
Values Presented:
BLAKE
BL****
JONES
JO****
KING
KI****
Business user
application screen
Dynamic Data Masking Layer

applies real-time HQL rewrites
to mask returned result set
Private Information
Stored
Application screens
and tools used by
production support,
DBAs, Outsourced or
unauthorized workforce
(2)Select substring(name,1,2)||***
from table1
BLAKE
JONES
KING
Hadoop
29
Archive to Hadoop
Compression Extends Hadoop Cluster Capacity
Without INFA Optimized
Archive Compression
10 TB
10 TB
With INFA Optimized

Archive 95% Compression
10 TB
500 GB
10 TB replicated 3X = 30TB
500 GB
500 GB
10 TB compressed 95% = 500GB
Replicated 3X = 1.5 TB
20X less I/O bandwidth required
20 min vs. 1 min response time
8 hours vs. 24 mins backup window
Unified Administration
Single Place to Manage & Monitor
Full traceability from workflow
to MapReduce jobs
View generated
Hive scripts
Learn more at
http://bit.ly/powercenterbde
32

Infa Big Data Essentials Skills 41391

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Infa Big Data Essentials Skills 41391

Uploaded by

Copyright:

Available Formats

Big Data Essentials

Removing the Skills Barrier

Business is connecting innovation to Big Data

Retail & Telco

Risk & Portfolio

Online & In-Game

Healthcare & Pharma

IT is struggling with the cost of Big Data

Lack of big data skills and

Growing data volume is

80% of the work in big data projects is

Unleash the Power of Hadoop

Social Media, Web

The Value of a Virtual Data Machine (like Vibe):

PowerCenter Big Data Edition

Big Transaction Data

Social Media & Web Data

Universal Data Access

Big Interaction Data

the VibeTM virtual

Entity Extraction and

Big Data Processing

Changing the Analytics Equation:

and services faster and

Free up data scientists

Allow more available &

Time spent on data preparation

Changing the Analytics Equation:

and services faster and

Free up data scientists

Allow more available &

Time spent on data preparation

Implement a Lean Big Data Supply Chain

Big Data Supply Chain

Data Management & Analytic Systems

Prioritize Your Business Goals

What is the optimal

How can I engage

How can I retain my

What are the best

Connect Business Goals to Data

Documents and Emails

Flexible architecture to support rapid changes

Flexible data integration

Large Government Agency

Minimize risk and grow digital business

Relational - SQL Server, Oracle,

PowerCenter Big Data Edition

Surveys & Net Promoter

Social Media, Web Logs,

Large Global Financial Services and Communications Company

Reduce Costs & Increase Revenue

Gain 360 view of

Reduce data storage

Data from Gaming

Large Global Media & Entertainment Company

Maximize Your Return On Big Data

& other NoSQL

Manage (i.e. Security, Performance, Governance, Collaboration)

Unleash the Power of Big Data

Vibe Data Stream for Machine Data

Ultra Messaging Bus

Leverage High Performance

Parse and Prepare Data on Hadoop