You are on page 1of 12


Cloudera / Informatica White Paper

Lower Costs, Increase Productivity, and Accelerate Value

with Enterprise-Ready Hadoop
This document contains Confidential, Proprietary and Trade Secret Information (“Confidential
Information”) of Informatica Corporation and may not be copied, distributed, duplicated, or otherwise
reproduced in any manner without the prior written consent of Informatica.

While every attempt has been made to ensure that the information in this document is accurate and
complete, some typographical errors or technical inaccuracies may exist. Informatica does not accept
responsibility for any kind of loss resulting from the use of information contained in this document. The
information contained in this document is subject to change without notice.

The incorporation of the product attributes discussed in these materials into any release or upgrade of
any Informatica software product—as well as the timing of any such release or upgrade—is at the sole
discretion of Informatica.

Protected by one or more of the following U.S. Patents: 6,032,158; 5,794,246; 6,014,670; 6,339,775;
6,044,374; 6,208,990; 6,208,990; 6,850,947; 6,895,471; or by the following pending U.S. Patents:
09/644,280; 10/966,046; 10/727,700.

This edition published March 2013

White Paper

Table of Contents

Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

Hadoop’s Role in the Big Data Challenge . . . . . . . . . . . . . . . . . . . 3

Cloudera: The Leading Hadoop Distribution . . . . . . . . . . . . . . . . 4

Informatica: Discover Insights and Innovate Faster

on Hadoop . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

Data Warehouse and ETL Optimization with

Cloudera and Informatica . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

eHarmony Embraces Big Data with Cloudera

and Informatica . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

The Cloudera/Informatica Advantage . . . . . . . . . . . . . . . . . . . . . . . 8

Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

Cloudera / Informatica White Paper 1

Organizations increasingly recognize the potential of big data to transform their business—improving customer
retention and acquisition, increasing operational efficiencies, enabling better products and service delivery, and
generating new business insights.

Cost-effectively harnessing terabytes or petabytes of big data requires a new approach that extends current
technologies. The limitations of traditional data infrastructures render them unsuitable for the extreme scale
of big data processing and storage. The open source Hadoop framework and advanced data integration
technology are critical components in a growing number of big data initiatives for both processing and storing
data in Hadoop at dramatically lower costs.

This white paper outlines how organizations can realize big data’s promise by combining Cloudera Enterprise,
an open-source Hadoop distribution and associated tools and services, and the Informatica® Platform. The
Informatica Platform can access all types of data, move up to terabytes per hour into Hadoop, parse, cleanse
and transform data on Hadoop, and deliver insights from Hadoop at any latency across the enterprise.

Over several years, Cloudera and Informatica have collaborated at a technological level to optimize
interoperability between the joint solutions. As respective leaders in Hadoop products and services and
enterprise data integration, the Cloudera and Informatica partnership can equip your organization with proven
technology and services expertise to maximize your return on big data.

Hadoop’s Role in the Big Data Challenge
Growth in data volumes, variety, and velocity is hitting the limits of existing information management
infrastructures, forcing companies to invest in more hardware and costly upgrades of databases and
data warehouses.

In many cases, adding traditional data infrastructure is impractical because of high costs, scalability limitations
when dealing with hundreds of terabytes, and incompatibility of relational systems with unstructured big
data. Organizations are implementing innovative approaches to handling growth in both big transaction data
(data warehouses, ERP applications, and OLTP systems) and big interaction data (from social media, web
clickstreams, call detail records [CDRs], sensors and devices, and more). Beyond handling growth, they seek a
solution capable of integrating traditional structured, multistructured, and unstructured data to gain insights not
otherwise possible.

Enter Hadoop. Cloudera chief architect Doug Cutting founded the Apache Hadoop project to address the
inability of traditional systems to handle the explosion of data on the Web. It enables distributed, fault-tolerant,
parallel storage, processing, and analysis of huge amounts of multistructured data across highly available
clusters of inexpensive industry standard servers. Hadoop is ideally suited for complex data analytics and large-
scale data storage and processing, often at 10 to 100 times less cost than traditional systems. Given its unique
strengths, many organizations are offloading between 20 percent and 50 percent of processing and storage to
Hadoop systems.

Cloudera / Informatica White Paper 3

Cloudera: The Leading Hadoop Distribution
With customers including eBay, Samsung, Chevron, Nokia, and JP Morgan Chase & Co.,Cloudera supplies
the industry’s leading Hadoop distribution as well as a comprehensive set of tools and services to effectively
operate Hadoop as a critical part of a technology infrastructure. Its Cloudera Enterprise offering includes:

• CDH: Cloudera’s 100 percent open source platform based on Apache Hadoop delivers the core elements
of Hadoop—scalable storage and distributed computing—plus capabilities for security, high availability, fault
tolerance, load balancing, compression, and integration with software and hardware solutions from partners
such as Informatica.

The CDH distribution is strengthened by a bundle of more than a dozen open source projects—including a
nonrelational database, workflow orchestration, cloud integration, and machine learning libraries—to help
maximize the performance and value of a Hadoop deployment.

• Cloudera Impala: As the industry’s first native real-time SQL query engine for Apache Hadoop, Impala is
the newest component of CDH. Impala completely changes the way organizations can benefit from
Hadoop, including:

•• Data processing workload acceleration, with data pipelines that last seconds instead of minutes or
hours, to meet tighter service-level agreement (SLA) specifications.

•• Interactive business intelligence with popular tools. This opens up real-time access to big data to
every analyst in the organization, without requiring any special Hadoop training, significantly lowering
the adoption risk of a big data project and accelerating return on investment (ROI).

•• Reduced overall cost of data management. Instead of replicating large amounts of data to a relational
database to get interactive SQL performance, Cloudera customers can obtain the same experience
without added cost or complexity.

• Cloudera Manager: Cloudera’s Hadoop management platform supplies a central point for administration
across a CDH cluster. The application automates installation to reduce deployment time from weeks to
minutes, provides a cluster-wide, real-time view of nodes and services running, enables configuration
changes from a single control console, and delivers reporting and diagnostic tools for troubleshooting and
• Cloudera Support: Cloudera offers the industry’s highest quality technical support for Hadoop, with a team
of support engineers composed of contributors and committers for every component of CDH. No one
knows the Hadoop stack better or has more experience supporting large-scale clusters in production. With
Cloudera Support, customers experience more uptime, faster issue resolution, and better performance.

Informatica: Discover Insights and Innovate Faster on Hadoop
For all its advantages in data processing and storage, Hadoop stands to become another data silo without
data integration or other complementary technology to unlock the business value of big data. In a number of
early deployments, some enterprises resorted to time-consuming hand coding for a range of data process
requirements, despite high costs and downstream maintenance headaches.

Informatica addresses the need for a codeless environment for extract, transform, and load (ETL) workloads
on Hadoop, with a range of innovative Informatica Platform technologies that enable organizations to use their
existing Informatica-trained professionals or find the requisite skills from a global pool of more than 100,000
developers trained on Informatica technology. Informatica capabilities for Hadoop include:

• GUI-based development: Most Hadoop development today is performed by hand in a manner very similar
to the way ETL code was developed a decade ago before ETL tools such as Informatica PowerCenter® were
created. Graphical codeless development has already proven to reduce development time by as much as
fivefold while identifying data errors not caught by hand coding Hadoop.

• Universal data access: Organizations use Hadoop to store and process a variety of diverse data sources
and often face challenges in combining and processing all relevant data from their legacy data sources and
new types of data. The Informatica Platform helps organizations achieve ease and reliability of pre- and post-
processing of data into and out of Hadoop.

• High-speed data ingestion: Access, load, transform, and extract big data between source and target
systems or directly into Hadoop or your data warehouse. Replicate hundreds of gigabytes to terabytes per
hour from source systems to Hadoop.

• Data archiving: Archive data directly to Hadoop. Informatica helps to automate complex partitioning based
on related tables or entities, not just individual tables, using the underlying database partitioning capabilities.
Archive inactive data from production databases and data warehouses to extend their capacity and avoid
costly upgrades.

• Data parsing and exchange: Hadoop excels at storing a diversity of data, but the ability to derive meanings
and make sense of it across all relevant datatypes is a major challenge. Informatica technology helps
improve productivity for extracting greater value from unstructured data sources—including images, texts,
binaries, and industry standards.

• Comprehensive data transformations: The Informatica Platform provides an extensive library of prebuilt
transformation capabilities on Hadoop, including basic datatype conversions and string manipulations,
high-performance caching-enabled lookups, joiners, sorters, routers, aggregations, and many more. Perform
natural language processing to extract entities from unstructured data such as from emails, social data, and
documents used to enrich master data.

Cloudera / Informatica White Paper 5

• Metadata management: Informatica supplies full metadata management capabilities, with data lineage and
auditability, and promotes standardization across heterogeneous data environments.

• Data quality and data governance: Many organizations use Hadoop for end-user reporting and analytics
that require high data quality. Informatica technology furnishes capabilities to profile, cleanse, and manage
data to better understand what data means, increase trust, and manage data growth effectively and securely.

• Data profiling: Profile data directly on Hadoop both through the Informatica developer tool and a
browser-based analyst tool. This ability makes profiling data faster and more scalable, as well as easier for
developers, analysts, and data scientists to collaborate on data flow specifications and validate mapping
transformation and rules logic.

• Data virtualization: Use data virtualization to provide a fine-grained secure access layer that combines data
on Hadoop with other information management systems such as your data warehouse, MDM, or application

Data Warehouse and ETL Optimization with Cloudera

and Informatica
Through technology and professional services, Cloudera and Informatica offer enterprises a fast, repeatable
process to optimize data warehouse and ETL processing and storage that maximizes the ROI of existing
information management infrastructure and the high performance and cost-effective benefits of Hadoop. The
challenges that motivate shifting data processing and data volumes to Hadoop include the following four:

• As data volumes and business complexity grows, ETL and ELT processing is unable to keep up on
conventional relational database technology. Critical business windows are missed.

• Databases are designed to primarily load and query data, not transform it. Transforming data in the database
consumes valuable CPU, making queries run slower, which impacts BI users’ experience.

• Conventional databases are expensive to scale as data volumes grow. Therefore, most organizations are
unable to keep all the data they would like to analyze directly in the data warehouse. As a result, they end
up throwing away the data or moving data to more affordable off-line systems, such as a storage grid or tape
backup. It’s very common to hear: “We want to analyze three years of data but can only afford three months.”

• Traditional data management infrastructure is not as flexible to change as data volumes grow and new
datatypes emerge (e.g., machine data, documents, and social media). Change requests to schemas and
reports can take weeks or even months, leaving the business to fend for itself. Hadoop provides the flexibility
to cost-effectively work with more data and more types of data and to perform more flexible analysis,
enabling the business and IT to be more agile.

Consulting and tools such as Informatica’s Data Warehouse Advisor, software that monitors how businesses
use data, can help organizations evaluate their current cost of data storage, processing capacity, and
performance bottlenecks, plus raw or dormant data that could be more cost-effectively managed in Hadoop.
The PowerCenter Big Data Edition supplies a visual no-code development environment to build and execute
ETL transformations on Hadoop. It also enables developers to do complex file parsing (e.g., Web logs, JSON,
and XML), data profiling, and entity extraction for unstructured text (e.g., natural language processing) on
Hadoop. The PowerCenter Big Data Edition includes connectivity to traditional relational databases, social data
for Facebook, Twitter, and LinkedIn, and many other capabilities.

The Cloudera/Informatica solution helps organizations address the challenges of traditional environments
through unlimited scalability, cost-effective performance, lower costs between 10 to 100 times, and increased
productivity up to 5 times.

Informatica technology enables developers to build and deploy data transformations and data flows on Hadoop
without hand codingand offers a variety of data movement capabilities, including data replication, batch,
trickle feed, and streaming, with scalability to move up to terabytes per hour into Hadoop and out of Hadoop.
Cloudera consultants provide expertise in configuring, managing, and tuning a CDH cluster, with knowledge
transfer to ensure sustainability and extensibility in the years to come.

eHarmony, the popular on-line dating site, is a good example of an enterprise capitalizing on the capabilities of
a joint Cloudera/Informatica solution.

eHarmony Embraces Big Data with Cloudera

and Informatica
eHarmony—founded in 2000 and now resulting in an average of 542 marriages a day in the United States—
deployed the Cloudera CDH Hadoop distribution as the analytics platform to run proprietary algorithms that
processed data to generate compatibility matches. The company’s problem was that reliance on Ruby scripting
to transform hierarchical JSON data in Hadoop for use by its data warehouse was time-consuming for both
script development and processing; it also could not scale to an expected fivefold increase in data volumes.

eHarmony turned to HParser, Informatica’s data transformation environment optimized for Hadoop, to take full
advantage of Cloudera CDH and cut data processing time by four times. Replacing Ruby scripting to process
JSON data held in Hadoop, HParser introduced advanced data parsing capabilities into the CDH environment,
eliminating tedious script development while slashing big data processing time from 40 minutes to 10 minutes.

With the move, eHarmony extended its existing investment in Informatica PowerCenter, which loaded up to 7
TB a day into the data warehouse from conventional sources, to add HParser’s capabilities to handle JSON,
XML, Omniture Web analytics data, log files, Word, Excel, PDF and other files, as well as industry-standard file
formats (e.g., SWIFT, NACHA, and HIPAA). The joint Cloudera/Informatica solution gives eHarmony greater
speed and agility in embracing big data to meet business demands—for instance, generating compatible
matches almost immediately after a new member joins.

Cloudera / Informatica White Paper 7

The Cloudera/Informatica Advantage
A joint Cloudera/Informatica solution offers distinct advantages in enabling organizations to realize the promise
of big data:

• Accelerates adoption of Hadoop by leveraging existing Informatica skill sets, letting customers design in
Informatica, reuse existing work, and run on CDH

• Expands Hadoop’s connectivity and processing capabilities through a rich set of prepackaged data
integration functionality

• Lowers costs of data processing and storage by allowing Informatica tasks best suited for Hadoop to run
on CDH

• Increases developer productivity with a metadata-driven graphical environment on a flexible and scalable
data platform

• Enables unified monitoring and management of data integration across Hadoop and other systems using
Informatica’s unified administration and Cloudera Manager

• Allows data governance across all data assets including data on Hadoop

Effectively harnessing big data promises quantifiable benefits to organizations. Beyond
offloading data storage and preprocessing from expensive database and data warehouse About Informatica
platforms to Hadoop for staging and ETL, financial services companies can improve fraud Informatica Corporation (NASDAQ:
detection processes and risk and portfolio analysis. Telcos can process massive volumes of INFA) is the world’s number one
CDRs to improve customer support and provide new location-based services. Manufacturers independent provider of data
can leverage big data from machine device sensors to improve product quality and integration software. Organizations
predictive maintenance. Retailers can use big data to make next-best offer recommendations around the world rely on
to increase customer up-sell and cross-sell. Informatica for maximizing return
on data to drive their top business
An analytics-ready Hadoop platform and advanced data integration are critical technologies imperatives. Worldwide, over 4,630
to take full advantage of big data. With Cloudera and Informatica, enterprises have proven enterprises depend on Informatica
solutions and services to maximize their big data returns by successfully leveraging Hadoop to fully leverage their information
as one part of their overall data integration infrastructure. Learn more at assets residing on-premise, in the
and Cloud and across social networks.

Cloudera / Informatica White Paper 9

Worldwide Headquarters, 100 Cardinal Way, Redwood City, CA 94063, USA
phone: 650.385.5000 fax: 650.385.5500 toll-free in the US: 1.800.653.3871
© 2013 Informatica Corporation. All rights reserved. Printed in the U.S.A. Informatica, the Informatica logo, and The Data Integration Company are trademarks or
registered trademarks of Informatica Corporation in the United States and in jurisdictions throughout the world. All other company and product names may be trade
names or trademarks of their respective owners.