P. 1


|Views: 7|Likes:
Published by Ashwarya Gupta
Haoop Map Reduce Informatica
Haoop Map Reduce Informatica

More info:

Published by: Ashwarya Gupta on Jun 29, 2013
Copyright:Attribution Non-commercial


Read on Scribd mobile: iPhone, iPad and Android.
download as PDF, TXT or read online from Scribd
See more
See less





White Paper


MapR — The Industry’s Most Dependable Hadoop Platform

Hadoop in the Enterprise:
Maximizing Big Data Benefits with MapR and Informatica

Table of Contents
Introduction Hadoop: A Strategic Data Analytics Platform Informatica with MapR A Better Hadoop: Additional Enhancements in MapR’s Distribution Summary

The volume, velocity and variety of data are all growing relentlessly. The growth is causing organizations to struggle finding the tools, talent and time to get value from data cost-effectively. The need to integrate Big Transaction Data with Big Interaction Data while leveraging Big Data Processing technologies like Hadoop, is particularly challenging. Informatica offers the industry’s leading independent data integration platform that uniquely enables organizations to maximize the return on Big Data and drive top business imperatives. Informatica is also integrated with Hadoop, which is purpose-built for processing Big Data effectively and affordably, and MapR Technologies’ distribution for Hadoop which improves performance, scalability, reliability and ease-of-use. This white paper outlines how the combination of Informatica’s Data Integration platform and MapR’s distribution for Hadoop offers powerful new capabilities for integrating and processing Big Data more efficiently and costeffectively than ever before.

© 2012 MapR, Inc. Confidential. All Rights Reserved.

store. including its batch-oriented data management and movement. All Rights Reserved.Page 2 | Maximizing Big Data Benefits with MapR and Informatica Hadoop: A Strategic Data Analytics Platform Hadoop provides a way to capture. Informatica’s Ultra Messaging can stream messages directly into the MapR cluster to be retained and processed via MapReduce. substantially improving performance. MapR Technologies has advanced the Hadoop state-of-the-art with major enhancements that overcome significant limitations of other Hadoop distributions. parse and process the full range of structured and unstructured data (including messaging streams) with greater performance. command-line tools. Hadoop is designed to scale up from dozens to thousands of servers. Commercial integration of the Informatica Data Integration platform with with MapR's distribution for Hadoop includes: includes: • Bi-directional data integration with Informatica PowerExchange • Near real-time and snapshot replication using Informatica Data Replication and Informatica FastClone • Parallel parsing and transformation on MapR using Informatica HParser • Data streaming using Informatica Ultra Messaging © 2012 MapR. Inc. and the requirement to close files before new updates can be read. performance. search. scalability and reliability. In addition to overcoming HDFS limitations and improving data protection. whereas other Hadoop distributions and database connectors provide much lower throughput and are limited to one-time table dumps and batch loading.or networkattached storage. Both Ultra Messaging and MapR feature parallel architectures with HA (no single points of failure) and best-in-class performance. organize. rather than writing first to direct. each offering local computation and storage. One major enhancement MapR has made involves re-architecting the Hadoop Distributed File System (HDFS) to provide full random read/write semantics. Due to the limitations of HDFS. Direct Access NFS affords some other significant advantages. making the combination ideal for production deployments. and optionally take advantage of the MapReduce framework for parallel processing. Informatica with MapR The combination of Informatica’s Data Integration platform with MapR’s Distribution for Hadoop enables organizations to access. all other distributions cannot support Ultra Messaging streaming. high availability. Leveraging MapR’s Direct Access NFS. Any remote client can simply mount the cluster. and application servers can then write their data and log files directly into the cluster. Existing applications and workflows can use standard NFS to access the Hadoop cluster to manipulate data. These innovations overcome the many limitations of HDFS. ingest. And files in the cluster can be modified directly using ordinary text editors. lack of random read/write file access by multiple users/processes. Informatica’s Data Replication and FastClone provide high-performance transaction updates and data loading from different hardware platforms and data sources into the MapR cluster for analysis via MapReduce or Hive. and direct access through NFS. The data is loaded into the MapR cluster in near real-time or on a scheduled basis. as well as easier to integrate into the enterprise. and UNIX applications and utilities. Confidential. making Hadoop more enterprise-class in its operation. or other development environments. Lockless storage with random reads and writes enables simultaneous access to data in near real-time. share. scale and dependability than ever before. and analyze disparate data sources across a large cluster of commodity servers. .

EDI. administered and secured. The performance advantage derives from the elimination of a Primary NameNode. A Better Hadoop: Additional Enhancements in MapR’s Distribution Direct Access NFS also facilitates support for volumes. Snapshots can be taken periodically to create drag-and-drop recovery points. and many more) and popular document formats (e. HL7.g. Without such transparent failover. without the use of hand-coding.). the cluster scales in a linear fashion with the number of DataNodes. all DataNodes might store and serve a portion of the file metadata. performance also scales in a linear fashion with the size of the cluster. combined with the efficiency of HParser. etc. by contrast. Continued. as well as integration between on-premise and private clouds. PDF.Page 3 | Maximizing Big Data Benefits with MapR and Informatica Informatica with MapR. and mirroring extends data protection to satisfy recovery time objectives. Every portion is then persisted to disk (with the node’s data) and also replicated to at least two other nodes to increase tolerance to multiple simultaneous node failures. MapR’s JobTracker offers similar resiliency with the ability to continue all tasks with no interruption or data loss in the event of a failure. which can become a bottleneck even in relatively small clusters. which can then be executed in parallel in the Hadoop cluster. FIX. Primary NameNode. The performance advantages of MapR. a single NameNode is normally limited to only about 70 million files. it is necessary to restart the job(s) affected from the beginning. . With MapR’s Distributed NameNode HA architecture. Inc. ACORD. In the extreme. MapR’s Distributed NameNode HA architecture also improves scalability and performance compared to configurations with a single.g. All Rights Reserved. further improving reliability without requiring any extraordinary measures. Even in a server configured with copious amounts of memory. MapR’s Distributed NameNode HA (High Availability) distributes the file metadata on ordinary DataNodes throughout the cluster. Another major enhancement MapR made was to eliminate single points of failure in the critical NameNode and JobTracker functions. PowerExchange for Hadoop makes it easier for non-programmers to move transaction and interaction data between a MapR cluster and other databases and data warehouses. Local mirroring provides high performance for highly-accessed data. This eliminates the need with other distributions to continuously back up the Primary NameNode to either a Checkpoint Node (previously called the Secondary NameNode) or a Backup Node.g. By distributing the file metadata across multiple DataNodes throughout the cluster. Volumes make clustered data easier to both access and manage by grouping related files and directories into a single tree structure that can be more readily organized. while remote mirroring provides business continuity across multiple data centers. XML and JSON). © 2012 MapR. MS Office. Confidential. Omniture. as well as complex files (e. MapR’s Direct Access NFS interface also enables users to leverage Informatica’s full range of data sources and transformations with the Hadoop environment. snapshots and mirroring for all data contained within the Hadoop cluster. SWIFT. Logs. allow users to perform data parsing and transformations with higher performance and lower hardware costs compared to other options. and can therefore contain a virtually unlimited number of files. Informatica’s HParser Community Edition (included in the MapR distribution) helps create an easy-to-use integrated data environment (IDE) that enables customers to visually design data parsing transformations for industry-standard (e.

or call 855-NOW-MAPR (855-669-6277).com. Together Informatica and MapR provide a cost-effective. most dependable and easiest to use distribution for Apache Hadoop.mapr. as well as to parse and process a broad range of structured and unstructured data natively in Hadoop — all without coding.12 . data warehouses and/or legacy systems to Hadoop. MapR Technologies is dedicated to advancing the Hadoop platform and ecosystem to enable more businesses to harness the power of big data analytics for competitive advantage. analytic-ready data storage and processing with enterprise-class high availability and business continuity. To learn more. please visit either company on the Web at www. MapR.Page 4 | Maximizing Big Data Benefits with MapR and Informatica Summary By using Informatica with MapR’s distribution for Hadoop. Confidential. The combination also gives organizations a more affordable way to archive data in applications. please visit www. For more information.com.mapr. © 2012.com or www. interaction and streaming data into the MapR cluster. Together the two companies are pushing the limits of high-performance networks to move many terabytes per hour of transaction. or to archive data to Hadoop’s lower-cost storage. MapR Technologies is the creator of the industry’s fastest. 05. organizations are now able to achieve high-performance data integration. replication and messaging.informatica.

You're Reading a Free Preview

/*********** DO NOT ALTER ANYTHING BELOW THIS LINE ! ************/ var s_code=s.t();if(s_code)document.write(s_code)//-->