PROGRAM GUIDE

October 12, 2010 | Hilton New York Hotel

WELCOME

Greetings! Welcome to the 2nd annual Hadoop World conference. Attendance has nearly doubled from last year —clearly, interest in Hadoop continues to grow and is stronger than ever! The pace of innovation around the platform is truly amazing. You’ll have a chance today to hear in detail, directly from practitioners, how and where Hadoop is making a real practical difference in the ways that they capture, store and analyze data. They’ll tell you how they make better decisions faster as a result. We’ve tried hard to leave time between sessions and during breaks for you to have the hallway conversations that are the most valuable part of any technology conference. I’d like to thank the presenters and sponsors who are helping to make this event happen. The sponsors will be available throughout the day in the exhibit hall to answer your questions and show how their products work with Hadoop. I’d also like to thank you personally for spending this day with us. Today’s program is excellent, but it’s the attendees that make the event worthwhile. I’m really excited by the quality of the talks and the speakers in today’s sessions. I’m sure that you’ll find them interesting and useful!

Mike Olson CEO Cloudera

2

AGENDA

8:00am – 6:00pm 8:00am – 9:00am 9:00am – 10:30am 10:30am – 11:00am

Registration Breakfast - Sponsored by Pentaho Mike Olson, CEO, Cloudera | Tim O’Reilly, Founder, CEO, O’Reilly Media Break - Sponsored by Greenplum

Grand Ballroom Foyer Grand Ballroom Foyer Grand Ballroom Grand Ballroom Foyer

Grand Ballroom 3rd Floor
11:00am – 11:30am The Business of Big Data Abhishek Mehta, Bank of America

Beekman Parlor 2nd Floor
T Hadoop Analytics: More Methods, Less Madness Shevek Mankin, Karmasphere

Sutton North 2nd Floor
Hadoop Image Processing for Disaster Relief Andrew Levine, TexelTek
T Making Hadoop Security Work in Your IT Environment Todd Lipcon, Cloudera

Sutton Center 2nd Floor
T Search Analytics with Flume and HBase Otis Gospodnetic, Sematext

Sutton South 2nd Floor
Advanced Analytics for the US Army Intelligence Cloud Tim Estes, Digital Reasoning
T The Explorys Network Doug Meil, Explorys

11:35am – 12:05pm

Hadoop at eBay Anil Madan, eBay

RBDMS and Hadoop: A Powerful Coexistence Ben Werther, Greenplum, now part of EMC Corp.
T Migrating to CDH and Streaming Data Warehouse Loading Christopher Gillett, Visible Measures

Top 10 Lessons Learned from Deploying Hadoop and HBase Rod Cope, OpenLogic
T Using Hadoop for Indexing for Biometric Data, High Resolution Images,Voice/Audio Clips, and Video Clips Lalit Kapoor, Booz Allen Hamilton

12:10pm – 12:40pm

Hadoop: Best Practices and Real Experience Going from 5 to 500 Nodes Phil Day, HP

AOL’s Data Layer Ian Holsman, AOL

Large Scale Web Analytics Utilizing AsterData and Hadoop Will Duckworth, comScore

12:40pm – 1:45pm 1:45pm – 2:15pm

Lunch - Sponsored by Karmasphere Hadoop and Hive at Orbitz Jonathan Seidman, Orbitz SIFTing Clouds Paul Burkhardt, SRA International, Inc.
T HBase in Production at Facebook Jonathan Gray, Facebook

Grand Ballroom Foyer Business Analyst Tools & Applications for Hadoop Amr Awadallah, Cloudera Better Ad, Offer, and Content Targeting using Membase with Hadoop James Philips, Membase, Inc. Manu Mukerji, ShareThis Pero Subasic, AOL
T Exchanging Data with the Elephant: Connecting Hadoop and an RDBMS Using SQOOP Guy Harrison, Director, R&D Melbourne, Quest

2:20pm – 2:50pm

The Hadoop Ecosystem at Twitter Kevin Weil, Twitter

T SHARD: Storing and Querying Large-Scale SemWeb Data Kurt Rohloff, BBN

T ZooKeeper in Online Systems, Feed Processing and Cluster Management! Mahadev Konar, Yahoo!

Scale In: Collecting Distributed Data via Flume and Querying Through Hive Anurag Phadke, Mozilla Multi-Channel Behavioral Analytics Stefan Groschupf, Datameer

2:55pm – 3:25pm

Millionfold Mashups Philip Kromer, Infochimps

Optimizing Hadoop Workloads Nurcan Coskun, Intel Software and Services Group

Cloudera Roadmap Review Charles Zedlewski, Cloudera

3:25pm – 4:00pm 4:00pm – 4:30pm

Break - Sponsored by Intel Intelligent Text Information Processing System Vaijanath Rao, AOL Hadoop: Lessons Learned from Deploying Enterprise Clusters Shinichi Yamada, NTT Data Corporation Sentiment Analysis Powered by Hadoop Linden Hillenbrand, GE Li Chen, GE A Fireside Chat: Using Hadoop to Tackle Big Data at comScore Martin Hall, Karmasphere Will Duckworth, comScore Apache Hadoop in the Enterprise Arun Murthy, Yahoo! Using R and Hadoop to Analyze VoIP Network Data for QoS Saptarshi Guha, Purdue University
T MapReduce and Parallel Database Systems: Complementary or Competitive Technology? Daniel Abadi, Yale University

Grand Ballroom Foyer

4:35pm – 5:05pm

T Mixing Real-Time Needs and Batch Processing: How StumbleUpon Built an Advertising Platform using HBase and Hadoop Jean-Daniel Cryans, StumbleUpon T Techniques to Use Hadoop with Scientific Data Jerome Rolia, HP Labs

5:10pm – 5:40pm

Managing Derivatives Data with Hadoop Joshua Bennett, Chicago Mercantile Exchange

T Putting Analytics in Big Data Analysis Richard Daley, Pentaho

“Productionizing” Hadoop: Lessons Learned Eric Sammer, Cloudera Grand Ballroom Mercury Ballroom

5:45pm – 6:00pm 6:00pm – 7:30pm
T - Technical Session

Closing Remarks, Mike Olson, CEO, Cloudera Networking Reception - Sponsored by NTT Data Corporation

3

BREAKOUT SESSIONS

11:00am – 11:30am The Business of Big Data Abhishek Mehta, Managing Director, Big Data & Analytics, Bank of America Grand Ballroom How an organization with established and legacy infrastructure, technology and business processes adopt Hadoop technologies and processes to find groundbreaking solutions to known problems. The key - start with a business problem.
T Hadoop Analytics: More Methods, Less Madness Shevek Mankin, Chief Technical Officer, Karmasphere Beekman Parlor

Search Analytics reports using that data. We’ll show how data flow through the system from the moment a query or click event is captured in the search application UI, until it lands in HBase via Flume’s HBase sink. We’ll also share information about what this system looked like in the pre-Flume days. Finally we’ll demonstrate various reports the system ultimately produces and insight we derive from them. Advanced Analytics for the US Army Intelligence Cloud Tim Estes, CEO, Digital Reasoning Sutton South The US Army’s mission has evolved to deal with understanding entity-level relationships from massive amounts of structured and unstructured data. To tackle this problem and support a new generation of entity-centric analytics, the US Army has adopted Hadoop and other cloud scale analytic technologies to support mission-critical intelligence. At the heart of these analytic efforts is a new system for understanding and integrating structured and unstructured data called Synthesys. This talk will discuss the type and scale of analysis that the new Army Cloud is doing using Synthesys and how Hadoop/CDH3 is a critical component of that infrastructure. 11:35am – 12:05pm Hadoop at eBay Anil Madan, Director Engineering, Analytical Platform Development, eBay Grand Ballroom This talk will illustrate how eBay is leveraging its data assets to do advanced insights and analytics. Learn how eBay is sourcing huge volumes of data into the cluster and running clickstream and transactional data analysis for user behavior, search quality and research use cases. RDBMS and Hadoop: A Powerful Coexistence Ben Werther, Director of Product Management, Greenplum, now part of EMC Corp. Beekman Parlor Today, ALL the data in an organization is important. What does it take to manage massive volumes of structured and unstructured data and meet the demand for timely business insight? Innovations like MPP, MapReduce, Hadoop and in-database analytics are redefining what is possible. Learn how Adknowledge applied these tools to analyze massive amounts of data from email and digital advertising campaigns to deliver actionable business insight – faster than ever before.

One of the biggest stumbling blocks to leveraging Big Data, or even cloud computing in general, is the amount of expertise it takes to get even simple tasks done. In this session, we’ll discuss proven methods to most quickly and effectively extract intelligence using Hadoop. This includes prioritizing when to use a lower level language like Java versus when to use Hive and SQL, Pig, Cascading, etc. We will discuss real-world use cases, including illustrating what other enterprises do to leverage Hadoop without loads of additional training or time testing. This session will help analysts and developers alike understand the capabilities and compromises of alternative approaches. Hadoop Image Processing for Disaster Relief Andrew Levine, Software Developer, TexelTek Inc. Sutton North The Open Cloud Consortium’s Matsu Project is developing an open source system to process large amounts of image data and detect significant changes to provide assistance for disaster relief efforts. Processing of the source imagery is focused on making high resolution images highly available for disaster relief workers in a timely fashion. This effort is also doing temporal comparison between geospatially identical areas to reveal change over time. For example, it is possible to highlight fallen buildings and bridges or the progress of floods. The framework should work well for other types of image processing like anomaly detection and pattern identification.
T Search Analytics with Flume and HBase Otis Gospodnetic, Founder, Sematext International Sutton Center

In this talk we will show how we use Flume to transport search and clickstream data to HBase with the ultimate goal of producing
4
T - Technical Session

T Making Hadoop Security Work in Your IT Environment Todd Lipcon, Cloudera Aaron Myers, Cloudera Sutton North

12:10pm – 12:40pm Hadoop: Best Practices and Real Experience Going from 5 to 500 Nodes Phil Day, HP Grand Ballroom The simple hardware requirements and pre-packaged distributions mean that getting a small Hadoop prototype system together is easily achievable in many organizations. However transitioning from this to a full proof of concept or operational cluster presents many technically and organizational challenges. In this talk we will discuss some of the issues we have encountered whilst working with customers who want to move beyond the prototype, and how we have helped to overcome them. In particular we will cover the steps from hardware selection through to build, deployment and configuration, and service management considerations. Migrating to CDH and Streaming Data Warehouse Loading Christopher Gillett, Chief Software Architect,Visible Measures Corporation Beekman Parlor
T

The Apache Hadoop project has seen several recent advances in its security model including the addition of authentication. This session will discuss the current state of Hadoop security and how compatible this is with different aspects of typical enterprise IT environments. Attendees will learn about the details of the security integration in Hadoop and we will also discuss the integration of security throughout the various projects included in Cloudera’s Distribution for Hadoop (CDH). Top 10 Lessons Learned from Deploying Hadoop and HBase Rod Cope, CTO & Founder, OpenLogic, Inc. Sutton Center Hadoop, HBase, and friends are built from the ground up to support Big Data, but that doesn’t make them easy. Just like with any other relatively new and complex technologies, there are some rough edges and growing pains to manage. I’ve learned some hard lessons while deploying HBase tables containing billions of rows and dozens of terabytes on OpenLogic’s Hadoop infrastructure. Come to this session to learn about some of the ”gotchas” you might run into when deploying Hadoop and HBase in your own production environment and how to avoid them. The Explorys Network Doug Meil, Director of Engineering, Explorys Sutton South
T

Formed in partnership with Cleveland Clinic, Explorys addresses the national imperative to leverage electronic health records (EHR), across a network of healthcare providers and life sciences organizations, for the improvement of care and drug safety. With three major healthcare providers already committed, and more expected this year, the Explorys network will go online in the summer of 2010. The Explorys healthcare cloud computing platform leverages Hadoop, HBase, and MapReduce, to search and analyze patient populations, treatment protocols, and clinical outcomes. Already spanning a billion anonymized clinical records, Explorys provides uniquely powerful and HIPAA compliant solutions for accelerating life saving discovery.

We recently migrated from a legacy version of Apache Hadoop to a modern implementation using CDH. In parallel we moved from MySQL to Vertica. This talk focuses on the migration techniques used, including gradual grid decommissioning and buildup, data compression, load balancing, etc. I will also discuss how we refactored our data warehouse loading process to move to a streaming approach from a more traditional bulk-load model. Finally, I will present performance numbers comparing CDH to legacy implementations of Hadoop. AOL’s Data Layer Ian Holsman, CTO Relegence, AOL Sutton North An overview on how we use Hadoop and other open source technologies to provide reporting and basic clustering services to AOL’s websites.

5

BREAKOUT SESSIONS continued

T Using Hadoop for Indexing for Biometric Data, High Resolution Images,Voice/Audio Clips, and Video Clips Lalit Kapoor, Booz Allen Hamilton Sutton Center

SIFTing Clouds Paul Burkhardt, SRA, International Inc. Beekman Parlor Computer vision algorithms are ideal candidates for distributed computing given the compute-intensive nature of the algorithms and the increasing extent of image resolution and volume. We will describe our MapReduce implementations of the Scale-Invariant Feature Transform (SIFT) algorithm, a well-known computer vision algorithm used for object-recognition. Our SIFT MapReduce application enables fast object identification in distributed image datasets. We will present our results and a new approach for internet image search.
T HBase in Production at Facebook Jonathan Gray, Software Engineer, Open Source Advocate, Facebook Sutton North

As the types and volume of multimedia content and complex numeric data increases across the Internet searching this data becomes inaccurate or prohibitively expensive. To help address this problem, we created Fuzzy Table. Fuzzy Table is a distributed, low latency, fuzzy matching database built over Hadoop that enables fast fuzzy searching of content that cannot be easily indexed or ordered such as biometric data, high resolution images, voice/audio clips, and video clips. In this presentation we will discuss scaling an application using Fuzzy Table over Amazon’s EC2 service. We will present experiences, lessons learned, and performance metrics of building large scale systems over Hadoop. Large Scale Web Analytics Utilizing AsterData and Hadoop Will Duckworth, Vice President Software Engineering, comScore Sutton South In this session you will be exposed to how one company has leveraged AsterData and Cloudera’s Distribution to Hadoop (CDH) to implement and build an environment that supports processing over 500 billion rows of web log data in web log records in a syndicated production environment. The session will focus on how comScore applies it taxonomy of the web to help categorize the observed URLs and the methods used to leverage multiple large scale analytical systems. comScore’s taxonomy currently classifies over 88% of all web pages observed on the internet. 1:45pm – 2:15pm Hadoop and Hive at Orbitz Jonathan Seidman, Lead Software Engineer, Orbitz Worldwide Grand Ballroom Orbitz Worldwide’s portfolio of global consumer travel brands processes millions of searches and transactions every day. Storing and processing the ever-growing volumes of data generated by this activity becomes increasingly difficult through traditional systems such as relational databases. This presentation details how Orbitz is using new tools such as Hadoop and Hive to meet these challenges. We’ll discuss how Hadoop and Hive are being leveraged to provide data and analysis that allows us to optimize the products shown to consumers and drive statistical analysis of macro trends.

A talk on how Facebook is using HBase in production to power both online and offline applications. Beginning with why we chose HBase, this presentation will cover the specifics of our use cases and how HBase fits in. Details will be shared about HBase usage for realtime serving applications as well as to augment existing Hadoopbased data warehousing. Business Analyst Tools & Applications for Hadoop Amr Awadallah, CTO, Cloudera Sutton Center It is a widely held misconception that Hadoop is limited to programmers or other people familiar with command line interfaces. While historically true, there has been an explosion of different analysts tools that have been announced for Hadoop. This session will cover the different categories of Hadoop analyst tools, their capabilities, current maturity and applicable use cases. Better Ad, Offer, and Content Targeting using Membase with Hadoop James Phillips, Co-founder, Membase, Inc. Manu Mukerji, Architect, ShareThis Pero Subasic, Chief Architect, AOL Sutton South Real-time ad, offer, and content targeting decisions must happen quickly. AOL Advertising and ShareThis describe how Membase and Hadoop combine in their environments to accelerate and improve targeting. Creating user profiles with Hadoop, then serving them from Membase, reduces profile read and write access to under a millisecond, leaving the bulk of the processing time budget for improved targeting and customization.

6

T - Technical Session

2:20pm – 2:50pm The Hadoop Ecosystem at Twitter Kevin Weil, Analytics Lead, Twitter Grand Ballroom Hadoop is rapidly becoming must-have infrastructure for companies of all kinds. But as word of mouth grows, so do questions around how one actually uses Hadoop to solve business problems. There are a number of excellent applications on top of Hadoop like Pig, HBase, and Hive; how do those fit in? How does one get data into Hadoop, and then back out afterward? In this talk I’ll discuss specifically how Twitter uses these tools to solve critical business and engineering problems.
T SHARD: Storing and Querying Large-Scale SemWeb Data Kurt Rohloff, Scientist, BBN Technologies Beekman Parlor

Scale In: Collecting Distributed Data via Flume and Querying through Hive Anurag Phadke, Senior Metrics Engineer, Mozilla Corporation Sutton Center Socorro (Crash Reporting System), Tinderbox, BuildBots (build system) are some of the few distributed systems used at Mozilla. These systems are critical for stable product releases and each build/deployment ”run” emits tons of useful log information. With Flume, the entire information is collected at single location and Hive allows us to analyze the data in a fine grained fashion. The presentation includes: Technical overview on Flume + Hive integration, our current architecture, optimizations, tradeoffs and then share results pertaining to: a. Socorro - performance of specific processor and the whole cluster (efficiency, throughput), the load graph etc. b. Tinderbox - time taken for a specific build to complete, types of most commonly occuring error messages etc. Exchanging Data with the Elephant: Connecting Hadoop and an RDBMS Using SQOOP Guy Harrison, Director, R&D Melbourne, Quest Sutton South As Hadoop penetrates the enterprise, it will increasingly be called upon to integrate with more traditional enterprise datastores, and with Oracle in particular. To this end, Cloudera have provided the open source SQOOP utility to import or export data between any SQL database and Hadoop. Quest have partnered with Cloudera to provide OraOop - an enhanced utility that provides performance and functionality enhancements for those who wish to inter-operate Oracle and Hadoop. This presentation will discuss the architecture of SQOOP and how its extensibility architecture allows third party providers like Quest to provide optimized drivers for specific SQL databases. We’ll then discuss technical challenges in moving data between Oracle and Hadoop. Finally, we’ll consider how Hadoop changes the landscape for enterprise data management and speculate on how the data centre of the future might leverage the best features of Oracle and Hadoop. 2:55pm – 3:25pm Millionfold Mashups Philip Kromer, President, infochimps Grand Ballroom At infochimps, we’re assembling a data repository containing thousands of public and commercial datasets, many at terabyte scale. Modern machine learning algorithms can provide insight into data by drawing only on its generic structure, even more so

Current Semantic Web data processing technologies are sufficient for generally small datasets, but current methodologies create horrible query processing bottlenecks in Semantic Web triplestores. This contradicts the fundamentally Web-scale Semantic Web vision, and the resulting triple-store performance is probably one of the reasons there hasn’t been a broader uptake in SemWeb technologies. In this talk I will review SHARD, a proof-of-concept triple-store built on Hadoop. SHARD responds to SPARQL queries, stores triple data in HDFS and provides basic OWL reasoning capabilities. SHARD compares favorably in query performance to recent industrial triple-stores, but is much more scalable and robust.
T ZooKeeper in Online Systems, Feed Processing and Cluster Management! Mahadev Konar, Software Engineer, Yahoo! Sutton North

ZooKeeper has been in production for over 3 years now. Its performance and reliability have allowed it to be a critical component in distributed systems. Its design has proven to be flexible enough that it can be applied to a variety of needs of distributed applications. It has simplified lives of services engineering and is easily applied to your project. In this talk we will review some examples of applications that use ZooKeeper to show the breadth of solutions enabled by ZooKeeper. We will review 1) An online ads system where ZooKeeper is used for fault tolerance and service discovery, 2) Feed processing platform that use ZooKeeper for fault tolerance, name service, service discovery and load balancing information, 3) Crawling service, wherein ZooKeeper is used for cluster management, storing sharding information, name service and fault tolerance.

7

BREAKOUT SESSIONS continued

when that data is organically embedded in a sea of linked datasets. I’ll talk about the tools and algorithms we use to manage massive scale and massive numerosity data collections, and our bag of tricks for exploring the deep structure and new frontiers where these datasets meet. Optimizing Hadoop Workloads Nurcan Coskun, Intel Software and Services Group Beekman Parlor Deploying a highly efficient Hadoop cluster requires careful attention not only to hardware but also to a multitude of configuration options in Hadoop, HDFS, and the software stack. Intel has devoted resources to Hadoop analysis and testing, both internally and also with fellow travelers, to develop ways to improve efficiency and performance of Hadoop clusters. This workshop will provide a brief introduction to Intel’s analysis and some considerations for optimizing Hadoop for faster analysis and better efficiency. Intel’s whitepaper on Hadoop Optimization will also be available for more in depth discussion. Cloudera Roadmap Review Charles Zedlewski, Sr. Director Product Management, Cloudera Sutton North In this session we will discuss recent updates to Cloudera’s Distribution for Hadoop (CDH) and to Cloudera Enterprise. In addition we will present the roadmap for the next 12 months, giving you valuable insight into development plans. Multi-Channel Behavioral Analytics Stefan Groschupf, Chief Technology Officer, Datameer Sutton Center This presentation will focus on a use case of how a Fortune 500 company can leverage Hadoop to tackle the challenges of multichannel behavioral analytics including the large number of data sources, structured and unstructured data as well as big data. For example, the demonstration will show the power of marrying clickstream data with customer demographic data from a CRM system and purchase history from an order management system to determine the promotional campaign most likely to succeed. Further, this session will explain how to bring in social media conversations so that companies can better identify how customers are influencing each other’s buying decisions.

4:00pm – 4:30pm Intelligent Text Information Processing System Vaijanath Rao, Technical Lead, AOL Grand Ballroom Given a large amount of online content available, extracting information from them poses a great challenge. While the first challenge is to process the huge text, the second important challenge is extracting useful and important information out of it. In this session, we describe our work of extraction of keywords, events (location, date and time) etc. The keywords include important and significant words or phrases that describe the content, which can be used for topic detection and modeling, summarization etc. Our goal is to be able to use them for contextual advertising by identifying relevant ads using the keywords. We pass them through a filtering module which identifies the mood of the content and we restrict the ads for only positive moods. Sentiment Analysis Powered by Hadoop Linden Hillenbrand, Product Manager, Hadoop Technologies, General Electric Li Chen, Project Manager, New Media Technology, General Electric Beekman Parlor At GE, our Digital Media and Hadoop teams built an interactive application for our Marketing & Communications functions. One of the application’s capabilities is providing automated sentiment analysis, which provides our Marketing & Communications teams the ability to assess external perception of GE (positive, neutral, or negative) through our various campaigns. Hadoop powers the sentiment analysis aspect of the application. This is a highly intensive text mining use case for Hadoop, but through it, we greatly reduce our processing time for sentiment analysis and enable our business leaders to complete their analyses quickly and accurately. Apache Hadoop in the Enterprise Arun Murthy, Yahoo! Sutton North This session covers the strides taken by Hadoop in the last 12 months at Yahoo! to address the needs of the enterprise, including the multi man-years of effort on strong security for Hadoop (both the file-system and MapReduce), support for multiple organizations to use Hadoop clusters in a multi-tenant, resilient manner and operability enhancements to help run very large clusters in a costeffective manner with minimal human intervention. This talk also presents a brief survey of some of the business critical applications which are enabled by these enhancements.

8

T - Technical Session

Using R and Hadoop to Analyze VoIP Network Data for QoS Saptarshi Guha, Dept. of Statistics, Purdue University Sutton Center RHIPE is an R package that integrates the R environment for statistics and data analysis with the Hadoop distributed computing framework. With RHIPE, the user can store and compute with large and complex data sets using R functions and programming idioms. In this talk, I will demonstrate the use of RHIPE to analyze 190GB of VoIP network data for QoS. The jitter between two consecutive packets is the deviation of the real inter-arrival time from theoretical. We show jitter follows desired properties and is negligible, which supports the assumption of the measured traffic being close to the offered traffic. 4:35pm – 5:05pm Hadoop: Lessons Learned from Deploying Enterprise Clusters Shinichi Yamada, EVP & CTO, NTT Data Corporation Grand Ballroom NTT DATA has over 3 years experience helping enterprise customers design, deploy and run Hadoop clusters at the range of 20 to over 1000 nodes. In this presentation, we briefly introduce Hadoop business cases in Japan and how NTT DATA addresses the needs of enterprise users. In addition, as lessons are learned from working with large enterprise clusters, we also discuss typical reframing in design and operational economies, which have made Hadoop’s deployment successful for users. To provide a use case example we have invited a customer to present alongside us to explain how they have adopted Hadoop into their private cloud infrastructure. A Fireside Chat: Using Hadoop to Tackle Big Data at comScore Martin Hall, co-founder and CEO, Karmasphere Will Duckworth, VP Software Engineering, comScore Beekman Parlor This session will present a commercial use case of Hadoop in a classic ‘Fireside Chat’ format. Martin Hall, co-founder and CEO of Karmasphere, will talk informally with Will Duckworth, Vice President of Software Engineering at comScore, sharing insights into comScores’ experiences in working with Hadoop to process significantly larger amounts of data from a new initiative. Recently comScore was faced with the challenge of dealing with data from a new initiative that required the systems to support a daily increase in excess of 800% compared to a year ago. After a survey

of potential solution options, Duckworth and his team settled on using Hadoop as part of a larger solution. This Fireside Chat will delve into how they selected Hadoop, the trials and tribulations that they have experienced during the learning process and what plans they have for the future. Any developer or analyst considering Hadoop for their own commercial application will find this session illuminating.
T Mixing Real-Time Needs and Batch Processing: How StumbleUpon Built an Advertising Platform Using HBase and Hadoop Jean-Daniel Cryans, Database Engineer/HBase Committer, StumbleUpon Sutton North

StumbleUpon serves millions of recommendations to users each and every day, and includes a small portion of sponsored stumbles into these recommendations. Providing accurate metrics to the sponsors participating in the system combines the needs of a batchprocessing system with the requirements of a real-time feedback loop to present comprehensive and up to the minute data. HBase is the mass data storage foundation of this advertising platform, with Hadoop and Cascading used to support numeric analysis and other batch jobs in a flexible and extensible fashion.
T MapReduce and Parallel Database Systems: Complementary or Competitive Technology? Daniel Abadi, Assistant Professor, Yale University Sutton Center

The MapReduce vs. parallel database system debate has finally been extinguished (for the most part), with the vast majority of people recognizing that each type of system has its own strengths and weaknesses, and ideal application areas. However, there is a new emerging debate. Some people believe that MapReduce and parallel database systems are entirely complementary technology and will coexist in the enterprise over the long term. Other people, while acknowledging that each have their own strengths and weaknesses, feel that these differences are superficial and that these systems are on a collision course, with one eventually becoming dominant in the enterprise. In this session, the speaker will debate against himself both sides of this argument.

9

BREAKOUT SESSIONS continued

5:10pm – 5:40pm Managing Derivatives Data with Hadoop Joshua Bennett, Technology Architect, Chicago Mercantile Exchange Group Grand Ballroom In 2002, CME Group, the world’s leading and most diverse derivatives marketplace, experienced an exponential growth in volume which has continued over subsequent years. In this presentation we will explore how technologies like Hadoop are leveraged to help cope with the hundreds of millions of daily customer transactions.
T Putting Analytics in Big Data Analysis Richard Daley, CEO, Pentaho Beekman Parlor

“Productionizing” Hadoop: Lessons Learned Eric Sammer, Solution Architect, Cloudera Sutton Center Many Hadoop deployments start small solving a single business problem but then begin to grow as the organization finds more valuable use cases. Moving a Hadoop deployment from the proof of concept phase into a full production system presents certain challenges for IT operations teams looking to manage the growing Hadoop deployment and maintain internal SLAs with their customers. In this session, Eric will review some of the key considerations the Cloudera Solutions Architect team have learned when working with customers to “productionize” a Hadoop deployment.

In this interactive session, we will discuss and present the Pentaho for Hadoop solution, the latest offering from Pentaho that integrates Pentaho Data Integration (also known as Kettle) with Hadoop and Hive to bring ETL, data warehousing and BI applications to the tasks of analyzing Big Data. This session will explore how Pentaho for Hadoop works to provide key data integration and transformation functionality to Hadoop data, how it can manage and control transformations and Hadoop jobs from the Pentaho management console and how Hadoop data can be integrated with data from other sources to drive compelling reporting and analytics for today’s massive volumes of data. The session will include a demonstration of the Pentaho for Hadoop solution.
T Techniques to use Hadoop with Scientific Data Jerome Rolia, HP Labs Sutton North

Platforms such as Hadoop are not designed specifically for science users making it difficult to express certain analysis functions in a way that results in efficient execution. In particular, many scientific analytics require the extraction of features from data represented as either a multidimensional array or points in a multidimensional space (e.g., clustering particles that represent a snapshot of a simulation of the universe or extracting hurricanes from a satellite picture). These applications pose an especially interesting challenge in that they exhibit significant computational skew, where different partitions take vastly different amounts of time to run even if their input datasets have the same size. This talk gives examples of such algorithms, manual techniques for overcoming computational skew, and describes joint work with the University of Washington on the SkewReduce platform that automatically partitions data to avoid computational skew.
10
T - Technical Session

SPONSORS

Thank you to our sponsors.
Platinum

Gold

Silver

11