You are on page 1of 27


Enabling Apache Hadoop to be the next-generation enterprise data platform

February 2012

Hortonworks Inc. 2012

Page 1

Hortonworks Vision

We believe that by the end of 2015, more than half the world's data will be processed by Apache Hadoop
How to achieve that vision???
Enable ecosystem around enterprise-viable open source data platform.

Hortonworks Inc. 2012

Page 2

What is Apache Hadoop?

Solution for big data
Deals with complexities of high volume, velocity & variety of data

Set of open source projects Transforms commodity hardware into a service that:
Stores petabytes of data reliably Allows huge distributed computations

Key attributes:
Redundant and reliable (no data loss) Extremely powerful Batch processing centric Easy to program distributed apps Runs on commodity hardware

One of the best examples of open source driving innovation and creating a market

Hortonworks Inc. 2012

Page 3

Market Trends Were Seeing

Hortonworks Inc. 2012

Page 4

Trend: Hadoop as a Data Refinery

The old way
Operational systems keep current records, short history Analytics systems keep only conformed / cleaned / digested data Unstructured data locked away in operational silos Archives offline
Inflexible, new questions require system redesigns

The new trend

Keep all copies of multi-structured data (raw & refined) in Hadoop Perform immediate transformations and data refining in Hadoop Move refined data downstream for data discovery and BI/analytics Agile outcome justifies new infrastructure

Hortonworks Inc. 2012

Page 5

Agile Data Refinery w/Hadoop

Connecting All of Your Big Data
CRUD / Serving systems
Web apps

Traditional Data Warehouses, BI & Analytics EDW Data Marts BI /


Store, Transform, Refine, Archive all data, Custom Analytics

Serving Logs

Social Media

Sensor Data

Text Systems

Unstructured Systems
Hortonworks Inc. 2012 Page 6

Trend: Data-driven Development

Limited runtime logic driven by huge lookup tables Data computed offline on Hadoop
Machine learning, other expensive computation offline Personalization, classification, fraud, value analysis

Application development requires data science

Huge amounts of actually observed data key to modern services Hadoop used as the science platform

Hortonworks Inc. 2012

Page 7


Personalized for each visitor Result: twice the engagement
Recommended links News Interests vs. one size fits all Top Searches vs. editor selected

+79% clicks
vs. randomly selected

+160% clicks +43% clicks

Copyright Yahoo 2011 Hortonworks Inc. 2012

Every Market Has Big Data

Digital data is personal, everywhere, increasingly accessible, and will continue to grow exponentially

Source: McKinsey & Company report. Big data: The next frontier for innovation, competition, and productivity. May 2011.

Hortonworks Inc. 2012

Page 9

Trend: Specialization of Data Systems

Hadoop adds new capabilities to the enterprise, especially in scale out situations
Does not replace existing systems

Specialization of traditional data components

Use Transactional systems for transactions Use Analytics systems for interactive analysis

Hadoop has LOTS of bandwidth for storage and CPU

Pull data out Transactional systems for storage and staging Pull ELT out of Analytics systems

Confidential Hortonworks Inc. 2012

Page 10

Hadoop and Transactional Systems

Online Transaction Processing
Mission critical Manages transactions & serves reports

Hadoop used to Process Reports

Free up 50+% processing power for transaction processing system Significant cost savings due to commodity nature of Hadoop

Web Site

Web Site

Transaction Processing Systems $$$


Transaction Logs

Web Site

Hortonworks Inc. 2012

Page 11

Hadoop and Analytics Systems

Fast loading, raw data staging, ELT & long-term archival (The Agile Data Zone) High-value strategic and operational intelligence (Leverages huge ecosystem of tooling)





Social Online Archival Other logs

Ex. Historical Black Friday data

Hortonworks Inc. 2012

Page 12

Hadoop as data refinery implies

Hadoop must be an open platform; Open Data APIs ETL Integration / Data Ingest
Hadoop should work well with industry standard tools

Offsite Backups / DR
HDFS Snapshots, Cloud Backup, other tools

Object / Event-level Storage APIs Non-Relational Data

HCatalog (for all 3 above)

Efficient / Low Cost Storage

Compression, Raid / Reed-Solomon

No Storage Limits
No file limits, scale beyond 10,000 computers / cluster

Hortonworks Inc. 2012

Page 13

Open Platform Enables Ecosystem

Hortonworks Inc. 2012

Page 14

Enabling a Broad Ecosystem

Hortonworks Inc. 2012

Page 15

Open Platform Enables Ecosystem

Search, Index SQL, NewSQL, NoSQL, xDBC Integration (msg bus, ) ETL (basic & advanced)
Hortonworks Data Platform Operational APIs

BI, Reporting, Visualization Analytics, EDW Algorithms, Data Science Tools, Languages


Hortonworks Inc. 2012

Page 16

Example: Teradata & Hortonworks

Online Customer Behavior Example
Fast loading, raw data staging, ELT & long-term archival Frequent, iterative analysis (e.g. user behavior/response to promotions, pattern det.) High-concurrency strategic & operational intelligence




Teradata Aster



Online Archival
Ex. Historical Black Friday data

Other Logs

Hortonworks Inc. 2012

Page 17

Example: Talend & Hortonworks

Industrys first open source big data integration software
Feature-rich Job Designer Rich palette of pre-built templates Supports HDFS, Pig, Hive, HBase, Sqoop Apache-licensed, bundled with HDP

Key benefits
Graphical development Robust and scalable execution Broadest connectivity to support all systems: 450+ components Real-time debugging

Hortonworks Inc. 2012

Page 18

Example: Microsoft & Hortonworks

Hadoop on Windows Server / Azure
Target most used Hadoop components Patches flow into Apache open source
Hadoop 1.0, 0.23, and Trunk JavaScript Developers

JavaScript Framework
Interactive JavaScript console for fast iterative development Fluent data query API that translates JavaScript queries to server-side Pig Latin and HiveQL Robust data visualization & charting

Open source client and server-side frameworks Open source Hive ODBC Driver

Business Analysts

Enhanced Hive ODBC Driver

Move data from Hive into Microsoft Excel, PowerPivot, Power View, etc. Analyze Hadoop data and build corporate BI solutions
Hortonworks Inc. 2012

Patches to open source components

Windows Server Admins

Page 19

Hortonworks Data Platform

Hortonworks Inc. 2012

Page 20

Balancing Innovation & Stability

Apache: Be aggressive - ship early and often
Projects need to keep innovating and visibly improve Aim for big improvements Make early buggy releases

Hortonworks: Be predictable - ship when stable

We need to ship stable, working releases Make packaged binary releases available We need to do regular sustaining engineering releases HDP quarterly release trains sweep in stable Apache projects
Enables HDP to stay reasonably current and predictable while minimizing risk of thrashing that coordinating large # of Apache projects can cause

Hortonworks Inc. 2012

Page 21

Hortonworks Data Platform (HDP)

Fully Integrated, Extensively Tested, Enterprise Supported
Challenge: Integrate, manage, and support changes across a wide range of open source projects that power the Hadoop platform; each with their own release schedules, versions, & dependencies. Time intensive, Complex, Expensive Solution: Hortonworks Data Platform Integrated certified platform distributions Extensive Q/A process: many apps across small, medium, & large clusters Industry-leading Support with clear service levels for updates and patches Continuity via multi-year Support and Maintenance Policy Technical guidance support for Universe and Multiverse components
Page 22

Hadoop Core

Pig Zookeeper Hive HCatalog


= New Version
Hortonworks Inc. 2012

Support & Distribution Model

Hortonworks Data Platform
Fully supported, integrated, tested, maintained 100% Apache license, or compatible: BSD, MIT/X11, NCSA, W3C Software license, X.Net

HDP Universe: Open Source Ecosystem

Validated & interoperable with HDP Technical guidance support; work with OSS projects 100% OSI-compliant licenses Optionally installed

HDP Multiverse: Commercial Ecosystem

Validated & interoperable with HDP Technical guidance support; work with TSANet 3rd-party vendor licenses and support options Optionally installed

Model and terminology conceptually similar to Ubuntus model:
Page 23

Hortonworks Inc. 2012

Hortonworks Data Platform (HDP)

Key Components of Standard Hadoop Open Source Stack
Hortonworks Data Platform HDP Universe

(Cluster Coordination) (Columnar NoSQL Store) (Data Flow)


Ambari &
Other Monitoring & Management



(Distributed Programing Framework)

Workflow scheduling

(Table & Schema Management)

Sqoop &
Other Ingest, ETL tools

(Hadoop Distributed File System)

Mahout &
Other libraries

Hortonworks Inc. 2012

Page 24

Hadoop Now, Next, and Beyond

Apache community, including Hortonworks investing to improve Hadoop: Make Hadoop an Open, Extensible, and Enterprise Viable Platform Enable More Applications to Run on Apache Hadoop Hadoop.Beyond Future investments Hadoop.Next (Hadoop 2.0) HDP2 Hadoop.Now (Hadoop 1.0) HDP1 Most stable Hadoop ever HBase, security, WebHDFS HCatalog data APIs HA, Next-gen MapReduce Extension & Integration APIs Extended HCatalog data APIs

Hortonworks Inc. 2012

Page 25

Hortonworks Data Platform Timeline

Q1 Q2 Q3 Q4









Hortonworks Data Platform 1




Hortonworks Data Platform 2

36 Month support policy, from GA date

Hortonworks Inc. 2012 Page 26

Thank You!
Questions & Answers

Hortonworks Inc. 2012

Page 27