You are on page 1of 2

Verlags-Sonderverffentlichung

Best practices

Hadoop! Coming soon to an enterprise data warehouse near you.


Deutsche Telekoms perspective on how the open-source Hadoop ecosystem delivers powerful innovation in storage, databases and business intelligence at a fraction of the cost of legacy systems. If your business is like those of many of our customers, you are overwhelmed by the explosive growth in the volume and type of data available to your business. New data types such as machine-based data (for example from sensors), clickstream data, sentiment data or if you are a telecom service provider call detail records (CDRs) overwhelm the capabilities of the traditional enterprise data warehouse. Yet many business intelligence departments are intrigued by the prospect of turning information into profit with the help of new Big Data technologies. It may sound like squaring the circle, but a fairly new and rapidly evolving open-source ecosystem called Hadoop holds the promise to store and process unprecedented volume and variety of data. Even in case you have so far only heard of Hadoop at conferences or read about it, its fair to say that it will come to an enterprise data warehouse near you in the very near future. In fact, I predict that by 2015, fully 80 percent of all new data will enter the typical enterprise on a Hadoop cluster first, making it the de facto enterprise-wide landing zone for large amounts of data. Thats ample reason to take a good, hard and most important, realistic look at what this bundle of technologies offered by several leading vendors can do and what it cant. This article will highlight the tangible bottom-line benefits that Hadoop is bringing to telecommunications companies like Deutsche Telekom to store and process data at an unprecedented price/performance ratio. But first things first. What exactly is Hadoop? Searching for it on Google brings up close to ten million results. In essence, it is based on a total of 13 Apache projects of which two form the core. It got its start in 2005, derived from two research papers that came out of Google, laying out the basics of MapReduce and the Google File System (GFS). Both are intended to optimize cost effective storage and processing of massive amounts of data through a distributed architecture.
What Works Special 2013

By Juergen Urbanski, VP Big Data Architectures & Technologies at T-Systems, the enterprise IT division of Deutsche Telekom

Hadoop is like a data warehouse, but it can store more data, more kinds of data, and perform more flexible analyses than comparable technologies. It is also crucial to note that Hadoop is open source and runs on industry-standard hardware, which means it is at least 10 times more economical than conventional data warehouse solutions. Lets take a closer look at the disruptive power of Hadoop. It can handle any data type, including unstructured data, which is quickly becoming the fastest growing and most important type of data. The good old relational database is unable to handle unstructured data. Whereas a relational database needs pre-defined, fixed schemas that are required on write, Hadoop only requires schemas on read. That means, a business can store structured and unstructured data first and ask any type of question later, literally in the moment. Hadoop brings unheard-of flexibility to collect, store and analyze large datasets, be they from manufacturing processes, sensors, customer transactions, from mobile devices or social media platforms. Equally important is Hadoops ability to perform parallel processing and scale out rather than just scale up. In other words, its near impossible to hit the ceiling when storing and transforming data in the Hadoop framework, which makes it ideal to compete in the age of Big Data. The final disruptive quality pertains to the hardware side. Unlike the conventional architecture that relied on enterprise-grade, mission-critical components, Hadoop was designed to run on off-the-shelf or commodity servers, which offer much cheaper storage. In a nutshell, Hadoop makes it affordable to store any kind of data for unlimited time periods, naturally within the bounds of what is legally permitted.

10

Verlags-Sonderverffentlichung

Its no wonder then that large enterprises have already managed to realize substantial cost reductions by offloading some enterprise data warehouse (EDW), ETL and archiving workloads to a Hadoop cluster. Even better, they dont have to rip and replace their EDW, but Hadoop complements the EDW. For all those reasons, T-Systems believes in the disruptive power of Hadoop and already uses vendors like Hortonworks and Cloudera that are committed to driving open source innovation. Fortunately, existing EDW users can access the transformative power of Hadoop because vendors like Teradata embed the Hortonworks distribution of Hadoop in the Teradata appliance, so users who just have SQL skills can run their queries on a Hadoop cluster without having to understand (almost) anything about Hadoop. There are many horizontal and industry-specific use cases. In the business intelligence domain, a Hadoop cluster can either be used at the beginning of the data lifecycle, as a landing zone to offload data, or at the tail end to archive data for later use. Its worth pointing out that Hadoop does not replace the EDW as we know it anytime soon. But it does free up space so customers can have the warehouse do higher value-added work, namely analysis. And they can grow into their installed EDW capacity in order to reduce their incremental EDW spend. While Hadoop has some highly strategic uses, just the cost savings in the EDW context are already very compelling. While storing a terabyte of data in a traditional enterprise data warehouse may cost $200,000, the same terabyte can be handled on a Hadoop cluster for less than a tenth of the cost. Of that amount, one quarter goes toward hardware, one quarter toward software, and the remaining half to services. Drilling down into the services, we see that this particular data warehouse had accumulated several hundred thousand lines of SQL code. All the ETL was in SQL and needed to be converted into MapReduce jobs. The company had developed their set of SQL queries over the course of six years but was able to convert these queries so they execute on a Hadoop cluster after a six month project.

The results speak for themselves. Spend on the legacy enterprise data warehouse vendor decreased from $65m to $35m over a five-year horizon, with most of the remaining $35m spend devoted to maintenance of the legacy EDW deployment. Moreover, EDW performance went up fourfold because the EDW was now able to focus on high value interactive tasks while the rest of the work is offloaded to the Hadoop cluster.

The Challenge Many EDWs are at capacity Running out of budget before running out of relevant data Older data archived in the dark, not available for exploration DATA WAREHOUSE Operational (44%) Analytics (11%) ETL Processing (42%) Cost is 1/10th

The Solution Hadoop for data storage and processing: parse, cleanse, apply structure and transform Free EDW for valuable queries Retain all data for analysis! DATA WAREHOUSE Operational (50%) Analytics (50%) HADOOP Storage & Processing

Source: T-Systems

Exhibit: Enterprise Data Warehouse Offload Use Case We are just beginning to see the full potential and capture the initial bottom-line benefits of Hadoop. I believe Hadoop represents the single biggest technological disruption of this decade. It will enable enterprises like Deutsche Telekom to move from small, departmental data puddles to enterprise-wide oceans of data. But adopting Hadoop to complement the traditional EDW is just the beginning. At T-Systems, we see potential for Hadoop in every industry and every functional area. Simply put, Hadoop allows any business to gain value from all its data by asking bigger questions. The promise of Hadoop is to deliver better insights to the business, which can contribute to profitable growth in any industry. The limit is your imagination, not how much storage you can afford or how long your analytic jobs take to complete. This gives business intelligence a whole new meaning limited only by the intelligence of your team in asking bigger questions.

kontakt
T-Systems International GmbH Hahnstrae 43d D-60528 Frankfurt am Main Tel: 069 20060 - 0 Internet: www.t-systems.com

What Works Special 2013

11

Best practices

You might also like