Professional Documents
Culture Documents
Best practices
By Juergen Urbanski, VP Big Data Architectures & Technologies at T-Systems, the enterprise IT division of Deutsche Telekom
Hadoop is like a data warehouse, but it can store more data, more kinds of data, and perform more flexible analyses than comparable technologies. It is also crucial to note that Hadoop is open source and runs on industry-standard hardware, which means it is at least 10 times more economical than conventional data warehouse solutions. Lets take a closer look at the disruptive power of Hadoop. It can handle any data type, including unstructured data, which is quickly becoming the fastest growing and most important type of data. The good old relational database is unable to handle unstructured data. Whereas a relational database needs pre-defined, fixed schemas that are required on write, Hadoop only requires schemas on read. That means, a business can store structured and unstructured data first and ask any type of question later, literally in the moment. Hadoop brings unheard-of flexibility to collect, store and analyze large datasets, be they from manufacturing processes, sensors, customer transactions, from mobile devices or social media platforms. Equally important is Hadoops ability to perform parallel processing and scale out rather than just scale up. In other words, its near impossible to hit the ceiling when storing and transforming data in the Hadoop framework, which makes it ideal to compete in the age of Big Data. The final disruptive quality pertains to the hardware side. Unlike the conventional architecture that relied on enterprise-grade, mission-critical components, Hadoop was designed to run on off-the-shelf or commodity servers, which offer much cheaper storage. In a nutshell, Hadoop makes it affordable to store any kind of data for unlimited time periods, naturally within the bounds of what is legally permitted.
10
Verlags-Sonderverffentlichung
Its no wonder then that large enterprises have already managed to realize substantial cost reductions by offloading some enterprise data warehouse (EDW), ETL and archiving workloads to a Hadoop cluster. Even better, they dont have to rip and replace their EDW, but Hadoop complements the EDW. For all those reasons, T-Systems believes in the disruptive power of Hadoop and already uses vendors like Hortonworks and Cloudera that are committed to driving open source innovation. Fortunately, existing EDW users can access the transformative power of Hadoop because vendors like Teradata embed the Hortonworks distribution of Hadoop in the Teradata appliance, so users who just have SQL skills can run their queries on a Hadoop cluster without having to understand (almost) anything about Hadoop. There are many horizontal and industry-specific use cases. In the business intelligence domain, a Hadoop cluster can either be used at the beginning of the data lifecycle, as a landing zone to offload data, or at the tail end to archive data for later use. Its worth pointing out that Hadoop does not replace the EDW as we know it anytime soon. But it does free up space so customers can have the warehouse do higher value-added work, namely analysis. And they can grow into their installed EDW capacity in order to reduce their incremental EDW spend. While Hadoop has some highly strategic uses, just the cost savings in the EDW context are already very compelling. While storing a terabyte of data in a traditional enterprise data warehouse may cost $200,000, the same terabyte can be handled on a Hadoop cluster for less than a tenth of the cost. Of that amount, one quarter goes toward hardware, one quarter toward software, and the remaining half to services. Drilling down into the services, we see that this particular data warehouse had accumulated several hundred thousand lines of SQL code. All the ETL was in SQL and needed to be converted into MapReduce jobs. The company had developed their set of SQL queries over the course of six years but was able to convert these queries so they execute on a Hadoop cluster after a six month project.
The results speak for themselves. Spend on the legacy enterprise data warehouse vendor decreased from $65m to $35m over a five-year horizon, with most of the remaining $35m spend devoted to maintenance of the legacy EDW deployment. Moreover, EDW performance went up fourfold because the EDW was now able to focus on high value interactive tasks while the rest of the work is offloaded to the Hadoop cluster.
The Challenge Many EDWs are at capacity Running out of budget before running out of relevant data Older data archived in the dark, not available for exploration DATA WAREHOUSE Operational (44%) Analytics (11%) ETL Processing (42%) Cost is 1/10th
The Solution Hadoop for data storage and processing: parse, cleanse, apply structure and transform Free EDW for valuable queries Retain all data for analysis! DATA WAREHOUSE Operational (50%) Analytics (50%) HADOOP Storage & Processing
Source: T-Systems
Exhibit: Enterprise Data Warehouse Offload Use Case We are just beginning to see the full potential and capture the initial bottom-line benefits of Hadoop. I believe Hadoop represents the single biggest technological disruption of this decade. It will enable enterprises like Deutsche Telekom to move from small, departmental data puddles to enterprise-wide oceans of data. But adopting Hadoop to complement the traditional EDW is just the beginning. At T-Systems, we see potential for Hadoop in every industry and every functional area. Simply put, Hadoop allows any business to gain value from all its data by asking bigger questions. The promise of Hadoop is to deliver better insights to the business, which can contribute to profitable growth in any industry. The limit is your imagination, not how much storage you can afford or how long your analytic jobs take to complete. This gives business intelligence a whole new meaning limited only by the intelligence of your team in asking bigger questions.
kontakt
T-Systems International GmbH Hahnstrae 43d D-60528 Frankfurt am Main Tel: 069 20060 - 0 Internet: www.t-systems.com
11
Best practices