You are on page 1of 10

WHITE PAPER

Reducing the Total Cost of Ownership of Big Data

Abstract
In this white paper we share some of the best practices and strategies associated with lowering the Total Cost of Ownership (TCO) of your Big Data solutions. It discusses the challenges that relate to the cost of Big Data solutions and looks at the technology options available to overcome these problems.

Impetus Technologies, Inc. www.impetus.com May - 2012

Reducing the Total Cost of Ownership of Big Data

Table of Contents
Introduction .............................................................................................. 2 Using commodity hardware for Big Data ................................................. 3 Using Open Source and cloud computing ................................................ 3 The cost components of a Big Data Warehouse....................................... 4 Entry cost ..................................................................................... 4 Cost associated with migrating the data ..................................... 4 Other costs................................................................................... 4 Lowering the TCO of Big Data ................................................................... 4 Reducing the cost of storage .................................................................... 5 What technologies, where? ...................................................................... 5 Big Data scenarios in OLAP ....................................................................... 6 Analytics with Hadoop .............................................................................. 7 Indirect analytics over Hadoop .................................................... 7 Direct analytics over Hadoop....................................................... 7 Analytics over Hadoop with MPP Data warehouse ..................... 7 Selecting the right technologies ............................................................... 8 Opting for faster Map Reduce/Hadoop .................................................... 8 NoSQL solutions ........................................................................................ 9 New era RDBMS versions ......................................................................... 9 Impetus solutions and recommendations ............................................... 9 Conclusion............................................................................................... 10

Introduction
The potential of Big Data is growing at a rapid rate. According to IDC/EMC estimates, the cost of computers, networks and storage facilities that drive the digital universe currently stands at a whopping USD 6 trillion! We spend an extra USD 650 billion for the data that is of little use, due to the overload of information that adds to the cost of productivity and storage, and eventually goes to waste.

Reducing the Total Cost of Ownership of Big Data

Over the next few years, these numbers are expected to grow significantly. According to one study, the size of digital universe doubles in every 18 months. We have a rich pool of data at our disposal, we are still extracting poor information. There is a lot more that can be done to unearth better and actionable insights and intelligence from this Big Data. Currently, there are several solutions available for cost-effective Big Data Analytics. Let us examine some of the pros and cons of these offerings, beginning with commodity hardware.

Using commodity hardware for Big Data


At Impetus, we believe that the biggest advantage of commodity hardware is that you can build it yourself, and that there are many avenues open for innovation. Commodity hardware is readily available and people have easy access to it. Therefore, while using commodity hardware, you have the option of customizing and optimizing it, over and beyond the existing offering. Our experience has been that the cost of building reliable storage using commodity hardware may be around USD 1 per gigabyte. This is only for storage and does not include the cost associated with managing, monitoring, and hosting the Big Data.

As for using cloud computing for Big Data, this too, has its advantages and disadvantages. The advantage is that you can rent resources over the cloud to deal with your data and analytics. Some prominent service providers you can turn to are Amazon Web Services and Microsoft, for its Windows Azure platform. You can select an offering appropriate for your needs and requirements, from their portfolios. On the minus side, there is storage over the cloud, which is not exactly an economical option.

Using Open Source and cloud computing


Let us now check out the pluses and minuses of using Open Source and cloud computing. Obviously, using free and Open Source software to store, manage and analyze Big Data is a good idea. We all know that Hadoop can be leveraged to tackle large volumes of data, while saving significantly on costs.

Reducing the Total Cost of Ownership of Big Data

The cost components of a Big Data Warehouse


Having discussed the advantages and disadvantages of using commodity hardware, Open Source and cloud computing for Big Data, we will now talk about the cost components of a typical Big Data Warehouse.

Entry cost
The first element is entry cost. This is the cost that you will incur while experimenting with your data and identifying whether a particular Big Data solution meets your requirements.

Cost associated with migrating the data


If the solution works, the next cost is related to migrating the data to the new system, especially by using the Extract, Transform and Load (ETL) processes, which may require you to have an expensive specialized tool.

Other costs
The other important cost component that comes into the picture is performing analytics over the data stored within the system. A cost is also incurred due to the manageability aspect, as there is a need for the system to be easily handled for scalability and failing conditions. On-going recurring maintenance accounts for another cost factor that cannot be ignored. Your Big Data Warehouse will always require monitoring and tuning as the data grows or changes are made. Together, all of these factors increase the costs associated with Big Data analytics. Based on its extensive experience and expertise in Big Data, Impetus has identified some best practices that can help you reduce the Total Cost of Ownership (TCO) of your Big Data solutions. We are categorizing them here on the basis of hardware and software.

Lowering the TCO of Big Data


As far as hardware is concerned, we suggest that you keep focusing on the cost incurred on storage and computation. In the area of software, we have some suggestions and options that can enable you to process more, at a lesser investment. The challenge here is to speed up things, while reducing the cost. Expensive commercial solutions also remain a major pain point.
4

Reducing the Total Cost of Ownership of Big Data

Reducing the cost of storage


We advise compressing the data to cut storage costs, as it will help in reducing the space required to store the data. Some solutions available today claim to compress data to 1/40th of its existing size. However, you should ensure that the read throughput is not compromised when the data is decompressed. When analytics come in to picture, it may not be required to bring in the entire data that has been accumulated over the years. You may opt to run analytics over some specific set of data. Therefore, instead of retaining each bit for processing, you can choose to load only the required data and then process it. We can also create new appropriate systems to store data and information based on the principles very similar to Information Lifecycle Management (ILM). ILM is not a new term or a practice. Now let us talk about small data, which is another important factor.

We do not recommend using Big Data solutions for the storage and retrieval of small amounts of data, as the relative latency will be higher for the fetch. We believe that better insights and conclusive drawings can be made over a small set of data. Therefore, our advice to you is not to take your eyes off the Small Data, as it is very important too.

What technologies, where?


A key factor that will help you reduce TCO is understanding the technologies available to you and where they can be used. With the advent of Big Data, several commercial, specialized hardware and appliances have come into existence. They offer rich features such as fault tolerance, easy capacity scaling, specialized management tools, etc. The commodity hardware available today can be harnessed for Big Data use cases by leveraging the Open Source stack or solutions. Latency is also a critical factor and the systems with the lowest latency are likely to be expensive. There is a specific niche market that tackles latency as a business problem.

Reducing the Total Cost of Ownership of Big Data

If you are considering the cloud as a potential solution for your Big Data, ask yourself whether moving to the cloud is the only solution for your data storage requirements. Such a move can be expensive, especially when the data is not already on the cloud. In such a scenario, you will have to upload the entire data required for processing, thereby, adding to your cost. Having covered what technologies, let us now talk about where to use these technologies. You can broadly classify these under two categoriesOnline Analytical Processing and Online Transaction Processing. If you are generating or working with large sets of data in your OLTP scenario, the cost-effective NoSQL solutions will come to your rescue. On the other hand, in a typical data warehouse situation, which requires analytical processing, Map Reduce or MPP-based systems will be a good option. Let us look at the possible Big Data scenarios, where we largely deal with analytical processing.

Big Data scenarios in OLAP


Online Analytical Processing with respect to Big Data can be divided into three categories. The first and the most common one is Big Input Small Output. This is used to draw conclusions and to prepare graphs or charts, or in scenarios where the top n-elements in a data set is to be identified. The other possibility is when the input set of data is small and the result is a big output (Small Input Big Output). This typically happens in case of predictive analysis, where n-number of outcomes is possible. It is also applicable in scenarios where the correlation-coefficient matrices are required to be populated with a given set of inputs that could be small, but the results might turn out to be very large. The last possibility that we observe is of Big Input and Big Output, in the process of ETL. Here, the magnitude of output data is similar to that of input data. In the real world scenario, when we are performing summarization or concentrating data with respect to some parameters (such as data volume, latency and cost), there is a decrease in the data volumes. The recommended systems operating on lesser volumes of data include small data solutions such as

Reducing the Total Cost of Ownership of Big Data

Massively Parallel Processing Systems (MPP), traditional RDBMS or the newer NoSQL databases. These systems offer the lowest latency in such a scenario. However, if you move from a lesser amount of data to Big Data, the latency of the systems increases and there is a decrease in cost per gigabyte. We know that Hadoop systems are cost effective. However, in case of small data solutions where latency is the key factor, it is better to opt for customized and tailored solutions that enable quicker retrieval of data. The down side is that the deployment of these solutions will take the storage cost per GB to a higher level. MPP, on the other hand, provides significant benefits. There are enough reasons for a warehouse to use MPP solutions. These systems provide you with relational stores while at the same time accommodating larger sizes of data. In a typical scenario, you may need to deploy all or a combination of these systems at the same time. Lets look at some of them.

Analytics with Hadoop


Indirect analytics over Hadoop
In this approach, Hadoop can be used for cleaning/transforming the data into a structured form, and then loading the same into the RDBMS databases. This approach provides the end-user with the flexibility of parallel processing of Hadoop and a SQL interface at the summarized data level. This solution is not very expensive, when compared with other options.

Direct analytics over Hadoop


Applying analytics directly over a Hadoop System without moving it to any RDBMS databases can prove to be an effective practice to analyze the data from the Hadoop file system. This approach allows you to do batch and asynchronous analytics over the same data present in the Hadoop system. It is a very cost-effective approach as it does not involve any expense in managing the separate data sources other than your existing Hadoop System. It also provides you with the flexibility of scaling to any level with your summarized data.

Analytics over Hadoop with MPP Data warehouse


Today, a number of options are available in the market that allows integration of MPP Data Warehouses with Hadoop. This is worth considering if you have a large amount of data even after applying summarization over it.

Reducing the Total Cost of Ownership of Big Data

However, the disadvantage of using this approach could be the costs involved. Most of the MPP DWs are expensive to acquire and some also require high-end servers for deployment, which could be an expensive proposition. Having discussed the Big Data strategies that can be adopted to reduce TCO, let us once again talk about what technologies to use.

Selecting the right technologies


In order to choose the correct technology stack, you need to look at three factors to understand whether your use cases can be implemented or not. 1. Cost: The first is the cost you are incurring for per terabyte of storage. Look at whether it impacts your business. The next consideration is the cost related to business continuity, and vendor lock-in. Understand whether your system is likely to change due to some strategic decisions, and whether you might have to move over to a different vendor. 2. Latency: The next factor that you should consider is your latency requirements. Is your use case complete with respect to the throughput of the system? If you have a system for the not-so-large data, and the response of the system is critical for you, MPP or RDBMS systems will probably be a good choice for you. 3. Dollar per terabyte: When your use case is driven by the dollar per terabyte factor, we advise an MPP solution. This offers a middle-way to the Hadoop and NoSQL based solutions and can store large amount of data without compromising on speed. If your requirements are volatile and the data or related strategies are likely to change frequently, then working with a vendor lock-in model will not be a good option.

Opting for faster Map Reduce/Hadoop


If your use-case is driven by cost or business continuity, we advise you to opt for Hadoop. Hadoop will allow you to store all of your data, but unfortunately has a relatively higher degree of latency. Few vendors have come up with faster Hadoop implementations or other parallel processing frameworks. Their solutions usually extend standard Hadoop APIs and offer enhanced system performance, as well as better support for the production environment.

Reducing the Total Cost of Ownership of Big Data

NoSQL solutions
In an OLTP scenario you require faster reads and writes. Vendors providing solutions in this area have different kinds of underlying implementations, with each one being suited for a typical business use case. If you require random and real-time read/write access to the Big table-like data, you can use HBase. If you require faster writes, you might want to check Cassandra. These are suited to industries like banking and finance. In case, if your transaction data storage requirement is mostly for query purposes, and you have a need to define indexes, you can use MongoDB or CouchDB. There are other databases--graph databases like Neo4j for instance, which allow the Big-Data-heavy social media analytics problems to be solved more easily.

New era RDBMS versions


The fact is that new era RDBMS for OLTP address the latency issues. In the OLTP scenario, people have been using SQL successfully for the last several years. Most business users still consider it as the best tool to query structured data. However, now there is an emerging set of technologies or new versions of existing RDBMS engines, which are adept at handling large volumes of structured data. Therefore, in a scenario, where you want to handle larger volumes of data and the data is in a structured format, and then you can use solutions such as MySQL cluster, GridSQL, a new version of MS SQL, and so on.

Impetus solutions and recommendations


We will now talk about how you can deal with the various cost components associated with a typical Big Data warehouse. For reducing the cost of moving data, we suggest that you opt for map-reduce for ETL, instead of using costly ETL tools. Management and provisioning tools are available with commercial Big data solutions for easy management of systems. Impetus offering Ankush can

Reducing the Total Cost of Ownership of Big Data

automatically provision multiple Hadoop clusters and also offer centralized management for these clusters. For on-going maintenance, our success mantra is automate, automate, automate! If there is a requirement to carry out any task more than once, just automate. This goes for monitoring and tuning too. As for dealing with changing capacity, you could keep on adding hardware or look for alternatives that can help speed up things. Utilization of Graphics Processing Units for general purpose computing can also help. We also recommend Rainstor and similar solutions that help to compress the data and reduce the cost of hardware required data storage. Finally, faster, tailored Map Reduce solutions will help you complete more tasks in a lesser time.

Conclusion
In summary it can be said that best practices and robust strategies can help you lower the TCO of your Big Data solutions, and overcome the challenges associated with Big Data. Impetus has successfully dealt with Big Data problems, and used the Hadoop ecosystem to overcome some of these challenges.

About Impetus Impetus Technologies offers Product Engineering and Technology R&D services for software product development. With ongoing investments in research and application of emerging technology areas, innovative business models, and an agile approach, we partner with our client base comprising large scale ISVs and technology innovators to deliver cutting-edge software products. Our expertise spans the domains of Big Data, SaaS, Cloud Computing, Mobility Solutions, Test Engineering, Performance Engineering, and Social Media among others. Impetus Technologies, Inc. 5300 Stevens Creek Boulevard, Suite 450, San Jose, CA 95129, USA Tel: 408.213.3310 | Email: inquiry@impetus.com Regional Development Centers - INDIA: New Delhi Bangalore Indore Hyderabad Visit: www.impetus.com

Disclaimers
The information contained in this document is the proprietary and exclusive property of Impetus Technologies Inc. except as otherwise indicated. No part of this document, in whole or in part, may be reproduced, stored, transmitted, or used for design purposes without the prior written permission of Impetus 10 Technologies Inc.