You are on page 1of 8

International

Journal of Information
Technology OF
& Management
Information System
(IJITMIS), ISSN 0976
INTERNATIONAL
JOURNAL
INFORMATION
TECHNOLOGY
&
6405(Print), ISSN 0976 6413(Online), Volume 5, Issue 2, May - August (2014), pp. 51-58 IAEME

MANAGEMENT INFORMATION SYSTEM (IJITMIS)

ISSN 0976 6405(Print)


ISSN 0976 6413(Online)
Volume 5, Issue 2, May - August (2014), pp. 51-58
IAEME: http://www.iaeme.com/IJITMIS.asp
Journal Impact Factor (2014): 6.2217 (Calculated by GISI)
www.jifactor.com

IJITMIS
IAEME

LIMITATIONS OF DATAWAREHOUSE PLATFORMS AND


ASSESSMENT OF HADOOP AS AN ALTERNATIVE
KULDEEP DESHPANDE1, Dr. BHIMAPPA DESAI2
1

(Ellicium Solutions, Pune, Maharashtra)


(Capgemini Consulting, Pune, Maharashtra)

ABSTRACT
Volume and complexity of data collected in datawarehouse systems is growing
rapidly. This is posing challenges to traditional datawarehouse platforms. At the same time,
Hadoop ecosystem has opened new avenues for implementing datawarehouse systems on
Hadoop and overcome these challenges. In this paper we survey previous studies about
limitations of traditional datawarehouse platforms. Opportunities offered by Hadoop for
datawarehouse implementation are discussed. This paper can give a direction to future
research in the areas of Datawarehouse implementation on Hadoop platform.
Keywords: Datawarehouse, Hadoop, Hive, Analytical, ETL
I. INTRODUCTION
The size of data sets being collected and analyzed in the industry for business
intelligence is growing rapidly, making traditional warehousing solutions prohibitively
expensive [5]. Data collected from web logs, social media has become important component
of analytical systems. At the same time, these data sources have resulted in added
complexities for datawarehouses. Post 2005, richer set of analytical database management
systems have been introduced [2]. However rate of growth of data volume and complexity is
posing challenges to these analytical database systems also. Companies like Yahoo,
Facebook have been using Hadoop for processing large datasets [5]. However in recent
period, there has been an increased interest in evaluating Hadoop as a Datawarehouse
platform. In this paper we study challenges to currently available datawarehouse platforms,
opportunities opened up by Hadoop and explore areas that need research.
51

International Journal of Information Technology & Management Information System (IJITMIS), ISSN 0976
6405(Print), ISSN 0976 6413(Online), Volume 5, Issue 2, May - August (2014), pp. 51-58 IAEME

This work is organized as follows. In section 2 we conduct a meta-analysis of two


surveys about issues with current datawarehouse platforms. We summarize top issues
encountered by industry practitioners in this section. This is followed by overview of Hadoop
and approaches for building datawarehouse on Hadoop. In section 4, we discuss various ways
in which Hadoop can be used for building a datawarehouse. This includes using Hadoop for
data archival, data staging and ETL processing. Hadoop is a new technology and has many
limitations that need to be overcome to make it a complete datawarehouse platform. These
shortcomings have been discussed in section 5. Finally in section 6, we discuss areas of
research to make Hadoop a full-fledged datawarehouse platform.
II. LIMITATIONS OF DATAWAREHOUSE PLATFORMS
A datawarehouse platform is the most important component of the datawarehouse /
analytical system. A datawarehouse platform is defined as a collection of hardware servers,
an operating system, a database management system (DBMS) and data storage [1]. Different
categories of datawarehouse platforms are as follows:
1. Traditional RDBMS databases
2. Column oriented databases
3. In memory databases
4. Software only appliances
5. Software and hardware appliances
6. Cloud based datawarehouse systems
7. Hadoop based datawarehouse platforms

In order to define limitations of traditional datawarehouse platforms and challenges


posed by growth in data volume and variety, we have referred two previous studies.
In [1], a survey of 417 business and technical executives was conducted by The
Datawarehousing Institute (TDWI). Objective of this survey was to help respondents
understand options available for datawarehouse platforms. From the list it is clear that poor
query response, lack of support for advanced analytics and inadequate load speed are critical
challenges faced by the industry with existing datawarehouse platforms.
In [2] a survey regarding usage of analytical platform has been conducted. This study
surveyed 223 respondents regarding their satisfaction with traditional RDBMS as platform
for Datawarehouse and plans for migration from RDBMS to analytical platforms. This study
found that 75% respondents have migrated to analytical platforms or are in the process of
migration. This survey asked respondents issues that lead to migration away from RDBMS.

52

International Journal of Information Technology & Management Information System (IJITMIS), ISSN 0976
6405(Print), ISSN 0976 6413(Online), Volume 5, Issue 2, May - August (2014), pp. 51-58 IAEME

Following table shows weighted average response of these 2 studies.


TABLE 1: Meta-Analysis of Datawarehouse platform surveys
Survey 1
(2009)
Problem
Poor query response
Cant support advanced
analytics
Inadequate data load speed
Cost of scaling up is too
expensive
Poorly suited to real-time or
on demand workloads
Cant support large
concurrent user count
Inadequate high availability
Other

Weighted
average
response

Survey 2
(2010)
% of
respondents
45.0%
40.0%
39.0%
33.0%
29.0%
20.0%
19.0%
4.0%

Problem
Query performance /
response times
Need for complex
analysis
Load times
Hardware growth / cost
Need for on demand
capacity
Growth in number of
concurrent users
Availability and fault
tolerance
Other

% of
respondents
60.7%

50.5%

61.3%
31.5%

47.4%
36.4%

22.0%

29.2%

49.4%

36.1%

38.1%

26.3%

19.0%
6.0%

19.0%
4.7%

From above analysis we conclude that following are main challenges with existing
datawarehouse platforms.
2.1 Poor query performance / response
In both the above mentioned surveys, poor query response is mentioned as the most
important challenge with existing datawarehouse platforms. SQL is the most common
language used for analysis. Poor query response reflects slow execution of SQL. Over last 10
years various approaches like 64 bit computing, increasing memory, MPP systems, and
columnar databases have been implemented to solve this challenge but still poor query
response remains number one challenge for datawarehouses.
2.2 No support for advanced analytics
Lack of advanced analytics capabilities is cited as an important challenge for
datawarehouse platforms. However there is a debate over exact definition of the term. From
various studies it can be concluded that support for various forms of predictive algorithms,
statistical analysis and geographic visualization can be clubbed under advanced analytics.
Traditional RDBMS based datawarehouse platforms generally do not support these advanced
analytic functions using SQL. Majority of organizations perform this kind of advanced
analytics outside RDBMS using hand coded platforms or tools [2].
2.3 Slow data load speed
Traditional datawarehouses are batch oriented and are loaded weekly / daily / multiple
times a day. However trend is to integrate datawarehouse with transactional and operational
applications such as fraud detection [1]. This is making traditional daily load oriented
datawarehouse applications obsolete. Majority of large corporations have their datawarehouse
load processes that run throughout the night and consume 10-12 hours. With increased data
volumes and complexity, data load times will keep on increasing and this challenge needs a
solution.
53

International Journal of Information Technology & Management Information System (IJITMIS), ISSN 0976
6405(Print), ISSN 0976 6413(Online), Volume 5, Issue 2, May - August (2014), pp. 51-58 IAEME

2.4 High hardware cost


Increased data volume, complexity of data and number of users results in need for
adding hardware (disk space or processor) to the datawarehouse infrastructure to support the
growth. Additional hardware also requires additional cooling, space, power and increased
management [2]. As per survey [1], due to recession, 57% of respondents said that
organizations have reduced the budget for Datawarehousing. Thus reduced cost of hardware
and support per additional volume, number of users and complexity of analysis is an
important requirement that datawarehouse platforms have to satisfy.
2.5 No support for on demand workload
With increased dependency on data driven decisions, need for ad hoc, one time on
demand analysis is increasing. Many times these on demand analysis requires analyzing very
large volume of data, combining of datawarehouse data with external data and usage of
archived data for analysis. All this requires Datawarehouse to scale up and make higher
memory and capacity available to specific analysis. With traditional RDBMS based
datawarehouse platforms, scaling up on demand requires adding costly hardware, adding
memory, long times to obtain the hardware etc. Lack of ability to scale up on demand with
minimal cost and ramp up time is a major challenge to existing datawarehouse platforms.
III. HADOOP AS AN ALTERNATIVE DATAWAREHOUSE PLATFORM
In this section we discuss fundamentals of Hadoop and early usage of Hadoop as a
datawarehouse platform.
What is Hadoop?
The Hadoop Distributed File System (HDFS) is a distributed file system designed to
run on commodity hardware. HDFS is highly fault-tolerant and is designed to be deployed on
low-cost hardware. HDFS provides high throughput access to application data and is suitable
for applications that have large data sets [14]. A typical file in HDFS is gigabytes to terabytes
in size. Thus, HDFS is tuned to support large files. HDFS applications need a write-onceread-many access model for files. A file once created, written, and closed need not be
changed.
A computation requested by an application is much more efficient if it is executed
near the data it operates on. HDFS provides interfaces for applications to move themselves
closer to where the data is located [14]. MapReduce is a programming model on top of HDFS
for processing and generating large data sets [13].
MapReduce for Datawarehouse
MapReduce is programming models on top of HDFS for processing and generating
large data sets which was developed as an abstraction of the map and reduce primitives
present in many functional languages [10]. In [8] a datawarehouse framework developed at
Turn Inc. has been discussed. This framework makes use of MapReduce for processing the
data. Framework demonstrated in this paper can benefit from massively parallel execution of
programs and scalability of MapReduce. Also ability of MapReduce to process non-relational
data / unstructured data can add value to the Datawarehouse. Thus problems such as Text
tokenization, indexing and search, data mining and machine learning, handling of high
number of hops can be handled easily in MapReduce. In this paper a specialized data model
for Hadoop based datawarehouse has been proposed. As per this design a virtual schema in
54

International Journal of Information Technology & Management Information System (IJITMIS), ISSN 0976
6405(Print), ISSN 0976 6413(Online), Volume 5, Issue 2, May - August (2014), pp. 51-58 IAEME

which fact and dimension tables are pre-joined is created. This schema abstracts joins of
tables from users and simplifies the query language. Since the implementation of JOIN
operator is complex in MapReduce, this data model simplifies the MapReduce program. This
paper demonstrated how Cheetah framework can process data at 1GB / Sec.
Hive
One of the first usages of Hadoop for Datawarehouse was reported by Facebook team.
Hive is an open source datawarehousing solution built on top of Hadoop. Hive was built in
2007 by Facebook team and was open sourced in 2008 [5]. MapReduce is programming
extensive and not very suitable for end users to query and analyze data. Purpose of Hive was
to bring the concepts of tables, columns and SQL to Hadoop and still retain flexibility of
Hadoop [5]. In 2010, Facebook Hive implementation had tens of thousands of tables and
contained 700 TB data [5]. It supported 200 users. Hive query language (HQL) is a subset of
SQL. This makes it easy for SQL oriented users to analyze data in Hadoop. Advanced
analysis in the form of MapReduce can be plugged into HiveQL. This enables complex
analysis such as text mining, pattern matching etc to be done in MapReduce and results can
be explored using SQL using Hive. Hive compiles SQL into MapReduce batch jobs. Thus
Hive was designed for complex, batch oriented jobs and not for low latency queries [13].
Hive works well as a solution for complex datawarehouse analysis scenario but does not
perform well for dashboard and real time analysis.
Other examples of Hadoop based DW implementation
In previous sections we discussed usage of Hadoop for data storage in datawarehouse
applications where data volumes are very large or data is unstructured. Few other studies
have reported usage of Hadoop for specific other use cases.
Sensors have been widely applied to various fields. Time series data generated by
sensors has large demand for data storage and analysis. Most of the time series
datawarehouse solutions available currently use RDBMS systems. Usage of Hbase database
which is a NoSQL database on Hadoop for time series datawarehouse is reported in [9]. This
experimental study conducted stress test on 400 million time series records and concluded
that Hbase has a good read performance for time series data.
Usage of Hadoop for high performance datawarehouse and OLAP has been
demonstrated in [6]. In this experiment, a Hadoop cluster consisting of 18 nodes with 36
cores was constructed. This experiment made use of MapReduce for cube construction. It
also provided XMLA API so that standard BI tools can access cube data.
All studies described so far have described usage of Hadoop for very large scale
datawarehouses. However midsized organizations may also benefit from usage of Hadoop by
reducing data storage costs for datawarehouse. In [13], an evaluation of Hadoop for small and
medium sized datawarehouse systems has been performed. This study has compared MySQL,
Hadoop + MapReduce and Hive for data sizes ranging from 200 MB to 10 GB. This study
found that up to 1 GB data size, MySQL performs better than Hadoop and Hive. Between 1
to 2 GB, Hadoop + MapReduce outperforms Hive and MySQL. Beyond 2 GB, Hive
outperforms Hadoop + MapReduce and MySQL. Work done in this study needs to be
extended further. Firstly, this study did not run extensive analytic queries to confirm the
conclusion. Secondly, low latency queries were not tested. Despite of these issues, this study
opens new opportunities for small and medium industries to leverage Hadoop for low cost
Datawarehouse.

55

International Journal of Information Technology & Management Information System (IJITMIS), ISSN 0976
6405(Print), ISSN 0976 6413(Online), Volume 5, Issue 2, May - August (2014), pp. 51-58 IAEME

IV. DIFFERENT WAYS


DATAWAREHOUSE

IN

WHICH

HADOOP

CAN

BE

USED

FOR

In the previous section we discussed how Hadoop can be used as a data storage
platform for Datawarehouse. There are other means of using Hadoop for Datawarehouse
implementation.
Lot of data processing happens in the data staging area to prepare the data to be
loaded in the datawarehouse [7]. Size of staging area is many times that of datawarehouse.
Usage of Hadoop has been reported in literature to take advantage of low cost, linear
scalability, facility with file-based data, and ability to manage unstructured data [7]. By doing
this Datawarehouse servers are utilized only for the purpose of loading cleaned data and for
the purpose of end user access to data.
Enormous amount of data is gathered over a period of time in the datawarehouse
system. Not all of this data may be required for reporting and analytics. However legal
requirements stipulate organizations to retain data beyond its usage life. Also in many cases
source data has to be retained as is along with the cleaned datawarehouse data. Retaining
older datawarehouse data and raw source data for long period can be expensive [7] and can
put burden on datawarehouse server. Similar to datawarehouse staging, data archival can be
migrated to Hadoop.
Migration of ETL processing to Hadoop can achieve benefits of reduced cost as well
as processing times. Cheap storage, massive scalability, ability to handle complex logic and
manage unstructured data are reasons due to which Hadoop can be an ideal ETL platform. It
is recommended to identify top 20% ETL workloads for migration to Hadoop to achieve
maximum benefits [4]. Following ETLs qualify for migration to Hadoop [4]:

Relatively high elapse processing times


Very complex scripts like change data capture, joins and cursors.
File processing and semi structured data
ETLs causing high impact on resource utilization
Unstable or error prone code

Thus apart from using Hadoop as a datawarehouse storage platform, it can also be
used for data archival, data staging and processing ETL programs.
V. LIMITATIONS OF HADOOP TO BE A DATAWAREHOUSE PLATFORM
Despite of advantages of Hadoop as a Datawarehouse platform stated in the previous
section, it has certain limitations. Most of these limitations are features that Hadoop lacks
compared to mature RDBMS. In this section we discuss limitations of Hadoop as a
Datawarehouse platform.
1. Low latency data access and queries MapReduce is a batch oriented programming
paradigm. Hence MapReduce is not best suited for real time and speedy queries. However
newer query engines like Impala and Apache Drill are providing faster query processing
capabilities to data stored in Hadoop [16]. Usage of Hadoop as a Datawarehouse platform in
future will depend to a great extent on maturity of these query engines and how fast these
query engines acquire RDBMS functionality.
56

International Journal of Information Technology & Management Information System (IJITMIS), ISSN 0976
6405(Print), ISSN 0976 6413(Online), Volume 5, Issue 2, May - August (2014), pp. 51-58 IAEME

2. Inserts and updates Hadoop does not support ACID compliant insert and update
queries. Even Hive does not currently support insert or update queries [5]. This makes it
difficult to use Hadoop for dimensional tables in datawarehouse that require updates for
slowly changing dimension tables.
3. Granular security Row level or field level security like a RDBMS is absent in
Hadoop [7]. Only basic checks like file permission checks are present in Hadoop [7].
4. SQL based analytics Any mature datawarehouse solution consists of end users
writing complex SQLs for data analysis. Hadoop based databases like Hive and Impala have
limited support for ANSI standard SQL [7]. Hive does not support correlated sub queries
which are commonly used in most traditional warehouse queries. However databases like
IBM BigSQL and GreenPlum HAWQ are Hadoop based databases and aim to support ANSI
SQL. Ability of Hadoop based query engines to match ANSI SQL capabilities will speed up
usage of Hadoop as datawarehouse platform.
VI. DIRECTIONS FOR FUTURE RESEARCH
As discussed in previous section, certain areas need further research to make Hadoop
a viable alternative platform for datawarehouse implementations. In this direction, following
research is necessary:
1. Adoption of Hadoop for datawarehouse implementations will depend to a great extent
on maturity of SQL engines on Hadoop and compliance of these SQL engines to ANSI
SQL. Further research is required to qualify which ANSI SQL features are lacking in
Hive (or Impala) that will make these SQL engines mature datawarehouse platform.
2. Suitability of Hadoop for managing large datasets is well established. [13] Has
discussed suitability of Hadoop for mid and small sized datasets. This aspect of
suitability to small implementations needs to be further explored and detailed guidelines
need to be developed to analyze suitability of Hadoop for small datasets. This will make
Hadoop as a more feasible alternative for traditional DW platforms for smaller
organizations.
3. Application of traditional dimensional or ER modeling techniques for datawarehouses
on Hadoop needs to be studied. If these approaches are found unsuitable, alternative
modeling methodology needs to be developed for modeling datawarehouse on Hadoop.
4. Various studies have proposed usage of Hadoop for processing ETL for transformations
such as lookup, joins etc. However detailed benchmarks on when can these
transformations benefit from Hadoop based implementation need to be developed.
5. A set of comprehensive guidelines / framework needs to be developed for evaluating
whether a datawarehouse will benefit from Hadoop.
VII. CONCLUSION
In this paper we analyzed shortcomings of traditional datawarehouse platforms. This
paper analyzed two surveys and conducted meta-analysis to report problems with current
datawarehouse problems. A survey of various experiments on Hadoop based datawarehouse
was reported in this paper.

57

International Journal of Information Technology & Management Information System (IJITMIS), ISSN 0976
6405(Print), ISSN 0976 6413(Online), Volume 5, Issue 2, May - August (2014), pp. 51-58 IAEME

Main objective of this paper is to analyze possibility of using Hadoop as


datawarehouse platform and areas of research that will make Hadoop as a strong alternative
to traditional datawarehouse platforms. As reported in section 7, further research is required
to make SQL engines on Hadoop more mature. Also there is a need of a comprehensive
framework to judge datawarehouses that will benefit from implementation on Hadoop
platform.
VIII. REFERENCES
[1]
[2]
[3]
[4]
[5]

[6]

[7]
[8]
[9]
[10]

[11]

[12]
[13]
[14]
[15]
[16]
[17]

Philip Russom, Next generation Datawarehouse platforms, The Datawarehousing


Institute, 2009
Merv Adrian and Colin White,Analytic Platforms: Beyond the Traditional Data
Warehouse,BeyeNETWORK Custom Research Report, 2010
Philip Russom, Analytic Databases for Big Data, The Datawarehousing Institute, 2012
Offload your Datawarehouse with Hadoop, Syncsort publication, 2014
Ashish Thusoo, Joydeep Sen Sarma, Namit Jain, Zheng Shao, Prasad Chakka, Ning
Zhang, Suresh Antony, Hao Liu and Raghotham Murthy, Hive A Petabyte Scale Data
Warehouse Using Hadoop, IEEE, 2010.
Jinguo You1, Jianqing Xi1, Chuan Zhang1, Gengqi Guo1, HDW: A High Performance
Large Scale Data Warehouse, Computer and Computational Sciences IMSCCS '08,
2008.
Philip Russom, Evolving Data Warehouse Architectures, The Datawarehousing
Institute, 2014
Songting Chen, Cheetah: A High Performance, Custom Data Warehouse on Top of
MapReduce, 36th International Conference on Very Large Data Bases, 2010
Wen-Yuan Ku, Tien-Yin Chou, Lan-Kun Chung, The Cloud-Based Sensor Data
Warehouse, International Symposium on Grids & Clouds, 2011.
T.K.Das and Arati Mohapatro, A Study on Big Data Integration with Data Warehouse,
International Journal of Computer Trends and Technology (IJCTT) volume 9,
number 4, Mar 2014.
Charles Loboz, Slawek Smyl, Suman Nath, DataGarage: Warehousing Massive
Performance Data on Commodity Servers, 36th International Conference on Very Large
Data Bases, 2010.
Sanjeev Khatiwada, Architectural Issues in Real-time Business Intelligence, 2012
Marissa Rae Hollingsworth, Hadoop and Hive as Scalable Alternatives to RDBMS: A
Case Study, Boise State University, 2012.
Dhruba Borthakur, 2007, The Hadoop Distributed File System: Architecture and
Design, Apache foundation.
Donald Feinberg, DBMS Infrastructure for the Modern Data Warehouse, Business
Intelligence Summit, 2010.
http://www.cloudera.com/content/cloudera/en/products-and-services/cdh/impala.html.
Kuldeep Deshpande and Dr. Bhimappa Desai, A Critical Study of Requirement
Gathering and Testing Techniques for Datawarehousing, International Journal of
Information Technology and Management Information Systems (IJITMIS), Volume 5,
Issue 1, 2014, pp. 60 - 71, ISSN Print: 0976 6405, ISSN Online: 0976 6413.

58

You might also like