Professional Documents
Culture Documents
Journal of Information
Technology OF
& Management
Information System
(IJITMIS), ISSN 0976
INTERNATIONAL
JOURNAL
INFORMATION
TECHNOLOGY
&
6405(Print), ISSN 0976 6413(Online), Volume 5, Issue 2, May - August (2014), pp. 51-58 IAEME
IJITMIS
IAEME
ABSTRACT
Volume and complexity of data collected in datawarehouse systems is growing
rapidly. This is posing challenges to traditional datawarehouse platforms. At the same time,
Hadoop ecosystem has opened new avenues for implementing datawarehouse systems on
Hadoop and overcome these challenges. In this paper we survey previous studies about
limitations of traditional datawarehouse platforms. Opportunities offered by Hadoop for
datawarehouse implementation are discussed. This paper can give a direction to future
research in the areas of Datawarehouse implementation on Hadoop platform.
Keywords: Datawarehouse, Hadoop, Hive, Analytical, ETL
I. INTRODUCTION
The size of data sets being collected and analyzed in the industry for business
intelligence is growing rapidly, making traditional warehousing solutions prohibitively
expensive [5]. Data collected from web logs, social media has become important component
of analytical systems. At the same time, these data sources have resulted in added
complexities for datawarehouses. Post 2005, richer set of analytical database management
systems have been introduced [2]. However rate of growth of data volume and complexity is
posing challenges to these analytical database systems also. Companies like Yahoo,
Facebook have been using Hadoop for processing large datasets [5]. However in recent
period, there has been an increased interest in evaluating Hadoop as a Datawarehouse
platform. In this paper we study challenges to currently available datawarehouse platforms,
opportunities opened up by Hadoop and explore areas that need research.
51
International Journal of Information Technology & Management Information System (IJITMIS), ISSN 0976
6405(Print), ISSN 0976 6413(Online), Volume 5, Issue 2, May - August (2014), pp. 51-58 IAEME
52
International Journal of Information Technology & Management Information System (IJITMIS), ISSN 0976
6405(Print), ISSN 0976 6413(Online), Volume 5, Issue 2, May - August (2014), pp. 51-58 IAEME
Weighted
average
response
Survey 2
(2010)
% of
respondents
45.0%
40.0%
39.0%
33.0%
29.0%
20.0%
19.0%
4.0%
Problem
Query performance /
response times
Need for complex
analysis
Load times
Hardware growth / cost
Need for on demand
capacity
Growth in number of
concurrent users
Availability and fault
tolerance
Other
% of
respondents
60.7%
50.5%
61.3%
31.5%
47.4%
36.4%
22.0%
29.2%
49.4%
36.1%
38.1%
26.3%
19.0%
6.0%
19.0%
4.7%
From above analysis we conclude that following are main challenges with existing
datawarehouse platforms.
2.1 Poor query performance / response
In both the above mentioned surveys, poor query response is mentioned as the most
important challenge with existing datawarehouse platforms. SQL is the most common
language used for analysis. Poor query response reflects slow execution of SQL. Over last 10
years various approaches like 64 bit computing, increasing memory, MPP systems, and
columnar databases have been implemented to solve this challenge but still poor query
response remains number one challenge for datawarehouses.
2.2 No support for advanced analytics
Lack of advanced analytics capabilities is cited as an important challenge for
datawarehouse platforms. However there is a debate over exact definition of the term. From
various studies it can be concluded that support for various forms of predictive algorithms,
statistical analysis and geographic visualization can be clubbed under advanced analytics.
Traditional RDBMS based datawarehouse platforms generally do not support these advanced
analytic functions using SQL. Majority of organizations perform this kind of advanced
analytics outside RDBMS using hand coded platforms or tools [2].
2.3 Slow data load speed
Traditional datawarehouses are batch oriented and are loaded weekly / daily / multiple
times a day. However trend is to integrate datawarehouse with transactional and operational
applications such as fraud detection [1]. This is making traditional daily load oriented
datawarehouse applications obsolete. Majority of large corporations have their datawarehouse
load processes that run throughout the night and consume 10-12 hours. With increased data
volumes and complexity, data load times will keep on increasing and this challenge needs a
solution.
53
International Journal of Information Technology & Management Information System (IJITMIS), ISSN 0976
6405(Print), ISSN 0976 6413(Online), Volume 5, Issue 2, May - August (2014), pp. 51-58 IAEME
International Journal of Information Technology & Management Information System (IJITMIS), ISSN 0976
6405(Print), ISSN 0976 6413(Online), Volume 5, Issue 2, May - August (2014), pp. 51-58 IAEME
which fact and dimension tables are pre-joined is created. This schema abstracts joins of
tables from users and simplifies the query language. Since the implementation of JOIN
operator is complex in MapReduce, this data model simplifies the MapReduce program. This
paper demonstrated how Cheetah framework can process data at 1GB / Sec.
Hive
One of the first usages of Hadoop for Datawarehouse was reported by Facebook team.
Hive is an open source datawarehousing solution built on top of Hadoop. Hive was built in
2007 by Facebook team and was open sourced in 2008 [5]. MapReduce is programming
extensive and not very suitable for end users to query and analyze data. Purpose of Hive was
to bring the concepts of tables, columns and SQL to Hadoop and still retain flexibility of
Hadoop [5]. In 2010, Facebook Hive implementation had tens of thousands of tables and
contained 700 TB data [5]. It supported 200 users. Hive query language (HQL) is a subset of
SQL. This makes it easy for SQL oriented users to analyze data in Hadoop. Advanced
analysis in the form of MapReduce can be plugged into HiveQL. This enables complex
analysis such as text mining, pattern matching etc to be done in MapReduce and results can
be explored using SQL using Hive. Hive compiles SQL into MapReduce batch jobs. Thus
Hive was designed for complex, batch oriented jobs and not for low latency queries [13].
Hive works well as a solution for complex datawarehouse analysis scenario but does not
perform well for dashboard and real time analysis.
Other examples of Hadoop based DW implementation
In previous sections we discussed usage of Hadoop for data storage in datawarehouse
applications where data volumes are very large or data is unstructured. Few other studies
have reported usage of Hadoop for specific other use cases.
Sensors have been widely applied to various fields. Time series data generated by
sensors has large demand for data storage and analysis. Most of the time series
datawarehouse solutions available currently use RDBMS systems. Usage of Hbase database
which is a NoSQL database on Hadoop for time series datawarehouse is reported in [9]. This
experimental study conducted stress test on 400 million time series records and concluded
that Hbase has a good read performance for time series data.
Usage of Hadoop for high performance datawarehouse and OLAP has been
demonstrated in [6]. In this experiment, a Hadoop cluster consisting of 18 nodes with 36
cores was constructed. This experiment made use of MapReduce for cube construction. It
also provided XMLA API so that standard BI tools can access cube data.
All studies described so far have described usage of Hadoop for very large scale
datawarehouses. However midsized organizations may also benefit from usage of Hadoop by
reducing data storage costs for datawarehouse. In [13], an evaluation of Hadoop for small and
medium sized datawarehouse systems has been performed. This study has compared MySQL,
Hadoop + MapReduce and Hive for data sizes ranging from 200 MB to 10 GB. This study
found that up to 1 GB data size, MySQL performs better than Hadoop and Hive. Between 1
to 2 GB, Hadoop + MapReduce outperforms Hive and MySQL. Beyond 2 GB, Hive
outperforms Hadoop + MapReduce and MySQL. Work done in this study needs to be
extended further. Firstly, this study did not run extensive analytic queries to confirm the
conclusion. Secondly, low latency queries were not tested. Despite of these issues, this study
opens new opportunities for small and medium industries to leverage Hadoop for low cost
Datawarehouse.
55
International Journal of Information Technology & Management Information System (IJITMIS), ISSN 0976
6405(Print), ISSN 0976 6413(Online), Volume 5, Issue 2, May - August (2014), pp. 51-58 IAEME
IN
WHICH
HADOOP
CAN
BE
USED
FOR
In the previous section we discussed how Hadoop can be used as a data storage
platform for Datawarehouse. There are other means of using Hadoop for Datawarehouse
implementation.
Lot of data processing happens in the data staging area to prepare the data to be
loaded in the datawarehouse [7]. Size of staging area is many times that of datawarehouse.
Usage of Hadoop has been reported in literature to take advantage of low cost, linear
scalability, facility with file-based data, and ability to manage unstructured data [7]. By doing
this Datawarehouse servers are utilized only for the purpose of loading cleaned data and for
the purpose of end user access to data.
Enormous amount of data is gathered over a period of time in the datawarehouse
system. Not all of this data may be required for reporting and analytics. However legal
requirements stipulate organizations to retain data beyond its usage life. Also in many cases
source data has to be retained as is along with the cleaned datawarehouse data. Retaining
older datawarehouse data and raw source data for long period can be expensive [7] and can
put burden on datawarehouse server. Similar to datawarehouse staging, data archival can be
migrated to Hadoop.
Migration of ETL processing to Hadoop can achieve benefits of reduced cost as well
as processing times. Cheap storage, massive scalability, ability to handle complex logic and
manage unstructured data are reasons due to which Hadoop can be an ideal ETL platform. It
is recommended to identify top 20% ETL workloads for migration to Hadoop to achieve
maximum benefits [4]. Following ETLs qualify for migration to Hadoop [4]:
Thus apart from using Hadoop as a datawarehouse storage platform, it can also be
used for data archival, data staging and processing ETL programs.
V. LIMITATIONS OF HADOOP TO BE A DATAWAREHOUSE PLATFORM
Despite of advantages of Hadoop as a Datawarehouse platform stated in the previous
section, it has certain limitations. Most of these limitations are features that Hadoop lacks
compared to mature RDBMS. In this section we discuss limitations of Hadoop as a
Datawarehouse platform.
1. Low latency data access and queries MapReduce is a batch oriented programming
paradigm. Hence MapReduce is not best suited for real time and speedy queries. However
newer query engines like Impala and Apache Drill are providing faster query processing
capabilities to data stored in Hadoop [16]. Usage of Hadoop as a Datawarehouse platform in
future will depend to a great extent on maturity of these query engines and how fast these
query engines acquire RDBMS functionality.
56
International Journal of Information Technology & Management Information System (IJITMIS), ISSN 0976
6405(Print), ISSN 0976 6413(Online), Volume 5, Issue 2, May - August (2014), pp. 51-58 IAEME
2. Inserts and updates Hadoop does not support ACID compliant insert and update
queries. Even Hive does not currently support insert or update queries [5]. This makes it
difficult to use Hadoop for dimensional tables in datawarehouse that require updates for
slowly changing dimension tables.
3. Granular security Row level or field level security like a RDBMS is absent in
Hadoop [7]. Only basic checks like file permission checks are present in Hadoop [7].
4. SQL based analytics Any mature datawarehouse solution consists of end users
writing complex SQLs for data analysis. Hadoop based databases like Hive and Impala have
limited support for ANSI standard SQL [7]. Hive does not support correlated sub queries
which are commonly used in most traditional warehouse queries. However databases like
IBM BigSQL and GreenPlum HAWQ are Hadoop based databases and aim to support ANSI
SQL. Ability of Hadoop based query engines to match ANSI SQL capabilities will speed up
usage of Hadoop as datawarehouse platform.
VI. DIRECTIONS FOR FUTURE RESEARCH
As discussed in previous section, certain areas need further research to make Hadoop
a viable alternative platform for datawarehouse implementations. In this direction, following
research is necessary:
1. Adoption of Hadoop for datawarehouse implementations will depend to a great extent
on maturity of SQL engines on Hadoop and compliance of these SQL engines to ANSI
SQL. Further research is required to qualify which ANSI SQL features are lacking in
Hive (or Impala) that will make these SQL engines mature datawarehouse platform.
2. Suitability of Hadoop for managing large datasets is well established. [13] Has
discussed suitability of Hadoop for mid and small sized datasets. This aspect of
suitability to small implementations needs to be further explored and detailed guidelines
need to be developed to analyze suitability of Hadoop for small datasets. This will make
Hadoop as a more feasible alternative for traditional DW platforms for smaller
organizations.
3. Application of traditional dimensional or ER modeling techniques for datawarehouses
on Hadoop needs to be studied. If these approaches are found unsuitable, alternative
modeling methodology needs to be developed for modeling datawarehouse on Hadoop.
4. Various studies have proposed usage of Hadoop for processing ETL for transformations
such as lookup, joins etc. However detailed benchmarks on when can these
transformations benefit from Hadoop based implementation need to be developed.
5. A set of comprehensive guidelines / framework needs to be developed for evaluating
whether a datawarehouse will benefit from Hadoop.
VII. CONCLUSION
In this paper we analyzed shortcomings of traditional datawarehouse platforms. This
paper analyzed two surveys and conducted meta-analysis to report problems with current
datawarehouse problems. A survey of various experiments on Hadoop based datawarehouse
was reported in this paper.
57
International Journal of Information Technology & Management Information System (IJITMIS), ISSN 0976
6405(Print), ISSN 0976 6413(Online), Volume 5, Issue 2, May - August (2014), pp. 51-58 IAEME
[6]
[7]
[8]
[9]
[10]
[11]
[12]
[13]
[14]
[15]
[16]
[17]
58