You are on page 1of 10

2018 IEEE International Conference on Big Data (Big Data)

Big Data Augmentation with Data Warehouse: A Survey

Umar Aftab Ghazanfar Farooq Siddiqui


Department of Computer Science Department of Computer Science
Quaid-i-Azam University Quaid-i-Azam University
Islamabad, Pakistan Islamabad, Pakistan
Email: umaraftab@cs.qau.edu.pk Email: ghazanfar@qau.edu.pk

Abstract – With dynamic changes in world’s technology, an [5]. The major reason behind the rise of data emergence are
increasing growth and adoption observed in the usage of social tremendous amount of data and ease in its accessibility [2],
media, computer networks, internet of things, and cloud [3], [4], [5]. Reasons behind the emergence of big data are:
computing. Research experiments are also generating huge
firstly, is social media (Twitter as the one of the biggest
amount of data which are to be collected, managed and
analyzed. This huge data is known as “Big Data”. Research social media platform) [2]. The most difficult and crucial
analysts have perceived an increase in data that contains both job is to build it over big data sets for observing and
useful and useless entities. In extraction of useful information, perceiving different shifts in user’s behavioral patterns e.g.;
data warehouse finds difficulties in enduring with increasing (sharing views, activities, opinions, likeness and
amount of data generated. With shifts in paradigm, big data dislikeness) to address different customer needs with
analytics emerged as promising area of research which
requirements [2], [8]. Secondly, introduction to internet-of-
supports business intelligence in terms of decision making.
This paper provides a comprehensive survey on BigData, things (IoT), we are living in a fast, complex changing
BigData problems, BigData Analytics and Big Data environment with increasing number of connections
Warehouse. In addition, it also explains how the need for between humans and smart technologies, that leads us into a
augmentation of big data and data warehouse emerged in digital world. Data can be generated from multiple sources
perspective of decision making, comparing methods and by multiple entities. This becomes worse when devices will
research problems. It also elaborates applications which
be interconnected for taking information [1], [2], [8].
support Big Data, Data Warehouse, and its challenges.
Thirdly, Scientific Computing, in which researchers are
Keywords—Data Warehouse, Big Data, Map Reduce, performing various experiments on data that results in data
Augmentation, Data Lake, OLAP, CMM, D&M. generation. But extracting useful information to support
decision making from large data repositories looks
I. INTRODUCTION challenging in the context of traditional databases for
If we start by looking at the historical perspective of the analysis [6]. Big data is creating a space with provisioning
growth cycle of data, realizing the pattern how trend of data which will strengthen different users to acquire
changes in recent times e.g. 1) Relational databases, 2) valuable data. With increase in the probability of making
Online Analytical Processing (OLAP), 3) Column based opportunities by exploration to support decision making in
data storage and cloud computing, and 4) Big data different social initiatives raised by the different institution
applications [1]. A need arises for handling high data (business and government organizations) [5], [7]. Lastly,
streaming. Initially, it was addressed by data management/ Big data has become a part of the infrastructure for the
storage based systems. In recent years, the rate of data developed and under-developed countries in terms of
streaming is increasing which is difficult to handle [2]. competing globally towards changing technology
This paper is organized as follows: next section discusses paradigms [5]. In Industry, a question is raised ‘whether
quality factors in data warehouse. In Section III, it’s about this data is useful in obtaining valuable information?’,
origin of knowledge sources and its acquisition. which will be addressed by exploring (big data) and
Furthermore, Section IV discusses performance evaluation exploiting (dark data, a data which is considered as useless
measures for data warehouse. However, Section V data initially) the data to get an in-depth understanding for
discusses the data preprocessing in data warehouse. new insights [4]. Big data can be described either in terms
Moreover, Section VI, VII, VIII, IX, and X discusses of features or challenges. These are, volume (‘data-at-rest’,
analytics, its perspective in data warehouse over BigData, collection of data from different sources), variety (‘data-of-
and its paradigm. In Section XI, XII, it discusses the many-forms’, data in multiple formats e.g. unstructured
competitive insights in perspective of industrial business data), veracity (‘data-in-doubt’, related to importance and
implications towards adopting BigData warehouse. Section trustworthiness of data from different sources), and velocity
XIII concludes with conclusion and future work directions. (‘data-in-motion’, data coming at an unprecedented speed
needs to be managed). But later on three more features were
A. Introduction to Big data identified which are variability (‘data-in-highlight’,
The huge amount of data which is generated (structured inconsistencies in data during collection), value (usefulness
or unstructured data) is difficult to collect and analyze by of data in perspective of extraction and usage) and
traditional data storages, is called “Big data” [1], [2], [4], visualization (a demonstration of data in different forms to

978-1-5386-5035-6/18/$31.00 ©2018 IEEE 2785


get a clear understanding) [4], [8]. With respect to the Table 1: Difference b/w Traditional Data Warehouse and Data
Warehouse on Big Data
growth of data, different integrations and transformations Subject Traditional Data Warehouse Data warehouse in BigData
among data incur new research challenges, which receive a
lot of interest among the research communities [6]. Business • Understand user needs • Understand the
requirements • Access/calculate the architecture
B. Big Data Problems needs for requirements • Understand the user
of data for analysis needs
Big data, an unprecedented growth of data, makes it • Confirmation of the • Access the need for
difficult to store, process and manage. For analysis, we are data availability, data perspective
depending on effective analytical tools and technologies. security, and • Confirmation of data
But with such data growth, it seems difficult to manage by accessibility availability,
accessibility, security
traditional data management tools in terms of scalability Data Analysis • Analyze in terms of • Real-time interactive
[4], [7]. The problem arises in extracting useful information data types, business analysis
in accordance with fulfilling the business case requirements rules, quality, and • Integrated data
from an organizational perspective, which are i) data granularity. • Integrated analysis
volume, ii) data availability, iii) data format, iv) level of • Handle post-added history
requirements
granularity required, and business requirements [5]. Data Modeling • Construction of • Data architecture
(7V’s) relational tables built on the basis of
C. Big data Analytics information
• Define hierarchies
Data analytics are serving as emerging horizon in business • Physical database • Structural
intelligence. With its increasing importance, both in private • Staging schemas infrastructure based
construction. on data architecture
and public sectors regarding insights depends upon the • Any Data format
requirements [7]. Data visualization plays an important part (unstructured,
in big data analytics. It comprises of different techniques structured etc.)
and reports which will give a better picture and helps in Data Movement Data ETL Module ETL(e.g. Apache
Spark) , or ELT
analyzing data [8]. Big data, a data coming from numerous Data Quality Ensure data quality Ensure data and module
dimensions, allows you to adopt a business model or quality
thinking of building a new business model. The major goal Data Data aggregation, Module data aggregation
Transformation summarization, encryption and summarization
of adopting traditional business model or building a new Data Predefined interface for user, Interface based on user
business model, is to find and analyze new insights that will Visualization predefined rules requirements.
support different initiatives and incentives [7]. But
supporting effective decision making depends on market The art of defining data, in different dimensions according
situation and business requirements to get the advantage to different business models and implementing it with
over traditional business models [7]. In view of recent defined methods to execute analytical processes over
advancements in the Data Warehouse and Online BigData. But such analytics requires computational
Analytical Processing (OLAP) to support analytics in big resources such as 1) system with high processing
data has emerged as a new area of research. Complex query performance, 2) scalable enough to handle continuous data
analysis performed to extract valuable information from streaming, 3) elastic to support the management storage
variable data [6]. This focuses on finding and extracting handling 4) compatible infrastructures such as cloud
useful data insights among the researchers. computing with the powerful and flexible support of data
warehouse and OLAP analysis capabilities. But this needs a
D. Big Data Warehousing careful attention and in-depth understanding of platforms
Data analytics requires extraction of valuable information integration among different paradigms. Furthermore, it
to support decision making [7]. But there has been a supports the initial misconception regarding analysis
misapprehension regarding compatibility of data warehouse perceived by researchers which proves that there is no need
architecture with big data. The primary focus behind data of data warehouse with respect to current situation of
warehouse was to support the data processing on static data BigData and supporting the alternatives such as cloud
or data streaming at a very low rate. But this doesn’t mean computing (infrastructure as service supporting intensive
to support the real-time transactional data analysis and data streaming) in terms of scalability, flexibility and
supporting its alternatives [7]. In response to above, real- performance along with other attributes in supporting
time business intelligence is supported by real-time data common goals [6]. In response to this misapprehension,
warehouses. With changing environments, the researchers both architectures are depending upon each other via
have perceived that there will be no further need for data continuous data streaming. In addition, they need
warehouse and relational databases to support big data [4]. specialized sequential integration in which (i) cloud
Data Warehouse has its own importance regarding data infrastructure should act as staging layer in classical data
processing for some formats and it will be helpful in certain warehouse, (ii) staging layer will augment data
conditions. There are some data warehouse tools which management platforms for defining new insights in
provide support for big data [4]. To support the above analytics and enhancing its performance, scalability and
discussion, some points are shown in Table 1. adaptability among these paradigms [6]. But above
discussion doesn’t support all analytical requirements as a

2786
data warehouse is unable to answer maximum number of
questions. To discover and explore different new insights 3) Project factors
from continuous data streaming, we need a new direction Project factors measure the complexity of tasks,
[6]. management of roles, and capabilities regarding the project
Our survey proposes a new idea of BigData Augmentation [11]. Furthermore, some tangible factors are associated with
with Data Warehouse, “how big data could be transformed the project. Therefore, current research suggests precise
and integrated with existing data sitting within a data project factors, which are planning and management,
warehouse to provide an organization with the improved human resource capabilities, technical experts, and external
capability to make strategic and tactical decisions?”. support that affects the success of data warehouse during
implementation.

II. QUALITY FACTORS IN DATA WAREHOUSE 4) Environmental factors


Data Warehouse is an integral part of information Environmental factors measure the surroundings with
management system e.g. (heterogeneous data sources, dynamic shifts in businesses, possibilities for uncertainties,
analytical tools). Furthermore, its practice for analysis is in order to reduce uncertainties, and utilize the new
gaining popularity across organizations. But a challenge information to get a competitive advantage. Therefore,
arises with the scale of complexity in terms of its current research suggests precise environmental factors,
implementation and managing it with the shifting which are the competitive business environment, master
paradigms in business to gain benefits across organizations. management vendor, and industry standards that affect the
In recent times, it was not always proved to be a significant success of data warehouse during its implementation.
improvement to organizations [10]. Moreover, researchers
are working to provide guidance for its successful 5) Infrastructure factors
implementation [11]. Also, some studies assessed on Infrastructure factors measure the dynamic environment
development, implementation and critical success factors of with shared and tangible resources of information to get a
data warehouse [10]. However, some researchers pointed competitive advantage in enabling present and future
out the impact on success factors in the implementation of a business initiatives among organizations. Therefore,
data warehouse still unclear. The theoretical model researchers suggest infrastructure factors, which are reliable
discussed here is DeLone and McLean Model (D&M) information, advance development applications, and
updated model [11]. It comprises of measures which will hardware that affects the success of data warehouse during
impact the performance of data warehouse. These measures its implementation.
are described as follows,
B. Factors Influencing Data Warehouse Success
A. Factors Influencing Data Warehouse Implementation Dependent success factors regarding data warehouse are.
Success
Independent success factors regarding data warehouse 1) Information Quality
implementation are. Information quality, measures the incoming information,
enabling different dimensions, to define new innovations
1) Organizational factors for getting a competitive advantage in business [11].
Organizational factors, measures association between Furthermore, it will provide quantitative metrics to achieve
stakeholders and their emphasis on certain factors during benchmark performance to measure success ratio.
implementation [11]. In recent times, it was supported by Therefore, current research suggests information quality
researchers with multiple perspectives. Furthermore, based factors, which are valuable information, adequate
on research, organizational factors proved to be successful information, and information integrity that will affect the
in data warehouse implementation. Therefore, current success of data warehouse.
research suggests precise organizational factors, which are
i) independent authority for taking decisions, ii) strategic 2) System Quality
objectives and iii) non-confrontation within the System quality measures the system competence and
organization that affects the success of data warehouse strength of different entities to produce competitive results
during its implementation. [11]. Furthermore, its success is assessed on the basis of its
adoption, ease of use and efficient operations. In addition,
2) Technical factors with the production of new information, to support decision
Technical factors, measures limitation, resource making for both end user and researchers. Therefore,
availability, and expertise to handle critical factors current research suggests precise factors, which are easy to
regarding integration of heterogeneous data sources during use, easily manageable, data positioning, and easy to
the implementation of data warehouse [11]. Therefore, interact.
current research suggests precise technical factors, which
are structural base, data model, effective methodology, and 3) Service Quality
critical data archive that affects the success of data Service quality measures the fulfillment of assistance
warehouse during its implementation. required for implementation of the information system. But,

2787
service quality will be an information effectiveness measure issues, and 3) organizational issues.
[11]. Therefore, current research suggests service quality A survey was conducted for measuring adoption of data
factors, which are training manual, trustworthiness, and fast warehouse across organizations. The result shows minor
response. adoption across companies [12].

4) Relationship Quality
Relationship quality is a measure to strengthen already IV. MEASURES FOR ASSESSING PERFORMANCE OF DATA
build relationships and to transform the relation among the WAREHOUSE
inattentive customers into a devoted one [11]. Moreover, In recent times, technology infrastructure has improved a
relationship quality can be taken in different context in lot. There has been an increasing interest seen among the
different situations. In addition, the relationship between
community in contributing towards infrastructure building.
business objectives and information technology impacts on
In line with the development, a lot of narratives emerged
the progress of data warehouse in terms of getting a
from different organizations according to their needs. A
competitive advantage. Therefore, current research suggests
relationship quality metrics which are effective trust, data warehouse is an important part of the infrastructure for
commitment, and user participation. data management. Its adoption requires a huge financial
undertaking. Business organizations strength can be judged
A blend of implementation success factors and quality on the basis of its financial standing. Therefore, competitive
success metrics had been acknowledged for the data business environment makes data warehouse adoption a
warehouse. In addition, with a better understanding, the critical decision. Keeping in view, that it will help in
data warehouse can become more effective for analysis providing valuable information to different end users.
purposes. Despite these conditions, it can be adopted if it will remain
operational and its performance is accepted by the user.
Hence, concerned management requires a measure for
III. ORIGIN OF KNOWLEDGE SOURCES AND ITS assessing data warehouse performance. In light of research
ACQUISITION [13], balance scorecard approach will provide a
comprehensive view of performance regarding different
Data Warehouse is one of the sources for knowledge
perspectives (customer, organizational, development,
acquisition. The objective to steer in the right direction for
innovation, and financial). It will address the critical factors
analysis and to enhance effectiveness in decision making,
that will affect the success of data warehouse.
researchers are investigating the role of data warehouse and
These factors are described as follows:
its impact on an organization [12]. Research studies have
described the importance of data warehouse, a source for
1) User Perspective
valuable information to support decision making in
different business processes [12]. Recently, with technology User perspective measures the factors that will strengthen
the relationship between the organization (DW managers,
shift, business market is getting more competitive. As a
developers) and analytical community [13]. Research
result, information system experts are counting on
suggests critical quality dimensions that will influence the
progressive technology. In parallel with the evolving
decision making and contribute in achieving user
environment, it tangibly impacts the nature of decisions. satisfaction, which are i) service effectiveness, ii) service
Next, the organization's intents to acquire valuable offering, iii) data consistency, iv) data compliance, and v)
information ahead of time to get a competitive advantage. robustness. With these, user satisfaction will be ensured in
To address the inline development described, integration of the adoption of the data warehouse.
information is considered valuable only, when its
availability, quality, pre-defined data sequence, and the 2) Organizational Business Significance Perspective
source is known. In addition, in order to support effective In line with development described above, business
decision making in the organization, it should be organizations need financial undertakings in acquiring
presentable in a pre-defined understandable format to resources or shifting technologies [13]. In view of financial
organizations management so that most of the business standing, adoption of a data warehouse requires a lot of
processes will be supported [12]. Moreover, it will be more investment including different operational and maintenance
effective, if computational resources and their capability of expenses (hardware, software, licenses). But, before
utilizing information meet business needs. The main adoption, top management assesses whether it will be
objectives were to acquire valuable information, to support beneficial for the organization in terms of lowering the
effective decision making, and gaining competitive deficit in investment annually. To accomplish this, data
advantage across organizations [12]. However, some issues warehouse champions need to ensure its effectiveness in
have been highlighted in recent research studies regarding achieving business goals. Defining the financial statement,
technical issues in the data warehouse. In addition, the a measure to assess the organizational strategic role
major reasons behind information system employment contributing towards successful implementation and
failures are e.g. 1) physiological issues, 2) environmental execution to achieve a certain level of improvement. In
recent times, research suggests some factors, which are i)

2788
data warehouse architects (evaluation and identification of V. DATA PRE-PROCESSING IN DATA WAREHOUSE
bottleneck and in-efficiencies in operations), data Data Warehouse is a platform designed for multiple
warehouse champions, a good negotiator (chief technical purposes. It comprises of different steps including pre-
officer (CTO) and manager’s role in decreasing annual processing as a pre-requisite before storing, scheduling and
expenses), risk aversion (hardware, software tools, management. Furthermore, complex queries are used for
licenses)) that will play its part in successful data analyzing and summarizing the organization's data in
warehouse adoption among organizations. Online Transaction Processing System (OLTP). Anosh
Fatima et.al. [15] explores different data pre-processing
3) Internal Process Perspective techniques in data warehouse applicable to different
In line with the discussion above, top management needs conditions and the problems raised.
some measuring method to assess data warehouse
performance in the organization [13]. To accomplish it, A. Dynamic Study
balance scorecard is regarded as a measure to provide a The study presented in [15] comprises of industrial
comprehensive view of performance to managers. experiment results. The motivation behind it was to
Moreover, it will help them focus on critical operations to categorize the precise selection of information management
gain more customers satisfaction level. But, some factors application platforms in terms of requirements, situation
regarding data warehouse internal operation needs to be and the benefits that can be gained from it [15]. Tools
assessed. As they will be considered as critical factors, considered here are Yale, Weka, and Alteryx.
which are i) Extract-Transform-Load (ETL) code
performance (stored procedure, views, report SQL), ii) 1) Problems in Selection of Sources and Schemas
batch cycle execution, iii) analytical reporting, iv) There are two types of sources and schemas
responsive progress, v) verification, vi) validation, vii) considered here
distribution, viii) data warehouse scalability and stability
I. Single Sources Problems
(Capability Maturity Model (CMM) provides useful
guidelines for improvement), and ix) data availability as per In architecture, the absence of principle limitations, some
DeLone and McLean Model (D&M) [14]. issues highlighted are i) uniqueness property is not
considered here, ii) referential integrity property is violated.
4) Business Innovation and Growth On instance level, errors highlighted are i) misspelled
In line with the development described above, words, ii) redundant words, iii) duplication, and iv)
improvements in technology nowadays are getting attention contradictory values.
in research communities [13]. In recent times, technology II. Multi-Source Problems
and business are changing quite fast. An organization
On schema level, problems involve heterogeneous data
strength can be seen as its ability to learn the gaps, innovate
and improve itself to compete with the environment. models and schema design issues, one highlighted is
Talking about innovation and growth in data warehouse, it structural type conflict. On instance level, problems
will help in providing better reporting capabilities and involve overlapping and inconsistent data issues.
capabilities in exploring and providing better business 2) Data Pre-Processing
intelligence insights. Hence, research suggests some factors This study considers data pre-processing approaches i)
(technology leadership, short project life cycle, efficient Classical Pre-Processing (cleaning, fusion, and structuring)
test-driven development, process improvement, innovative
and ii) Advanced Pre-Processing (summarization) [15].
query writing, emerging technology learning, and increase
automation) that will play a part in the successful adoption
of data warehouse among organizations. I. Data Cleaning Approaches
Data cleaning, a process that helps in preparation of data to
In line with the development described above, Nayem make it useful. This process includes data analysis, data
Rahman [13] proposed a balanced method for measuring definition, transformation, verification, mapping rules and
effectiveness in different operational perspectives of data back data flow analysis [15]. Research suggests two
management. This approach will provide a comprehensive appropriate approaches to data cleaning 1) Record
view of operations to the concerned management. With this Matching, 2) Data Repairing to improve the performance of
approach, key operational areas have been identified, which data warehouse [16]. However, Query Optimization is
are important regarding implementation. Furthermore, some another factor for improving the performance of decision
crucial measures for delivering operational excellence have making in data warehouse [16]. To make data warehouse
been identified for the data warehouse managers and performance more effective, researchers suggest
architects, which are i) industrial worth formation, ii) Materialized View (pre-calculated end results of queries to
budget loss prevention, iii) improvement, iv) requirement improve time efficiency) and Automated Child Views but
fulfillment, v) in-house processes, vi) effectiveness and vii) both approaches need more time to evolve enough to
consistency that will be under consideration for optimize the execution.
organizational benefit.

2789
a) Materialized Views (MV’s) Management. Framework evaluation factors are i) integrity,
Data analysis is considered to be an integral part of the data ii) precision, iii) validation, iv) repeated words elimination,
warehouse. But, for analysis, some factors are considered v) openness, and iv) data access time.
important, which are i) query processing time, ii) effective
query writing, and iii) fast pre-processing. To accomplish VI. ANALYTICS OVER BIG DATA
all above, research suggests materialized views method.
This method comprises of estimated outcomes of analytical In line with the emerging advancements in the data
queries to provide fast processing [16]. Usually, queries are warehouse for improving analytics to support effective
executed with joins and aggregation. But, materialized view decision making. Using analytics over big data, complex
eliminates joins and aggregates. In addition, during procedures are executed to extract useful data. To execute
execution of queries, it stores results in cached tables which analytics over big data, it requires a system with high
are constantly updated till the processing remains for quick performance, scalable, and elastic infrastructures. As a
access. The benefit of achieving development is to reduce consequence of supporting analytics, research challenges
query processing time and frequent querying can be arise at the convergence between a data warehouse and
executed in no time. Moreover, Cached tables can be cloud infrastructures [17]. Tools considered here are
altered by indexing to increase execution speed. Hadoop, Hive and Cloud Infrastructure.
A. Research Challenges
Outlier detection and its elimination, are other factors that
 Data preparation from variable data sources. As a
play an important part in data cleaning [16]. In recent times,
consequence, data can be populated using OLAP
past work in outlier detection is on numeric data. But
[17].
current study proposes a hybrid approach for outlier
 The issue of data cleaning and management in data
detection [16], which comprises of data mining techniques
extractions to support Business Intelligence (BI)
i) weighted k-means and ii) neural network for cluster
with complex analytics view for effective decision
formation [16]. Moreover, another technique used for
making [17].
outlier detection is integrated semantic knowledge (SOF),
which detects semantic outliers (a point which behaves B. Arguing Comments
differently from others in a cluster) [16]. But, it also In light of research challenges discussed here, a number of
requires improvement in symbol modification both in text questions are highlighted as follows;
and numeric data.  “Considering data-intensive cloud infrastructures
as today’s alternative of data warehouse”
II. Approaches for Missing Data  In response to above argument, research
Data can be collected from heterogeneous data sources. It community has given a few suggestions in [17],
may have missing attribute values while training the data. which are as follows,
Either the missing values are not recorded properly or  Cloud will be considered as data staging
ignored in classification due to confidentiality constraints. medium in data warehouse [17].
In recent times, research suggests three methods for data  “This staging area should populate the
pre-processing in missing data, which is the elimination of subsequent Data Warehousing layer and
constants, elimination with attribute mean, and elimination OLAP layer to define complex analytics
of random values. systems that are more and more powerful,
adaptable and scalable.”
III. Framework
C. Research Future Challenges
In recent times, research suggests some essential framework
and attributes that will help in its evaluation. Associations  Multi-Structural Support in Hadoop.
 Integration of live data streaming sources in
between data acquisition, data quality, and data cleaning are Hadoop
depending on each other. In light of research [16], some  Modification of Hive Query Language (HiveQL)
data quality issues highlighted are i) completeness in data, to support multi-structured data.
ii) accuracy in data, iii) validation of business procedures  The analysis in Hadoop over live streaming data.
and guidelines, iv) precision, v) non-duplication, vi)  Multi-visualization representations supporting
derivation integrity, vii) accessibility, and viii) timeliness. multi-structural analysis.
In addition, some data cleaning issues highlighted are i)
VII. ANALYTICAL PERSPECTIVE OF DATA WAREHOUSE
data entry error, ii) data integration error, and iii) OVER BIG DATA
measurement error. Moreover, some data acquisition issues
Data Warehouse and Big Data, what separates them is the
highlighted are i) data conversion from heterogeneous data
size of data, data sources, data dimensions, data streaming
sources, ii) redundant data removal, and iii) data rate and analytical method. In order to make them a
transformation. compatible with each other, a framework needs to be
In line with the above discussion, proposed frameworks are designed, Lihua Sun et.al. [18] analyzes and summarizes
Total Data Quality Management, Rule Base, and Log three architectures i) Parallel Database, ii) Map Reduce

2790
Framework and iii) Nested architecture, which makes data of data analysis with new paradigms. But, considering the
warehouse suitable with big data analysis. Platform incoming data management requirements, it performs
considered here are Hadoop (MapReduce) and Parallel well in comparison to previous traditional method [19].
Database (Relational DB).
B. Proposed Approach
Reasons behind considering architectures discussed above
are as follows, Francesco Di Tria et.al. [19] proposed a method called
 Data preparation and management of cost GrHyMM Model. This is a key-value based model,
expenses. preferred with unstructured data, without joins. To do so,
 Multiple Joins and Aggregations. data warehouse needs to extend itself to new data sources.
 Data re-positioning according to data model It comprises of following steps,
 Static data splitting  Source Integration
A. Architectures Under Observation  Semantic Ontology definition and its
representation
First is a Parallel Database dominated architecture [18]. The
 Building predicate logic for vocabulary
output of Map-Reduce is input to query as well as query
 Entity description generation i.e. logic-
output can be input to the Map-Reduce algorithm. Problems
based description according to concepts
highlighted are fault tolerance and Scalability.
of database
Second is a Map-Reduce dominated architecture [18].
 Similarity Comparison i.e. similarity
HiveQL is an interface to analytical operations over data.
comparison rules for analyzing the logical
Whereas, PigLatin language uses data operatives in
description
providing the interface for the data stream.
 The conceptual definition of rule
Third is a Nested architecture, in which representative
systems are Hadoop DB, and Vertica [18]. It uses Hadoop generation.
for task distribution over the network and relational  Requirement Analysis i.e. adopting the I*
database as a medium for query processing. Its problems framework, represents business goals considering
include scalability issue in Vertica regarding structured data the actors i) decision makers, ii) data warehouse
and performance issue in Hadoop.
 Conceptual Design i.e. multidimensional model
In general, problems highlighted are low preprocessing of
Map-Reduce because of its batch processing nature and for graphical oriented representation of relational
dynamic data redistribution of the relational database. database
 Requirement analysis (attribute tree based
B. Research Directions on facts, reduced on derived constraints)
 Pre-computing of multi-dimensional data  Integrated schema (detecting relations for
 Parallel analysis implementation identifying facts)
 Query sharing  Re-engineering
 User interface  Logical Design
 Multi-dimensional indexing  Incremental step (exploring new facts with
changing requirements and data sources)
VIII. EXPLORING PARADIGM OF BIG DATA WAREHOUSE  Physical Design.
Big Data Warehouse is a new software paradigm proposed
in [19], which seems to be quite different from traditional IX. DATA LAKE; DEFINITION
data warehouses in terms of schema, design methodology, Data lake has emerged as a powerful architecture in a time
principles, flexibility and factors regarding realization and when most of the organizations are shifting to mobile,
changing requirements adaption. The main objective of cloud-based or internet of things (IoT’s) in context of an
exploring big data warehouse is to find new models increase in volumes of data and market growth [20]. Data
regarding new insights [19]. Furthermore, exploring and lake is a central repository for storing data regardless of its
exploiting (big data and dark data) requires the innovative source, size, and format. Data might be structured or
insight of data warehouse solutions. To do so, data analytics unstructured, a lot of tools regarding storage and processing
must be devoted to improving decision making. The areas relating to extended big data ecosystem, are considered in
discussed here are i) marketing analysis, ii) tweets and order to gain efficient access to important data and to
blogs, iii) public sector (intelligence services), and iv) support organization decisions [20]. Its difference with
public administration. traditional data warehouse is shown in Table 2.
A. Contribution in Analytics Table 2: Difference between Traditional Data Warehouse
An innovative methodological solution for the and Data Lake
architecture of big data warehouse proposed in [19]. The Traditional Data Warehouse Data Lake
1. Pre-defined Schema No specific Schema
model is based on a key-value approach for the
1. Enterprise form Native form
representation of multidimensional definition at the 2. Pre-Processing No Pre-Processing
logical layer. It also discusses and motivates the approach 3. Schema-on-Write Schema-on-Read

2791
4. Conformed Data Model No specific Data Model
5. Less Flexible More Flexible
6. Limited Questions More Questions Answered B. Architecture Review
Answered
Reference architecture of enterprise data warehouse is
shown in Figure 1, its factors proposed in [21] are described
A. Advantages of Data Lake in above sections.
 Ability to derive valuable insights from
heterogeneous data sources.
 Ability to acquire data in any format and store data
in a general format in data lake, e.g. from
Customer Relationship Management (CRM) data
to social media posts.
 Ability to refine it as per understanding,
requirements, and insight.
 Ability to explore complex query analysis on data. Figure 1: Enterprise Data Warehouse Architecture
 Availability of tools to gain deep insights from the
data. The significant difference between enterprise data
 Elimination of data silos warehouse and data lake architecture is the input data
 Democratized data access via a single, unified format, data lake takes data in native format while data
view of data across the organization when using an warehouse takes data in a pre-defined schema format [21].
effective data management platform. Now referencing the enterprise data warehouse as shown in
B. Characteristics of Data Lake Figure 1, complementing it with data lake as shown in Figure
2.
 Single shared data repository (Distributed File
System)
 Includes orchestration and job scheduling (central
operational management)
 Contains a set of applications and workflows to
consume, process or act upon data (data
preservation).

Figure 2: Data Warehouse Complementing Data Lake


X. COMPLEMENTING DATA LAKE WITH DATA WAREHOUSE
In recent times, data lake emerged as an essential In line with the development, competing with the business
development in data management [21]. It provides support environment, organization procures data lake to offload
to enterprise factors. In addition, these are imperative in costly data processing [21], to get benefit in terms of data
enterprise data management. In view of the competitive access and analysis as shown in Figure 3.
environment, research suggests a need of complementing
enterprise data warehouse (EDW) with data lake. In
consequence of the development, it will provide i) more
flexibility, ii) fast data processing, iii) capturing continuous
data streaming, and iv) frees up bandwidth for Business
Intelligence (BI) analytics.
In parallel with above emergent narrative, “managed” data
lake uses a data lake management platform to perform
operations such as i) ingestion, ii) apply metadata, and iii) Figure 3: Data Warehouse with Data Lake (offload the
enable data governance. As a consequence, a user with his processing)
competence can utilize available information.
Complementing data lake makes no change in data
warehouse [21]. Moreover, data lake is also flexible with XI. DEFINING COMPETITIVE INSIGHTS IN BUSINESS
data governance rules based on data ingestion approach. In recent times, the emergence of big data has increased the
competitiveness of a business environment. To compete
with challenges, organizations are shifting dimensions to
A. Reasons behind Complementing Data Lake with Data
sustain their current position in the market. For achieving
Warehouse
something, you have to withstand difficulties [22].
 Blue sky - to be able to explore more about Organizations competence is based on how they utilize the
enhancing data warehouse. opportunity by exploring and exploiting big data to gain
 Cut costs - reducing costs by leveraging competitive benefit over others [22]. In respect of the
commodity hardware. business environment, organizations need to keep an eye on

2792
emerging challenges and ongoing customer demands  What to keep and what to discard (business
regarding different insights [22]. For exploring new requirements)
insights, first, the data needs to be relevant and consistent.  Data mapping during retention and disposition to
With the adoption of changing technology infrastructure, justify deletion (addressing categorization during
the benefit can be gained by integrating it with the current retention)
infrastructure to enhance organizational capabilities.  Secure Disposition Plan Execution
 Determine the benefit value for dark data
A. Guidelines for Effective Business Insights C. Query Processing for Big Data
 Initiate steps towards emerging technology for In recent times, big data environment is expanded with the
business growth. growing technology insights [25]. But, Big Data processing
 Enabling multi-structural data streaming from is initially based on Map-Reduce, which is a batch-oriented
variable sources processing. As a consequence, data processing is slow and
 Establish a platform for supporting analytics, by limited to data-at-rest. To overcome this issue,
considering it a key performance indicator in organizations are shifting to SQL based Hadoop [25].
achieving success. Furthermore, these are analytical application tools, which
 Ensuring principles for exploiting information, combine SQL query style and Hadoop framework to speed
data security and confidentiality. up analytical queries and data integration simpler. Its
 Invest in human resource training in defining new execution consists of some steps, which are i) connectors
insights in business and making decision making for translating SQL query in map reduce format, ii) ‘push
effective. down’ that forgo map reduce and execute SQL with
Hadoop cluster, and iii) job distribution of allocating SQL
XII. INDUSTRIAL PERSPECTIVE TOWARDS BIG DATA procedures among different nodes of data clusters. Premier
WAREHOUSE Inc., a healthcare information system provider, has shifted
from data warehouse infrastructure to Hadoop cluster [25].
A. Data and Analytics: Organizational Prospect
Furthermore, because map-reduce is batch-oriented
Data and Analytics are an integral part of innovative processing which doesn’t support web-based business
technology adopted by organizations. But having said intelligence dashboard. To overcome this issue, the Premier
that, some factors are important in the successful Inc. shifted to impala and Cloudera Inc. SQL based Hadoop
implementation of data and analytics such as i) cluster, which provides faster Hadoop query performance
appropriate tools, ii) data champions, iii) significance of and similar framework to previous Hadoop framework [25].
organizational components, and iv) clear strategy. ‘Using the SQL syntax optimizes our development work.’
Cultural resistance is also considered as one of the said by a senior representative of the organization [25].
reasons behind data and analytics failure across Overall on adopting SQL based Hadoop, saves a lot of cost
organization [23]. In support of above argument, Gartner on higher processing platforms.
Inc. has presented a calculated estimation of data and
analytics project delay most of the times. The reasons for
project delay are i) organizational structure, ii) XIII. CONCLUSION AND FUTURE WORK
appropriate talent acquisition, and iii) non-alignment with
In this survey, Big Data, its problems, Insights in analytics,
business strategy.
Big Data Warehouse and its challenges have been
B. Information Governance Rules reviewed. Besides, it explains data warehouse and its
Data streaming is increased with the presence of different related literature. Finally, it sums up with the approaches of
circles of information [24]. ‘Dark Data’, a data that we have big data augmentation with data warehouse, its challenges
but don’t know what to do with it. Gartner Inc. defines in data and analytics to support effective decision making
‘Dark Data’ as “information asset that organization acquires along with industrial perspective. The motivation for this
and pre-process as per business activities, but fail to utilize literature is to support the research initiative, ‘how can we
it for some purpose”. ‘Dark data is useful if it is managed integrate and transform big data with the data sitting inside
properly’, otherwise it will be problematic for the data warehouse to perform effective decision making’.
organization. To do so, six governance rules proposed in In future work, our focus will be not only be restricted to
[24], organizations have to adapt to manage dark data the integration of data management platform to improve
properly, as follows, decision making. But, also on the business models that will
have certain significance in analytics. This survey also
 Define and identify your dark data i.e. determine overviews the prospects that support the business analytical
the source of data user-generated or system dynamics. First is predictive analytical models, which is a
generated. It does not guarantee that processing source to steer towards effective strategies for business. It
will pay off in the end will help in forecasting and predict future results [26], [27].
 Cost-benefit analysis of data usefulness (short With its integration with analytical applications, it can
retention time frame) improve the business decision making [28]. Second is Data
Quality, usually, business decision making is data-driven.

2793
But, with an excessive amount of data incoming, it may processing techniques and tools”, I.J. Information
lose its objectivity, so more measures need to be explored Technology and Computer Science, 2017, 3, 50-61 Published
in order to get good data prepared for analysis [29], [30]. Online March 2017 in MECS (http://www.mecs-press.org/).
[17] Alfredo Cuzzocrea, “Analytics over big data exploring the
convergence of data warehousing, olap, and data-intensive
cloud infrastructures”, 2013 IEEE 37th Annual Computer
REFERENCES Software and Applications Conference.
[1] Mehdi Gheisari, Guojun Wang, Md Zakirul Alam Bhuiyan, [18] Lihua Sun, Mu Hu, Kaiyin Ren, Mingming Ren, “Present
“A survey on deep learning in big Data”, presented at IEEE situation and prospect of data warehouse architecture under
International Conference on Computational Science and the background of big data”, 2013 International Conference
Engineering (CSE) and IEEE International Conference on on Information Science and Cloud Computing Companion.
Embedded and Ubiquitous Computing (EUC), 2017. [19] Francesco Di Tria, Ezio Lefons, Filippo Tangorra, “Design
[2] Gore Sumit Sureshrao, Ambulgekar H. P., “MapReduce- process for big data warehouse”, Data Science and Advanced
Based warehouse systems: a survey”, presented at IEEE Analytics (DSAA), 2014 International Conference.
International Conference on Advances in Engineering & [20] Zaloni Inc., “Defining the data lake”, A White Paper
Technology Research (ICAETR - 2014), August 2014. [Online], Available: “https://resources.zaloni.com/white-
[3] A. Sunny Kumar, “Performance analysis of mysql partition, papers/defining-the-data-lake-white-paper”.
hive partition-bucketing, and apache pig”, IEEE 2016. [21] Zaloni Inc., “Why your data warehouse needs a data lake and
[4] M. Ptiček and B. Vrdoljak, “MapReduce research on how to make them work together”, A White Paper [Online],
warehousing of big data”, MIPRO 2017, May 22- 26, 2017. Available: https://resources.zaloni.com/white-papers/dw-
[5] Hai-Fei QIN, Zhi-ming QIAN, Yong-chao ZHAO, “On the augmentation-white-paper
research of data warehouse in big data”, presented at [22] Capgemini, “Big & fast data: the rise of insight-driven
International Conference on Network and Information business; insights at the point of action will redefine
Systems for Computers,2015. competitiveness” [Online], Available:
[6] Alfredo Cuzzocrea, “Analytics over bigdata exploring the https://www.capgemini.com/resources/big-fast-data-the-rise-
convergence of data warehousing, olap, and data-intensive of-insight-driven-business/
cloud infrastructures”, presented at IEEE 37th Annual [23] Carl Carande, Paul Lipinski, Traci Gusher (2017, June),
Computer Software and Applications Conference, 2013. “How to integrate data and analytics into every part of your
[7] Francesco Di Tria, Ezio Lefons, and Filippo Tangorra, organization”, Available: https://hbr.org/2017/06/how-to-
“Design process for big data warehouses”, presented at Data integrate-data-and-analytics-into-every-part-of-your-
Science and Advanced Analytics (DSAA), International organization.
Conference, 2014 [24] Fred A. Pulzello (2014, September), Six steps to 'dark data'
[8] M.D. Anto Praveena, Dr. B. Bharathi, “A survey paper on big information governance. ARMA International [Online],
data analytics”, presented at International Conference on Available: http://searchcompliance.techtarget.com/tip/Six-
Information, Communication & Embedded Systems (ICICES steps-to-dark-data-information-governance.
2017). [25] Craig Stedman (2016, April), SQL engines boost Hadoop
[9] Nayem Rahman,” An empirical study of data warehouse query processing for big data users. Search Business
implementation effectiveness”, presented at International Analytics [Online], Available:
Journal of Management Science and Engineering http://searchbusinessanalytics.techtarget.com/feature/SQL-
Management, 2017 Vol. 12, no. 1, 55–63. engines-boost-Hadoop-query-processing-for-big-data-users.
[10] R. L. Hayen, C. D. Rutashobya, and D. E. Vetter, "An [26] A. Yu. Dorogov, “Technologies of predictive analytics for
investigation of the factors affecting data warehousing big data”, 2015 XVIII International Conference on Soft
success," International Association for Computer Information Computing and Measurements (SCM).
Systems (IACIS), vol. 8, pp. 547-553, 2007. [27] Mykoa Pechenizkiy, “Predictive analytics on evolving data
[11] AlMabhouh, A., & Ahmad, A, “Identifying quality factors streams”, 2015 International Conference on High
within the data warehouse”. Proceedings of the Second Performance Computing & Simulation (HPCS).
International Conference on Computer Research and [28] Parth Wazurkar, Robin Singh Bhadoria, Dhananjai Bajpai,
Development, IEEE, 65-72. “Predictive analytics in data science for business intelligence
[12] Moh’d Alsqour, Mieczyzslaw L. Owoc, Abdulrahman S. solutions”, 2017 7th International Conference on
Ahmad, “Data warehouse as a source of knowledge Communication Systems and Network Technologies
acquisition. an empirical study”, Proceedings of the 2014 (CSNT).
Federated Conference on Computer Science and Information [29] Pengcheng Zhang, Fang Xiong, Jerry Gao, Jimin Wang,
Systems pp. 1421–1430, ACSIS, Vol. 2. “Data Quality in Big Data Processing: Issues, Solutions and
[13] Nayem Rahman, “Measuring performance for data Open Problems”, 2017 IEEE SmartWorld, Ubiquitous
warehouses - a balanced scorecard approach”, Copyright © Intelligence & Computing, Advanced & Trusted Computed,
2013 IJCIT, ISSN 2078-5828 (Print), ISSN 2218-5224 Scalable Computing & Communications, Cloud & Big Data
(Online), Volume 04, Issue 01, Manuscript Code: 130701. Computing, Internet of People and Smart City Innovation
[14] W. DeLone and E. McLean, "The Delone and McLean Model (SmartWorld/SCALCOM/UIC/ATC/CBDCom/IOP/SCI).
of Information Systems Success: A Ten-Year Update," [30] Ikbal Taleb, Mohamed Adel Serhani, “Big Data Pre-
Journal of Management Information Systems, Vol. 19, No. 4, Processing: Closing the Data Quality Enforcement Loop”,
pp.9-30, 2003. 2017 IEEE International Congress on Big Data (BigData
[15] Christen, P, "A Survey of Indexing Techniques for Scalable Congress).
Record Linkage and Deduplication." IEEE Transactions on
Knowledge and Data Engineering24(9): 1537-1555, 2012.
[16] Anosh Fatima, Nosheen Nazir, Muhammad Gufran Khan,
“Data cleaning in data warehouse: a survey of data pre-

2794

You might also like