You are on page 1of 12

review articles

doi:10.1145/1978542.1978562

BI technologies are essential to running


today’s businesses and this technology
is going through sea changes.
By Surajit Chaudhuri, Umeshwar Dayal,
and Vivek Narasayya

An Overview
of Business
Intelligence
Technology
is a collection
B usiness intel l i g e n c e ( BI ) soft wa r e
of decision support technologies for the enterprise
aimed at enabling knowledge workers such as
executives, managers, and analysts to make better
and faster decisions. The past two decades have seen
explosive growth, both in the number of products aggressively by deploying and experi-
menting with more sophisticated data
and services offered and in the adoption of these analysis techniques to drive business
technologies by industry. This growth has been decisions and deliver new functional-
fueled by the declining cost of acquiring and storing ity such as personalized offers and ser-
vices to customers. Today, it is difficult
very large amounts of data arising from sources to find a successful enterprise that
such as customer transactions in banking, retail has not leveraged BI technology for
its business. For example, BI technol-
as well as in e-businesses, RFID tags for inventory ogy is used in manufacturing for or-
tracking, email, query logs for Web sites, blogs, and der shipment and customer support,
product reviews. Enterprises today collect data at a in retail for user profiling to target
grocery coupons during checkout, in
finer granularity, which is therefore of much larger financial services for claims analysis
volume. Businesses are leveraging their data asset and fraud detection, in transportation

88 communications of th e ac m | au g ust 2 0 1 1 | vo l . 5 4 | n o. 8
for fleet management, in telecommu- of varying quality, use inconsistent
nications for identifying reasons for representations, codes, and formats,
key insights
customer churn, in utilities for power which have to be reconciled. Thus the 
The cost of data acquisition and data
usage analysis, and health care for problems of integrating, cleansing, storage has declined significantly. This
outcomes analysis. and standardizing data in preparation has increased the appetite of businesses
A typical architecture for support- for BI tasks can be rather challenging. to acquire very large volumes in order to
extract as much competitive advantage
ing BI within an enterprise is shown in Efficient data loading is imperative for from it as possible.
Figure 1 (the shaded boxes are technol- BI. Moreover, BI tasks usually need to
ogy that we focus on in this article). The be performed incrementally as new 
New massively parallel data architec-
tures and analytic tools go beyond
data over which BI tasks are performed data arrives, for example, last month’s traditional parallel SQL data warehouses
often comes from different sources— sales data. This makes efficient and and OLAP engines.
typically from multiple operational da- scalable data loading and refresh ca-
tabases across departments within the pabilities imperative for enterprise BI. 
The need to shorten the time lag
between data acquisition and decision
organization, as well as external ven- These back-end technologies for pre- making is spurring innovations in
dors. Different sources contain data paring the data for BI are collectively business intelligence technologies.

au g u st 2 0 1 1 | vo l . 5 4 | n o. 8 | c o m m u n ic at i o n s o f t he acm 89
review articles

Figure 1. Typical business intelligence architecture.

Data Data movement, Data warehouse Mid-tier Front-end


sources streaming engines servers servers applications

Search
Relational
Extract Transform DBMS OLAP Enterprise
External Data Load (ETL) Server search engine
Sources Spreadsheet

Dashboard
Operational MapReduce
Databases Complex Event engine
Processing Engine Data mining, Reporting
text analytic Server
engines
Ad hoc
query

referred to as Extract-Transform-Load ics. Such engines are currently being likely to respond to my upcoming cata-
(ETL) tools. Increasingly there is a need extended to support complex SQL-like log mailing campaign? Text analytic en-
to support BI tasks in near real time, queries essential for traditional enter- gines can analyze large amounts of text
that is, make business decisions based prise data warehousing scenarios. data (for example, survey responses or
on the operational data itself. Special- Data warehouse servers are comple- comments from customers) and ex-
ized engines referred to as Complex mented by a set of mid-tier servers that tract valuable information that would
Event Processing (CEP) engines have provide specialized functionality for otherwise require significant manual
emerged to support such scenarios. different BI scenarios. Online analytic effort, for example, which products are
The data over which BI tasks are processing (OLAP) servers efficiently mentioned in the survey responses and
performed is typically loaded into a expose the multidimensional view of the topics that are frequently discussed
repository called the data warehouse data to applications or users and en- in connection with those products.
that is managed by one or more data able the common BI operations such There are several popular front-
warehouse servers. A popular choice of as filtering, aggregation, drill-down end applications through which users
engines for storing and querying ware- and pivoting. In addition to tradition- perform BI tasks: spreadsheets, en-
house data is relational database man- al OLAP servers, newer “in-memory terprise portals for searching, perfor-
agement systems (RDBMS). Over the BI” engines are appearing that exploit mance management applications that
past two decades, several data struc- today’s large main memory sizes to enable decision makers to track key
tures, optimizations, and query pro- dramatically improve performance of performance indicators of the busi-
cessing techniques have been devel- multidimensional queries. Reporting ness using visual dashboards, tools
oped primarily for executing complex servers enable definition, efficient ex- that allow users to pose ad hoc que-
SQL queries over large volumes of da- ecution and rendering of reports—for ries, viewers for data mining models,
ta—a key requirement for BI. An exam- example, report total sales by region and so on. Rapid, ad hoc visualization
ple of such an ad hoc SQL query is: find for this year and compare with sales of data can enable dynamic explora-
customers who have placed an order from last year. The increasing availabil- tion of patterns, outliers and help un-
during the past quarter whose amount ity and importance of text data such as cover relevant facts for BI.
exceeds the average order amount by at product reviews, email, and call center In addition, there are other BI tech-
least 50%. Large data warehouses typi- transcripts for BI brings new challeng- nologies (not shown in Figure 1) such
cally deploy parallel RDBMS engines so es. Enterprise search engines support as Web analytics, which enables under-
that SQL queries can be executed over the keyword search paradigm over text standing of how visitors to a compa-
large volumes of data with low latency. and structured data in the warehouse ny’s Web site interact with the pages;
As more data is born digital, there is (for example, find email messages, for example which landing pages are
increasing desire to architect low-cost documents, history of purchases and likely to encourage the visitor to make
data platforms that can support much support calls related to a particular a purchase. Likewise, vertical pack-
larger data volume than that tradition- customer), and have become a valuable aged applications such as customer
ally handled by RDBMSs. This is often tool for BI over the past decade. Data relationship management (CRM) are
described as the “Big Data” challenge. mining engines enable in-depth analy- widely used. These applications often
Driven by this goal, engines based on sis of data that goes well beyond what support built-in analytics, for exam-
the MapReduce9 paradigm—originally is offered by OLAP or reporting servers, ple, a CRM application might provide
built for analyzing Web documents and provides the ability to build predic- functionality to segment customers
and Web search query logs—are now tive models to help answer questions into those most likely and least likely
being targeted for enterprise analyt- such as: which existing customers are to repurchase a particular product.

90 communi cations of th e ac m | au g ust 2 0 1 1 | vo l . 5 4 | n o. 8


review articles

Another nascent but important area is effective for low cardinality domains
mobile BI that presents opportunities (for example, gender), they can also be
for enabling novel and rich BI applica- used for higher cardinality domains
tions for knowledge workers on mo- using bitmap compression.
bile devices.
In this short article, we are not Today, it is difficult Materialized views. Reporting que-
ries often require summary data, for
able to provide comprehensive cover-
age of all technologies used in BI (see
to find a successful example, aggregate sales of the most
recent quarter and the current fiscal
Chaudhuri et al.5 for additional details enterprise that year. Hence, precomputing and mate-
on some of these technologies). We
therefore chose to focus on technology
has not leveraged rializing summary data (also referred
to as materialized views) can help dra-
where research can play, or has his- BI technology for matically accelerate many decision
torically played, an important role. In
some instances, these technologies are
their business. support queries. The greatest strength
of a materialized view is its ability to
mature but challenging research prob- specifically target certain queries by ef-
lems still remain—for example, data fectively caching their results. However
storage, OLAP servers, RDBMSs, and this very strength also can limit its ap-
ETL tools. In other instances, the tech- plicability, that is, for a slightly differ-
nology is relatively new with several ent query it may not be possible to use
open research challenges, for example, the materialized view to answer that
MapReduce engines, near real-time BI, query. This is in contrast to an index,
enterprise search, data mining and text which is a much more general struc-
analytics, cloud data services. ture, but whose impact on query per-
formance may not be as dramatic as
Data Storage a materialized view. Typically, a good
Access structures. Decision support physical design contains a judicious
queries require operations such as mix of indexes and materialized views.
filtering, join, and aggregation. To ef- Partitioning. Data partitioning can
ficiently support these operations, be used to improve both performance
special data structures (not typically (discussed later) and manageability.
required for OLTP queries) have been Partitioning allows tables and indexes
developed in RDBMSs, described here. to be divided into smaller, more man-
Access structures used in specialized ageable units. Database maintenance
OLAP engines that do not use RDBMSs operations such as loading and backup
are discussed later. can be performed on partitions rather
Index structures. An index enables than an entire table or index. The com-
associative access based on values of a mon types of partitioning supported
particular column. When a query has today are hash and range. Hybrid
one or more filter conditions, the se- schemes that first partition by range
lectivities of these conditions can be followed by hash partitioning within
exploited through index scans (for ex- each range partition are also common.
ample, an index on the StoreId column Column-oriented storage. Tradition-
can help retrieve all sales for StoreId = al relational commercial database en-
23) and index intersection (when mul- gines store data in a row-oriented man-
tiple conditions exist). These opera- ner, that is, the values of all columns
tions can significantly reduce, and in for a given row in a table are stored
some cases eliminate, the need to ac- contiguously. The Sybase IQ product30
cess the base tables, for example, when pioneered the use of column-oriented
the index itself contains all columns storage, where all values of a particular
required to answer the query. Bitmap column are stored contiguously. This
indexes support efficient index opera- approach optimizes for “read-mostly”
tions such as union and intersection. A workloads of ad hoc queries. The col-
bitmap index on a column uses one bit umn-oriented representation has two
per record for each value in the domain advantages. First, significantly greater
of that column. To process a query of data compression is possible than in
the form column1 = val1 AND column2 = a row-oriented store since data values
val2 using bitmap indexes, we identify within a column are typically much
the qualifying records by taking the bit- more repetitive than across columns.
wise AND of the respective bit vectors. Second, only the columns accessed in
While such representations are very the query need to be scanned. In con-

au g u st 2 0 1 1 | vo l . 5 4 | n o. 8 | c o m m u n ic at i o n s o f t he acm 91
review articles

Figure 2. Multidimensional data. for RLE—the choice of sort order of


the table can significantly affect the
amount of compression possible. De-
… termining the best sort order to use is
th
on Mar a non-trivial optimization problem.
M
Feb
Jan
Dimensional hierarchies Finally, the decision of whether to
Country Industry Year compress access structures is work-
Product

Milk
load dependent. Thus, there is a need
Soap Region Category Quarter for automated physical design tools to
also recommend which access struc-
Toothpaste Week Month
tures should be compressed and how
City Product
… Date based on workload information.
New L.A. Chicago …
York
Query Processing
City
A popular conceptual model used for
BI tasks is the multidimensional view of
data, as shown in Figure 2. In a multidi-
trast, in a row-oriented store, it is not There are different compression mensional data model, there is a set of
easy to skip columns that are not ac- techniques used in relational DBMSs. numeric measures that are the objects
cessed in the query. Together, this can Null suppression leverages the fact that of analysis. Examples of such mea-
result in reduced time for scanning several commonly used data types in sures are sales, budget, revenue, and
large tables. DBMSs are fixed length (for example, inventory. Each of the numeric mea-
Finally, we note that in the past de- int, bigint, datetime, money), and sig- sures is associated with a set of dimen-
cade, major commercial database sys- nificant compression is possible if they sions, which provide the context for the
tems have added automated physical are treated as variable length for stor- measure. For example, the dimensions
design tools that can assist database ad- age purposes. Only the non-null part associated with a sale amount can be
ministrators (DBAs) in choosing appro- of the value is stored along with the the Product, City, and the Date when
priate access structures (see Chaudhuri actual length of the value. Dictionary the sale was made. Thus, a measure
and Narasayya7 for an overview) based compression identifies repetitive values can be viewed as a value in the multidi-
on workload information, such as que- in the data and constructs a dictionary mensional space of dimensions. Each
ries and updates executed on the sys- that maps such values to more com- dimension is described by a set of attri-
tem, and constraints, for example, total pact representations. For example, a butes, for example, the Product dimen-
storage allotted to access structures. column that stores the shipping mode sion may consist of the following at-
Data Compression can have signifi- for an order may contain string values tributes: the category, industry, model
cant benefits for large data warehouses. such as ‘AIR’, ‘SHIP’, ‘TRUCK’. Each number, year of its introduction. The
Compression can reduce the amount value can be represented using two attributes of a dimension may be re-
of data that needs to be scanned, and bits by mapping them to values 0,1,2 lated via a hierarchy of relationships.
hence the I/O cost of the query. Second, respectively. Finally, unlike compres- For example, a product is related to its
since compression reduces the amount sion schemes in row-oriented stores category and the industry attributes
of storage required for a database, it can where each instance of a value requires through a hierarchical relationship
also lower storage and backup costs. an entry (potentially with fewer bits), (Figure 2). Another distinctive feature
A third benefit is that compression ef- in column-oriented stores other com- of the conceptual model is its stress
fectively increases the amount of data pression techniques such as run-length on aggregation of measures by one or
that can be cached in memory since the encoding (RLE) can become more effec- more dimensions; for example, com-
pages can be kept in compressed form, tive. In RLE compression, a sequence puting and ranking the total sales by
and decompressed only on demand. of k instances of value v is encoded by each county for each year.
Fourth, certain common query opera- the pair (v,k). RLE is particularly attrac- OLAP Servers. Online Analytic pro-
tions (for example, equality conditions, tive when long runs of the same value cessing (OLAP) supports operations
duplicate elimination) can often be occur; this can happen for columns such as filtering, aggregation, pivoting,
performed on the compressed data it- with relatively few distinct values, or rollup and drill-down on the multi-
self without having to decompress the when the column values are sorted. dimensional view of the data. OLAP
data. Finally, compressing data that There are several interesting tech- servers are implemented using either
is transferred over the network effec- nical challenges in data compression. a multidimensional storage engine
tively increases the available network First, new compression techniques (MOLAP); a relational DBMS engine
bandwidth. This is important for paral- suitable for large data warehouses and (ROLAP) as the backend; or a hybrid
lel DBMSs where data must be moved incurring an acceptable trade-off with combination called HOLAP.
across nodes. Data compression plays decompression and update costs are MOLAP servers. MOLAP servers di-
a key role not just in relational DBMSs, important. Second, even for known rectly support the multidimensional
but also in other specialized engines, compression techniques important view of data through a storage engine
for example, in OLAP. open problems remain—for example, that uses the multidimensional array

92 communi cations of th e ac m | au g ust 2 0 1 1 | vo l . 5 4 | n o. 8


review articles

abstraction. They typically precom- vide faster access, and older data in For example, consider a query that
pute large data cubes to speed up query ROLAP. Since MOLAP performs bet- computes the total sales for each cus-
processing. Such an approach has the ter when the data is reasonably dense tomer in a particular state. When the
advantage of excellent indexing prop- and ROLAP servers perform for sparse data is initially loaded into the system,
erties and fast query response times, data, Like MOLAP servers, HOLAP the engine can associate pointers from
but provides relatively poor storage uti- servers also perform density analysis each state to customers in that state,
lization, especially when the data set is to identify sparse and dense sub-re- and similarly pointers from a cus-
sparse. To better adapt to sparse data gions of the multidimensional space. tomer to all the order detail records
sets, MOLAP servers identify dense and All major data warehouse vendors to- for that customer. This allows fast as-
sparse regions of the data, and store/ day offer OLAP servers (for example, sociative access required to answer
index these regions differently. For ex- IBM Cognos,15 Microsoft SQL,17 and the query quickly, and is reminiscent
ample dense sub-arrays of the cube are Oracle Hyperion23). of approaches used by object-oriented
identified and stored in array format, In-memory BI engines. Technol- databases as well as optimizations in
whereas the sparse regions are com- ogy trends are providing an opportu- traditional DBMSs such as join indices.
pressed and stored separately. nity for a new class of OLAP engines Third, in-memory BI engines can sig-
ROLAP servers. In ROLAP, the mul- focused on exploiting large main nificantly increase the effective data
tidimensional model and its opera- memory to make response times for sizes over which they can operate in
tions have to be mapped into relations ad-hoc queries interactive. First, the memory by using data organization
and SQL queries. They rely on the data ratio of time to access data on disk vs. techniques such as column-oriented
storage techniques described earlier data in memory is increasing. Second, storage and data compression. In-
to speed up relational query process- with 64-bit operating systems becom- memory BI engines are best suited
ing. ROLAP servers may also need ing common, very large addressable for read-mostly data without in-place
to implement functionality not sup- memory sizes (for example, 1TB) are data updates where new data arrives
ported in SQL, for example, extended possible. Third, the cost of memory primarily in the form of incremental
aggregate functions such as median, has dropped significantly, which batch inserts due to data decompres-
mode, and time window based moving makes servers with large amounts of sion cost.
average. The database designs used main memory affordable. Unlike tra- Relational Servers. Relational data-
in ROLAP are optimized for efficiency ditional OLAP servers, in-memory BI base servers (RDBMSs) have tradition-
in querying and in loading data. Most engines (for example, QlikView24) rely ally served as the backend of large data
ROLAP systems use a star schema to on a different set of techniques for warehouses. Such data warehouses
represent the multidimensional data achieving good performance. First, need to be able to execute complex
model. The database consists of a since the detailed data is memory SQL queries as efficiently as possible
single fact table and a single table for resident they avoid expensive I/Os re- against very large databases. The first
each dimension. Each row in the fact quired to access data cubes, indexes, key technology needed to achieve this
table consists of a pointer (a.k.a. for- or materialized views. Second, they is query optimization, which takes a
eign key) to each of the dimensions use data structures that would not be complex query and compiles that query
that provide its multidimensional suitable for disk-based access, but are into an execution plan. To ensure that
coordinates, and stores the numeric very effective for in-memory access. the execution plan can scale well to
measures for those coordinates. Each
dimension table consists of columns Figure 3. Snowflake schema.
that correspond to attributes of the di-
mension. Star schemas do not explic-
Order Date Month Year
itly provide support for attribute hier-
archies. Snowflake schemas (shown in OrderNo DateKey Month
OrderDate Date Year
Figure 3) provide a refinement of star Month
schemas where the dimensional hier- OrderDetails
archy is explicitly represented by nor- SalesPerson City State
OrderNo
malizing the dimension tables. This SalesPersonID SalesPersonID CityName
Name CustomerNo State
leads to advantages in maintaining City DateKey
the dimension tables. Quota CityName
ProdNo
HOLAP servers. The HOLAP archi- Quantity
Customer TotalPrice Product Category
tecture combines ROLAP and MOLAP
by splitting storage of data in a MO- CustomerNo ProductNo CategoryName
Name Name Description
LAP and a relational store. Splitting Address Description
the data can be done in different ways. City Category
UnitPrice
One method is to store the detailed QOH
data in a RDBMS as ROLAP servers do,
and precomputing aggregated data in
MOLAP. Another method is to store
more recent data in MOLAP to pro-

au g u st 2 0 1 1 | vo l . 5 4 | n o. 8 | c o m m u n ic at i o n s o f t he acm 93
review articles

large databases, data partitioning and sometimes result in brittleness since of all processors, and can become a
parallel query processing are leveraged large inaccuracies can lead to genera- bottleneck. Shared disk systems are
extensively (see Graefe13 for an over- tion of very poor plans. There has been relatively cost effective for small- to
view of query processing techniques). research in leveraging feedback from medium-sized data warehouses.
We therefore discuss two pieces of key query execution to overcome errors In shared nothing systems (for ex-
technology—query optimization and made by the query optimizer by observ- ample, Teradata31) data needs to be
parallel query processing. ing actual query execution behavior distributed across nodes a priori. They
Query optimization technology has (for example, the actual result size of have the potential to scale to much
been a key enabler for BI. The query a query expression), and adjusting the larger data sizes than shared disk sys-
optimizer is responsible for select- execution plan if needed. However, tems. However, the decision of how to
ing an execution plan for answering collecting and exploiting feedback at effectively distribute the data across
a query. The execution plan is a com- low overhead is also challenging, and nodes is crucial for performance and
position of physical operators (such much more work is needed to realize scalability. This is important both from
as Index Scan, Hash Join, Sort) that the benefits of this approach. the standpoint of leveraging parallel-
when evaluated generates the results Parallel processing and appliances. ism, but also to reduce the amount of
of the query. The performance of a Parallelism plays a significant role in data that needs to be transferred over
query crucially depends on the abil- processing queries over massive da- the network during query processing.
ity of the optimizer to choose a good tabases. Relational operators such as Two key techniques for data distribu-
plan from a very large space of alterna- selection, projection, join, and aggre- tion are partitioning and cloning. For
tives. The difference in execution time gation present many opportunities for example consider a large database
between a good and bad plan for such parallelism. The basic paradigm is data with the schema shown in Figure 3.
complex queries can be several orders parallelism, that is, to apply relational Each of the two large fact tables, Orders
of magnitudes (for example, days in- operators in parallel on disjoint subsets and OrderDetails can be hash parti-
stead of minutes). This topic has been of data (partitions), and then combine tioned across all nodes on the OrderId
of keen interest in database research the results. The article by Dewitt and attribute respectively, that is, on the
and industry (an overview of the field Gray10 provides an overview of work in attribute on which the two tables are
appears in Chaudhuri4). Following the this area. For several years now, all ma- joined. All other dimension tables,
pioneering work done in the System R jor vendors of database management which are relatively small, could be
optimizer from IBM Research in the systems have offered data partitioning cloned (replicated) on each node. Now
late 1970s, the next major architec- and parallel query processing technol- consider a query that joins Customers,
tural innovation came about a decade ogy. There are two basic architectures Orders and OrderDetails. This query can
later: extensible optimizers. This al- for parallelism: Shared disk, where each be processed by issuing one query per
lowed system designers to “plug-in” processor has a private memory but node, each operating on a subset of the
new rules and extend the capabilities shares disks with all other processors. fact data and joining with the entire
of the optimizer. For example, a rule Shared nothing, where each processor dimension table. As a final step, the re-
could represent equivalence in rela- has private memory and disk and is typ- sults of each of these queries are sent
tional algebra (for example, pushing ically a low-cost commodity machine. over the network to a single node that
down an aggregation below join). Ap- Interestingly, while these architectures combines them to produce the final
plication of such rules can potentially date back about two decades, neither answer to the query.
transform the execution plan into one has yet emerged as a clear winner in the Data warehouse appliances. Recently
that executes much faster. Extensible industry and successful implementa- a new generation of parallel DBMSs
optimizers allowed many important tions of both exist today. referred to as data warehouse appli-
optimizations developed by indus- In shared disk systems all nodes ances (for example, Netezza19) have ap-
try and research over the years to be have access to the data via shared peared. An appliance is an integrated
incorporated relatively easily with- storage, so there is no need to a priori set of server and storage hardware,
out having to repeatedly modify the partition the data across nodes as in operating system and DBMS software
search strategy of the optimizer. the shared nothing approach. During specifically pre-installed and pre-opti-
Despite the success of query optimi- query processing, there is no need to mized for data warehousing. These ap-
zation and the crucial role it plays in BI, move data across nodes. Moreover, pliances have gained impetus from the
many fundamental challenges still re- load balancing is relatively simple following trends. First, since DW appli-
main. The optimizer needs to address since any node can service any re- ance vendors control the full hardware/
the inherently difficult problem of esti- quest. However, there are a couple software stack, they can offer the more
mating the cost of a plan, that is, the to- of issues that can affect scalability of attractive one service call model. Sec-
tal work (CPU, I/O, among others) done shared disk systems. First, the nodes ond, some appliances push part of the
by the plan. However, constrained by need to communicate in order to en- query processing into specialized hard-
the requirement to impose only a small sure data consistency. Typically this ware thereby speeding up queries. For
overhead, the optimizer typically uses is implemented via a distributed lock example, Netezza uses FPGAs (field-
limited statistical information such as manager, which can incur non-trivial programmable gate arrays) to evaluate
histograms describing a column’s data overhead. Second, the network must selection and projection operators on
distribution. Such approximations support the combined I/O bandwidth a table in the storage layer itself. For

94 communi cations of th e ac m | au g ust 2 0 1 1 | vo l . 5 4 | n o. 8


review articles

typical decision support queries this is having an impact on parallel DBMS Streambase29). Businesses can specify
can significantly reduce the amount of products and research. For example, the patterns or temporal trends that
data that needs to be processed in the some parallel DBMS vendors (for they wish to detect over streaming op-
DBMS layer. example, Aster Data2) allow invoca- erational data (referred to as events),
Distributed Systems using Map- tion of MapReduce functions over and take appropriate actions when
Reduce Paradigm. Large-scale data data stored in the database as part those patterns occur. The genesis of
processing engines based on the Map- of a SQL query. The MapReduce func- CEP engines was in the financial do-
Reduce paradigm9 were originally devel- tion appears in the query as a table main where they were used for appli-
oped to analyze Web documents, query that allows its results to be composed cations such as algorithmic stock trad-
logs, and click-through information with other SQL operators in the query. ing, which requires detecting patterns
for index generation and for improving Many other DBMS vendors provide over stock ticker data. However, they
Web search quality. Platforms based utilities to move data between MapRe- are now being used in other domains
on a distributed file system and using duce-based engines and their relation- as well to make decisions in real time,
the MapReduce runtime (or its variants al data engines. A primary use of such for example, clickstream analysis or
such as Dryad16) have been successfully a bridge is to ease the movement of manufacturing process monitoring
deployed on clusters with an order of structured data distilled from the data (for example, over RFID sensor data).
magnitude more nodes than tradition- analysis on the MapReduce platform CEP is different from traditional BI
al parallel DBMSs. Also, unlike paral- into the SQL system. since operational data does not need to
lel DBMSs where the data must first be Near Real-Time BI. The competi- be first loaded into a warehouse before
loaded into a table with a predefined tive pressure of today’s businesses it can be analyzed (see Figure 4). Appli-
schema before it can be queried, a Ma- has led to the increased need for near cations define declarative queries that
pReduce job can directly be executed real-time BI. The goal of near real-time can contain operations over stream-
on schema-less input files. Further- BI (also called operational BI or just- ing data such as filtering, windowing,
more, these data platforms are able to in-time BI) is to reduce the latency aggregations, unions, and joins. The
automatically handle important issues between when operational data is ac- arrival of events in the input stream(s)
such as data partitioning, node fail- quired and when analysis over that triggers processing of the query. These
ures, managing the flow of data across data is possible. Consider an airline are referred to as “standing” or “con-
nodes, and heterogeneity of nodes. that tracks its most profitable cus- tinuous” queries since computation
Data platforms based on the tomers. If a high-value customer has a may be continuously performed as
MapReduce paradigm and its variants lengthy delay for a flight, alerting the long as events continue to arrive in
have attracted strong interest in the ground staff proactively can help the the input stream or the query is explic-
context of the “Big Data” challenge airline ensure that the customer is po- itly stopped. In general, there could be
in enterprise analytics, as described tentially rerouted. Such near real-time multiple queries defined on the same
in the introduction. Another factor decisions can increase customer loy- stream; thus one of the challenges for
that makes such platforms attractive alty and revenue. the CEP engine is to effectively share
is the ability to support analytics on A class of systems that enables such computation across queries when
unstructured data such as text docu- real-time BI is Complex Event Pro- possible. These engines also need to
ments (including Web crawls), image cessing (CEP) engines (for example, handle situations where the streaming
and sensor data by enabling execu-
tion of custom Map and Reduce func- Figure 4. Complex event processing server architecture.
tions in a scalable manner. Recently,
these engines have been extended to
support features necessary for enter- Event Sources CEP Server Event Sources
prise adoption (for example, Clou-
dera8). While serious enterprise adop-
tion is still in early stages compared
to mature parallel RDBMS systems,
exploration using such platforms is
growing rapidly, aided by the avail- Devices Web server Pager Trading Station
ability of the open source Hadoop14
ecosystem. Driven by the goal of
improving programmer productivity ABC GHI
25.50 50.75 Standing queries
while still exploiting the advantages DEF JKL
noted here, there have been recent 33.60 45.15

efforts to develop engines that can MNO PQR


15.30 25.50
take a SQL-like query, and automati-
cally compile it to a sequence of jobs Database server Stock Ticker Database server KPI dashboard

on a MapReduce engine (for example,


Thusoo et al.32). The emergence of
analytic engines based on MapReduce

au g u st 2 0 1 1 | vo l . 5 4 | n o. 8 | c o m m u n ic at i o n s o f t he ac m 95
review articles

data is delayed, missing, or out-of-or- who is preparing for a meeting with stores the data into a central content
der, which raise both semantic as well a customer would like to know rel- index using an internal representa-
as efficiency challenges. evant customer information before tion that is suitable for fast querying.
There are several open technical the meeting. This information is to- The configuration data controls what
problems in CEP; we touch upon a day siloed into different sources: CRM objects to index (for example, a crawl
few of them here. One important chal- databases, email, documents, and query that returns objects from a data-
lenge is to handle continuous queries spreadsheets, both in enterprise serv- base) as well as what objects to return
that reference data in the database ers as well as on the user’s desktop. In- in response to a user query (for ex-
(for example, the query references a creasingly, a large amount of valuable ample, a serve query to run against the
table of customers stored in the data- data is present in the form of text, for database when the query keywords
base) without affecting near real-time example, product catalogs, customer match a crawled object). Several tech-
requirements. The problem of opti- emails, annotations by sales represen- nical challenges need to be addressed
mizing query plans over streaming tatives in databases, survey responses, by enterprise search engines. First,
data has several open challenges. In blogs and reviews. In such scenarios, crawling relies on the availability of
principle, the benefit of an improved the ability to retrieve and rank the appropriate adapters for each source.
execution plan for the query is un- required information using the key- Achieving a high degree of data fresh-
limited since the query executes “for- word search paradigm is valuable for ness requires specialized adapt-
ever.” This opens up the possibility of BI. Enterprise search focuses on sup- ers that can efficiently identify and
more thorough optimization than is porting the familiar keyword search extract data changes at the source.
feasible in a traditional DBMS. More- paradigm over text repositories and Second, ranking results across data
over, the ability to observe execution structured enterprise data. These en- sources is non-trivial since there may
of operators in the execution plan over gines typically exploit structured data be no easy way to compare relevance
an extended period of time can be po- to enable faceted search. For example, across sources. Unlike ranking in Web
tentially valuable in identifying sub- they might enable filtering and sort- search, links across documents in an
optimal plans. Finally, the increasing ing over structured attributes of docu- enterprise are much sparser and thus
importance of real-time analytics im- ments in the search results such as not as reliable a signal. Similarly, que-
plies that many traditional data min- authors, last modification date, docu- ry logs and click-through information
ing techniques may need to be revis- ment type, companies (or other enti- are typically not available at sufficient
ited in the context of streaming data. ties of interest) referenced in docu- scale to be useful for ranking. Finally,
For example, algorithms that require ments. Today, a number of vendors deploying enterprise search can in-
multiple passes over the data are no (for example, FAST Engine Search11 volve manually tuning the relevance,
longer feasible for streaming data. and Google Search Appliance12) pro- for example, by adjusting the weight
vide enterprise search capability. of each source.
Enterprise Search A popular architecture for enter-
BI tasks often require searching over prise search engines is the integrated Extract-Transform-Load Tools
different types of data within the en- model, shown in Figure 5. The search The accuracy and timeliness of report-
terprise. For example, a salesperson engine crawls each data source and ing, ad hoc queries, and predictive anal-
ysis depends on being able to efficiently
Figure 5. Enterprise search architecture (integrated model). get high-quality data into the data ware-
house from operational databases and
external data sources. Extract-Trans-
form-Load (ETL) refers to a collection
Application
of tools that play a crucial role in help-
Search
ing discover and correct data quality is-
Query
Results sues and efficiently load large volumes
of data into the warehouse.
Query Engine Data quality. When data from one
or more sources is loaded into the
Indexing Engine warehouse, there may be errors (for
Configuration Data Content example, a data entry error may lead to
Index a record with State = ‘California’ and
Country = ‘Canada’), inconsistent rep-
resentations for the same value (for
example, ‘CA’, ‘California’), and miss-
ing information in the data. There-
fore, tools that help detect data quality
Web sites Business Data Email Network share issues and restore data integrity in the
warehouse can have a high payoff for
BI. Data profiling tools enable identifi-
cation of data quality issues by detect-

96 communications of th e ac m | au g ust 2 0 1 1 | vo l . 5 4 | n o. 8
review articles

ing violations of properties that are domain-specific rules (for example, also typically checkpoint the operation
expected to hold in the data. For exam- ‘Bob’ and ‘Robert’ are synonymous). so that in case of a failure the entire
ple, consider a database of customer Thus, the ability to efficiently per- work does not need to be redone. Us-
names and addresses. In a clean da- form such approximate string match- ing the techniques discussed above for
tabase, we might expect that (Name, ing across many pairs of entities (also capturing changed data and efficient
Address) combinations are unique. known as fuzzy matching) is important loading, these days utilities are able to
Data profiling tools verify whether for de-duplication. Most major ven- approach refresh rates in a few seconds
this uniqueness property holds, and dors support fuzzy matching and de- (for example, Oracle GoldenGate22).
can quantify the degree to which it is duplication as part of their ETL suite Thus, it is potentially possible to even
violated, for example, this might hap- of tools. An overview of tools for merg- serve some near real-time BI scenarios,
pen if Name or Address information ing data from different sources can be as discussed earlier.
is missing. Data profiling tools can found in Bernstein.3
also discover rules or properties that Data load and refresh. Data load Other BI Technology
hold in a given database. For exam- and refresh utilities are responsible Here, we discuss two areas we think are
ple, consider an external data source for moving data from operational da- becoming increasingly important and
that needs to be imported into a data tabases and external sources into the where research plays a key role.
warehouse. It is important to know data warehouse quickly and with as Data Mining and Text Analytics.
which columns (or sets of columns) little performance impact as possible Data mining enables in-depth analy-
are keys (unique) for the source. This at both ends. There are two major sis of data including the ability to
can help in matching the incoming challenges. First, there is a need to ef- build predictive models. The set of
data against existing data in the ware- ficiently capture data at the sources, algorithms offered by data mining go
house. For efficiency, these tools of- that is, identify and collect data to be well beyond what is offered as aggre-
ten use techniques such as sampling moved to the data warehouse. Trig- gate functions in relational DBMSs
when profiling large databases. gers are general-purpose constructs and in OLAP servers. Such analysis
Accurately extracting structure from supported by SQL that allow rows includes decision trees, market bas-
a string can play an important role in modified by an insert/update SQL ket analysis, linear and logistic regres-
improving data quality in the ware- statement to be identified. However, sion, neutral networks and more (see
house. For example, consider a shop- triggers are a relatively heavyweight survey6). Traditionally, data mining
ping Web site that stores MP3 player mechanism and can impose non-triv- technology has been packaged sepa-
product data with attributes such as ial overheads on the operational da- rately by statistical software compa-
Manufacturer, Brand, Model, Color, tabase running OLTP queries. A more nies, for example, SAS,26 and SPSS.27
Storage Capacity and receives a data efficient way of capturing changed The approach is to select a subset of
feed for a product as text, for example, data is to sniff the transaction log of data from the data warehouse, per-
“Coby MP3 512MB MP-C756 – Blue.” the database. The transaction log is form sophisticated data analysis on
Being able to robustly parse the struc- used by the database system to record the selected subset of data to identify
tured information present in the text all changes so that the system can re- key statistical characteristics, and to
into the appropriate attributes in the cover in case of a crash. Some utilities then build predictive models. Finally,
data warehouse is important, for exam- allow pushing filters when processing these predictive models are deployed
ple, for answering queries on the Web transaction log records, so that only in the operational database. For ex-
site. Vendors have developed extensive relevant changed data is captured; for ample, once a robust model to offer a
sets of parsing rules for important ver- example, only changed data pertain- room upgrade to a customer has been
ticals such as products and addresses. ing to a particular department within identified, the model (such as a deci-
The survey article by Sarawagi28 dis- the organization. sion tree) must be integrated back
cusses techniques to the broader area The second aspect relates to tech- in the operational database to be ac-
of information extraction. niques for efficiently moving captured tionable. This approach leads to sev-
Another important technology that data into the warehouse. Over the eral challenges: data movement from
can help improve data quality is de- years, database engines have devel- warehouse to the data mining engine,
duplication: identifying groups of ap- oped specialized, performance op- and potential performance and scal-
proximately duplicate entities (for ex- timized APIs for bulk-loading data ability issues at the mining engine (or
ample, customers). This can be viewed rather than using standard SQL. Par- implied limitations on the amount
as a graph clustering problem where titioning the data at the warehouse of data used to build a model). To be
each node is an entity and an edge ex- helps minimize disruption of queries practical, such models need to be ef-
ists between two nodes if the degree at the data warehouse server. The data ficient to apply when new data arrives.
of similarity between two entities is is loaded into a partition, which is then Increasingly, the trend is toward “in-
sufficiently high. The function that de- switched in using a metadata opera- database analytics,” that is, integrat-
fines the degree of similarity between tion only. This way, queries referencing ing the data mining functionality in
two entities is typically based on string that table are blocked only for a very the backend data-warehouse architec-
similarity functions such as edit dis- short duration required for the meta- ture so that these limitations may be
tance (for example, ‘Robert’ and ‘Ro- data operation rather than during the overcome (for example, Netz et al.20
bet’ have an edit distance of as well as entire load time. Finally, load utilities and Oracle Data Mining21).

au g u st 2 0 1 1 | vo l . 5 4 | n o. 8 | c o m m u n ic at i o n s o f t he acm 97
review articles

Text analytics. Consider a com- virtualization in the cloud has prompt- the BI backend architecture are ex-
pany making portable music players ed database vendors to virtualize data pected. Finally, there is increasing
that conducts a survey of its prod- services so as to further improve re- demand to deliver interactive BI expe-
ucts. While many survey questions are source utilization and reduce cost. riences on mobile devices for today’s
structured (for example, demographic These data services initially started as knowledge workers. There are ample
information), other open-ended sur- simple key-value stores but have now opportunities to enable novel, rich,
vey questions (for example, “Enter begun to support the functionality of and interactive BI applications on the
other comments here”) are often free a single node relational database as a next generation of mobile devices.
text. Based on such survey responses, hosted service (for example, Microsoft Thus, business intelligence software
the company would like to answer SQL Azure18). While the primary initial has many exciting technical challenges
questions such as: Which products users of such cloud database services and opportunities still ahead that will
are referenced in the survey respons- are relatively simple departmental continue to reshape its landscape.
es? What topics about the product applications (OLTP), the paradigm is
are people mentioning? In these sce- being extended to BI as well (for ex- References
narios, the challenge is to reduce the ample, Pentaho25). 1. Amazon EC2; http://aws.amazon.com
2. Aster Data; http://www.asterdata.com.
human cost of having to read through The need for the full range of BI ser- 3. Bernstein, P. and Haas, L. Information integration in
large amounts of text data such as vices over the data collected by these the enterprise. Commun ACM 51, 9 (Sept. 2008).
4. Chaudhuri, S. An overview of query optimization in
surveys, Web documents, blogs, and applications raises new challenges for relational systems. ACM PODS 1998.
social media sites in order to extract cloud database services. First, the per- 5. Chaudhuri, S. and Dayal, U. An overview of data
warehousing and OLAP technology. SIGMOD Record
structured information necessary to formance and scale requirements of 26, 1 (1997).
answer these queries. This is the key large reporting or ad hoc queries will 6. Chaudhuri, S., Dayal, U. and Ganti, V. Database
technology for decision support systems. IEEE
value of text analytic engines. Today’s require database service providers to Computer 34, 12 (2001).
text analysis engines (for example, support a massively parallel process- 7. Chaudhuri, S. and Narasayya, V. Self-tuning database
systems: A decade of progress. In Proceedings of
FAST11 and SAS26) primarily extract ing system (parallel DBMS and/or Ma- VLDB 2007.
structured data that can be broadly pReduce-based engine) in the cloud, 8. Cloudera Enterprises; http://www.cloudera.com
9. Dean, J. and Ghemawat, S. MapReduce: Simplified
categorized as: Named entities are Second, these services are multi-ten- data processing on large clusters. In Proceedings of
references to known objects such as ant, and complex SQL queries can be OSDI 2004.
10. DeWitt D.J. and Gray J. Parallel database systems:
locations, people, products, and orga- resource intensive. Thus, the ability The future of high -performance database systems
nizations. Concepts/topics are terms in to provide performance Service Level Commun. ACM 35, 6 (June 1992).
11. FAST Enterprise Search; http://www.fastsearch.com
the documents that are frequently ref- Agreements (SLAs) to tenants and ju- 12. Google Search Appliance; http://www.google.com/
erenced in a collection of documents. diciously allocate system resources enterprise/gsa
13. Graefe, G. Query evaluation techniques for large
For example, in the above scenario of across tenant queries becomes im- databases. ACM Computing Surveys 25, 2 (June 1993).
portable music players, terms such as portant. Third, many of the technical 14. Hadoop; http://hadoop.apache.org
15. IBM Cognos; http://www.ibm.com
“battery life,” “appearance,” and “ac- challenges of traditional “in-house” 16. Isard et al. Dryad: Distributed data-parallel programs
cessories” may be important concepts/ BI such as security and fine grained from sequential building blocks. In Proceedings of
Eurosys 2001.
topics that appear in the survey. Such access control become amplified in 17. Microsoft SQL Server Analysis Services; http://www.
information can potentially be used as the context of cloud data services. For microsoft.com
18. Microsoft SQL Azure; http://www.microsoft.com
a basis for categorizing the results of example, techniques for processing 19. Netezza; http://www.netezza.com
the survey. Sentiment analysis produc- queries on encrypted data become 20. Netz, A., Chaudhuri, S., Fayyad, U. and Bernhardt, J.
Integrating data mining with SQL databases. OLE DB
es labels such as “positive,” “neutral,” more important in public clouds. For for Data Mining, 2001.
or “negative” with each text document these reasons, an intermediate step 21. Oracle Data Mining; http://www.oracle.com
22. Oracle GoldenGate; http://www.oracle.com
(or part of a document such as a sen- in adoption of BI technologies may be 23. Oracle Hyperion; http://www.oracle.com
tence). This analysis can help answer in private clouds, which hold promise 24. QlikView; http://www.qlikview.com
25. Pentaho; http://www.pentaho.com
questions such as which product re- similar to public clouds but with more 26. SAS: Business Analytics and Business Intelligence
ceived the most negative feedback. control over aspects such as security. Software; http://www.sas.com
27. SPSS: Data Mining, Statistical Analysis, Predictive
Cloud Data Services. Managing en- Analytics, Decision Support Systems; http://www.
spss.com
terprise BI today requires handling Conclusion 28. Sarawagi, S. Information extraction. Foundations and
tasks such as hardware provisioning, The landscape of BI in research and Trends in Databases 1, 3 (2008), 261-377.
29. Streambase; http://www.streambase.com
availability, and security patching. industry is vibrant today. Data acquisi- 30. Sybase IQ; http://www.sybase.com
Cloud virtualization technology (for tion is becoming easier and large data 31. Teradata; http://www.teradata.com
32. Thusoo, A. et al. Hive—A warehousing solution over a
example, Amazon EC21) allows a server warehouses with 10s to 100s of tera- MapReduce framework. VLDB 2009
to be hosted in the cloud in a virtual bytes or more of relational data are
machine, and enables server consoli- becoming common. Text data is also Surajit Chaudhuri (surajitc@microsoft.com) is a principal
dation through better utilization of being exploited as a valuable source researcher at Microsoft Research, Redmond, WA.
hardware resources. Hosted servers for BI. Changes in the hardware tech- Umeshwar Dayal (umeshwar.dayal@hp.com) is an HP
also offer the promise of reduced cost nology such as decreasing cost of Fellow in the Intelligent Enterprise Technology Labs at
Hewlett-Packard Lab, Palo Alto, CA.
by offloading manageability tasks, and main memory are impacting how the
Vivek Narasayya (viveknar@microsoft.com) is a principal
leveraging the pay-as-you-go pricing backend of large data-warehouses are researcher at Microsoft Research, Redmond, WA.
model to only pay for services that are architected. Moreover, as cloud data
actually used. The success of hardware services take root, more changes in © 2011 ACM 0001-0782/11/08 $10.00

98 communications of th e acm | au g ust 2 0 1 1 | vo l . 5 4 | n o. 8


Copyright of Communications of the ACM is the property of Association for Computing Machinery and its
content may not be copied or emailed to multiple sites or posted to a listserv without the copyright holder's
express written permission. However, users may print, download, or email articles for individual use.

You might also like