You are on page 1of 20

MapReduce and the Data Scientist

Colin White, BI Research January 2012


Copyright 2012 BI Research, All Rights Reserved.


Big data is big news and so too is analytics on big data. Technologies for analyzing big data are evolving rapidly and there is significant interest in new analytic approaches such as Hadoop MapReduce and Hive, and MapReduce extensions to existing relational DBMSs. This paper examines the benefits of big data and the role of the data scientist in deploying big data solutions. It discusses the benefits of Hadoop MapReduce, Hive, and relational DBMSs that have added MapReduce capabilities. It presents a set of scenarios outlining how Hadoop and relational DBMS technology can coexist to provide the benefits of big data and big data analytics to the business and to data scientists. As an example of the use of MapReduce in a relational DBMS, the paper also reviews the implementation of MapReduce in the Aster Database.


Big data is a valuable term despite the hype The topic of big data is receiving significant press coverage and gathering increasing interest from both business users and IT staff. The result is that big data is becoming an overhyped and overused marketing buzzword. It remains, however, a valuable term because from an analytics perspective it still represents analytic workloads and data management solutions that could not previously be supported because of cost considerations and/or technology limitations. These solutions can bring significant benefits to the business by enabling smarter and faster decision making, and allowing organizations to achieve faster time to value from their investments in analytical processing technology and products. Analytics on multistructured data enable smarter decsions Smarter decision making comes from the ability to analyze new sources of data to enhance existing analytics and predictive models created from structured data in operational systems and data warehouses. There is strong emphasis in big data products, for example, on the analysis of multi-structured 1 data from sensors, system and web logs, social computing web sites, text documents, and so forth. Hitherto, these types of data have been difficult to process using traditional analytical processing technologies. Faster decisions are enabled because big data solutions support the rapid analysis of high volumes of detailed data. The analysis of large data stores has been difficult to date because it is often impossible to implement in a timely or cost-effective manner. To overcome this issue, organizations have had to aggregate or sample the detailed data before it could be analyzed, which not only increases data latency, but also reduces the value of the results.

Rapid analysis of detailed data provides faster decisions

Multi-structured data is data that has unknown or ill-defined schemas, i.e., it is usually unmodeled. The term unstructured is also used to refer to this type of data.

Copyright 2012 BI Research, All Rights Reserved.

MapReduce and the Data Scientist

Analyzing non- EDW data offers faster time to value

Faster time to value is possible because organizations can now process and analyze data that is outside of the enterprise data warehouse. It is often not practical or cost effective, for example, to integrate large volumes of machine-generated data from sensors and system and web logs into the enterprise data warehouse for analysis. In many cases, however, analyzing machine-generated data helps identify smaller subsets of high-value information that can be integrated into the enterprise data warehouse. To realize the full value of big data, organizations need to build an investigative computing platform 2 that enables business users such as data scientists to ingest, structure and analyze big data to extract useful business information that is not easily discoverable in its raw native form.


Data scientists turn big data into big value A role or job frequently associated with big data is that of the data scientist. Daniel Tunkelang, Principal Data Scientist at LinkedIn, states, Data scientists turn big data into big value, delivering products that delight users, and insight that informs business decisions. 3 Data scientists use a combination of their business and technical skills to investigate big data looking for ways to improve current business analytics and predictive analytical models, and also for possible new business opportunities. One of the biggest differences between a data scientist and a business intelligence (BI) user such as a business analyst is that a data scientist investigates and looks for new possibilities, while a BI user analyzes existing business situations and operations. Data scientists require a wide range of skills: Organizations will need to build data science teams Business domain expertise and strong analytical skills Creativity and good communications Knowledgeable in statistics, machine learning and data visualization Able to develop data analysis solutions using modeling/analysis methods and languages such as MapReduce, R, SAS, etc. Adept at data engineering, including discovering and mashing/blending large amounts of data

People with this wide range of skills are rare, and this explains why data scientists are in short supply. In most organizations, rather than looking for individuals with all of these capabilities, it will be necessary instead to build a team of people that collectively has these skills.

Teradata Corporation, the sponsor of this paper, uses the term data discovery platform to describe such an environment. 3 LinkedIns Daniel Tunkelang On What Is a Data Scientist, Dan Woods,, October 2011. See: linkedins-daniel-tunkelang-on-what-is-a-data-scientist/.
Copyright 2012 BI Research, All Rights Reserved. 2

MapReduce and the Data Scientist

Data scientists use an investigative computing platform (see Figure 1) to bring unmodeled data, e.g., multi-structured data, into an investigative data store for experimentation. This data store is separated, but connected to, the traditional data warehousing environment. In some cases, existing modeled data from the data warehouse may also be added to the data store. Figure 1. Investigative computing

Data analysis tools are used to process data in the investigative data store and experiment with different analytical solutions. The final results may be used to improve the existing data warehousing and business intelligence (BI/DW) environment, or to deploy a new standalone business area solution that is kept separate from the existing BI/DW environment for business, technology, performance, or cost reasons. The concepts behind data science are not new Most of the concepts behind data science and investigative computing are not new. For example, experts in statistics and data mining have been building predictive models for many years for risk analysis, fraud detection, and so forth. What is new about data science is that the use of big data and associated data analysis technologies enables a broader set of business solutions to be addressed, which brings what hitherto has been viewed as a backroom and specialized skill set into the business limelight. For data science to succeed, however, it will be important for vendors to focus on making big data analysis technologies (such as MapReduce) more approachable to a broader set of business experts. Big data helps broaden the business scope of investigative computing in three areas: New sources of data supports access to multi-structured data. New and improved analysis techniques enables sophisticated analytical processing of multi-structured data using techniques such as MapReduce and indatabase analytic functions.

Big data enhances investigative computing

Copyright 2012 BI Research, All Rights Reserved.

MapReduce and the Data Scientist

Improved data management and performance provides improved price/performance for processing multi-structured data using non-relational systems such as Hadoop, relational DBMSs, and integrated hardware/software appliances.

The following sections of this paper discuss these approaches and technologies in more detail.

MapReduce is a technique popularized by Google that distributes the processing of very large multi-structured data files across a large cluster of machines. High performance is achieved by breaking the processing into small units of work that can be run in parallel across the hundreds, potentially thousands, of nodes in the cluster. To quote the seminal paper on MapReduce: MapReduce is a programming model for automating parallel computing MapReduce is a programming model and an associated implementation for processing and generating large data sets. Programs written in this functional style are automatically parallelized and executed on a large cluster of commodity machines. This allows programmers without any experience with parallel and distributed systems to easily utilize the resources of a large distributed system. 4 The key point to note from this quote is that MapReduce is a programming model, not a programming language, i.e., it is designed to be used by programmers, rather than business users. The easiest way to describe how MapReduce works is through the use of an example see the Colored Square Counter in Figure 2. Figure 2. Colored Square Counter

R R P R R P B G B B P G O O R G B P R O split

G B B P map


R P R R P B B shuffle/sort P G B G G O B B O B O 3 O 3 reduce R R P R P G 4 3 3


G P R O map




The input to the MapReduce process in Figure 2 is a set of colored squares. The objective is to count the number of squares of each color. The programmer in this example is responsible for coding the map and reduce programs; the remainder of the MapReduce: Simplified Data Processing on Large Clusters, Jeffrey Dean and Sanjay Ghemawat, Google, Inc. See: content/untrusted_dlcp/
Copyright 2012 BI Research, All Rights Reserved. 4

MapReduce and the Data Scientist

processing is handled by the software system implementing the MapReduce programming model. Multiple map and reduce programs run in parallel on the nodes of the cluster The MapReduce system first reads the input file and splits it into multiple pieces. In this example, there are two splits, but in a real life scenario, the number of splits would typically be much higher. These splits are then processed by multiple map programs running in parallel on the nodes of the cluster. The role of each map program in this case is to group the data in a split by color. The MapReduce system then takes the output from each map program and merges (shuffle/sort) the results for input to the reduce program, which calculates the sum of the number of squares of each color. In this example, only one copy of the reduce program is used, but there may be more in practice. To optimize performance, programmers can provide their own shuffle/sort program and can also deploy a combiner that combines local map output files to reduce the number of output files that have to be remotely accessed across the cluster by the shuffle/sort step.


MapReduce aids organizations in processing and analyzing large volumes of multistructured data. Application examples include indexing and search, graph analysis, text analysis, machine learning, data transformation, and so forth. These types of applications are often difficult to implement using the standard SQL employed by relational DBMSs. Prebuilt MapReduce functions can be used by nonprogrammers The procedural nature of MapReduce makes it easily understood by skilled programmers. It also has the advantage that developers do not have to be concerned with implementing parallel computing this is handled transparently by the system. Although MapReduce is designed for programmers, non-programmers can exploit the value of prebuilt MapReduce applications and function libraries. Both commercial and open source MapReduce libraries are available that provide a wide range of analytic capabilities. Apache Mahout, for example, is an open source machine-learning library of algorithms for clustering, classification and batch-based collaborative filtering 5 that are implemented using MapReduce.


MapReduce programs are usually written in Java, but they can also be coded in languages such as C++, Perl, Python, Ruby, R, etc. These programs may process data stored in different file and database systems. At Google, for example, MapReduce was implemented on top of the Google File System (GFS). Hadoop supports MapReduce development using high-level languages as Hive and Pig One of the main deployment platforms for MapReduce is the open source Hadoop distributed computing framework provided by Apache Software Foundation. Hadoop supports MapReduce processing on several file systems, including the Hadoop Distributed File System (HDFS), which was motivated by GFS. Hadoop also provides Hive and Pig, which are high-level languages that generate MapReduce programs. Several vendors offer open source and commercially supported Hadoop distributions; examples include Cloudera, DataStax, Hortonworks (a spinoff from


Copyright 2012 BI Research, All Rights Reserved.

MapReduce and the Data Scientist

Yahoo) and MapR. Many of these vendors have added their own extensions and modifications to the Hadoop open source platform. Some relational DBMSs support the use of MapReduce SQL functions Another direction of vendors is to support MapReduce processing in relational DBMSs. These are implemented as in-database analytic functions that can be used in SQL statements. These functions are run inside the database system, which enables them to benefit from the parallel processing capabilities of the DBMS. Supported in the Teradata Aster MapReduce Platform, the Aster Database provides a number of built-in MapReduce functions for use with SQL. It also includes an interactive development environment, Aster Developer Express, for programmers to create their own MapReduce functions. MapReduce Development Using Hadoop Organizations have developed their own non-relational systems to handle big data Organizations with large amounts of multi-structured data find it difficult to use traditional relational DBMS technology for processing and analyzing this data. This is especially true for Web-based companies such as Google, Yahoo, Facebook, and LinkedIn who need to process vast volumes, often petabytes, of data in a timely and cost-effective manner. To overcome this issue, several of these organizations developed their own non-relational systems. Google, for example, developed MapReduce and the Google File System. It also built a DBMS system known as BigTable. There are many different types of non-relational systems, each suited to different types of data and different kinds of processing. This paper focuses specifically on Hadoop and its implementation of MapReduce for analytical processing. It is important to note that non-relational systems are not new. Many existed even before the advent of relational DBMS technology. What is new is that many of the more recent systems are designed to exploit commodity hardware in a large-scale distributed computing environment and have been made available to the open source community. An Introduction to Hadoop Apache Hadoop consists of several components. The ones that of are of interest from a database and analytical processing perspective are: Data availability in HDFS is achieved by replicating the data Hadoop Distributed File System (HDFS): A distributed file system that stores and replicates large files across multiple machine nodes. MapReduce: A programming model for distributing the processing of large data files (usually HDFS files) across a large cluster of machines. Pig: A high-level data flow language (Pig Latin) and compiler that create MapReduce jobs for analyzing large data files. Hive: SQL-like language (HiveQL) and optimizer that create MapReduce jobs for analyzing large data files. HBase: A distributed DBMS modeled after Google BigTable. Sqoop: A tool for moving data between a relational DBMS and Hadoop.

There are several different types of non-relational systems Hadoop with MapReduce is one example

HDFS can be a source or target file system for MapReduce programs. It is best suited to a small number of very large files. Data availability is achieved in HDFS using data replication. This of course increases the amount of storage required to manage

Copyright 2012 BI Research, All Rights Reserved.

MapReduce and the Data Scientist

the data. By default the replication factor is three. The first copy of the data is written to the node creating the file, the second copy to a node in the same rack, and the third copy to a node in a different rack in the machine cluster. If a node fails during application processing, the required data is simply retrieved by HDFS from another node. The Hadoop MapReduce framework attempts to distribute the map program processing so that the required HDFS data is local to the program wherever possible. Reduce program processing involves more internode data access and movement, because it has to process all of the output files created by the mapping process. During execution, map and reduce programs write work data to the local file system to avoid the overhead of HDFS replication. If a particular copy of a map or reduce program fails, it is simply rerun by the system. Existing HDFS data cannot be updated HDFS supports multiple readers and one writer. The writer can only add data at the end of a file, i.e., it cannot modify existing data within the file. HDFS does not provide an index mechanism, which means that it is best suited to read-only applications that need to scan and read the complete contents of a file (such as MapReduce programs). The actual location of the data within an HDFS file is transparent to applications and external software. Software built on top of HDFS therefore has little control over data placement or knowledge of data location, which can make it difficult to optimize performance. Developers code Hadoop map and reduce programs using low-level procedural interfaces. Hadoop provides an API for use by Java Programmers. C++ programmers employ the Hadoop Pipes facility, which uses a sockets interface to Hadoop MapReduce and HDFS. Programs written in other languages use Hadoop Streaming, which allows developers to create and run MapReduce jobs using any program or script. These latter jobs read input (line by line) from stdin and write the output to stdout. MapReduce jobs are run in parallel and distributed across the nodes of the machine cluster. The Hadoop MapReduce framework handles the scheduling, running and recovery of MapReduce jobs. Although Hadoop MapReduce offers a powerful capability for processing large amounts of data, coding map and reduce programs using low-level procedural interfaces is time-consuming and restricts MapReduce development to skilled programmers. Ashish Thusoo of FaceBook highlights this issue in his June 2009 article explaining why FaceBook developed the Hive high-level language for MapReduce: Hive was created to improve the productivity of MapReduce development Hadoop was not easy for end users, specially for the ones who were not familiar with map/reduce. Hadoop lacked the expressibility of popular query languages like SQL and as a result users ended up spending hours (if not days) to write programs for typical analysis. It was very clear to us that in order to really empower the company to analyze this data more productively, we had to improve the query capabilities of Hadoop. Bringing this data closer to users is what inspired us to build Hive. 6

Map and reduce programs are developed using low-level APIs

Ashish Thusoo, Hive A Petabyte Scale Data Warehouse Using Hadoop. See:
Copyright 2012 BI Research, All Rights Reserved. 7

MapReduce and the Data Scientist

Using Hive for MapReduce Development Hive is one of several high-level languages for developing MapReduce applications. Two others are Pig and JAQL. 7 Pig was developed by Yahoo and is now a component of the Apache Hadoop project. It includes Pig Latin, which is a high-level data flow and manipulation language for building MapReduce applications. It has both procedural (like a programming language) and declarative (as in SQL) statements and is somewhat similar to (but less powerful) than early fourthgeneration languages such as Focus, Nomad and SAS. JAQL is a query language designed for Javascript Object Notation (JSON), a data format that has become popular because of its simplicity. JAQL was developed by IBM for its BigInsights product. Although IBM has placed JAQL in the open source community, it has gained little traction outside of the IBM environment. Hive provides an SQL-like query language (HiveQL) This paper focuses on Hive because it has gained the most acceptance in the industry and also because its SQL-like syntax makes it easy to use by non-programmers who are comfortable using SQL. Many of the conclusions in this paper about Hive, however, are equally applicable to Pig, JAQL and other high-level languages used for deploying MapReduce solutions. The main components of Hive are illustrated in Figure 3. HiveQL statements can be entered using a command line or Web interface, or may be embedded in applications that use ODBC and JDBC interfaces to the Hive system. The Hive Driver system converts the query statements into a series of MapReduce jobs. Figure 3. Hive components (source:Facebook)

A useful comparison of the computational power, conciseness and performance of Hive, Pig and JAQL can be found in the paper Comparing High Level MapReduce Query Languages, by Stewart, Trinder and Loidl of Heriot Watt University. See:

Copyright 2012 BI Research, All Rights Reserved.

MapReduce and the Data Scientist

A Hive table can be partitioned into separate HDFS files

Data files in Hive are seen in the form of tables (and views) with columns to represent fields and rows to represent records. Tables can be vertically partitioned based on one or more table columns. The data for each partition is stored in a separate HDFS file. Data is not validated against the partition definition during the loading of data into the HDFS file. Partitions can be further split into buckets by hashing the data values of one or more partition columns. Buckets are stored in separate HDFS files and are used for sampling and for building a Hive index on an HDFS file. Hive tables do not support the concepts of primary or foreign keys or constraints of any type. All definitions are maintained in the Hive metastore, which is a relational database such as MySQL. Data types supported by Hive include primitive types such as integers, floating point numbers, strings and boolean. A timestamp data type is provided in Hive 0.8.0. Hive also supports complex types such as arrays (indexed lists), maps (key value pairs), structs (structures) and user-defined types. External HDFS files can be defined to Hive as external tables. Hive also allows access to data stored in other file and database systems such as HBase. Access to these data stores is enabled via storage handlers that present the data to Hive as nonnative tables. The HiveQL support (and restrictions) for these non-native tables is broadly the same as that for native tables.

HiveQL supports a subset of SQL SELECT

HiveQL supports a subset of the SQL SELECT statement operators and syntax, including: Join (equality, outer, and left semi-joins are supported) Union Subqueries (supported in the FROM clause only) Relational, arithmetic, logical and complex type operators Arithmetic, string, date, XPath, and user-defined functions including aggregate and table functions (UDFs, UDAFs, UDTFs)

Custom map and reduce scripts can be embedded in HiveQL

The Hive MAP and REDUCE operators can be used to embed custom MapReduce scripts in HiveQL queries. An INSERT statement is provided, but it can only be used to load or replace a complete table or table partition. The equivalents of the SQL UPDATE and DELETE statements are not supported. When comparing Hadoop Hive to a relational DBMS employing SQL, two areas have to be considered: query language syntax and query performance. Query language syntax is a moving target in both Hadoop Hive and relational DBMS products. Although Hive provides a useful subset (and superset) of SQL functionality, it is highly probable that existing SQL-based user applications and vendor software would need to be modified and simplified to work with Hive.

The most important comparison between Hive and SQL is performance

Perhaps the most important comparison between Hadoop Hive and the relational DBMS environment concerns performance. Such comparisons should consider traditional short and/or ad-hoc SQL-like queries running on Hive versus a relational DBMS, and also MapReduce performance on Hive compared with using SQL MapReduce relational DBMS functions for querying and analyzing large volumes of

Copyright 2012 BI Research, All Rights Reserved.

MapReduce and the Data Scientist

multi-structured data. Knowledge of the way Hive and relational DBMSs process queries is useful when discussing the performance of the two approaches. Hive provides an SQL wrapper on top of Hadoop HDFS (and other storage systems). It has no control or knowledge of the placement or location of data in HDFS files. The Hive optimizer uses rules to convert HiveQL queries into a series of MapReduce jobs for execution on a Hadoop cluster. Hints in HiveQL queries can aid the optimization process, for example, to improve the performance of join processing. Hive partitions and buckets reduce the amount of data scanned Hive-generated MapReduce jobs sequentially scan the HDFS files used to store the data associated with a Hive table. The Hive optimizer is partition and bucket aware, and so table partitions and buckets can be used to reduce the amount of data scanned by a query. Hive supports compact indexes (in 0.7.0) and bitmapped indexes (in 0.8.0), which aid lookup and range queries, and also enable certain queries to be satisfied by index-only access. Since Hive has no knowledge of the actual physical location of the data in an HDFS file, the Hive indexes contain data rather than pointers to data. A table index will need to be rebuilt if the data in a table partition is refreshed. Hive does, however, support partitioned indexes. Hive indexes aid the performance of traditional SQL-like queries, rather than MapReduce queries, which by their nature involve sequential processing. The primary use case for Hadoop Hive is the same as that for Hadoop MapReduce, which is the sequential processing of very large multi-structured data files such as Web logs. It is not well suited to ad- hoc queries where the user expects fast response times. The positioning of Hive is aptly described on the Apache Hive Wiki. Hive is not designed for OLTP workloads and does not offer real-time queries or row-level updates. It is best used for batch jobs over large sets of append-only data (like Web logs). What Hive values most is scalability (scale out with more machines added dynamically to the Hadoop cluster), extensibility (with MapReduce framework and UDF/UDAF/UDTF), fault-tolerance, and loose-coupling with its input formats. 8 The main benefit of Hive is that it dramatically improves the simplicity and speed of MapReduce development. The Hive optimizer also makes it easier to process interrelated files as compared with hand-coding MapReduce procedural logic to do this. The Hive optimizer, however, is still immature and is not fully insulated from the underlying file system, which means that for more complex queries, the Hive user is still frequently required to aid the optimizer through hints and certain HiveQL language constructions. There are many other tools for improving the usability of Hadoop, e.g., Informatica HParser for data transformation, and Karmasphere Studio, Pentaho Business Analytics and Revolution RevoConnectR for analytical processing. Most of these tools are front ends to Hadoop MapReduce and HDFS, and so many of the considerations discussed above for Hive apply equally to these products. Relational DBMS MapReduce Development Before discussing how relational DBMSs support MapReduce, it is useful to first compare how Hadoop Hive and a relational DBMS process queries. This process is illustrated in Figure 4.

Hive is suited to batch MapReduce processing, rather than short ad-hoc queries

Hive reduces the time it takes to deploy MapReduce solutions


Copyright 2012 BI Research, All Rights Reserved.

MapReduce and the Data Scientist

The left side of the diagram shows the operation of Hadoop Hive outlined in the previous section. The Hive rules-driven optimizer processes HiveQL statements and generates one or more MapReduce jobs to carry out the required query operations. These jobs are passed to the Hadoop system, where they are scheduled and managed by Hadoops job and task tracker capabilities. The MapReduce jobs read and process HDFS data and pass the results back to the Hive application. Relational DBMSs provide SQL users with physical data independence With the SQL application shown on the right side of the diagram, the relational mapping layer provides an interface between the relational logical view of data and the data management software used to create and manipulate physical data in the storage subsystem. In general, the logical and physical views of data in a relational system are completely isolated from each other in order to provide physical data independence. As discussed earlier, the isolation in Hadoop Hive is not as complete, which reduces data independence, but may give the developer more control for certain complex queries. The data independence of a relational DBMS has the advantage that vendors can extend or add runtime and data storage engines without affecting existing applications. The open source MySQL relational DBMS, for example, offers several data storage engines. Commercial vendors have also extended and added new data storage engines to support XML data, online analytical processing (OLAP), and so forth. The Teradata Aster Database includes a runtime engine for MapReduce processing, as well as a data store that enables both row and columnar data storage. Figure 4. Hadoop Hive and relational DBMS query processing

Relational DBMSs manage the complete data flow from disk to the user

Another advantage of the flexibility of relational DBMS runtime and storage engines is that the software vendor controls the complete data flow from the disk to the application. This enables the DBMS to control data placement and location, and to add performance features such as indexing, columnar storage, hybrid storage, buffering management and workload management, without affecting existing applications. Most vendors also provide an extensive set of administration tools and

Copyright 2012 BI Research, All Rights Reserved.


MapReduce and the Data Scientist

features to handle security, backup and recovery, availability (without the need for replication), and so forth. This is different from Hadoop Hive since Hive has little control of the use of HDFS, which makes performance optimization more difficult. Hive is also less mature, and therefore, has fewer administration tools. Programmers dislike SQL because they cannot control access to the data Although data independence makes life easier for non-programmers, the downside of relational DBMS data independence is that experienced developers have little or no control over how data is accessed and processed. Instead, they have to rely on the relational optimizer to make the right decisions about how data is to be accessed. The developer also has to deal with converting relational result sets to the record-at-atime nature of procedural programming. This is why many expert programmers dislike declarative languages, such as SQL, that identify what data is required, but not how it should be accessed. Programmers prefer to use procedural programming approaches and to have more control over access to the data. To overcome this issue, several relational DBMS vendors are now adding to their products the ability to code user-defined functions in procedural languages (Java, for example) that can employ techniques such as MapReduce. Relational database vendors have spent many years of effort and significant resources improving the quality of SQL optimization. Most relational optimizers today use sophisticated cost-based optimization and use statistics about the data to determine the most appropriate access paths to the data. Many also dynamically rewrite user SQL statements to improve performance. Despite all this effort, relational optimization is not perfect. There are some types of queries and some types of data that are difficult to optimize for good performance. Queries involving the processing of financial time series information and social network graph analysis are examples. Rick van der Lans of R/20 Consultancy outlines many more in his paper, Using SQL-MapReduce for Advanced Analytical Queries. 9 To overcome optimization issues and extend the scope of SQL, many relational DBMS products provide the ability for developers to add user-defined procedures and functions to the DBMS. These may be written in scripting languages (some of which are proprietary) or programming languages such as Java or C++. Some products include interactive development environment tools to aid in the development of these procedures and functions. Most relational DBMS vendors provide a range of prebuilt functions with their products. Third-party vendors, such as FuzzyLogix and Revolution Analytics, also market libraries of analytic functions for several relational DBMS products. The benefit of both programmer-developed and vendor-provided functions is that SQL users, such as data scientists, only need to understand what a function does and how to use it. They do not need to know how to develop the functions. This reduces the skill level requirements for data scientists, and also provides faster time to value.

Certain types of query are difficult for relational optimizers

SQL functions improve the analytic capabilities of relational DBMSs

Using SQL-MapReduce for Advanced Analytical Queries, Rick van der Lans, September 2011. See: for_Advanced_ Analytics.pdf.

Copyright 2012 BI Research, All Rights Reserved.

MapReduce and the Data Scientist

Important to understand how a relational DBMS executes a function at run time

User-defined functions in SQL statements are usually black boxes to the SQL optimizer. All the optimizer needs to do is create an execution plan that passes data to the function and handle any result data. It is important, therefore, to understand how the run-time engine implements these functions. Does the DBMS or an external runtime engine execute the function? Does the function exploit the parallel processing capabilities of the DBMS? Is the function run in a separate memory space from the DBMS in order to prevent errant functions from affecting DBMS performance and availability? The increasing interest and popularity of MapReduce has led some relational DBMS vendors to support MapReduce functions inside the DBMS. This capability not only offers the benefits outlined above for deploying user-defined functions, but also adds the advantages of MapReduce to the relational DBMS environment, i.e., the ability to process multi-structured data using SQL. It also brings the maturity of relational DBMS technology to MapReduce processing. The Teradata Aster Database is an example of a product that supports MapReduce, and this product will be used to explain how MapReduce is supported using relational technology. Teradata Aster MapReduce Platform

Some relational DBMSs support MapReduce functions

MapReduce functions enable SQL users to exploit the power of MapReduce

The Aster Database product is a leader in the use of MapReduce in a relational DBMS environment. The SQL-MapReduce capability couples SQL with MapReduce to provide a framework for running sophisticated analyses on a diverse set of both structured and multi-structured data. At the same time, SQL-MapReduce preserves the declarative and storage-independence benefits of SQL, while exploiting the power of the MapReduce procedural approach to extend SQLs analytic capabilities. One of the key objectives of the Aster Database is to make it easier for less experienced users to exploit the analytical capabilities of existing and packaged MapReduce functions without the need to know how these functions are programmed. SQL-MapReduce functions are run by a separate execution engine within the Aster Database environment and are automatically executed in parallel across the nodes of the machine cluster. The SQL-MapReduce facility includes a library of prebuilt analytic functions to speed the deployment of analytic applications. Functions are provided for path, pattern, statistical, graph, text and cluster analysis, and for data transformation. Custom functions can be written in a variety of languages (C, C++, C#, Java, Python, Perl, R, etc.) for use in both batch and interactive environments. An interactive development environment tool (Aster Developer Express) helps reduce the effort to build and test custom functions. This tool can also be used to import existing Java MapReduce programs for deployment in the Aster Database system. These latter programs may need to be modified to fit into the Aster MapReduce implementation. The SQL query in Figure 5 shows the use of the prepackaged Aster Database nPath function to process Web log clickstream data to determine the number of users who start at the home page (page_id=50) and exclusively access the business pages (page_cat_id=90) of the Web site.

The Aster Database provides some 50 prebuilt MapReduce functions and a MapReduce development tool

MapReduce clickstream analysis example

Copyright 2012 BI Research, All Rights Reserved.


MapReduce and the Data Scientist

Results from MapReduce processing can be used by other SQL operators Figure 5. SQL example showing the use of Aster nPath function

The results from functions such as nPath are the equivalent of those produced by an SQL subquery, i.e., they are a relational table. They can, therefore, be used like any other table in an SQL query joined to rows from other tables, filtered using the WHERE clause, grouped with the GROUP BY clause, and so on. SQL-MapReduce queries can be entered and the results visualized using any third-party tool, for example, Tableau, that supports the building of custom SQL statements.
SELECT B_count FROM npath (ON clickstream PARTITION BY user_id ORDER BY ts MODE(NONOVERLAPPING) PATTERN (H.B*) SYMBOLS( page_id=50 as H, page_cat_id=90 as B) RESULT Count(*) of ANY(B) as B_count) ORDER BY B_count desc;

SQL-MapReduce functions are self-describing and support late-binding, which means they can be written without any knowledge of the structure of the data that will be processed by the function. The formats of the input and output data are determined programmatically by a function at query runtime. This approach is ideally suited to analyzing unmodeled, multi-structured data. Functions also manage and control their own memory and file structures, and are run by a separate process to reduce the likelihood of errant functions damaging system operations. Aster Database supports both map and reduce processing The equivalent of a map program in SQL-MapReduce is a table function; a reduce program is known as a partition function. These functions can be chained together to provide the same capabilities as a series of MapReduce jobs, but without the overheads of job scheduling. Multiple instances of a function are run in parallel across nodes and the threads within a node. The processing of input rows is distributed among the threads, and output rows are collected from all threads. With a row function, each row from the input table is handled by exactly one instance of the function. With a partition function, each group of rows defined by the PARTITION BY clause (see the nPath example above) will be handled by exactly one instance of the function. The runtime engine therefore manages parallelism at either the row level or the partition level, depending on the type of function being used. SQL-MapReduce functions can also read and write data from/to external data sources. A function or set of functions could be used, for example, to read, transform and load external data into a DBMS table for processing and analysis, and write the results to an external output file. 10

A MapReduce function can read/ write external data


More detailed information about SQL-MapReduce operation and usage can be found in, SQL-MapReduce: A practical approach to self-describing, polymorphic, and parallelizable user-defined functions, by Friedman, Pawlowski and Cieslewicz. See:

Copyright 2012 BI Research, All Rights Reserved.

MapReduce and the Data Scientist


Non-programmers prefer declarative languages The debate about when to use Hadoop and when to use an RDBMS is reminiscent of the object versus relational debates of the 1980s, and the reasons are similar. Programmers prefer procedural programmatic approaches for accessing and manipulating data, i.e., Hadoop MapReduce, whereas non-programmers prefer declarative manipulation languages, i.e., relational DBMSs and SQL. The availability of an SQL-like language in Hadoop Hive and the incorporation of MapReduce functions in relational DBMSs, however, complicate the debate, since Hadoop gains the benefits of a declarative manipulation language, HiveQL, while relational DBMSs gain the advantages of MapReduce. Figure 6 positions the roles of relational DBMSs and Hadoop Hive. As the figure shows, relational DBMSs are suited to both high- and low-latency SQL queries that need to analyze many terabytes of structured data, whereas Hadoop Hive is suited to handling high-latency queries that process petabytes of multi-structured data. Figure 6. Relational DBMS versus Hadoop Hive

Using Hive to handle traditional SQL-like queries extends its use to structured data and lower-latency results. In this mode of operation, however, Hive cannot substitute for the functionality, usability, performance and maturity of a relational DBMS. Adding MapReduce to a relational DBMS extends its use to multi-structured data. The volume of multi-structured data that can be processed in a cost-effective manner using MapReduce functions with SQL will be product dependent. As the data volume approaches the petabyte level, a proof of concept benchmark will almost certainly be required.

Copyright 2012 BI Research, All Rights Reserved.


MapReduce and the Data Scientist

Key considerations are analytical functionality, usability, and workload performance

Figure 6 does not take into account the developer or the user of the investigative computing platform. The discussion at the beginning of this paper focused on the benefits of big data to organizations and how data scientists could exploit this data. The main requirements identified in this discussion were the need for organizations to easily analyze large volumes of multi-structured data with good price/performance and the requirement to make the technologies for doing those analyses more usable by data scientists. When choosing an analytical solution, therefore, analytical functionality and usability also need to be considered in addition to workload performance. The table below looks at these factors from the perspective of the Aster Database and Hadoop Hive.

Comparison tables are by their nature subjective. The table above does, however, identify some important differences between the two environments that should be considered when evaluating solutions for developing and deploying new analytic capabilities. Adding MapReduce to a relational DBMS brings the benefits of this style of processing to those data scientists who find relational technology more familiar and easier to use than Hadoop. The relational DBMS environment (by virtue of its maturity) also has better administration and workload management capabilities. Many organizations will use both Hadoop and relational DBMSs Given market interest and the benefits of Hadoop, it is inevitable that most organizations will use a combination of both Hadoop and relational DBMS technologies. It is important for those organizations to develop coexistence strategies to simplify product selection and application deployment decisions.

Copyright 2012 BI Research, All Rights Reserved.


MapReduce and the Data Scientist

There are several possible scenarios for using a combination of Hadoop and relational DBMS technologies: 1. Use Hadoop for storing and archiving multi-structured data. A connector to a relational DBMS can then be used to extract required data from Hadoop for analysis by the relational DBMS. If the relational DBMS supports MapReduce functions, these functions can be used to do the extraction. The Aster-Hadoop adaptor, for example, uses SQL-MapReduce functions to provide fast, two-way data loading between HDFS and the Aster Database. Data loaded into the Aster Database can then be analyzed using both SQL and MapReduce. 2. Use Hadoop for filtering, transforming and/or consolidating multistructured data. A connector such as the Aster-Hadoop adaptor can be used to extract the results from Hadoop processing to the relational DBMS for analysis. 3. Use Hadoop to analyze large volumes of multi-structured data and publish the analytical results to the traditional data warehousing environment, a shared workgroup data store, or a common user interface. 4. Use a relational DBMS that provides MapReduce capabilities as an investigative computing platform. Data scientists can employ the relational DBMS (the Aster Database system, for example) to analyze a combination of structured data and multi-structured data (loaded from Hadoop) using a mixture of SQL processing and MapReduce analytic functions. 5. Use a front-end query tool to access and analyze data that is stored in both Hadoop and the relational DBMS. Direction is toward integrated Hadoop and relational DBMS appliances The scenarios above support an environment where the Hadoop and relational DBMS systems are separate from each other and connectivity software is used to exchange data between the two systems. The direction of the industry over the next few years will be to more tightly couple both Hadoop and relational DBMS technology together both through software such as APIs and metadata integration, as well as into a single hardware and software appliance that will allow applications to exploit the benefits of both technologies in a single system. Such integration provides many benefits eliminating the need to install and maintain multiple systems, reduce data movement, provide a single metadata store for application development, and provide a single interface for both business users and analytical tools.

Impossible to manage all data in an EDW Big data and associated technologies can bring significant benefits to the business, but as the use of these technologies grows, it will become difficult for organizations to tightly control all of the many forms of data used for analysis and investigation. It will certainly become impossible for organizations to centralize all of the data in a single enterprise data warehouse (EDW) and it will become necessary to distinguish between data that can, and that cannot, be consolidated into the EDW. For data that remains outside of the EDW environment (for performance or cost reasons, for example), BI developers and data scientists should evaluate the best approach to use for filtering and/or analyzing the data. Options are to use a relational

Several solutions for analyzing non-EDW data

Copyright 2012 BI Research, All Rights Reserved.


MapReduce and the Data Scientist

DBMS optimized for analytic processing such as the Aster Database, a non-relational system such as Hadoop with Hive, or to analyze in-motion data as it flows through enterprise systems using data streaming technology. Data sandbox can be used for investigative computing For EDW data it will be necessary to determine the data that needs to be extracted into an underlying data mart or data cube for analysis and reporting by BI users, or into an investigative data sandbox for investigation by data scientists. This latter data sandbox will likely also contain multi-structured data, possibly extracted from a Hadoop system. Given these many different approaches and components, organizations must extend their existing EDW infrastructure to support them. This new infrastructure can be thought of as an extended data warehouse. Such an infrastructure is essential if organizations are to reap the full benefits of big data, and enable data scientists to investigate ways of improving the business and also identify new business opportunities. I would like to thank Teradata for its support in the publication of this paper.

Data warehouse framework needs to be extended to support big data

Note that brand and product names mentioned in this paper may be the trademarks or registered trademarks of their respective owners.

About BI Research BI Research is a research and consulting company whose goal is to help companies understand and exploit new developments in business intelligence, data management, and collaborative computing. SQL-MapReduce, Teradata, and the Teradata logo are registered trademarks of Teradata Corporation and/or its affiliates in the U.S. and worldwide. EB-6512 > 0112

Copyright 2012 BI Research, All Rights Reserved.