MapReduce and the Data Scientist

Colin White, BI Research January 2012


Copyright © 2012 BI Research, All Rights Reserved.

MAPREDUCE AND THE DATA SCIENTIST Big data is big news and so too is analytics on big data. but also reduces the value of the results. i. organizations have had to aggregate or sample the detailed data before it could be analyzed. however. THE IMPORTANCE OF BIG DATA Big data is a valuable term despite the hype The topic of big data is receiving significant press coverage and gathering increasing interest from both business users and IT staff. and relational DBMSs that have added MapReduce capabilities. There is strong emphasis in big data products. To overcome this issue. which not only increases data latency. These solutions can bring significant benefits to the business by enabling smarter and faster decision making. It presents a set of scenarios outlining how Hadoop and relational DBMS technology can coexist to provide the benefits of big data and big data analytics to the business and to data scientists. This paper examines the benefits of big data and the role of the data scientist in deploying big data solutions. The term unstructured is also used to refer to this type of data. It discusses the benefits of Hadoop MapReduce.. for example. and allowing organizations to achieve faster time to value from their investments in analytical processing technology and products. It remains. Rapid analysis of detailed data provides faster decisions 1 Multi-structured data is data that has unknown or ill-defined schemas. Faster decisions are enabled because big data solutions support the rapid analysis of high volumes of detailed data. social computing web sites. As an example of the use of MapReduce in a relational DBMS. system and web logs. and MapReduce extensions to existing relational DBMSs. a valuable term because from an analytics perspective it still represents analytic workloads and data management solutions that could not previously be supported because of cost considerations and/or technology limitations. it is usually unmodeled. the paper also reviews the implementation of MapReduce in the Aster Database.e. and so forth. on the analysis of multi-structured 1 data from sensors. Hitherto. Technologies for analyzing big data are evolving rapidly and there is significant interest in new analytic approaches such as Hadoop MapReduce and Hive. Analytics on multistructured data enable smarter decsions Smarter decision making comes from the ability to analyze new sources of data to enhance existing analytics and predictive models created from structured data in operational systems and data warehouses. The analysis of large data stores has been difficult to date because it is often impossible to implement in a timely or cost-effective manner. 1 Copyright © 2012 BI Research. Hive. text documents. All Rights Reserved. these types of data have been difficult to process using traditional analytical processing technologies. The result is that big data is becoming an overhyped and overused marketing buzzword. .

states.EDW data offers faster time to value Faster time to value is possible because organizations can now process and analyze data that is outside of the enterprise data warehouse. etc. THE ROLE OF THE DATA SCIENTIST “Data scientists turn big data into big value…” A role or job frequently associated with big data is that of the data scientist. 2 Teradata Corporation. and insight that informs business decisions. It is often not practical or cost effective. to integrate large volumes of machine-generated data from sensors and system and web logs into the enterprise data warehouse for analysis. Forbes. Principal Data Scientist at LinkedIn. R. October 2011. including discovering and mashing/blending large amounts of data People with this wide range of skills are rare.forbes. for In many cases. and also for possible new business opportunities. SAS. however. To realize the full value of big data. In most linkedins-daniel-tunkelang-on-what-is-a-data-scientist/. the sponsor of this paper. rather than looking for individuals with all of these capabilities. while a BI user analyzes existing business situations and operations. uses the term data discovery platform to describe such an environment.MapReduce and the Data Scientist Analyzing non. organizations need to build an investigative computing platform 2 that enables business users such as data scientists to ingest. it will be necessary instead to build a team of people that collectively has these skills. analyzing machine-generated data helps identify smaller subsets of high-value information that can be integrated into the enterprise data warehouse. 2 . structure and analyze big data to extract useful business information that is not easily discoverable in its raw native form. “Data scientists turn big data into big value. machine learning and data visualization Able to develop data analysis solutions using modeling/analysis methods and languages such as MapReduce.’” Dan Woods.” 3 Data scientists use a combination of their business and technical skills to investigate big data looking for ways to improve current business analytics and predictive analytical models. See: www. Daniel Tunkelang. One of the biggest differences between a data scientist and a business intelligence (BI) user – such as a business analyst – is that a data scientist investigates and looks for new possibilities. Adept at data engineering. Data scientists require a wide range of skills: • • • • • Organizations will need to build data science teams Business domain expertise and strong analytical skills Creativity and good communications Knowledgeable in statistics. All Rights Reserved. Copyright © 2012 BI Research. 3 “LinkedIn’s Daniel Tunkelang On ‘What Is a Data Scientist. delivering products that delight users. and this explains why data scientists are in short supply.

e. but connected to. What is new about data science is that the use of big data and associated data analysis technologies enables a broader set of business solutions to be addressed.. it will be important for vendors to focus on making big data analysis technologies (such as MapReduce) more approachable to a broader set of business experts. technology. experts in statistics and data mining have been building predictive models for many years for risk analysis. fraud detection. 3 .g. or to deploy a new standalone business area solution that is kept separate from the existing BI/DW environment for business. The final results may be used to improve the existing data warehousing and business intelligence (BI/DW) environment. The concepts behind data science are not new Most of the concepts behind data science and investigative computing are not new. For example. which brings what hitherto has been viewed as a backroom and specialized skill set into the business limelight. Figure 1.MapReduce and the Data Scientist Data scientists use an investigative computing platform (see Figure 1) to bring unmodeled data. existing modeled data from the data warehouse may also be added to the data store. Big data helps broaden the business scope of investigative computing in three areas: • • New sources of data – supports access to multi-structured data. multi-structured data. the traditional data warehousing environment. All Rights Reserved. or cost reasons. Investigative computing Data analysis tools are used to process data in the investigative data store and experiment with different analytical solutions. however. and so forth. For data science to succeed. performance. This data store is separated. into an investigative data store for experimentation. New and improved analysis techniques – enables sophisticated analytical processing of multi-structured data using techniques such as MapReduce and indatabase analytic functions. Big data enhances investigative computing Copyright © 2012 BI Research. In some cases.

e. relational DBMSs. To quote the seminal paper on MapReduce: MapReduce is a programming model for automating parallel computing “MapReduce is a programming model and an associated implementation for processing and generating large data sets. The objective is to count the number of squares of each color.” 4 The key point to note from this quote is that MapReduce is a programming model. of nodes in the cluster. Copyright © 2012 BI Research. Programs written in this functional style are automatically parallelized and executed on a large cluster of commodity not a programming content/untrusted_dlcp/research.pdf. High performance is achieved by breaking the processing into small units of work that can be run in parallel across the hundreds. The following sections of this paper discuss these approaches and technologies in more detail. the remainder of the “MapReduce: Simplified Data Processing on Large Clusters.googleusercontent. it is designed to be used by programmers.. rather than business users. Colored Square Counter R R P R R P B G B B P G O O R G B P R O split G B B P map R P G B R P R R P B B shuffle/sort P G B G G O B B O B O 3 O 3 reduce R R P R P G 4 3 3 G O O R G P R O map G O P G O O R R The input to the MapReduce process in Figure 2 is a set of colored squares. potentially thousands. Google. i. This allows programmers without any experience with parallel and distributed systems to easily utilize the resources of a large distributed system. The easiest way to describe how MapReduce works is through the use of an example – see the Colored Square Counter in Figure 2. 4 4 .” Jeffrey Dean and Sanjay Ghemawat. Figure See: static. and integrated hardware/software appliances.MapReduce and the Data Scientist • Improved data management and performance – provides improved price/performance for processing multi-structured data using non-relational systems such as Hadoop. Inc. All Rights Reserved. The programmer in this example is responsible for coding the map and reduce programs. WHAT IS MAPREDUCE? MapReduce is a technique popularized by Google that distributes the processing of very large multi-structured data files across a large cluster of machines. Prebuilt MapReduce functions can be used by nonprogrammers The procedural nature of MapReduce makes it easily understood by skilled programmers. there are two splits. Hadoop supports MapReduce processing on several file systems. programmers can provide their own shuffle/sort program and can also deploy a combiner that combines local map output files to reduce the number of output files that have to be remotely accessed across the cluster by the shuffle/sort step. Hadoop also provides Hive and Pig. only one copy of the reduce program is used.MapReduce and the Data Scientist processing is handled by the software system implementing the MapReduce programming model. including the Hadoop Distributed File System (HDFS). These types of applications are often difficult to implement using the standard SQL employed by relational DBMSs. In this example. Multiple map and reduce programs run in parallel on the nodes of the cluster The MapReduce system first reads the input file and splits it into multiple pieces. At Google. These splits are then processed by multiple map programs running in parallel on the nodes of the cluster. . Application examples include indexing and search. and so forth. MapReduce was implemented on top of the Google File System (GFS). Hadoop supports MapReduce development using high-level languages as Hive and Pig One of the main deployment platforms for MapReduce is the open source Hadoop distributed computing framework provided by Apache Software Foundation. WHY USE MAPREDUCE? MapReduce aids organizations in processing and analyzing large volumes of multistructured data. 5 Copyright © 2012 BI Research. These programs may process data stored in different file and database systems. machine learning. data transformation. All Rights Reserved. In this example. DataStax. graph analysis. Hortonworks (a spinoff from 5 Source: www. is an open source machine-learning library of “algorithms for clustering. Several vendors offer open source and commercially supported Hadoop distributions. but there may be more in practice. which are high-level languages that generate MapReduce programs. for example. for example. but in a real life scenario. THE HOW OF MAPREDUCE: APPLICATION DEVELOPMENT MapReduce programs are usually written in Java. The role of each map program in this case is to group the data in a split by color. Perl. etc. non-programmers can exploit the value of prebuilt MapReduce applications and function libraries.apache. the number of splits would typically be much higher. but they can also be coded in languages such as C++. To optimize performance. which calculates the sum of the number of squares of each color. Although MapReduce is designed for programmers. The MapReduce system then takes the output from each map program and merges (shuffle/sort) the results for input to the reduce program. It also has the advantage that developers do not have to be concerned with implementing parallel computing – this is handled transparently by the system. examples include Cloudera. Both commercial and open source MapReduce libraries are available that provide a wide range of analytic capabilities.mahout. which was motivated by GFS. R. classification and batch-based collaborative filtering” 5 that are implemented using MapReduce. text analysis. Ruby. Apache Mahout.

All Rights Reserved. developed MapReduce and the Google File System. often petabytes. Many existed even before the advent of relational DBMS technology. It is best suited to a small number of very large files. This is especially true for Web-based companies such as Google. This paper focuses specifically on Hadoop and its implementation of MapReduce for analytical processing. and LinkedIn who need to process vast volumes. MapReduce Development Using Hadoop Organizations have developed their own non-relational systems to handle big data Organizations with large amounts of multi-structured data find it difficult to use traditional relational DBMS technology for processing and analyzing this data.MapReduce and the Data Scientist Yahoo) and MapR. It is important to note that non-relational systems are not new. An Introduction to Hadoop Apache Hadoop consists of several components. Google. Aster Developer Express. Sqoop: A tool for moving data between a relational DBMS and Hadoop. the Aster Database provides a number of built-in MapReduce functions for use with SQL. . which enables them to benefit from the parallel processing capabilities of the DBMS. Supported in the Teradata Aster MapReduce Platform. This of course increases the amount of storage required to manage 6 Copyright © 2012 BI Research. each suited to different types of data and different kinds of processing. Facebook. These are implemented as in-database analytic functions that can be used in SQL statements. Yahoo. MapReduce: A programming model for distributing the processing of large data files (usually HDFS files) across a large cluster of machines. What is new is that many of the more recent systems are designed to exploit commodity hardware in a large-scale distributed computing environment and have been made available to the open source community. It also built a DBMS system known as BigTable. for programmers to create their own MapReduce functions. for example. These functions are run inside the database system. There are several different types of non-relational systems – Hadoop with MapReduce is one example HDFS can be a source or target file system for MapReduce programs. The ones that of are of interest from a database and analytical processing perspective are: • • • • • • Data availability in HDFS is achieved by replicating the data Hadoop Distributed File System (HDFS): A distributed file system that stores and replicates large files across multiple machine nodes. Hive: SQL-like language (HiveQL) and optimizer that create MapReduce jobs for analyzing large data files. Data availability is achieved in HDFS using data replication. several of these organizations developed their own non-relational systems. It also includes an interactive development environment. of data in a timely and cost-effective manner. There are many different types of non-relational systems. To overcome this issue. HBase: A distributed DBMS modeled after Google BigTable. Some relational DBMSs support the use of MapReduce SQL functions Another direction of vendors is to support MapReduce processing in relational DBMSs. Pig: A high-level data flow language (Pig Latin) and compiler that create MapReduce jobs for analyzing large data files. Many of these vendors have added their own extensions and modifications to the Hadoop open source platform.

The first copy of the data is written to the node creating the file. it cannot modify existing data within the file. coding map and reduce programs using low-level procedural interfaces is time-consuming and restricts MapReduce development to skilled programmers. Copyright © 2012 BI Research. which means that it is best suited to read-only applications that need to scan and read the complete contents of a file (such as MapReduce programs). which can make it difficult to optimize performance. Developers code Hadoop map and reduce programs using low-level procedural interfaces. which uses a sockets interface to Hadoop MapReduce and HDFS. The Hadoop MapReduce framework handles the scheduling. running and recovery of MapReduce jobs. The Hadoop MapReduce framework attempts to distribute the map program processing so that the required HDFS data is local to the program wherever possible. If a node fails during application processing. Bringing this data closer to users is what inspired us to build Hive.facebook. C++ programmers employ the Hadoop Pipes facility. 7 6 . map and reduce programs write work data to the local file system to avoid the overhead of HDFS replication. Existing HDFS data cannot be updated HDFS supports multiple readers and one writer. The writer can only add data at the end of a file. It was very clear to us that in order to really empower the company to analyze this data more productively. specially for the ones who were not familiar with map/reduce.” 6 Map and reduce programs are developed using low-level APIs Ashish Thusoo. Software built on top of HDFS therefore has little control over data placement or knowledge of data location.e. If a particular copy of a map or reduce program fails. The actual location of the data within an HDFS file is transparent to applications and external software.MapReduce and the Data Scientist the data. we had to improve the query capabilities of Hadoop. Although Hadoop MapReduce offers a powerful capability for processing large amounts of data.php?note_id=89508453919. Programs written in other languages use Hadoop Streaming.” See: www. Hadoop lacked the expressibility of popular query languages like SQL and as a result users ended up spending hours (if not days) to write programs for typical analysis. MapReduce jobs are run in parallel and distributed across the nodes of the machine cluster. During Reduce program processing involves more internode data access and movement. the required data is simply retrieved by HDFS from another node. and the third copy to a node in a different rack in the machine cluster. which allows developers to create and run MapReduce jobs using any program or script.. Ashish Thusoo of FaceBook highlights this issue in his June 2009 article explaining why FaceBook developed the Hive high-level language for MapReduce: Hive was created to improve the productivity of MapReduce development “Hadoop was not easy for end users. HDFS does not provide an index mechanism. All Rights Reserved. Hadoop provides an API for use by Java Programmers. because it has to process all of the output files created by the mapping process. it is simply rerun by the system. i. These latter jobs read input (line by line) from stdin and write the output to stdout. By default the replication factor is three. the second copy to a node in the same rack. “Hive – A Petabyte Scale Data Warehouse Using Hadoop.

Although IBM has placed JAQL in the open source community. . Figure 3. JAQL was developed by IBM for its BigInsights product. 8 Copyright © 2012 BI Research. See: www. Pig and JAQL can be found in the paper “Comparing High Level MapReduce Query Languages.hw. All Rights Reserved. Many of the conclusions in this paper about Hive.MapReduce and the Data Scientist Using Hive for MapReduce Development Hive is one of several high-level languages for developing MapReduce applications. or may be embedded in applications that use ODBC and JDBC interfaces to the Hive HiveQL statements can be entered using a command line or Web interface. The main components of Hive are illustrated in Figure 3. Trinder and Loidl of Heriot Watt University.pdf. Nomad and SAS. are equally applicable to Pig. Hive provides an SQL-like query language (HiveQL) This paper focuses on Hive because it has gained the most acceptance in the industry and also because its SQL-like syntax makes it easy to use by non-programmers who are comfortable using SQL. JAQL is a query language designed for Javascript Object Notation (JSON). The Hive Driver system converts the query statements into a series of MapReduce jobs. which is a high-level data flow and manipulation language for building MapReduce applications. however. 7 Pig was developed by Yahoo and is now a component of the Apache Hadoop project. JAQL and other high-level languages used for deploying MapReduce solutions. Hive components (source:Facebook) 7 A useful comparison of the computational power.” by a data format that has become popular because of its simplicity. It includes Pig Latin. Two others are Pig and JAQL. conciseness and performance of Hive. It has both procedural (like a programming language) and declarative (as in SQL) statements and is somewhat similar to (but less powerful) than early fourthgeneration languages such as Focus.macs. it has gained little traction outside of the IBM environment.

floating point numbers. HiveQL supports a subset of SQL SELECT HiveQL supports a subset of the SQL SELECT statement operators and syntax. All Rights Reserved. but it can only be used to load or replace a complete table or table partition. The data for each partition is stored in a separate HDFS file. two areas have to be considered: query language syntax and query performance. UDTFs) Custom map and reduce scripts can be embedded in HiveQL The Hive MAP and REDUCE operators can be used to embed custom MapReduce scripts in HiveQL queries. The HiveQL support (and restrictions) for these non-native tables is broadly the same as that for native tables. Data types supported by Hive include primitive types such as integers. Buckets are stored in separate HDFS files and are used for sampling and for building a Hive index on an HDFS file. Hive also allows access to data stored in other file and database systems such as HBase. Hive also supports complex types such as arrays (indexed lists). maps (key value pairs). arithmetic. Although Hive provides a useful subset (and superset) of SQL functionality. and left semi-joins are supported) Union Subqueries (supported in the FROM clause only) Relational. The equivalents of the SQL UPDATE and DELETE statements are not supported. Tables can be vertically partitioned based on one or more table columns. Partitions can be further split into buckets by hashing the data values of one or more partition columns. 9 . A timestamp data type is provided in Hive 0.MapReduce and the Data Scientist A Hive table can be partitioned into separate HDFS files Data files in Hive are seen in the form of tables (and views) with columns to represent fields and rows to represent records. An INSERT statement is provided. logical and complex type operators Arithmetic. External HDFS files can be defined to Hive as external tables. When comparing Hadoop Hive to a relational DBMS employing SQL. and also MapReduce performance on Hive compared with using SQL MapReduce relational DBMS functions for querying and analyzing large volumes of Copyright © 2012 BI Research. Query language syntax is a moving target in both Hadoop Hive and relational DBMS products. outer. including: • • • • • Join (equality. All definitions are maintained in the Hive metastore. date. it is highly probable that existing SQL-based user applications and vendor software would need to be modified and simplified to work with Hive.8. Hive tables do not support the concepts of primary or foreign keys or constraints of any type. strings and boolean. Data is not validated against the partition definition during the loading of data into the HDFS file. and user-defined functions including aggregate and table functions (UDFs. Such comparisons should consider traditional short and/or ad-hoc SQL-like queries running on Hive versus a relational DBMS. The most important comparison between Hive and SQL is performance Perhaps the most important comparison between Hadoop Hive and the relational DBMS environment concerns performance.0. string. UDAFs. XPath. Access to these data stores is enabled via storage handlers that present the data to Hive as nonnative tables. structs (structures) and user-defined types. which is a relational database such as MySQL.

extensibility (with MapReduce framework and UDF/UDAF/UDTF). rather than short ad-hoc queries Hive reduces the time it takes to deploy MapReduce solutions Source: www. A table index will need to be rebuilt if the data in a table partition is refreshed.8. the Hive indexes contain data rather than pointers to data. All Rights Reserved. and so table partitions and buckets can be used to reduce the amount of data scanned by a query. 10 Copyright © 2012 BI Research. Hive partitions and buckets reduce the amount of data scanned Hive-generated MapReduce jobs sequentially scan the HDFS files used to store the data associated with a Hive table. it is useful to first compare how Hadoop Hive and a relational DBMS process queries. Hive provides an SQL wrapper on top of Hadoop HDFS (and other storage systems). This process is illustrated in Figure 4.MapReduce and the Data Scientist multi-structured data. What Hive values most is scalability (scale out with more machines added dynamically to the Hadoop cluster). It has no control or knowledge of the placement or location of data in HDFS files. Hive supports compact indexes (in 0. to improve the performance of join processing.cwiki. which aid lookup and range queries. Pentaho Business Analytics and Revolution RevoConnectR for analytical processing. Hive indexes aid the performance of traditional SQL-like queries. It is best used for batch jobs over large sets of append-only data (like Web logs). and also enable certain queries to be satisfied by index-only access. Informatica HParser for data transformation. The positioning of Hive is aptly described on the Apache Hive Wiki. which by their nature involve sequential processing.0). is still immature and is not fully insulated from the underlying file system.7. the Hive user is still frequently required to aid the optimizer through hints and certain HiveQL language constructions. however.” 8 The main benefit of Hive is that it dramatically improves the simplicity and speed of MapReduce development. Knowledge of the way Hive and relational DBMSs process queries is useful when discussing the performance of the two approaches. Hive does. There are many other tools for improving the usability of Hadoop. and so many of the considerations discussed above for Hive apply equally to these products. and loose-coupling with its input formats. The Hive optimizer also makes it easier to process interrelated files as compared with hand-coding MapReduce procedural logic to do this. support partitioned indexes. It is not well suited to ad. for example. The Hive optimizer. The Hive optimizer is partition and bucket aware. Hints in HiveQL queries can aid the optimization process.apache. The Hive optimizer uses rules to convert HiveQL queries into a series of MapReduce jobs for execution on a Hadoop cluster. however. which is the sequential processing of very large multi-structured data files such as Web queries where the user expects fast response times. Relational DBMS MapReduce Development Before discussing how relational DBMSs support MapReduce. Since Hive has no knowledge of the actual physical location of the data in an HDFS file. “Hive is not designed for OLTP workloads and does not offer real-time queries or row-level updates. Most of these tools are front ends to Hadoop MapReduce and HDFS. rather than MapReduce queries. and Karmasphere Studio.. The primary use case for Hadoop Hive is the same as that for Hadoop MapReduce. 8 Hive is suited to batch MapReduce processing. . e. fault-tolerance.0) and bitmapped indexes (in 0. which means that for more complex queries.g.

Most vendors also provide an extensive set of administration tools and Copyright © 2012 BI Research. Hadoop Hive and relational DBMS query processing Relational DBMSs manage the complete data flow from disk to the user Another advantage of the flexibility of relational DBMS runtime and storage engines is that the software vendor controls the complete data flow from the disk to the application. The MapReduce jobs read and process HDFS data and pass the results back to the Hive application. the isolation in Hadoop Hive is not as complete. columnar storage. which reduces data independence. The Teradata Aster Database includes a runtime engine for MapReduce processing. This enables the DBMS to control data placement and location. the relational mapping layer provides an interface between the relational logical view of data and the data management software used to create and manipulate physical data in the storage subsystem. without affecting existing applications. for example. hybrid storage. the logical and physical views of data in a relational system are completely isolated from each other in order to provide physical data independence. Commercial vendors have also extended and added new data storage engines to support XML data. offers several data storage engines. The data independence of a relational DBMS has the advantage that vendors can extend or add runtime and data storage engines without affecting existing applications. 11 . buffering management and workload management. as well as a data store that enables both row and columnar data storage. Figure 4. Relational DBMSs provide SQL users with physical data independence With the SQL application shown on the right side of the diagram. All Rights Reserved. where they are scheduled and managed by Hadoop’s job and task tracker capabilities. The Hive rules-driven optimizer processes HiveQL statements and generates one or more MapReduce jobs to carry out the required query operations. and so forth. As discussed earlier. In general.MapReduce and the Data Scientist The left side of the diagram shows the operation of Hadoop Hive outlined in the previous section. The open source MySQL relational DBMS. and to add performance features such as indexing. These jobs are passed to the Hadoop system. but may give the developer more control for certain complex queries. online analytical processing (OLAP).

September 2011. Programmers prefer to use procedural programming approaches and to have more control over access to the data. many relational DBMS products provide the ability for developers to add user-defined procedures and functions to the DBMS. for example) that can employ techniques such as MapReduce. Some products include interactive development environment tools to aid in the development of these procedures and functions. Rick van der Lans of R/20 Consultancy outlines many more in his paper. Programmers dislike SQL because they cannot control access to the data Although data independence makes life easier for non-programmers. See: www.asterdata. . 12 Copyright © 2012 BI for_Advanced_ Analytics. There are some types of queries and some types of data that are difficult to optimize for good performance. To overcome this issue. This is different from Hadoop Hive since Hive has little control of the use of HDFS. backup and recovery. This is why many expert programmers dislike declarative languages. Most relational DBMS vendors provide a range of prebuilt functions with their products. such as data scientists. several relational DBMS vendors are now adding to their products the ability to code user-defined functions in procedural languages (Java.pdf. such as SQL. also market libraries of analytic functions for several relational DBMS products. Most relational optimizers today use sophisticated cost-based optimization and use statistics about the data to determine the most appropriate access paths to the data. Many also dynamically rewrite user SQL statements to improve performance. relational optimization is not perfect. Relational database vendors have spent many years of effort and significant resources improving the quality of SQL optimization. but not how it should be accessed. The developer also has to deal with converting relational result sets to the record-at-atime nature of procedural programming. “Using SQL-MapReduce® for Advanced Analytical Queries. such as FuzzyLogix and Revolution Analytics. only need to understand what a function does and how to use it.” 9 To overcome optimization issues and extend the scope of SQL. they have to rely on the relational optimizer to make the right decisions about how data is to be accessed. and therefore. availability (without the need for replication). that identify what data is required.MapReduce and the Data Scientist features to handle security. Instead. Despite all this effort. Hive is also less mature. All Rights Reserved. The benefit of both programmer-developed and vendor-provided functions is that SQL users.” Rick van der Lans. has fewer administration tools. Queries involving the processing of financial time series information and social network graph analysis are examples. These may be written in scripting languages (some of which are proprietary) or programming languages such as Java or C++. Third-party vendors. Certain types of query are difficult for relational optimizers SQL functions improve the analytic capabilities of relational DBMSs 9 “Using SQL-MapReduce for Advanced Analytical Queries. This reduces the skill level requirements for data scientists. the downside of relational DBMS data independence is that experienced developers have little or no control over how data is accessed and processed. They do not need to know how to develop the functions. and so forth. and also provides faster time to value. which makes performance optimization more difficult.

Perl. and for data transformation. The SQL-MapReduce facility includes a library of prebuilt analytic functions to speed the deployment of analytic applications. One of the key objectives of the Aster Database is to make it easier for less experienced users to exploit the analytical capabilities of existing and packaged MapReduce functions without the need to know how these functions are programmed. The SQL query in Figure 5 shows the use of the prepackaged Aster Database nPath function to process Web log clickstream data to determine the number of users who start at the home page (page_id=50) and exclusively access the business pages (page_cat_id=90) of the Web site. Does the DBMS or an external runtime engine execute the function? Does the function exploit the parallel processing capabilities of the DBMS? Is the function run in a separate memory space from the DBMS in order to prevent errant functions from affecting DBMS performance and availability? The increasing interest and popularity of MapReduce has led some relational DBMS vendors to support MapReduce functions inside the DBMS. It is important. This tool can also be used to import existing Java MapReduce programs for deployment in the Aster Database system. The Teradata Aster Database is an example of a product that supports MapReduce. and this product will be used to explain how MapReduce is supported using relational technology. SQL-MapReduce preserves the declarative and storage-independence benefits of SQL. C++. SQL-MapReduce functions are run by a separate execution engine within the Aster Database environment and are automatically executed in parallel across the nodes of the machine cluster.MapReduce and the Data Scientist Important to understand how a relational DBMS executes a function at run time User-defined functions in SQL statements are usually black boxes to the SQL optimizer. It also brings the maturity of relational DBMS technology to MapReduce processing. Functions are provided for path. Custom functions can be written in a variety of languages (C. but also adds the advantages of MapReduce to the relational DBMS environment. All Rights Reserved. therefore. C#. Teradata Aster MapReduce Platform Some relational DBMSs support MapReduce functions MapReduce functions enable SQL users to exploit the power of MapReduce The Aster Database product is a leader in the use of MapReduce in a relational DBMS environment. The Aster Database provides some 50 prebuilt MapReduce functions and a MapReduce development tool MapReduce clickstream analysis example Copyright © 2012 BI Research. etc.) for use in both batch and interactive environments. statistical. while exploiting the power of the MapReduce procedural approach to extend SQL’s analytic capabilities.. the ability to process multi-structured data using SQL. At the same time. All the optimizer needs to do is create an execution plan that passes data to the function and handle any result data. i. graph. text and cluster analysis. These latter programs may need to be modified to fit into the Aster MapReduce implementation. Java. Python.e. The SQL-MapReduce capability couples SQL with MapReduce to provide a framework for running sophisticated analyses on a diverse set of both structured and multi-structured data. to understand how the run-time engine implements these functions. 13 . An interactive development environment tool (Aster Developer Express) helps reduce the effort to build and test custom functions. pattern. R. This capability not only offers the benefits outlined above for deploying user-defined functions.

. Aster Database supports both map and reduce processing The equivalent of a map program in SQL-MapReduce is a table function.MapReduce and the Data Scientist Results from MapReduce processing can be used by other SQL operators Figure 5. i. SQL-MapReduce functions are self-describing and support late-binding. and are run by a separate process to reduce the likelihood of errant functions damaging system operations. . each group of rows defined by the PARTITION BY clause (see the nPath example above) will be handled by exactly one instance of the function. SELECT B_count FROM npath (ON clickstream PARTITION BY user_id ORDER BY ts MODE(NONOVERLAPPING) PATTERN (‘H. therefore.pdf. With a row function. SQL example showing the use of Aster nPath function The results from functions such as nPath are the equivalent of those produced by an SQL subquery. grouped with the GROUP BY clause.B*’) SYMBOLS( page_id=50 as H.asterdata. to read. filtered using the WHERE clause. The runtime engine therefore manages parallelism at either the row level or the partition level. The formats of the input and output data are determined programmatically by a function at query runtime. for example. This approach is ideally suited to analyzing unmodeled. depending on the type of function being used. that supports the building of custom SQL statements.” by Friedman. Functions also manage and control their own memory and file structures. These functions can be chained together to provide the same capabilities as a series of MapReduce jobs. a reduce program is known as a partition function. and parallelizable user-defined functions. they are a relational table. and write the results to an external output file. They can. multi-structured data. 10 A MapReduce function can read/ write external data 10 More detailed information about SQL-MapReduce operation and usage can be found in. With a partition function. SQL-MapReduce queries can be entered and the results visualized using any third-party tool. All Rights Reserved.e. A function or set of functions could be used. The processing of input rows is distributed among the threads. page_cat_id=90 as B) RESULT Count(*) of ANY(B) as B_count) ORDER BY B_count desc. transform and load external data into a DBMS table for processing and analysis. See: www. Multiple instances of a function are run in parallel across nodes and the threads within a node. but without the overheads of job scheduling. and output rows are collected from all Pawlowski and Cieslewicz. Tableau. and so on. “SQL-MapReduce: A practical approach to self-describing. for example. be used like any other table in an SQL query – joined to rows from other tables. polymorphic. which means they can be written without any knowledge of the structure of the data that will be processed by the function. each row from the input table is handled by exactly one instance of the function. 14 Copyright © 2012 BI Research. SQL-MapReduce functions can also read and write data from/to external data sources.

while relational DBMSs gain the advantages of MapReduce. Programmers prefer procedural programmatic approaches for accessing and manipulating data. however. Copyright © 2012 BI Research.MapReduce and the Data Scientist WHICH TO USE WHEN? Non-programmers prefer declarative languages The debate about when to use Hadoop and when to use an RDBMS is reminiscent of the object versus relational debates of the 1980s. Hadoop MapReduce. Adding MapReduce to a relational DBMS extends its use to multi-structured data. whereas Hadoop Hive is suited to handling high-latency queries that process petabytes of multi-structured data. i. complicate the debate. i.. Figure 6. Hive cannot substitute for the functionality.e.and low-latency SQL queries that need to analyze many terabytes of structured data. a proof of concept benchmark will almost certainly be required. usability. In this mode of operation. relational DBMSs and SQL. and the reasons are similar. Relational DBMS versus Hadoop Hive Using Hive to handle traditional SQL-like queries extends its use to structured data and lower-latency results. whereas non-programmers prefer declarative manipulation languages. 15 . performance and maturity of a relational DBMS. As the data volume approaches the petabyte level. HiveQL.. relational DBMSs are suited to both high. As the figure shows. The availability of an SQL-like language in Hadoop Hive and the incorporation of MapReduce functions in relational DBMSs.e. however. since Hadoop gains the benefits of a declarative manipulation language. Figure 6 positions the roles of relational DBMSs and Hadoop Hive. The volume of multi-structured data that can be processed in a cost-effective manner using MapReduce functions with SQL will be product dependent. All Rights Reserved.

and workload performance Figure 6 does not take into account the developer or the user of the investigative computing platform. Adding MapReduce to a relational DBMS brings the benefits of this style of processing to those data scientists who find relational technology more familiar and easier to use than Hadoop. Comparison tables are by their nature subjective. When choosing an analytical solution. The table above does. usability. It is important for those organizations to develop coexistence strategies to simplify product selection and application deployment decisions. analytical functionality and usability also need to be considered in addition to workload performance.MapReduce and the Data Scientist Key considerations are analytical functionality. therefore. 16 . identify some important differences between the two environments that should be considered when evaluating solutions for developing and deploying new analytic capabilities. The relational DBMS environment (by virtue of its maturity) also has better administration and workload management capabilities. however. All Rights Reserved. it is inevitable that most organizations will use a combination of both Hadoop and relational DBMS technologies. Many organizations will use both Hadoop and relational DBMSs Given market interest and the benefits of Hadoop. The table below looks at these factors from the perspective of the Aster Database and Hadoop Hive. The main requirements identified in this discussion were the need for organizations to easily analyze large volumes of multi-structured data with good price/performance and the requirement to make the technologies for doing those analyses more usable by data scientists. Copyright © 2012 BI Research. The discussion at the beginning of this paper focused on the benefits of big data to organizations and how data scientists could exploit this data.

2. Such integration provides many benefits – eliminating the need to install and maintain multiple systems. Use a front-end query tool to access and analyze data that is stored in both Hadoop and the relational DBMS. transforming and/or consolidating multistructured data. BI developers and data scientists should evaluate the best approach to use for filtering and/or analyzing the data. Options are to use a relational Several solutions for analyzing non-EDW data Copyright © 2012 BI Research. it will become difficult for organizations to tightly control all of the many forms of data used for analysis and investigation. A connector such as the Aster-Hadoop adaptor can be used to extract the results from Hadoop processing to the relational DBMS for analysis. Use Hadoop to analyze large volumes of multi-structured data and publish the analytical results to the traditional data warehousing environment. but as the use of these technologies grows. reduce data movement. as well as into a single hardware and software appliance that will allow applications to exploit the benefits of both technologies in a single system. provide a single metadata store for application development. 5. for example). for example) to analyze a combination of structured data and multi-structured data (loaded from Hadoop) using a mixture of SQL processing and MapReduce analytic functions. Data loaded into the Aster Database can then be analyzed using both SQL and MapReduce. The Aster-Hadoop adaptor. Direction is toward integrated Hadoop and relational DBMS appliances The scenarios above support an environment where the Hadoop and relational DBMS systems are separate from each other and connectivity software is used to exchange data between the two systems. two-way data loading between HDFS and the Aster Database. Use Hadoop for storing and archiving multi-structured data. these functions can be used to do the extraction. and that cannot. be consolidated into the EDW. Use Hadoop for filtering. For data that remains outside of the EDW environment (for performance or cost reasons. If the relational DBMS supports MapReduce functions. 4. uses SQL-MapReduce functions to provide fast. a shared workgroup data store. Data scientists can employ the relational DBMS (the Aster Database system. or a common user interface. for example. A connector to a relational DBMS can then be used to extract required data from Hadoop for analysis by the relational DBMS. 3. All Rights Reserved. It will certainly become impossible for organizations to centralize all of the data in a single enterprise data warehouse (EDW) and it will become necessary to distinguish between data that can. The direction of the industry over the next few years will be to more tightly couple both Hadoop and relational DBMS technology together – both through software such as API’s and metadata integration.MapReduce and the Data Scientist COEXISTENCE STRATEGIES There are several possible scenarios for using a combination of Hadoop and relational DBMS technologies: 1. and provide a single interface for both business users and analytical tools. Use a relational DBMS that provides MapReduce capabilities as an investigative computing platform. 17 . CONCLUSIONS Impossible to manage all data in an EDW Big data and associated technologies can bring significant benefits to the business.

This latter data sandbox will likely also contain multi-structured data. Teradata. Data sandbox can be used for investigative computing For EDW data it will be necessary to determine the data that needs to be extracted into an underlying data mart or data cube for analysis and reporting by BI users. I would like to thank Teradata for its support in the publication of this paper. and enable data scientists to investigate ways of improving the business and also identify new business opportunities. a non-relational system such as Hadoop with Hive. This new infrastructure can be thought of as an extended data warehouse.S. data management. and worldwide. EB-6512 > 0112 Copyright © 2012 BI Research. All Rights Reserved. SQL-MapReduce. About BI Research BI Research is a research and consulting company whose goal is to help companies understand and exploit new developments in business intelligence. organizations must extend their existing EDW infrastructure to support them. Given these many different approaches and components. Data warehouse framework needs to be extended to support big data Note that brand and product names mentioned in this paper may be the trademarks or registered trademarks of their respective owners. or to analyze in-motion data as it flows through enterprise systems using data streaming technology. possibly extracted from a Hadoop system. or into an investigative data sandbox for investigation by data scientists. and the Teradata logo are registered trademarks of Teradata Corporation and/or its affiliates in the U.MapReduce and the Data Scientist DBMS optimized for analytic processing such as the Aster Database. Such an infrastructure is essential if organizations are to reap the full benefits of big data. 18 . and collaborative computing.

Sign up to vote on this title
UsefulNot useful