Professional Documents
Culture Documents
Systems
1 Introduction
Increasingly, more businesses are collecting data about their businesses, often times without even
being aware of it. From customer sales data to employee performance data, the typical business is
not taking full advantage of the deep insights that the data is able to provide. However, there is a
silver lining as businesses have just recently recognized the value of the data that have been sitting
within their organizations. IDC estimates that the worldwide market for business analytics is worth
$25 billion in 2009, a growth of 4 percent over 2008 (IBM, 2009). IT service providers who are
attuned to their customers needs are picking up on the growing interests of the industry by
providing solutions for the various functional areas of a business. Companies are using analytics in
various functions such as supply chain, customer relationship management and pricing. Amazon is a
good example of a company that has leveraged on the power of data analytics to provide
customized recommendations to its customers.
In the recent Global CIO Study 2009 by IBM, a survey of global CIOs, eighty three percent of survey
respondents identified business intelligence and analytics as a competitive advantage for their
organizations. In addition, IBM CIO Pat Toole commented that CIOs are investing in business
analytics capabilities to help them improve decision making and can be key to new growth markets
(IBM, 2009). In The McKinsey Quarterly (2007), the consulting firm identified analytics as one of the
eight business technology trends to watch in the next decade. Davenport (2006) highlighted that as
firms in many industries offer similar products using comparable technologies, business processes
are among the last remaining points of differentiation. Analytics allows competitors to get the most
value from those processes to make the best decisions at every level of the firm. Now that
businesses are waking up to this new reality, they are finding solutions that could help them build
this capability. MapReduce is one framework for doing analytics on large amounts of data, and
Hadoop is a solution that has implemented this framework. Other solutions include building data
warehouses and implementing database clusters, which both require a large capital investment and
high operational costs. In this paper, we use Hadoop, an Apache Foundation project, for doing data
analytics on a large dataset. Hadoop is based on the Java programming language.
The foundation of this paper is the start of collaboration between the Singapore Management
University School of Information Systems (SMU SIS) and a large taxi company two years ago. The
goal was to gain better insights into the dynamics that occur in a taxi network using the GPS location
data that they provide using data analytics. Leveraging on the ongoing efforts with the company, this
project hopes to use the MapReduce algorithm to accelerate the process of studying of these
dynamics. The current process for discovering these dynamic is to use a large database for querying,
which is extremely slow due to the large dataset. The dataset stands at a couple of hundred
gigabytes at the moment, and the typical query would only cover a period of a days worth of data
and take about half a day to complete.
MapReduce is one of the methods used in doing distributed data analysis, first made popular
through its use by Google for crawling and indexing the web. In 2004, Google published a paper
titled MapReduce: Simplified Data Processing on Large Clusters which sparked off a series of
events that led to the formation of Hadoop. Hadoop forms the basis of the data analytics used in this
project.
2 Hadoop
In several use cases of Hadoop, it has tremendously saved time and money. The New York Times
used 100 machines on the Amazon Elastic Computing Cloud to convert 4 terabytes of scanned
archives into PDFs within 24 hours. In a contest of speed, Hadoop broke the world record on the
fastest system to sort a terabyte of data with a time of 209 seconds in 2008. In the following year, it
took just 62 seconds to perform the same feat.
The Hadoop project consists of various sub-projects (Figure 1), each with a specific use of
MapReduce. The use cases introduced later will make use of the Hadoop Core sub-project. The Core
sub-project allows a developer to create programs that does data crunching on the Hadoop cluster.
The MapReduce API (Figure 2) sets up the mapping and reduction process for the developer. All the
developer has to do is to write code within the mapping and reduction function to determine the
operations that will be executed on the data coming in. This will be clearer when explained in the
use case later.
Hadoop provides an administration interface (Figure 4) to query the status and details of a job. The
detailed view includes various performance indicators such as bytes read from the HDFS shared
filesystem. It also keeps a history of all the jobs that has been submitted previously.
Figure 4: The admin interface
3 Use Cases
There will be 2 use cases that will be presented. The first is generating secondary data from the
primary data source and the second is generating a frequency analysis of the GPS location. These
two use cases are chosen because the result requires a sweep of the entire dataset, something
which Hadoop is particularly good at. This begs the question of when do we use a traditional RDBMS
and when do we use MapReduce. Below is a good summary of the characteristics of each approach.
In both of these use cases, we simply need to output a single file which contains the results. This fits
with the update characteristic. As you will see later the raw data is cleaned and processed before
being operated on, though this is not necessary as Hadoop has a flexible dynamic schema and
operates best on text files.
The raw dataset provided by the taxi company is in the following format:
Date time, vehicle no., driver ID, long, lat, speed, status
E.g.:
01/03/2009 00:00:00, SH1234S, 1809481,103.94063, 1.32617, 0, PAYMENT
This dataset is cleaned for errors and processed for anonymity. The final dataset is provided and is
not part of this paper. The final record that the use cases will be operating on will be as follows:
LogSerialNo,Datetime,vehicleID,driverID,long,lat,speed,status,week,DayOfWeek,day,hour
E.g.:
20090301000000000,2009/03/01 00:00:00,454,1809481,103.94063,1.32617,0,3,1520,0,01,00
The hardware configuration for each of the node in the Beowulf-5 cluster is the same and is as
follows:
Item Description
Processor 2 x AMD Opteron 250
RAM 4 GB
1. Mapper. Each line in the file is read and represented as a <key, value> pair to the Mapper. In
this instance, the key is the byte pos of the start of the line and the value is the line itself. I
parse the line and extract out the vehicle ID and log serial number. I created a custom key
consisting of a combination of the two pieces of information. A <VehicleSnPair, value> is
then written out, where the value is the original line.
2. The GroupComparator sorts the records by vehicle ID. The output is the same key value pair
as above.
3. The SortComparator then sorts by log serial number within each vehicle ID. The output is the
same key value pair as above.
4. The output from the SortComparator usually goes to the Reducer. But in this use case, a
custom Partitioner is implemented to determine the machine that a vehicle ID goes to.
5. The Reducer writes the relevant information to the output.
The algorithm is executed on varying input file sizes, from 16MB to 30GB. The results are shown in
two separate graphs below, from 16MB to 1GB (Figure 7) and from 1GB to 30GB (Figure 8). In Figure
7, there is an increasing marginal return as the file size increases from 16MB to 64MB. This is due to
the file split size of 64MB. As the file size goes beyond 64MB, the amount of time it takes is roughly
linear. This is also observed in Figure 8. The jobs are executed on a 5 slave node configuration,
(noted by n=5). Each job is executed for 3 times and the average time is taken.
100.00 98.67
87.00
80.00
Seconds
60.00 61.00
48.67
40.00 44.0045.00
37.67
29.00
20.00
0.00
0 200 400 600 800 1000 1200
Size in MB
Figure 7: Average job completion time on 5 slave nodes for file sizes between 16MB to 1024MB
Avg time, n=5, size=1GB to 30GB
3000.00
2,792.67
2500.00
2000.00
Seconds
1,717.00
1500.00
1,314.67
1000.00 1,022.33
671.67
500.00
325.00
0.00 98.67
Figure 8: Average job completion time on 5 slave nodes for file sizes between 1GB to 30GB
Figure 9 and Figure 10 shows the same job executed for 3 and 4 slave nodes to understand how the
number of nodes affect the job completion time. Figure 9 shows the overall trend which is a
relatively linear. Figure 10 shows a smaller range of file size. There is no significant change in the
pattern, other than the increased time taken for lesser number of slave nodes.
800.00
600.00
400.00
200.00
0.00
0 1000 2000 3000 4000 5000 6000 7000 8000 9000
Size in MB
Figure 9: Average job completion time over 3, 4 and 5 nodes for file sizes 16MB to 8096MB
Avg time, n={3,4,5}, size=16MB to 1024MB
180.00
160.00
140.00
120.00
Seconds
100.00
80.00
60.00
40.00
20.00
0.00
0 200 400 600 800 1000 1200
Size in MB
Figure 10: Average job completion time over 3, 4 and 5 nodes for file sizes 16MB to 1024MB
From the above graphs we note that there is an increasing marginal return as the file size increases.
The point at which there is no longer an increase tells us the optimum file size to work on. Figure 11
tells us that from file size 4096MB, the speed is constant and there is no increase in marginal returns
from larger file sizes. The results are however limited, as we would normally expect to see a point at
which there is diminishing marginal returns. Given more time, there is a possibility of finding out the
point at which that happens, though Figure 8 does show that it scales almost linearly to 30GB.
6.00
4.00
2.00
0.00
0 1000 2000 3000 4000 5000 6000 7000 8000 9000
Size in MB
Figure 11: Throughput over 3, 4 and 5 nodes for file sizes 16MB to 8096MB
As the size of the analysis required for gathering insights into the taxi data increases, Hadoop has
shown to be able to scale linearly to a months worth of data. Further experiments will need to be
done to determine if the Hadoop is able to scale to a years worth of data because patterns that are
observed over that time period would prove useful for answering research questions.
3.3 Use Case 2
The second use case uses the GPS data to plot a colored frequency map of Singapore. The taxi
company has divided Singapore into 86 different zones, most of them bordering neighborhood
estates. These colored maps help identify zones which are having the highest frequency. In this use
case the color represents the number of times a particular record appears in the log file. The
following illustration (Figure 12) shows a representation of the algorithm.
Much simpler than the previous use case, this use case consists of only a Mapper and a Reducer.
However, in addition to this it uses the Java Topological Suite for simple GIS queries.
1. Mapper. It goes through each zone to check if the current record is in the zone. The zone
array consists of polygon definitions of each of the 86 zones (shown in Figure 13). If a zone is
found, it will write a <key, value> pair where the key is the zone number. If a zone is not
found, possibly due to anomalous data, the key is -1.
2. Reducer. It simply counts the number of records for each zone.
The output file is then fed into a JSP web application where it is parsed and colored using the Google
Maps API (Figure 14).
Figure 13: Map of SIngapore sivided into the 86 zones
This use case displays the potential for using Hadoop not simply as an end, but also as a means to an
end, where that end could be the visualization of data. Colored maps showing the historical
passenger demand could prove useful to taxi drivers as it could influence behavior and thus increase
the overall efficiency of the taxi system.
4 Alternatives
The alternatives to Hadoop include parallel databases such as those by Oracle and IBM. Increasingly,
column-oriented databases are an excellent alternative on similar workload profiles. Various
research papers have been published comparing these alternatives. There are other MapReduce
solutions that are under research such as Dryad by Microsoft Research and Clustera by University of
Wisconsin-Madison. These frameworks are still under heavy development.
5 Related Work
Prior work in MapReduce and Hadoop has mostly focused on its inner mechanisms such schedulers,
performance debugging and file systems. In Improving MapReduce Performance in Heterogeneous
Environments (Zaharia, Konwinski, Joseph, Katz, & Stoica, 2008), it takes a look at the improving the
scheduler for heterogeneous compute environments.
As mentioned in the previous section on the comparisons made between MapReduce and other
alternatives, Pavlo, et al. (2009) has compared two approaches to large scale data analysis, the
MapReduce model and parallel databases. They tested Hadoop against two parallel DBMSs, Vertica
and another system from a major relational DB vendor.
Other works on studying the use of Hadoop on datasets include Loebman, et al. (2009) for
astrophysical simulations, comparing Hadoop and a commercial relational database. Cary, et al.
(2009)explored the use of MapReduce for solving spatial problems.
6 Conclusion
With the results from the use cases above, the taxi project now has the capability of analyzing data
that spans across multiple months or years and receive results much more quickly than before. This
saves the time of the faculty members by getting answers to their questions faster and, could
possibly result in better quality research and better funding. In addition, the findings and efficiencies
could go back to the taxi company and their drivers which could reap large economic and
environmental benefits such as reduced fuel costs, increase driver revenue and lower carbon
emissions.
In the beginning, there are extensive references to the business use of analytics and how it creates a
new dimension for competition. Hadoop has democratized data analytics, however there is still
some ways to go before it is easy enough for businesses to take advantage of its capabilities. With
heavy development still continuing, there is certainly potential for Hadoop to be improved upon for
business analytics.
7 Future Work
We have seen in the first use case where the scaling of the dataset is largely linear past the input
split size. However, where does it end? Storage space limitations on the Beowulf-5 cluster caused us
to stop at the 30GB file size. With the dataset at a couple hundred gigabytes, there is enough data to
experiment on if the limitations are eased. One of the ongoing efforts is to get approval to use the
Open Cirrus cloud computing testbed for this project to enable faster and larger analysis.
The second use case provides us a glimpse of what could potentially be an interactive visual
representation of the data on a web front end, driven by a Hadoop back-end. Even though the
typical job is batch in nature, higher level projects such as Hbase, Pig and Hive give the developer the
ability to drive real-time analytics while making the life of the developer easier.
Bibliography
1. Abadi, D. J., Madden, S. R., & Hachem, N. (2008). Column-stores vs. row-stores: how different
are they really? SIGMOD '08: Proceedings of the 2008 ACM SIGMOD international conference on
Management of data, (pp. 967-980).
2. Cary, A., Sun, Z., Hristidis, V., & Rishe, N. (2009). Lecture Notes in Computer Science. 21st
International Conference on Scientific and Statistical Database Management (pp. 302-319).
Springer.
3. Davenport, T. H. (2006). Competing on Analytics. Harvard Business Review , 84 (1), p98-107.
4. IBM. (2009, July 28). IBM to Acquire SPSS Inc. to Provide Clients Predictive Analytics Capabilities.
Retrieved Nov 2009, from IBM: http://www-03.ibm.com/press/us/en/pressrelease/27936.wss
5. IBM. (2009, Sep 10). New IBM Study Highlights Analytics As Top Priority For Todays CIO.
Retrieved November 2009, from IBM: http://www-
03.ibm.com/press/us/en/pressrelease/28314.wss
6. Loebman, S., Nunley, D., Kwon, Y., Howe, B., Balazinska, M., & Gardner., J. P. (2009). Analyzing
Massive Astrophysical Datasets: Can Pig/Hadoop or a Relational DBMS Help? Workshop on
Interfaces and Architectures for Scientific Data Storage.
7. Manyika, J. M., Roberts, R. P., & Sprague, K. L. (2007). Eight Business Technology Trends to
Watch. The McKinsey Quarterly , 1-11.
8. Pavlo, A., Paulson, E., Rasin, A., Abadi, D. J., DeWitt, D. J., Madden, S., et al. (2009). A
Comparison of Approaches to Large-Scale Data Analysis. SIGMOD '09: Proceedings of the 35th
SIGMOD international conference on Management of data , (pp. 165-178). New York.
9. Stonebraker, M., Abadi, D. J., Batkin, A., Chen, X., Cherniack, M., Ferreira, M., et al. (2005). C-
store: a column-oriented dbms. VLDB '05: Proceedings of the 31st international conference on
Very large data bases (pp. 553-564). VLDB Endowment.
10. Zaharia, M., Konwinski, A., Joseph, A. D., Katz, R., & Stoica, I. (2008). Improving mapreduce
performance in heterogeneous environments. 8th Symposium on Operating Systems Design and
Implementation (pp. 29-42). USENIX Association.