You are on page 1of 6


Proposal for a 
Thesis in the Field of 
Information Technology 
In Partial Fulfillment of the Requirements 
For a Master of Liberal Arts Degree 

Harvard University 
Extension School 

April 07, 2014 

Luis F. Montoya 
13926 NW 21st Lane 
Gainesville, FL 32606 
Proposed Start Date: April 2014 
Anticipated Date of Graduation: After June 2014 
Thesis Director: TBD 


This project intends to create a small Hadoop cluster and study a way to find that threshold for both structured and unstructured (doesn't fit into tables) data. if the size of the data is not large enough. B). Abstract Since the announcement of Google's invention of the MapReduce programming model for the processing of large data. (Cross. but sometimes is more efficient to use Relational databases instead. In other words.1. C). 2 . both in terms of data size and data structures. Several web publications and web blogs have discussed this topic (Stucchio. (O’Grady). new models keep emerging and offer performance improvement for some specific tasks. there should be a threshold of data size or data structure that merits the migration from Relational Databases to Hadoop/MapReduce. 2. One of these models is Hadoop. but none of these sources make a conscientious analysis of where the mentioned threshold lies. Thesis Title Implementation of a Hadoop Cluster with necessary tools for a small-midsize enterprise using off the shelf computers and Ubuntu Operating System and Determination of data size to justify its usage over Relational Databases.

If the local machines approach is utilized. To accomplish this tasks.3. Hive and Pig will be examined and/or used. although at time of writing. analyzed and finally presented in a meaningful way to the final user. The cluster will have anywhere between 4 and 10 of these servers. Apache Hadoop has two main subprojects: MapReduce and HDFS. the idea is to obtain data from sources that generate information continuously. as required by small businesses trying to have their own cloud infrastructure. The purpose of this approach was to use lower cost hardware and rely on the software ability to detect and handle failures. These projects provide SQL-Like interfaces that complement the MapReduce functionality. Data in a Hadoop cluster is broken down into blocks and distributed throughout the cluster. The implementation section will discuss the hardware and software requirements in more detail. the software to be installed has to be open source to minimize costs. In this way. HDFS or Hadoop Distributed File System allows applications to be run across multiple servers. The work is broken down into mapper and reducer tasks to manipulate data stored across a cluster of servers. thus providing the scalability required for Big Data Processing. Wikipedia Data Dumps. The intention of this project is to implement a Hadoop Cluster using low price personal computers with a Linux OS installed in them. R programming language will be used. 3. one or both of the most widely used projects. R is indicated when data from different sources is obtained. it needs to be merged (if necessary). one of the tools to be used is Impala from Cloudera. such as weather. Amazon Web Services (AWS) has just slashed prices for their Cloud Computing Services. Needless to say. Hadoop is a collection of projects to solve large and complex data problems.1 Introduction to Hadoop Apache Hadoop is an open source project used for distributed processing of large data sets across clusters of commodity servers (What is Hadoop?). cleaned. MapReduce is a programming model for processing large data sets with a parallel. these computers can be obtained locally and relatively inexpensive. which makes it probably more practical from the hardware maintenance point of view. Description of the Project 3. especially for small cluster sizes. distributed algorithm on a cluster. because programming with R will present the data in a more usable form. the map and reduce functions can be executed on smaller subsets of larger data sets. Finally. This tool outperforms Hive in certain type of data queries.2 The Data Sources and Data Manipulation Since this project will use data of various sizes. 3 . Google Developer’s Data Dumps or the Public Data Sets on Amazon Web Services. if one want to have almost Real Time Queries on large data. After the data is obtained. Since MapReduce techniques is challenging (O’Grady). The cluster implemented here has Impala installed in it.

4 .3. CDS installs Hive and Impala as part of their distribution package. Impala and MongoDB.3 Software Tools to be Used The software distribution to be used for this project is Cloudera Distribution System (CDS) (Cloudera Downloads) and the servers will have installed all the open source programs such as MySQL. required to run this project successfully. and loading) step when handling the data. One of the most important aspects of data to consider when switching to Hadoop is the ETL (extraction. If the data is even of medium size but unstructured. use of Hadoop integrated with R might provide fast and interactive queries. R. Other tools necessary for the success of this project will be studied also. Other tools will be installed as the need arises. Hive. transformation.

The same experiments will be tried in this homebrew cluster. 4.4.2 Obtain some of the data mentioned on section 3.2 Create several instances of the image created on 5. (More details of the installation process will be produced in the final document). investigate which is suitable for queries using Relational DataBase Management Systems (RDBMS) MySQL and run the same type of queries as in 4. Implementation 5. 5.medium or m1.1. 5. 5. 6. 4.4 Run the experiments described in chapter 4. 5. Repeat tests with different data sizes until the queries fail to produce results on a timely manner.3 Configure the Hadoop cluster (White. investigate possible ways to manipulate the unstructured data for it to be more friendly to RDBMS. Comparison of results will be analyzed. 4. 5. m1. Work Plan 4. T). using R and related tools.1 Create a Hadoop cluster of about 4 to 10 nodes in AWS and run some of the standard experiments to verify that the cluster is operating normally. Conclusion TBD 5 .3 On the data obtained.2 and time/run several queries using the Hadoop cluster.2 using the MySQL installed in an AWS instance. Install Hadoop in it and create an image of this instance.5 Repeat the same implementation/experiments using the homebrew Hadoop cluster. Results TBD 7. Also.4 Parallel to the implementation using AWS.large capable of handling Hadoop.1 Create an AWS EC2 Virtual Server running CentOS of RedHat Linux. a small Hadoop cluster will be implemented using of the shelf computers with at least 1 GB of memory and 500GB hard drive of storage.

2012 8.cloudera. extracted from 8. Hadoop. extracted from http://www. 2013 8.html 8. extracted from https://bighadoop. References 8. The definite Guide.chrisstucchio. O’Reilly Media. Retrieved from http://redmonk.wordpress. “What Factors Justify the Use of of Apache Hadoop”. I.7 Planning for Big Data. 3rd Edition. “Don't use Hadoop . 2012 8. O’ 6 .3 Amazon Web Services (AWS) extracted from https://aws.10 Szegedi.2 What is Hadoop? by IBM Software.html 8.8 Stucchio C. Tom.4 Cloudera “Big Data Is Less About Size And More About Freedom”. O’Reilly Radar Team. 8.6 Big Data Now. “Integrating R with Cloudera Impala for Real-Time Queries on Hadoop”.your data isn't that big”.8.5 O’Grady 8.1 Cross. http://www. extracted from 8.