You are on page 1of 32

1

HADOOP
VS
SQL

A COMPARITIVE INDEPENDENT STUDY

SUBMITTED BY
LAXMAN PANDRAMISH

INDEPENDENT STUDY HADOOP VS SQL


2

TABLE OF CONTENTS

What’s the Study About?

HADOOP – Open Source Project

SQL – Structured Query Language

HADOOP Processing

SQL Processing

Traditional Differences

Practical Differences

Overview

References

INDEPENDENT STUDY HADOOP VS SQL


3

WHATS THE STUDY ABOUT?

Hadoop is replacing RDBM in most of the cases, especially in data


warehousing, business intelligence reporting, and other analytical
processing. It becomes a real challenge to perform complex reporting in
these applications as the size of the data grows exponentially. Along with
that, there is customers demand complex analysis and reporting on
those data. So, Hadoop vs SQL database is a pertaining question when
you are going to select the data storage and processing framework for
your next project.

Many people are concerned about this question : IS SQL BETTER? Or IS


HADOOP BETTER?

This study briefly explains about SQL HADOOP and their differences and
comparison based on execution and outputs

This study compares and generalizes traditional differences , practical


differences based on a real time project example executed both in
Hadoop and SQL procedures

INDEPENDENT STUDY HADOOP VS SQL


4

HADOOP – OPEN SOURCE PROJECT

Apache Hadoop is a collection of open-source software utilities that


facilitate using a network of many computers to solve problems involving
massive amounts of data and computation. It provides a software
framework for distributed storage and processing of Bigdata using map-
reducing techniques

The base Apache Hadoop framework is composed of the following


modules:

• Hadoop Common – contains libraries and utilities needed by other


Hadoop modules;
• Hadoop Distributed File System (HDFS) – a distributed file-system
that stores data on commodity machines, providing very high
aggregate bandwidth across the cluster;
• Hadoop YARN – introduced in 2012 is a platform responsible for
managing computing resources in clusters and using them for
scheduling users' applications and
• Hadoop MapReduce – an implementation of the MapReduce
programming model for large-scale data processing.

INDEPENDENT STUDY HADOOP VS SQL


5

Apache Hadoop's MapReduce and HDFS components were inspired


by Google papers on their Map Reduce and Google File System
The Hadoop framework itself is mostly written in the Java Programming
Language, with some native code in C and Command Line utilities
written as shell scripts. Though MapReduce Java code is common, any
programming language can be used with "Hadoop Streaming" to
implement the "map" and "reduce" parts of the user's program.

Benefits of Hadoop

• Scalability and Performance – distributed processing of data local to


each node in a cluster enables Hadoop to store, manage, process and
analyze data at petabyte scale.
• Reliability – large computing clusters are prone to failure of individual
nodes in the cluster. Hadoop is fundamentally resilient – when a node
fails processing is re-directed to the remaining nodes in the cluster
and data is automatically re-replicated in preparation for future node
failures.
• Flexibility – unlike traditional relational database management
systems, you don’t have to created structured schemas before storing
data. You can store data in any format, including semi-structured or
unstructured formats, and then parse and apply schema to the data
when read.
• Low Cost – unlike proprietary software, Hadoop is open source and
runs on low-cost commodity hardware.

INDEPENDENT STUDY HADOOP VS SQL


6

SQL – STRUCTURED QUERY LANGUAGE

SQL (Structured Query Language) is a domain specific language used in


programming and designed for managing data held in a Relational
Database Management System (RDBMS), or for stream processing in
a Relational Database stream management system (RDSMS)

Originally based upon relational algebra and tuple relational calculus,


SQL consists of many types of statements, which may be informally
classed as sublanguages, commonly: a data query language (DQL), a data
definition language (DDL), a data control language (DCL), and a data
manipulation language (DML) The scope of SQL includes data query, data
manipulation (insert, update and delete), data definition (schema
creation and modification), and data access control.

The SQL language is subdivided into several language elements,


including:

• Clauses, which are constituent components of statements and


queries. (In some cases, these are optional.)
• Expressions, which can produce either scalar values, or tables
consisting of columns and rows of data
• Predicates, which specify conditions that can be evaluated to SQL
three-valued logic (true/false/unknown) or Boolean Truth values and
are used to limit the effects of statements and queries, or to change
program flow.
• Queries, which retrieve the data based on specific criteria. This is an
important element of SQL.

INDEPENDENT STUDY HADOOP VS SQL


7

• Statements, which may have a persistent effect on schemata and


data, or may control transactions, program flow, connections,
sessions, or diagnostics.

Advantages of SQL
SQL Queries can be used to retrieve large amounts of records from a
database quickly and efficiently.
SQL is used to view the data without storing the data into the object.
SQL joins two or more tables and show it as one object to user.
SQL databases use long-established standard, which is being adopted
by ANSI & ISO. Non-SQL databases do not adhere to any clear
standard.
Using standard SQL it is easier to manage database systems without
having to write substantial amount of code.
SQL restricts the access of a table so that nobody can insert the rows
into the table.

INDEPENDENT STUDY HADOOP VS SQL


8

HADOOP Processing

To differentiate Hadoop and SQL processing a project named Banking


Analysis has been selected
It has huge Excel Data set with approximately 45000 rows , resolved in
both Hadoop and SQL platforms

Data set showing bank data details

INDEPENDENT STUDY HADOOP VS SQL


9

Hadoop processing has been on ORACLE VM , with spark being initialized

Oracle Virtual Box Installation

Oracle Virtual Box is installed to run Hadoop on the system currently using

In Phase 1 of this project , Data set is being run on the virtual box and its processed using
hadoop pre-installed on the system

Oracle Virtual Box

Oracle virtual box runs along with the PC with the same network privileges , it has eclipse , java
, and hadoop pre installed

INDEPENDENT STUDY HADOOP VS SQL


10

Data set Analyzation

Start the virtual box

The dataset selected is a portugese bank data set and its being analyzed

Dataset is huge containing 45000 rows approx and it must be organized and analyzed ,
performing analysis on the organized data

Data Frame Creation

The data which is in excel sheet must be organized before analyzing it

So using the data used in excel is being converted to a dataframe so that it can be analyzed and
necessary operations can be performed on it

INDEPENDENT STUDY HADOOP VS SQL


11

To create data frame first we must start hadoop on the terminal

Scala>hadoop

And then copy the file which is in local to hadoop cluster

Scala> hadoop fs mkdir project

Scala> hadoop fs -copyFromLocal final.csv project

Scala>hadoop fs -ls

Then create data frame by initiating databricks spark cluster

Scala>spark-shell --packages com.databricks:spark-csv_2.10:1.4.0

Code for data frame creation

Val df =
sqlContext.read.format("com.databricks.spark.csv").option("header","true").option("inferSche
ma","true").option("delimiter","_").load("/project/final.csv");

INDEPENDENT STUDY HADOOP VS SQL


12

After successful creation of dataframe , data is organized and displayed below

Data Frame creation

Now filtering data based on required conditions (success and failure rates)

Val success = df.filter($”poutcome”===”success”)


Val s = success.count();
Val r = df.count(); [ total count]

Val successrate = r/s

Val failure = df.filter($”poutcome”===”failure”)


Val f = failure.count();

INDEPENDENT STUDY HADOOP VS SQL


13

Val r = same as above value

Val failurerate = r/f

The success and failure rates are shown below

Success and failure rates

INDEPENDENT STUDY HADOOP VS SQL


14

Featured Engineering on the data set

Scala>df.groupBy(“age”,”y”).count().sort($”count”.desc).show

Here data is grouped by age , success and failure rate arranged in descending order

Average age of people who say yes and no can be solved by applying aggregation principles

df.groupBy(“y”).agg(avg($age)).show

INDEPENDENT STUDY HADOOP VS SQL


15

Different Processings of Data

Impact of marriage and age on the dataset

Val marriage = df.groupBy(“y”,”marital”).agg(avg($”age”)).show

INDEPENDENT STUDY HADOOP VS SQL


16

Creating Temporary tables and calculation of median

Scala>df.registerTempTable(“BankDetails”);
Scala>sqlcontext.sql(“Select percentile(balance,0.5) as median , avg(balance) as average from
BankDetails”).show;

INDEPENDENT STUDY HADOOP VS SQL


17

SQL Processing

The same dataset has been processed in SQL server


Starting step was to transfer the data from Excel sheet to SQL
server by use of import export wizard by microsoft
Data transfer must be initiated by installing Access Database
Engine
Data transferred is directly moved to table by the engine and
must be made organized by use of some commands

INDEPENDENT STUDY HADOOP VS SQL


18

First database is created and then table is created by import export


wizard

Steps to transfer data from Excel to SQL

INDEPENDENT STUDY HADOOP VS SQL


19

INDEPENDENT STUDY HADOOP VS SQL


20

INDEPENDENT STUDY HADOOP VS SQL


21

INDEPENDENT STUDY HADOOP VS SQL


22

Processing the data by SQL queries

INDEPENDENT STUDY HADOOP VS SQL


23

INDEPENDENT STUDY HADOOP VS SQL


24

INDEPENDENT STUDY HADOOP VS SQL


25

Traditional Differences

Practical Differences

INDEPENDENT STUDY HADOOP VS SQL


26

Traditional Differences

Hadoop Vs SQL Comparison Table


Characteristics Traditional SQL Hadoop

Data Size Gigabytes Petabytes

Access Interactive & Batch Batch

Read and Write – Write once, read Multiple


Updates
Multiple times times

Structure Static Schema Dynamic Schema

Integrity High Low

Scaling Non-Linear Linear

Above written are Basic differences

Elementary description

FUNCTIONAL PROGRAMMING
Hadoop supports writing functional programming in languages like java, scala,
and python. In RDBMS, there is no possibility of writing UDF and this increases
the complexity of writing SQL. Moreover the data stored in HDFS can be accessed
by all the ecosystem of Hadoop like Hive, Pig, Sqoop and HBase. So, if the UDF is
written it can be used by any of the above mentioned application. It increases
the performance and supportability of the system.

INDEPENDENT STUDY HADOOP VS SQL


27

DATA STORAGE

A crucial principle of relational databases is data stores in tables containing relational


structure characterized by defined row and columns. Moreover, data is stored in
interrelated tables

In Hadoop, a basic data can begin in any shape. However, in the long run, it changes
into a key-value pair. Because once the data enters into Hadoop, it is replicated
across multiple nodes in the Hadoop Distributed File System (HDFS). It may seem like
a waste of storage space, but it’s the primary reason behind Hadoop’s massive
scalability.

ARCHITECTURE

Hadoop is meant for Big Data solution, and usually, Hadoop architecture consists of
an unlimited number of servers. Now let’s say that one of those servers gets down
or faces issues while processing data. In this case, the data processing will not hold.
Because every time data gets replicated in each data blocks, hence data processing
continues without any interruption and maintains consistency. As a result, Hadoop
architecture is highly reliable for data.

On the other hand, for SQL you need complete consistency across all the systems
before it releases anything to the user. This is called a two-phase commit.

COST FACTOR

Cost-effectiveness is always a concern for companies looking to adopt new


technologies. When implementing Hadoop, companies need to do their effort to
make sure that the realized benefits of a Hadoop deployment outweigh the costs.
Otherwise it would be best to stick with a traditional database to meet data storage
and analytics needs.

All things considered, big data using Hadoop has number of things for it that make
implementation more cost-effective than companies may realize.

INDEPENDENT STUDY HADOOP VS SQL


28

Practical Differences

The 3 main differences found are

Usage of Delimiter
In Hadoop while organizing the dataset before execution , a delimiter has
been used to differentiate columns of data and it made creation of data
frame very easy
The usage of delimiter enables the spark cluster to organize and process
data efficiently
Where as in SQL data must be in tabular format in order to get processed
Delimiters are of no use in SQL , tables columns rows typically form a
SQL table and SQL queries

INDEPENDENT STUDY HADOOP VS SQL


29

Offline and Online Processing


Hadoop is designed for offline processing and analysis of large-scale
data. It doesn’t work for random reading and writing of a few records,
which is the type of load for online transaction processing. In fact, as of
this writing (and in the foreseeable future), Hadoop is best used as a
write once , read-many-times type of data store. In this aspect it’s same
as data warehouses in the SQL world.
While processing the datasets , Spark didn’t function while system is in
offline mode , server was not initiated when the network isn’t connected
SQL was working even in offline mode , the reason for Spark isn’t
functioning might be the virtual machine not working due to lack of
network

Functional programming vs Queries


SQL is fundamentally a high-level declarative language. You query data
by stating the result you want and let the database engine figure out how
to derive it. Under MapReduce you specify the actual steps in processing
the data, which is more analogous to an execution plan for a SQL engine
Under SQL you have query statements; under MapReduce you have
scripts and codes. MapReduce allows you to process data in a more
general fashion than SQL queries. For example, you can build complex
statistical models from your data or reformat your image data. SQL is not
well designed for such tasks.
SQL had direct and simple queries to process and extract data and also
store data
While Hadoop had some complex programming statements compared
to SQL and also SQL is user-friendly

INDEPENDENT STUDY HADOOP VS SQL


30

OVERVIEW

Overall, Hadoop steps ahead of the traditional SQL in terms of cost,


time, performance, reliability, supportability and availability of data
to the very large user group. In order to efficiently handle the
tremendous amount of data generated every day, Hadoop
framework helps in timely capturing, storing, processing, filtering
and finally storing in it in a centralized place

INDEPENDENT STUDY HADOOP VS SQL


31

REFERENCES

INDEPENDENT STUDY HADOOP VS SQL


32

INDEPENDENT STUDY HADOOP VS SQL