You are on page 1of 51

Index

1. About the Workshop


2. The Objectives of Big Data
3. Introduction to Big Data
3.1 What is Big Data
3.2 Why Big Data
3.3 3 Vs of Big Data Volume, Velocity and Variety
4. Big Data Basics of Big Data Architecture
4.1 What is Hadoop?
4.2 What is MapReduce?
4.3 What is HDFS ?
5. Why to Learn Big Data?
6. Overview of Big Data Analytics
7. Relationship between Small Data and Big Data
8. Social Media analysis including sentiment analysis in Big Data
9. Applications of Big Data to Security, DHS Web, Social Networks, Smart Grid
10. Tools for Big Data
11. Projects on Big Data
12. Future of Big Data

About The Workshop


Big Data workshop is designed to provide knowledge and skills to become a successful Hadoop Developer. Big
data is no more an industry buzz and the analytics we glean from it is an industry necessity for success. Big
data analytics is the process of examining big data to uncover hidden patterns, unknown correlations and other
useful information that can be used to make better decisions. The five days faculty development programme on
Big Data aims at
(i)

to understand the challenges in architecture to store & access the Big data

(ii)

to perform analytics on Big data for data intensive applications

(iii)

analyzing Hadoop and other related tools that provide SQL-like access to unstructured data NoSQL
for

their critical features, data consistency and ability to scale to extreme volumes
(iv)

to introduce application of Big data science that applies to various challenging areas.

It paves the way for the academicians / professionals to uncover research issues in the storage and analysis of
huge volume of data from the various sources. Let us discover how big Big Data is!

Course Objectives
After the completion of the 'Big Data and Hadoop' Course at, you should be able to:

Master the concepts of Hadoop Distributed File System and Map Reduce framework.
Understand Hadoop 2.x Architecture -- HDFS Federation, Name Node High
Availability.
Setup a Hadoop Cluster.
Understand Data Loading Techniques using Sqoop and Flume.
Program in MapReduce.
Learn to write Complex MapReduce programs.
Perform Data Analytics using Pig and Hive.
Implement HBase, MapReduce Integration, Advanced Usage and Advanced Indexing.
Implement best Practices for Hadoop Development.
Implement a Hadoop Project.
Work on a Real Life Project on Big Data Analytics and gain Hands on Project
Experience.

Who should go for this course?


As Big Data is not only an Industry buzz word but also a hot research topic directed towards
understanding of numerous techniques for deriving structured data from unstructured text and the data
analytics. This course is designed for professionals aspiring to make a career in Big Data Analytics using
Hadoop Framework. Software Professionals, Analytics Professionals, ETL developers, Project Managers,
Testing Professionals are the key beneficiaries of this course. Other professionals who are looking
forward to acquire a solid foundation of Hadoop Architecture can also opt for this course. Big Data
subject is also introduced in the curriculum and therefore we feel that this workshop would be an eyeopener for all the Faculty as it being delivered by the eminent faculty from IITs , NITs and Industry
experts and it will provide a scope for the faculty who are doing or willing to do their research activities
and to handle the subject and also major projects in the area respectively.

Introduction to Big Data

What is Big Data?


We want to learn Big Data. We have no clue where and how to start learning about it.Does Big Data really
mean data is big? What are the tools and software I need to know to learn Big Data? We often have
questions in our mind. They are good questions and honestly when we search online; it is hard to find
authoritative and authentic answers.
In the next 5 days we will understand what is so big about Big Data.
Every day, we create 2.5 quintillion bytes of data so much that 90% of the data in the world today has
been created in the last two years alone. This data comes from everywhere: sensors used to gather climate
information, posts to social media sites, digital pictures and videos, purchase transaction records, and cell
phone GPS signals to name a few. This data is big data.

Big Data Big Thing!


Big Data is becoming one of the most talked about technology trends nowadays. The real challenge with
the big organization is to get maximum out of the data already available and predict what kind of data to
collect in the future. How to take the existing data and make it meaningful that it provides us accurate
insight in the past data is one of the key discussion points in many of the executive meetings in
organizations. With the explosion of the data the challenge has gone to the next level and now a Big Data
is becoming the reality in many organizations.

What does Big Data trigger?

Big Data A Rubiks Cube


Compare big data with the Rubiks cube. You will believe they
have many similarities. Just like a Rubiks cube it has many
different solutions. Let us visualize a Rubiks cube solving
challenge where there are many experts participating. If you
take five Rubiks cube and mix up the same way and give it to
five different experts to solve it. It is quite possible that all the
five people will solve the Rubiks cube in fractions of the
seconds but if you pay attention to the same closely, you will
notice that even though the final outcome is the same, the route
taken to solve the Rubiks cube is not the same. Every expert
will start at a different place and will try to resolve it with
different methods. Some will solve one color first and others will solve another color first. Even
though they follow the same kind of algorithm to solve the puzzle they will start and end at
a different place and their moves will be different at many occasions. It is nearly impossible to
have a exact same route taken by two experts.

3 Vs of Big Data Volume, Velocity and Variety


Data is forever. Think about it it is indeed true. Are you using any application as it is which
was built 10 years ago? Are you using any piece of hardware which was built 10 years ago? The
answer is most certainly No. However, if I ask you are you using any data which were captured
50 years ago, the answer is most certainly yes. For example, look at
the history of our nation. I am from India and we have documented
history which goes back as over 1000s of year. Well, just look at our
birthday data at least we are using it till today. Data never gets old
and it is going to stay there forever. Application which interprets and
analysis data got changed but the data remained in its purest format in
most cases.
As organizations have grown the data associated with them also grew exponentially and today
there are lots of complexities to their data. Most of the big organizations have data in multiple
applications and in different formats. The data is also spread out so much that it is hard to
categorize with a single algorithm or logic. The mobile revolution which we are experimenting
right now has completely changed how we capture the data and build intelligent systems. Big
organizations are indeed facing challenges to keep all the data on a platform which give them a
single consistent view of their data. This unique challenge to make sense of all the data coming
in from different sources and deriving the useful actionable information out of is the
revolution Big Data world is facing.

Defining Big Data


The 3Vs that define Big Data are Variety, Velocity and Volume.

Volume
We currently see the exponential growth in the data storage as the data is now more than text
data. We can find data in the format of videos, musics and large images on our social media
channels. It is very common to have Terabytes and Petabytes of the storage system for
enterprises. As the database grows the applications and architecture built to support the data
needs to be reevaluated quite often. Sometimes the same data is re-evaluated with multiple
angles and even though the original data is the same the new found intelligence
creates explosion of the data. The big volume indeed represents Big Data.
Velocity
The data growth and social media explosion have changed how we look at the data. There was a
time when we used to believe that data of yesterday is recent. The matter of the fact newspapers
is still following that logic. However, news channels and radios have changed how fast we
receive the news. Today, people reply on social media to update them with the latest happening.
On social media sometimes a few seconds old messages (a tweet, status updates etc.) is not

something interests users. They often discard old messages and pay attention to recent updates.
The data movement is now almost real time and the update window has reduced to fractions of
the seconds. This high velocity data represent Big Data.
Variety
Data can be stored in multiple formats. For example database, excel, csv, access or for the matter
of the fact, it can be stored in a simple text file. Sometimes the data is not even in the traditional
format as we assume, it may be in the form of video, SMS, pdf or something we might have not
thought about it. It is the need of the organization to arrange it and make it meaningful. It will be
easy to do so if we have data in the same format, however it is not the case most of the time. The
real world have data in many different formats and that is the challenge we need to overcome
with the Big Data. This variety of the data represent represents Big Data.
Big Data in Simple Words
Big Data is not just about lots of data, it is actually a concept providing an opportunity to find
new insight into your existing data as well guidelines to capture and analysis your future data. It
makes any business more agile and robust so it can adapt and overcome business challenges.
Data in Flat File

In earlier days data was stored in the flat file and there was no structure in the flat file. If any
data has to be retrieved from the flat file it was a project by itself. There was no possibility of
retrieving the data efficiently and data integrity has been just a term discussed without any
modeling or structure around. Database residing in the flat file had more issues than we would
like to discuss in todays world. It was more like a nightmare when there was any data processing
involved in the application. Though, applications developed at that time were also not that
advanced the need of the data was always there and there was always need of proper data
management.

Edgar F Codd and 12 Rules


Edgar Frank Codd was a British computer scientist who, while
working for IBM, invented the relational model for database
management, the theoretical basis for relational databases. He
presented 12 rules for the Relational Database and suddenly the
chaotic world of the database seems to see discipline in the
rules. Relational Database was a promising land for all the
unstructured database users. Relational Database brought into
the relationship between data as well improved the performance
of the data retrieval. Database world had immediately seen a major transformation and every
single vendors and database users suddenly started to adopt the relational database models.
Relational Database Management Systems
Since Edgar F Codd proposed 12 rules for the RBDMS there were many different vendors who
started them to build applications and tools to support the relationship between databases. This
was indeed a learning curve for many of the developer who had never worked before with the
modeling of the database. However, as time passed by pretty much everybody accepted the
relationship of the database and started to evolve product which performs its best with the
boundaries of the RDBMS concepts. This was the best era for the databases and it gave the world
extreme experts as well as some of the best products. The Entity Relationship model was also
evolved at the same time. In software engineering, an Entityrelationship model (ER model) is
a data model for describing a database in an abstract way.
Enormous Data Growth
Well, everything was going fine with the RDBMS in the
database world. As there were no major challenges the
adoption of the RDBMS applications and tools was pretty
much universal. There was a race at times to make the
developers life much easier with the RDBMS management
tools.
Due to the extreme popularity and easy to use system pretty much every data was stored in the
RDBMS system. New age applications were built and social media took the world by the storm.
Every organization was feeling pressure to provide the best experience for their users based the
data they had with them. While this was all going on at the same time data was growing pretty
much every organization and application.

Data Warehousing
The enormous data growth now presented a big challenge for the organizations who wanted to
build intelligent systems based on the data and provide near real time superior user experience to
their customers. Various organizations immediately start building data warehousing solutions
where the data was stored and processed.
The trend of the business intelligence becomes the need of everyday. Data was received from the
transaction system and overnight was processed to build intelligent reports from it. Though this
is a great solution it has its own set of challenges. The relational database model and data
warehousing concepts are all built with keeping traditional relational database modeling in the
mind and it still has many challenges when unstructured data was present.
Interesting Challenge
Every organization had expertise to manage structured
data but the world had already changed to unstructured
data. There was intelligence in the videos, photos, SMS,
text, social media messages and various other data
sources. All of these needed to now bring to a single
platform and build a uniform system which does what
businesses need.
The way we do business has also been changed. There
was a time when user only got the features what technology supported, however, now users ask
for the feature and technology is built to support the same. The need of the real time
intelligence from the fast paced data flow is now becoming a necessity.
Large amount (Volume) of difference (Variety) of high speed data (Velocity) is the properties of
the data. The traditional database system has limits to resolve the challenges this new kind of the
data presents. Hence the need of the Big Data Science. We need innovation in how we handle
and manage data. We need creative ways to capture data and present to users. Big Data is
Reality!

Big Data Basics of Big Data Architecture


We will understand basics of the Big Data Architecture.
Big Data Cycle
Just like every other database related applications, bit data project have its development cycle.
Though three Vs (link) for sure plays an important role in deciding the architecture of the Big
Data projects. Just like every other project Big Data project also goes to similar phases of the
data capturing, transforming, integrating, analyzing and building actionable reporting on the top
of the data.
While the process looks almost same but due to the nature of the data the architecture is often
totally different. Here are few of the questions which everyone should ask before going ahead
with Big Data architecture.
Questions to Ask
How big is your total database?
What is your requirement of the reporting in terms of time real time, semi real time or at
frequent interval?
How important is the data availability and what is the plan for disaster recovery?
What are the plans for network and physical security of the data?
What platform will be the driving force behind data and what are different service level
agreements for the infrastructure?This is just basic questions but based on your application and
business need you should come up with the custom list of the question to ask. As mentioned
earlier this question may look quite simple but the answer will not be simple. When we are
talking about Big Data implementation there are many other important aspects which we have to
consider when we decide to go for the architecture.
Building Blocks of Big Data Architecture
It is absolutely impossible to discuss and nail down the most optimal architecture for anyBig
Data Solution in a single page; however, we can discuss the basic building blocks of big data
architecture. Here is the image explains how the building blocks of the Big Data architecture
works. Big data can be stored, acquired, processed, and analyzed in many ways. Every big data
source has different characteristics, including the frequency, volume, velocity, type, and veracity
of the data. When big data is processed and stored, additional dimensions come into play, such as
governance, security, and policies. Choosing an architecture and building an appropriate big data
solution is challenging because so many factors have to be considered.

This "Big data architecture and patterns" series presents a structured and pattern-based approach
to simplify the task of defining an overall big data architecture. Because it is important to assess
whether a business scenario is a big data problem, we include pointers to help determine which
business problems are good candidates for big data solutions.

.
Above image gives good overview of how in Big Data Architecture various components are
associated with each other. In Big Data various different data sources are part of the architecture
hence extract, transform and integration are one of the most essential layers of the architecture.
Most of the data is stored in relational as well as non-relational data marts and data warehousing
solutions. As per the business need various data are processed as well converted to proper reports
and visualizations for end users. Just like software the hardware is almost the most important part
of the Big Data Architecture. In the big data architecture hardware infrastructure is extremely
important and failure over instances as well as redundant physical infrastructure is usually
implemented.
NoSQL in Data Management
NoSQL is a very famous buzz word and it really means Not Relational SQL or Not Only SQL.
This is because in Big Data Architecture the data is in any format. It can be unstructured,
relational or in any other format or from any other data source. To bring all the data together
relational technology is not enough, hence new tools, architecture and other algorithms are
invented which takes care of all the kind of data. This is collectively called NoSQL.

What is NoSQL?
NoSQL stands for Not Relational SQL or Not Only SQL. Lots of people think that NoSQL
means there is No SQL, which is not true they both sound same but the meaning is totally
different. NoSQL does use SQL but it uses more than SQL to achieve its goal.
As per Wikipedias NoSQL Database Definition A NoSQL database provides a mechanism
for storage and retrieval of data that uses looser consistency models than traditional relational
databases.

Why use NoSQL?


A traditional relation database usually deals with predictable structured data. Whereas as the
world has moved forward with unstructured data we often see the limitations of the traditional
relational database in dealing with them. For example, nowadays we have data in format of SMS,
wave files, photos and video format. It is a bit difficult to manage them by using a traditional
relational database. I often see people using BLOB filed to store such a data. BLOB can store the
data but when we have to retrieve them or even process them the same BLOB is extremely slow
in processing the unstructured data. A NoSQL database is the type of database that can handle
unstructured, unorganized and unpredictable data that our business needs it.
Along with the support to unstructured data, the other advantage of NoSQL Database is high
performance and high availability.
Eventual Consistency
Additionally to note that NoSQL Database may not provided 100% ACID (Atomicity,
Consistency, Isolation, Durability) compliance. Though, NoSQL Database does not support
ACID they provide eventual consistency. That means over the long period of time all updates can
be expected to propagate eventually through the system and data will be consistent.
Taxonomy
Taxonomy is the practice of classification of things or concepts and the principles. The NoSQL
taxonomy supports column store, document store, key-value stores, and graph databases. We will
discuss the taxonomy in detail in later blog posts. Here are few of the examples of the each of the
No SQL Category.

Column: Hbase, Cassandra, Accumulo

Document: MongoDB, Couchbase, Raven

Key-value : Dynamo, Riak, Azure, Redis, Cache, GT.m

Graph: Neo4J, Allegro, Virtuoso, Bigdata

What is Hadoop?
Apache Hadoop is an open-source, free and Java based software framework offers a powerful
distributed platform to store and manage Big Data. It is licensed under an Apache V2 license. It
runs applications on large clusters of commodity hardware and it processes thousands of
terabytes of data on thousands of the nodes. Hadoop is inspired from Googles MapReduce and
Google File System (GFS) papers. The major advantage of Hadoop framework is that it provides
reliability and high availability.

What are the core components of Hadoop?


There are two major components of the Hadoop framework and both fo them does two of the
important task for it.

Hadoop MapReduce is the method to split a larger data problem into smaller chunk and
distribute it to many different commodity servers. Each server have their own set of resources and
they have processed them locally. Once the commodity server has processed the data they send it
back collectively to main server. This is effectively a process where we process large data
effectively and efficiently. (We will understand this in tomorrows blog post).

Hadoop Distributed File System (HDFS) is a virtual file system. There is a big difference
between any other file system and Hadoop. When we move a file on HDFS, it is automatically
split into many small pieces. These small chunks of the file are replicated and stored on other
servers (usually 3) for the fault tolerance or high availability. (We will understand this in the day
after tomorrows blog post).

Besides above two core components Hadoop project also contains following modules as well.

Hadoop Common: Common utilities for the other Hadoop modules


Hadoop Yarn: A framework for job scheduling and cluster resource
management

There are a few other projects (like Pig, Hive) related to above Hadoop as well which we will
gradually explore in later blog posts.
A Multi-node Hadoop Cluster Architecture
Now let us quickly see the architecture of the a multi-node Hadoop cluster.

A small Hadoop cluster includes a single master node and multiple worker or slave node. As
discussed earlier, the entire cluster contains two layers. One of the layer of MapReduce Layer
and another is of HDFS Layer. Each of these layer have its own relevant component. The master
node consists of a Job Tracker, Task Tracker, NameNode and DataNode. A slave or worker node
consists of a DataNode and TaskTracker. It is also possible that slave node or worker node is
only data or compute node. The matter of the fact that is the key feature of the Hadoop.
Why Use Hadoop?
There are many advantages of using Hadoop. Let me quickly list them over here:

Robust and Scalable We can add new nodes as needed as well modify them.
Affordable and Cost Effective We do not need any special hardware for running
Hadoop. We can just use commodity server.
Adaptive and Flexible Hadoop is built keeping in mind that it will handle structured
and unstructured data.
Highly Available and Fault Tolerant When a node fails, the Hadoop framework
automatically fails over to another node.

Why Hadoop is named as Hadoop?


In year 2005 Hadoop was created by Doug Cutting and Mike Cafarella while working at Yahoo.
Doug Cutting named Hadoop after his sons toy elephant.
What is MapReduce?
Map Reduce was designed by Google as a programming model for processing large data sets
with a parallel, distributed algorithm on a cluster. Though, MapReduce was originally Google
proprietary technology, it has been quite a generalized term in the recent time.
MapReduce comprises a Map() and Reduce() procedures. Procedure Map() performance filtering
and sorting operation on data where as procedure Reduce() performs a summary operation of the
data. This model is based on modified concepts of the map and reduce functions commonly
available in functional programing. The library where procedure Map() and Reduce() belongs is
written in many different languages. The most popular free implementation of MapReduce is
Apache Hadoop which we will explore tomorrow.

Advantages of MapReduce Procedures


The MapReduce Framework usually contains distributed servers and it runs various tasks in
parallel to each other. There are various components which manages the communications
between various nodes of the data and provides the high availability and fault tolerance.
Programs written in MapReduce functional styles are automatically parallelized and executed on
commodity machines. The MapReduce Framework takes care of the details of partitioning the
data and executing the processes on distributed server on run time. During this process if there is
any disaster the framework provides high availability and other available modes take care of the
responsibility of the failed node.
As you can clearly see more this entire MapReduce Frameworks provides much more than just
Map() and Reduce() procedures; it provides scalability and fault tolerance as well. A typical
implementation of the MapReduce Framework processes many petabytes of data and thousands
of the processing machines.
How do MapReduce Framework Works?
A typical MapReduce Framework contains petabytes of the data and thousands of the nodes.
Here is the basic explanation of the MapReduce Procedures which uses this massive commodity
of the servers.
Map() Procedure
There is always a master node in this infrastructure which takes an input. Right after taking input
master node divides it into smaller sub-inputs or sub-problems. These sub-problems are
distributed to worker nodes. A worker node later processes them and does necessary analysis.
Once the worker node completes the process with this sub-problem it returns it back to master
node.
Reduce() Procedure
All the worker nodes return the answer to the sub-problem assigned to them to master node. The
master node collects the answer and once again aggregate that in the form of the answer to the
original big problem which was assigned master node.
The MapReduce Framework does the above Map () and Reduce () procedure in the parallel and
independent to each other. All the Map() procedures can run parallel to each other and once each
worker node had completed their task they can send it back to master code to compile it with a

single answer. This particular procedure can be very effective when it is implemented on a very
large amount of data (Big Data).
The MapReduce Framework has five different steps:

Preparing Map() Input


Executing User Provided Map() Code
Shuffle Map Output to Reduce Processor
Executing User Provided Reduce Code
Producing the Final Output

Here is the Dataflow of MapReduce Framework:

Input Reader
Map Function
Partition Function
Compare Function
Reduce Function
Output Writer

MapReduce in a Single Statement


MapReduce is equivalent to SELECT and GROUP BY of a relational database for a very large
database.
What is HDFS ?
HDFS stands for Hadoop Distributed File System and it is a primary storage system used by
Hadoop. It provides high performance access to data across Hadoop clusters. It is usually
deployed on low-cost commodity hardware. In commodity hardware deployment server failures
are very common. Due to the same reason HDFS is built to have high fault tolerance. The data
transfer rate between compute nodes in HDFS is very high, which leads to reduced risk of
failure.
HDFS creates smaller pieces of the big data and distributes it on different nodes. It also copies
each smaller piece to multiple times on different nodes. Hence when any node with the data
crashes the system is automatically able to use the data from a different node and continue the
process. This is the key feature of the HDFS system.

Architecture of HDFS
The architecture of the HDFS is master/slave architecture. An HDFS cluster always consists of
single NameNode. This single NameNode is a master server and it manages the file system as
well regulates access to various files. In additional to NameNode there are multiple DataNodes.
There is always one DataNode for each data server. In HDFS a big file is split into one or more
blocks and those blocks are stored in a set of DataNodes.
The primary task of the NameNode is to open, close or rename files and directory and regulate
access to the file system, whereas the primary task of the DataNode is read and write to the file
systems. DataNode is also responsible for the creation, deletion or replication of the data based
on the instruction from NameNode.
In reality, NameNode and DataNode are software designed to run on commodity machine build
in Java language.
Visual Representation of HDFS Architecture

Let us understand how HDFS works with the help of the diagram. Client APP or HDFS Client
connects to NameSpace as well as DataNode. Client App access to the DataNode is regulated by
NameSpace Node. NameSpace Node allows Client App to connect to the DataNode based by
allowing the connection to the DataNode directly. A big data file is divided into multiple data
blocks (let us assume that those data chunks are A,B,C and D. Client App will later on write data
blocks directly to the DataNode. Client App does not have to directly write to all the node.
It just has to write to any one of the node and NameNode will decide on which other DataNode it
will have to replicate the data. In our example Client App directly writes to DataNode 1 and
detained 3. However, data chunks are automatically replicated to other nodes. All the information
like in which DataNode which data block is placed is written back to NameNode.

High Availability During Disaster


Now as multiple DataNode have same data blocks in the case of any DataNode which faces the
disaster, the entire process will continue as other DataNode will assume the role to serve the
specific data block which was on the failed node. This system provides very high tolerance to
disaster and provides high availability.
If you notice there is only single NameNode in our architecture. If that node fails our entire
Hadoop Application will stop performing as it is a single node where we store all the metadata.
As this node is very critical, it is usually replicated on another clustered as well as on another
data rack. Though, that replicated node is not operational in architecture, it has all the necessary
data to perform the task of the NameNode in the case of the NameNode fails.
The entire Hadoop architecture is built to function smoothly even there are node failures or
hardware malfunction. It is built on the simple concept that data is so big it is impossible to have
come up with a single piece of the hardware which can manage it properly. We need lots of
commodity (cheap) hardware to manage our big data and hardware failure is part of the
commodity servers. To reduce the impact of hardware failure Hadoop architecture is built to
overcome the limitation of the non-functioning hardware.
Big Question?

Here are a few questions often received since the beginning of the Big Data Series

Does the relational database have no space in the story of the Big Data?
Does relational database is no longer relevant as Big Data is evolving?
Is relational database not capable to handle Big Data?
Is it true that one no longer has to learn about relational data if Big Data is the final
destination?

Well, every single time when we hear that one person wants to learn about Big Data and is no
longer interested in learning about relational database.
It is very clear that one who is aspiring to become Big Data Scientist or Big Data Expert they
should learn about relational database.
NoSQL Movement
The reason for the NoSQL Movement in recent time was because of the two important
advantages of the NoSQL databases.
1. Performance
2. Flexible Schema
In personal experience I have found that when I use NoSQL I have found both of the above listed
advantages when I use NoSQL database. There are instances when I found relational database
too much restrictive when my data is unstructured as well as they have in thedatatype which my
Relational Database does not support. It is the same case when I have found that NoSQL solution
performing much better than relational databases. I must say that I am a big fan of NoSQL
solutions in the recent times but I have also seen occasions and situations where relational
database is still perfect fit even though the database is growing increasingly as well have all the
symptoms of the big data.
Situations in Relational Database Outperforms
Adhoc reporting is the one of the most common scenarios where NoSQL is does not have
optimal solution. For example reporting queries often needs to aggregate based on the columns
which are not indexed as well are built while the report is running, in this kind of scenario
NoSQL databases (document database stores, distributed key value stores) database often does
not perform well. In the case of the ad-hoc reporting I have often found it is much easier to work
with relational databases.
SQL is the most popular computer language of all the time. we have been using it for almost over
10 years and we feel that we will be using it for a long time in future. There are plenty of the
tools, connectors and awareness of the SQL language in the industry. Pretty much every
programming language has a written drivers for the SQL language and most of the developers
have learned this language during their school/college time. In many cases, writing query based
on SQL is much easier than writing queries in NoSQL supported languages. I believe this is the
current situation but in the future this situation can reverse when No SQL query languages
are equally popular.

ACID (Atomicity Consistency Isolation Durability) Not all the NoSQL solutions offers ACID
compliant language. There are always situations (for example banking transactions,
eCommerce shopping carts etc.) where if there is no ACID the operations can be invalid as well
database integrity can be at risk. Even though the data volume indeed qualify as a Big Data there
are always operations in the application which absolutely needs ACID compliance matured
language.
The Mixed Bag
We have often heard argument that all the big social media sites now a days have moved away
from Relational Database. Actually this is not entirely true. While researching about Big Data
and Relational Database, I have found that many of the popular social media sites uses Big Data
solutions along with Relational Database. Many are using relational databases to deliver the
results to end user on the run time and many still uses a relational database as their major
backbone.
Here are a few examples:

Facebook uses MySQL to display the timeline.


Twitter uses MySQL.
Tumblr uses Sharded MySQL
Wikipedia uses MySQL for data storage.

There are many for prominent organizations which are running large scale applications uses
relational database along with various Big Data frameworks to satisfy their
variousbusiness needs.
I believe that RDBMS is like a vanilla ice cream. Everybody loves it and everybody has
it.NoSQL and other solutions are like chocolate ice cream or custom ice cream there is a huge
base which loves them and wants them but not every ice cream maker can make it just right for
everyones taste. No matter how fancy an ice cream store is there is always plain vanilla ice
cream available there. Just like the same, there are always cases and situations in the Big Datas
story where traditional relational database is the part of the whole story. In the real world
scenarios there will be always the case when there will be need of the relational database
concepts and its ideology. It is extremely important to accept relational database as one of the
key components of the Big Data instead of treating it as a substandard technology.
Ray of Hope NewSQL
In this module we discussed that there are places where we need ACID compliance from our Big
Data application and NoSQL will not support that out of box. There is a new termed coined for
the application/tool which supports most of the properties of the traditional RDBMS and
supports Big Data infrastructure NewSQL.

What is NewSQL?
NewSQL stands for new scalable and high performance
SQL Database vendors. The products sold by NewSQL
vendors are horizontally scalable. NewSQL is not kind
of databases but it is about vendors who supports
emerging data products with relational database
properties (like ACID, Transaction etc.) along with high
performance. Products from NewSQL vendors usually
follow in memory data for speedy access as well are
available immediate scalability.NewSQL term was
coined by 451 groups analyst Matthew Aslett
On the definition of NewSQL, Aslett writes:NewSQL
is our shorthand for the various new scalable/high performance SQL database vendors. We have
previously referred to these products as ScalableSQL to differentiate them from the incumbent
relational database products. Since this implies horizontal scalability, which is not necessarily a
feature of all the products, we adopted the term NewSQL in the new report. And to clarify, like
NoSQL, NewSQL is not to be taken too literally: the new thing about the NewSQL vendors is
the vendor, not the SQL.
In other words NewSQL incorporates the concepts and principles of Structured Query
Language (SQL) and NoSQL languages. It combines reliability of SQL with the speed and
performance of NoSQL.
Categories of NewSQL
There are three major categories of the NewSQL
New Architecture In this framework each node owns a subset of the data and queries are split
into smaller query to sent to nodes to process the data. E.g. NuoDB, Clustrix, VoltDB
MySQL Engines Highly Optimized storage engine for SQL with the interface of MySQLare
the example of such category. E.g. InnoDB, Akiban
Transparent Sharding This system automatically split database across multiple nodes.
E.g.Scalearc

Why to learn Big Data ?


Big Data! A Worldwide Problem?
Billions of Internet users and machine-to-machine connections are causing a tsunami of data
growth. Utilizing big data requires transforming your information infrastructure into a more
flexible, distributed, and open environment. Intel offers a choice of big data solutions based on
industry-standard chips, servers, and the Apache Hadoop* framework.
According to Wikipedia, "Big data is a collection of large and complex data sets which becomes
difficult to process using on-hand database management tools or traditional data processing
applications." In simpler terms, Big Data is a term given to large volumes of data that
organizations store and process. However, it is becoming very difficult for companies to store,
retrieve and process the ever-increasing data. If any company gets hold on managing its data
well, nothing can stop it from becoming the next BIG success!
Big data is going to change the way you do things in the future, how you gain insight, and make
decisions (the change isnt going to be a replacement, rather a synergy and extension).
The problem lies in the use of traditional systems to store enormous data. Though these systems
were a success a few years ago, with increasing amount and complexity of data, these are soon
becoming obsolete. The good news is - Hadoop, which is not less than a panacea for all those
companies working with BIG DATA in a variety of applications has become an integral part for
storing, handling, evaluating and retrieving hundreds or even petabytes of data.
Apache Hadoop! A Solution for Big Data!
Hadoop is an open source software framework that supports data-intensive distributed
applications. Hadoop is licensed under the Apache v2 license. It is therefore generally known as
Apache Hadoop. Hadoop has been developed, based on a paper originally written by Google on
MapReduce system and applies concepts of functional programming. Hadoop is written in the
Java programming language and is the highest-level Apache project being constructed and used
by a global community of contributors. Hadoop was developed by Doug Cutting and Michael J.
Cafarella. And just don't overlook the charming yellow elephant you see, which is basically
named after Doug's son's toy elephant!
Some of the top companies using Hadoop:
The importance of Hadoop is evident from the fact that there are many global MNCs that are
using Hadoop and consider it as an integral part of their functioning, such as companies like
Yahoo and Face book! On February 19, 2008, Yahoo! Inc. established the world's largest Hadoop
production application. The Yahoo! Search Web map is a Hadoop application that runs on over
10,000 core Linux cluster and generates data that is now widely used in every Yahoo!
Web search query.

Facebook, a $5.1 billion company has over 1 billion active users in 2012,
according to Wikipedia. Storing and managing data of such magnitude could have been a
problem, even for a company like Facebook. But thanks to Apache Hadoop! Facebook uses
Hadoop to keep track of each and every profile it has on it, as well as all the data related to them
like their images, posts, comments, videos, etc.
Opportunities for Hadoopers!
Opportunities for Hadoopers are infinite - from a Hadoop Developer, to a Hadoop Tester or a
Hadoop Architect, and so on. If cracking and managing Big Data is your passion in life, then
think no more and Join Edureka's Hadoop Online course and carve a niche for yourself! Happy
Hadooping!

Overview of Big Data Analytics


Big data analytics is the process of examining large amounts of data of a variety of types to
uncover hidden patterns, unknown correlations, and other useful information. Such information
can provide competitive advantages over rival organizations and result in business benefits, such
as more effective marketing and increased revenue. New methods of working with big data, such
as Hadoop and MapReduce, offer alternatives to traditional data warehousing.
Big Data Analytics with R and Hadoop is focused on the techniques of integrating R and Hadoop
by various tools such as RHIPE and RHadoop. A powerful data analytics engine can be built,
which can process analytics algorithms over a large scale dataset in a scalable manner. This can
be implemented through data analytics operations of R, MapReduce, and HDFS of Hadoop.

The primary goal of big data analytics is to help companies make more informed business
decisions by enabling data scienti The primary goal of big data analytics is its, predictive
modelers and other analytics professionals to analyze large volumes of transaction data, as well
as other forms of data that may be untapped by conventional business intelligence (BI) programs.
That could include Web server logs and Internet clickstream data, social media content and social
network activity reports, text from customer emails and survey responses, mobile-phone call

detail records and machine data captured by sensors connected to the Internet of Things. Some
people exclusively associate big data with semi-structured and unstructured data of that sort, but
consulting firms like Gartner Inc. and Forrester Research Inc. also consider transactions and
other structured data to be valid components of big data analytics applications.
Big data can be analyzed with the software tools commonly used as part of advanced
analytics disciplines such as predictive analytics, data mining, text analytics and statistical
analysis. Mainstream BI software and data visualization tools can also play a role in the analysis
process. But the semi-structured and unstructured data may not fit well in traditional data
warehouses based on relational databases. Furthermore, data warehouses may not be able to
handle the processing demands posed by sets of big data that need to be updated frequently or
even continually -- for example, real-time data on the performance of mobile applications or of
oil and gas pipelines. As a result, many organizations looking to collect, process and analyze big
data have turned to a newer class of technologies that includesHadoop and related tools such
as YARN, MapReduce, Spark, Hive and Pigas well as NoSQL databases. Those technologies
form the core of an open source software framework that supports the processing of large and
diverse data sets across clustered systems. In some cases, Hadoop clusters and NoSQL systems
are being used as landing pads and staging areas for data before it gets loaded into a data
warehouse for analysis, often in a summarized form that is more conducive to relational
structures. Increasingly though, big data vendors are pushing the concept of a Hadoop data lake
that serves as the central repository for an organization's incoming streams of raw data. In such
architectures, subsets of the data can then be filtered for analysis in data warehouses and
analytical databases, or it can be analyzed directly in Hadoop using batch query tools, stream
processing software and SQL on Hadoop technologies that run interactive, ad hoc queries written
in SQL.
Potential pitfalls that can trip up organizations on big data analytics initiatives include a lack of
internal analytics skills and the high cost of hiring experienced analytics professionals. The
amount of information that's typically involved, and its variety, can also cause data management
headaches, including data quality and consistency issues. In addition, integrating Hadoop
systems and data warehouses can be a challenge, although various vendors now offer software
connectors between Hadoop and relational databases, as well as other data integration tools with
big data capabilities.

Applications
Twitter Data Analysis: Twitter data analysis is used to understand the hottest trends by dwelling
into the twitter data. Using flume data is fetched from twitter to Hadoop in JSON format. Using
JSON-serde twitter data is read and fed into HIVE tables so that we can do different analysis
using HIVE queries. For eg: Top 10 popular tweets etc.
Stack Exchange Ranking and Percentile data-set: Stack Exchange is a place where you will
find enormous data from multiple websites of Stack Group (like: stack overflow) which is open
sourced. The place is a gold mine for people who wants to come up with several POC and are
searching for suitable data-sets. In there you may query out the data you are interested in which
will contain more than 50,000 odd records. For eg: You can download Stack Overflow Rank and
Percentile data and find out the top 10 rankers.
Loan Dataset: The project is designed to find the good and bad URL links based on the reviews
given by the users. The primary data will be highly unstructured. Using MR jobs the data will be
transformed into structured form and then pumped to HIVE tables. Using Hive queries we can
query out the information very easily. In the phase two we will feed another dataset which
contains the corresponding cached web pages of the URL's into HBASE. Finally the entire
project is showcased into a UI where you can check the ranking of the URL and view the cached
page.
Data -sets by Government: These Data sets could be like Worker Population Ratio (per 1000)
for persons of age (15-59) years according to the current weekly status approach for each
state/UT.
Machine Learning Dataset like Badges datasets : Such dataset is for system to encode names,
for example +/- label followed by a person's name.

Big data analytics is the process of examining large amounts of data of a variety of types to
uncover hidden patterns, unknown correlations, and other useful information. Such information
can provide competitive advantages over rival organizations and result in business benefits, such
as more effective marketing and increased revenue. New methods of working with big data, such
as Hadoop and MapReduce, offer alternatives to traditional data warehousing.
Big Data Analytics with R and Hadoop is focused on the techniques of integrating R and Hadoop
by various tools such as RHIPE and RHadoop. A powerful data analytics engine can be built,
which can process analytics algorithms over a large scale dataset in a scalable manner. This can
be implemented through data analytics operations of R, MapReduce, and HDFS of
HadoopWeather Dataset: It has all the details of weather over a period of time using which you
may find out the highest, lowest or average temperature.

Social Media analysis including sentiment analysis in Big Data


Today's consumers are heavily involved in social media, with users having accounts on multiple
social media services. Social media gives users a platform to communicate effectively with
friends, family, and colleagues, and also gives them a platform to talk about their favorite (and
least favorite brands). This unstructured conversation can give businesses valuable insight into
how consumers perceive their brand, and allow them to actively make business decisions to
maintain their image.
Social Media
Historically, unstructured data has been
very difficult to analyze using traditional
data warehousing technologies. New cost
effective solutions, such as Hadoop, are
changing this and allowing data of high
volume, velocity, and variety to be much
more easily analyzed. Hadoop is a
massively parallel technology designed to
be cost effective by running on commodity
hardware. Today businesses can use
Microsofts Hadoop implementation,
HDInsight, and SQL Server 2012 to effectively understand and analyze unstructured data, such
as social media feeds, alongside existing Key Performance Indicators.
Where do I start?
Posts can be downloaded and loaded into Hadoop using
familiar tools like SQL Server Integration Services, or purpose
built tools like Apache Flume. Data can often be gathered for
free directly from a social media services public application
interfaces, though sometimes there are limitations, or from an
aggregation service, such as DataSift, which pulls many
sources together into a standard format.
Social networks like Twitter and Facebook manage hundreds of millions of interactions each day.
Because of this large volume of traffic, the first step in analyzing social media is to understand
the scope of data needed to be collected for analysis. Quite often the data can be limited to
certain hash tags, accounts, and key words.
Map Reduce
Once the data is loaded into Hadoop the next step is to transform it into a format that can be used
for analysis. Data transformation in Hadoop is completed using a process called MapReduce.
MapReduce jobs can be written in a number of programming languages, including .Net, Java,
Python, and Ruby, or can be system generated by tools such as Hive (a SQL like language for

Hadoop that many data analysts would be immediately comfortable with) or PIG (a procedural
scripting language for Hadoop).
MapReduce allows us to take unstructured data and transform (map) it to something meaningful,
and then aggregate (reduce) for reporting. All of this happens in parallel across all nodes in the
Hadoop cluster.
A simple example of MapReduce could map social media posts to a list of words and a count of
their occurrences, and then reduce, that list to a count of occurrence of a word per day. In a more
complex example we can use a dictionary in the map process to cleanse the social media posts,
and then use a statistical model to determine the tone of an individual post.
So Now What?

Once MapReduce has done its magic, the meaningful data now stored in Hadoop can be loaded
into existing enterprise business intelligence (BI) platform or analyzed directly using powerful
self-service tools like PowerPivot and PowerView. Customers utilizing SQL Server as their
enterprise BI platform have a variety of options to access their Hadoop data, including: Sqoop,
SQL Server Integration Services, and Polybase (SQL Server PDW 2012 only).
Hadoop and SQL Server
Having social media data loaded into an existing enterprise BI platform allows dashboards to be
created that give at glance information on how customers feel about a brand. Imagine how
powerful it would be to have the ability to visualize how customer sentiment is affecting top line
sales over time! This type of powerful analysis allows businesses to have the insight needed to
quickly adapt, and its all made possible through Hadoop.

Relationship between Small Data and Big Data

Small Data = Big Opportunity


According to a recent KPMG survey of 144 CIOs, 69% stated that data and analytics were
crucial or very important to their business. However, 85% also said that they dont know how to
analyze the data they have already collected and 54% said their greatest barrier to success was an
inability to identify the data worth collecting.
In short, many businesses have simply bit off more than they can chew when it comes to Big
Data. And with all the hype surrounding Big Data Analytics the small data, data that is small
enough size for each of us to understand often gets overlooked. While there is no debate that
big data analytics can provide a wealth of valuable information to a business or organization it
is the small data the actionable data that provides the real opportunity for businesses.
Think of it this way. Big Data analytics can provide overall trends like how many people are
purchasing x between 5-8pm, while small data is more focused on what you are buying
between those same hours.
According to IBM, Big Data spans four dimensions: Volume, Velocity, Variety, and Veracity.
Volume, of course, defines the amount of data
Velocity defines real-time processing and recognition of the data
Variety is the type both structure and unstructured and from multiple sources
Veracity is the authenticity and/or accuracy of the data
However, small data adds a fifth dimension: Value. Small data is more focused on the end-user
what they do, what they need and what they can do with this information.

Small Data vs. Big Data


The key differentiator between big data and
small data is the targeted nature of the
information and the fact that it can be easily and
quickly, acted upon.
Individuals leave a significant amount of digital
traces in a single day. From check-ins, to
Facebook likes and comments, tweets, web
searches, Instagram and Pinterest postings,
reviews, email sign ups, etc., while many
businesses already collect significant amounts of
small data directly from the customer such as
sales receipts, surveys and customer loyalty
(rewards) cards.
The key is to use this information in a way that is
actionable and more importantly adds value to
both the end user and the business.
Small Data is at the heart of CRM
All of this collected small data is at the heart of CRM (customer relationship management). By
combining insights from all of this data, businesses can create a rich profile of its customer and
better inform, motivate and connect with them.
According to Digital Clarity Group, there are four key principles for using small data:

Make it Simple: keep it as singular in


focus as possible and use pictures charts
and info graphics to convey the
information
Make it Smart: make sure results are
repeatable and trusted
Be Responsive: provide customers with
the information they need for wherever
they are

Be Social: make sure information can be share socially

For instance, Road Runner Sports has a shoe-dog app on its website in which it asks a few
questions and provides recommendations on which running shoes I should consider.
Unfortunately, it provides A LOT of options with no real way to distinguish which shoes are
considered my best option.
In this example, small data can make a difference is if the app takes into account my previous
shoe purchases and my review of those shoes before providing a recommendation. Additionally,
if I tend to buy the same brand and their customer service experts believe I would benefit from
another brand perhaps they could offer me a discount as an incentive
. Finally, it could limit the number of options by only showing me the highest rated shoes or only
those in a specific price range.
We are able also provide some interesting opportunities particularly those related to fitness.
Combining exercise with nutrition information could possibly yield better results for users. This
in turn could keep a user better engaged with both the device and the app. For instance, I am
currently training for a marathon; it would be great after a run if my device (based on my
workout metrics distance, speed, heart rate, calories, etc.) offered me some suggestions related
to nutrition such as be sure to consume x amount of water, protein, carbs, etc. within 30
minutes. Or perhaps, based on my food log make sure I am consuming the right amount of
food based on my level of exercise.
Finally, how great would it be if every time you call into a customer help desk they have a
record of your past calls, tweets or message in Facebook so the CSR could quickly state I see
that you have called about x issue multiple times are you having the same issue or is this a
new issue? This simple step could easily diffuse an angry customer and go a long way to
building trust.
Small data is personal. Small data is local. The goal is to turn all of this information that is
readily available into action and improve the customer experience. The opportunities are endless
and apply across all industry segments with no business being too small to use data analytics.
And remember that bigger is not always better.

Big Data Analysis Tools


1. Hadoop
You simply can't talk about big data without mentioning Hadoop. The Apache distributed data
processing software is so pervasive that often the terms "Hadoop" and "big data" are used
synonymously. The Apache Foundation also sponsors a number of related projects that extend
the capabilities of Hadoop, and many of them are mentioned below. In addition, numerous
vendors offer supported versions of Hadoop and related technologies. Operating System:
Windows, Linux, OS X.

2. MapReduce
Originally developed by Google, the MapReduce website describe it as "a programming model
and software framework for writing applications that rapidly process vast amounts of data in
parallel on large clusters of compute nodes." It's used by Hadoop, as well as many other data
processing applications. Operating System: OS Independent.
3. GridGain
GridGrain offers an alternative to Hadoop's MapReduce that is compatible with the Hadoop
Distributed File System. It offers in-memory processing for fast analysis of real-time data. You
can download the open source version from GitHub or purchase a commercially supported
version from the link above. Operating System: Windows, Linux, OS X.
4. HPCC
Developed by LexisNexis Risk Solutions, HPCC is short for "high performance computing
cluster." It claims to offer superior performance to Hadoop. Both free community versions and
paid enterprise versions are available. Operating System: Linux.

5. Storm
Now owned by Twitter, Storm offers distributed real-time computation capabilities and is often
described as the "Hadoop of realtime." It's highly scalable, robust, fault-tolerant and works with
nearly all programming languages. Operating System: Linux.
Databases/Data Warehouses
6. Cassandra
Originally developed by Facebook, this NoSQL database is now managed by the Apache
Foundation. It's used by many organizations with large, active datasets, including Netflix,
Twitter, Urban Airship, Constant Contact, Reddit, Cisco and Digg. Commercial support and
services are available through third-party vendors. Operating System: OS Independent.
7. HBase
Another Apache project, HBase is the non-relational data store for Hadoop. Features include
linear and modular scalability, strictly consistent reads and writes, automatic failover support and
much more. Operating System: OS Independent.
8. MongoDB
MongoDB was designed to support humongous databases. It's a NoSQL database with
document-oriented storage, full index support, replication and high availability, and more.
Commercial support is available through 10gen. Operating system: Windows, Linux, OS X,
Solaris.

9. Neo4j
The "worlds leading graph database," Neo4j boasts performance improvements up to 1000x or
more versus relational databases. Interested organizations can purchase advanced or enterprise
versions from Neo Technology. Operating System: Windows, Linux.
10. CouchDB
Designed for the Web, CouchDB stores data in JSON documents that you can access via the Web
or or query using JavaScript. It offers distributed scaling with fault-tolerant storage. Operating
system: Windows, Linux, OS X, Android.
11. OrientDB
This NoSQL database can store up to 150,000 documents per second and can load graphs in just
milliseconds. It combines the flexibility of document databases with the power of graph
databases, while supporting features such as ACID transactions, fast indexes, native and SQL
queries, and JSON import and export. Operating system: OS Independent.
12. Terrastore
Based on Terracotta, Terrastore boasts "advanced scalability and elasticity features without
sacrificing consistency." It supports custom data partitioning, event processing, push-down
predicates, range queries, map/reduce querying and processing and server-side update functions.
Operating System: OS Independent.
13. FlockDB

Best known as Twitter's database, FlockDB was designed to store social graphs (i.e., who is
following whom and who is blocking whom). It offers horizontal scaling and very fast reads and
writes. Operating System: OS Independent.
14. Hibari
Used by many telecom companies, Hibari is a key-value, big data store with strong consistency,
high availability and fast performance. Support is available through Gemini Mobile. Operating
System: OS Independent.
15. Riak
Riak humbly claims to be "the most powerful open-source, distributed database you'll ever put
into production." Users include Comcast, Yammer, Voxer, Boeing, SEOMoz, Joyent, Kiip.me,
Formspring, the Danish Government and many others. Operating System: Linux, OS X.
16. Hypertable
This NoSQL database offers efficiency and fast performance that result in cost savings versus
similar databases. The code is 100 percent open source, but paid support is available. Operating
System: Linux, OS X.

Big Data Analytics for Security


This section explains how Big Data is changing the analytics landscape. In particular, Big Data
analytics can be leveraged to improve information security and situational awareness. For
example, Big Data analytics can be employed to analyze financial transactions, log files, and
network traffic to identify anomalies and suspicious activities, and to correlate multiple sources
of information into a coherent view.
Data-driven information security dates back to bank fraud detection and anomaly-based intrusion
detection systems. Fraud detection is one of the most visible uses for Big Data analytics. Credit
card companies have conducted fraud detection for decades. However, the custom-built
infrastructure to mine Big Data for fraud detection was not economical to adapt for other fraud
detection uses. Off-the-shelf Big Data tools and techniques are now bringing attention to
analytics for fraud detection in healthcare, insurance, and other fields.

In the context of data analytics for intrusion detection, the following evolution is anticipated:
1 st generation: Intrusion detection systems Security architects realized the need for layered
security (e.g., reactive security and breach response) because a system with 100% protective
security is impossible.
2 nd generation: Security information and event management (SIEM) Managing alerts from
different intrusion detection sensors and rules was a big challenge in enterprise settings. SIEM
systems aggregate and filter alarms from many sources and present actionable information to
security analysts.

3 rd generation: Big Data analytics in security (2nd generation SIEM) Big Data tools have
the potential to provide a significant advance in actionable security intelligence by reducing the
time for correlating, consolidating, and contextualizing diverse security event information, and
also for correlating long-term historical data for forensic purposes.
Analyzing logs, network packets, and system events for forensics and intrusion detection has
traditionally been a significant problem; however, traditional technologies fail to provide the
tools to support long-term, large-scale analytics for several reasons:
1. Storing and retaining a large quantity of data was not economically feasible. As a result, most
event logs and other recorded computer activity were deleted after a fixed retention period (e.g.,
60 days).
2. Performing analytics and complex queries on large, structured data sets was inefficient
because traditional tools did not leverage Big Data technologies.
3. Traditional tools were not designed to analyze and manage unstructured data. As a result,
traditional tools had rigid, defined schemas. Big Data tools (e.g., Piglatin scripts and regular
expressions) can query data in flexible formats.
4. Big Data systems use cluster computing infrastructures. As a result, the systems are more
reliable and available, and provide guarantees that queries on the systems are processed to
completion.
New Big Data technologies, such as databases related to the Hadoop ecosystem and stream
processing, are enabling the storage and analysis of large heterogeneous data sets at an
unprecedented scale and speed. These technologies will transform security analytics by:
(a) Collecting data at a massive scale from many internal enterprise sources and external sources
such as vulnerability databases;
(b) Performing deeper analytics on the data;
(c) Providing a consolidated view of security-related information; and
(d) Achieving real-time analysis of streaming data. It is important to note that Big Data tools still
require system architects and analysts to have a deep knowledge of their system in order to
properly configure the Big Data analysis tools.
Network Security
In a recently published case study, Zions Bancorporation 8 announced that it is using Hadoop
clusters and business intelligence tools to parse more data more quickly than with traditional
SIEM tools. In their experience, the quantity of data and the frequency analysis of events are too
much for traditional SIEMs to handle alone. In their traditional systems, searching among a
months load of data could take between 20 minutes and an hour. In their new Hadoop system
running queries with Hive, they get the same results in about one minute.
The security data warehouse driving this implementation not only enables users to mine
meaningful security information from sources such as firewalls and security devices, but also
from website traffic, business processes and other day-to-day transactions. 10 This incorporation
of unstructured data and multiple disparate data sets into a single analytical framework is one of
the main promises of Big Data.

Enterprise Events Analytics


Enterprises routinely collect terabytes of security relevant data (e.g., network events, software
application events, and people action events) for several reasons, including the need for
regulatory compliance and post-hoc forensic analysis. Unfortunately, this volume of data
quickly becomes overwhelming. Enterprises can barely store the data; much less do anything
useful with it. For example, it is estimated that an enterprise as large as HP currently (in 2013)
generates 1 trillion events per day, or roughly 12 million events per second. These numbers will
grow as enterprises enable event logging in more sources, hire more employees, deploy more
devices, and run more software. Existing analytical techniques do not work well at this scale and
typically produce so many false positives that their efficacy is undermined. The problem
becomes worse as enterprises move to cloud architectures and collect much more data. As a
result, the more data that is collected, the less actionable information is derived from the data.
The goal of a recent research effort at HP Labs is to move toward a scenario where more data
leads to better analytics and more actionable information (Manadhata, Horne, & Rao,
forthcoming). To do so, algorithms and systems must be designed and implemented in order to
identify actionable security information from large enterprise data sets and drive false positive
rates down to manageable levels. In this scenario, the more data that is collected, the more value
can be derived from the data. However, many challenges must be overcome to realize the true
potential of Big Data analysis.
Among these challenges are the legal, privacy, and technical issues regarding scalable data
collection, transport, storage, analysis, and visualization.
Despite the challenges, the group at HP Labs has successfully addressed several Big Data
analytics for security challenges, some of which are highlighted in this section. First, a largescale graph inference approach was introduced to identify malware-infected hosts in an
enterprise network and the malicious domains accessed by the enterprise's hosts. Specifically, a
host-domain access graph was constructed from large enterprise event data sets by adding edges
between every host in the enterprise and the domains visited by the host. The graph was then
seeded with minimal ground truth information from a black list and a white list, and belief
propagation was used to estimate the likelihood that a host or domain is malicious. Experiments
on a 2 billion HTTP request data set collected at a large enterprise, a 1 billion DNS request data
set collected at an ISP, and a 35 billion network intrusion detection system alert data set
collected from over 900 enterprises worldwide showed that high true positive rates and low false
positive rates can be achieved with minimal ground truth information (that is, having limited
data labeled as normal events or attack events used to train anomaly detectors).
Second, terabytes of DNS events consisting of billions of DNS requests and responses collected
at an ISP were analyzed. The goal was to use the rich source of DNS information to identify
botnets, malicious domains, and other malicious activities in a network. Specifically, features
that are indicative of maliciousness were identified. For example, malicious fast-flux domains
tend to last for a short time, whereas good domains such as hp.com last much longer and resolve
to many geographically-distributed IPs. A varied set of features were computed, including ones
derived from domain names, time stamps, and DNS response time-to-live values. Then,

classification techniques (e.g., decision trees and support vector machines) were used to identify
infected hosts and malicious domains. The analysis has already identified many malicious
activities from the ISP data set.
Netflow Monitoring to Identify Botnets
This section summarizes the BotCloud research project (Fraois, J. et al. 2011, November),
which leverages the MapReduce paradigm for analyzing enormous quantities of Netflow data to
identify infected hosts participating in a botnet (Franois, 2011, November). The rationale for
using MapReduce for this project stemmed from the large amount of Netflow data collected for
data analysis. 720 million Netflow records (77GB) were collected in only 23 hours. Processing
this data with traditional tools is challenging. However, Big Data solutions like MapReduce
greatly enhance analytics by enabling an easy-to-deploy distributed computing paradigm.
BotCloud relies on BotTrack, which examines host relationships using a combination of
PageRank and clustering algorithms to track the command-and-control (C&C) channels in the
botnet (Franois et al., 2011, May). Botnet detection is divided into the following steps:
dependency graph creation, PageRank algorithm, and DBScan clustering.The dependency graph
was constructed from Netflow records by representing each host (IP address) as a node. There is
an edge from node A to B if, and only if, there is at least one Netflow record having A as the
source address and B as the destination address. PageRank will discover patterns in this graph
(assuming that P2P communications between bots have similar characteristics since they are
involved in same type of activities) and the clustering phase will then group together hosts
having the same pattern. Since PageRank is the most resource-consuming part, it is the only one
implemented in MapReduce.
BotCloud used a small Hadoop cluster of 12 commodity nodes (11 slaves + 1 master): 6 Intel
Core 2 Duo 2.13GHz nodes with 4 GB of memory and 6 Intel Pentium 4 3GHz nodes with 2GB
of memory. The dataset contained about 16 million hosts and 720 million Netflow records. This
leads to a dependency graph of 57 million edges.
The number of edges in the graph is the main parameter affecting the computational complexity.
Since scores are propagated through the edges, the number of intermediate MapReduce keyvalue pairs is dependent on the number of links. Figure 5 shows the time to complete an
iteration with different edges and cluster sizes.

Figure . Average execution time for a single PageRank iteration.

The results demonstrate that the time for analyzing the complete dataset (57 million edges) was
reduced by a factor of seven by this small Hadoop cluster. Full results (including the accuracy of
the algorithm for identifying botnets) are described in Franois et al. (2011, May).
Advanced Persistent Threats Detection
An Advanced Persistent Threat (APT) is a targeted attack against a high-value asset or a
physical system. In contrast to mass-spreading malware, such as worms, viruses, and Trojans,
APT attackers operate in low-and-slow mode. Low mode maintains a low profile in the
networks and slow mode allows for long execution time. APT attackers often leverage stolen
user credentials or zero-day exploits to avoid triggering alerts. As such, this type of attack can
take place over an extended period of time while the victim organization remains oblivious to
the intrusion.
The 2010 Verizon data breach investigation report concludes that in 86% of the cases, evidence
about the data breach was recorded in the organization logs, but the detection mechanisms failed
to raise security alarms (Verizon, 2010).
APTs are among the most serious information security threats that organizations face today. A
common goal of an APT is to steal intellectual property (IP) from the targeted organization, to
gain access to sensitive customer data, or to access strategic business information that could be
used for financial gain, blackmail, and embarrassment, data poisoning, illegal insider trading or
disrupting an organizations business. APTs are operated by highly-skilled, well-funded and
motivated attackers targeting sensitive information from specific organizations and operating
over periods of months or years. APTs have become very sophisticated and diverse in the
methods and technologies used, particularly in the ability to use organizations own employees
to penetrate the IT systems by using social engineering methods. They often trick users into
opening spear-phishing messages that are customized for each victim (e.g., emails, SMS, and
PUSH messages) and then downloading and installing specially crafted malware that may
contain zero-day exploits (Verizon, 2010; Curry et al., 2011; and Alperovitch, 2011).
Today, detection relies heavily on the expertise of human analysts to create custom signatures
and perform manual investigation. This process is labor-intensive, difficult to generalize, and not
scalable. Existing anomaly detection proposals commonly focus on obvious outliers (e.g.,
volume-based), but are ill-suited for stealthy APT attacks and suffer from high false positive
rates.
Big Data analysis is a suitable approach for APT detection. A challenge in detecting APTs is the
massive amount of data to sift through in search of anomalies. The data comes from an everincreasing number of diverse information sources that have to be audited. This massive volume
of data makes the detection task look like searching for a needle in a haystack (Giura & Wang,
2012). Due to the volume of data, traditional network perimeter defense systems can become
ineffective in detecting targeted attacks and they are not scalable to the increasing size of
organizational networks. As a result, a new approach is required. Many enterprises collect data
about users and hosts activities within the organizations network, as logged by firewalls, web

proxies, domain controllers, intrusion detection systems, and VPN servers. While this data is
typically used for compliance and forensic investigation, it also contains a wealth of information
about user behavior that holds promise for detecting stealthy attacks.
Beehive: Behavior Profiling for APT Detection
At RSA Labs, the observation about APTs is that, however subtle the attack might be, the
attackers behavior (in attempting to steal sensitive information or subvert system operations)
should cause the compromised users actions to deviate from their usual pattern. Moreover,
since APT attacks consist of multiple stages (e.g., exploitation, command-and-control, lateral
movement, and objectives), each action by the attacker provides an opportunity to detect
behavioral deviations from the norm.
Correlating these seemingly independent events can reveal evidence of the intrusion, exposing
stealthy attacks that could not be identified with previous methods.
These detectors of behavioral deviations are referred to as anomaly sensors, with each sensor
examining one aspect of the hosts or users activities within an enterprises network. For
instance, a sensor may keep track of the external sites a host contacts in order to identify unusual
connections (potential command-and-control channels), profile the set of machines each user
logs into to find anomalous access patterns (potential pivoting behavior in the lateral
movement stage), study users regular working hours to flag suspicious activities in the middle
of the night, or track the flow of data between internal hosts to find unusual sinks where large
amounts of data are gathered (potential staging servers before data exfiltration).
While the triggering of one sensor indicates the presence of a singular unusual activity, the
triggering of multiple sensors suggests more suspicious behavior. The human analyst is given
the flexibility of combining multiple sensors according to known attack patterns (e.g.,
command-and-control communications followed by lateral movement) to look for abnormal
events that may warrant investigation or to generate behavioral reports of a given users
activities across time.
The prototype APT detection system at RSA Lab is named Beehive. The name refers to the
multiple weak components (the sensors) that work together to achieve a goal (APT detection),
just as bees with differentiated roles cooperate to maintain a hive. Preliminary results showed
that Beehive is able to process a days worth of data (around a billion log messages) in an hour
and identified policy violations and malware infections that would otherwise have gone
unnoticed (Yen et al., 2013).
In addition to detecting APTs, behavior profiling also supports other applications, including IT
management (e.g., identifying critical services and unauthorized IT infrastructure within the
organization by examining usage patterns), and behavior-based authentication (e.g.,
authenticating users based on their interaction with other users and hosts, the applications they
typically access, or their regular working hours). Thus, Beehive provides insights into an
organizations environment for security and beyond.

Using Large-Scale Distributed Computing to Unveil APTs


Although an APT itself is not a large-scale exploit, the detection method should use large-scale
methods and close-to-target monitoring algorithms in order to be effective and to cover all
possible attack paths. In this regard, a successful APT detection methodology should model the
APT as an attack pyramid, as introduced by Giura & Wang (2012). An attack pyramid should
have the possible attack goal (e.g., sensitive data, high rank employees, and data servers) at the
top and lateral planes representing the environments where the events associated with an attack
can be recorded (e.g., user plane, network plane, application plane, or physical plane). The
detection framework proposed by Giura & Wang groups all of the events recorded in an
organization that could potentially be relevant for security using flexible correlation rules that
can be redefined as the attack evolves. The framework implements the detection rules (e.g.,
signature based, anomaly based, or policy based) using various algorithms to detect possible
malicious activities within each context and across contexts using a MapReduce paradigm.
There is no doubt that the data used as evidence of attacks is growing in volume, velocity, and
variety, and is increasingly difficult to detect. In the case of APTs, there is no known bad item
that IDS could pick up or that could be found in traditional information retrieval systems or
databases. By using a MapReduce implementation, an APT detection system has the possibility
to more efficiently handle highly unstructured data with arbitrary formats that are captured by
many types of sensors (e.g., Syslog, IDS, Firewall, NetFlow, and DNS) over long periods of
time. Moreover, the massive parallel processing mechanism of MapReduce could use much
more sophisticated detection algorithms than the traditional SQL-based data systems that are
designed for transactional workloads with highly structured data. Additionally, with
MapReduce, users have the power and flexibility to incorporate any detection algorithms into
the Map and Reduce functions. The functions can be tailored to work with specific data and
make the distributed computing details transparent to the users. Finally, exploring the use of
large-scale distributed systems has the potential to help to analyze more data at once, to cover
more attack paths and possible targets, and to reveal unknown threats in a context closer to the
target, as is the case in APTs.
The WINE Platform for Experimenting with Big Data Analytics in Security
The Worldwide Intelligence Network Environment (WINE) provides a platform for conducting
data analysis at scale, using field data collected at Symantec (e.g., anti-virus telemetry and file
downloads), and promotes rigorous experimental methods (Dumitras & Shoue, 2011). WINE
loads, samples, and aggregates data feeds originating from millions of hosts around the world
and keeps them up-to-date. This allows researchers to conduct open-ended, reproducible
experiments in order to, for example, validate new ideas on real-world data, conduct empirical
studies, or compare the performance of different algorithms against reference data sets archived
in WINE. WINE is currently used by Symantecs engineers and by academic researchers.

WINE Analysis Example: Determining the Duration of Zero-Day Attacks


A zero-day attack exploits one or more vulnerabilities that have not been disclosed publicly.
Knowledge of such vulnerabilities enables cyber criminals to attack any target undetected, from
Fortune 500 companies to millions of consumer PCs around the world. The WINE platform was
used to measure the duration of 18 zero-day attacks by combining the binary reputation and
anti-virus telemetry data sets and by analyzing field data collected on 11 million hosts
worldwide (Bilge & Dumitras, 2012). These attacks lasted between 19 days and 30 months, with
a median of 8 months and an average of approximately 10 months (Figure 6). Moreover, 60% of
the vulnerabilities identified in this study had not been previously identified as exploited in
zero-day attacks. This suggests that such attacks are more common than previously thought.
These insights have important implications for future security technologies because they focus
attention on the attacks and vulnerabilities that matter most in the real world.

Figure 6. Analysis of zero-day attacks that go undetected.


The outcome of this analysis highlights the importance of Big Data techniques for security
research. For more than a decade, the security community suspected that zero-day attacks are
undetected for long periods of time, but past studies were unable to provide statistically
significant evidence of this phenomenon. This is because zero-day attacks are rare events that
are unlikely to be observed in honeypots or in lab experiments. For example, most of the zeroday attacks in the study showed up on fewer than 150 hosts out of the 11 million analyzed. Big
Data platforms such as WINE provide unique insights about advanced cyber attacks and open
up new avenues of research on next-generation security technologies.

Projects on Big Data

Areas of Big Data Applications in Action:

Consumer product companies and retail organizations are monitoring social media
like Facebook and Twitter to get an unprecedented view into customer behavior,
preferences, and product perception.
Manufacturers are monitoring minute vibration data from their equipment, which
changes slightly as it wears down, to predict the optimal time to replace or maintain.
Replacing it too soon wastes money; replacing it too late triggers an expensive work
stoppage
Manufacturers are also monitoring social networks, but with a different goal than
marketers: They are using it to detect aftermarket support issues before a warranty
failure becomes publicly detrimental.
Financial Services organizations are using data mined from customer interactions to
slice and dice their users into finely tuned segments. This enables these financial
institutions to create increasingly relevant and sophisticated offers.
Advertising and marketing agencies are tracking social media to understand
responsiveness to campaigns, promotions, and other advertising mediums.
Insurance companies are using Big Data analysis to see which home insurance
applications can be immediately processed, and which ones need a validating inperson visit from an agent.
By embracing social media, retail organizations are engaging brand advocates,
changing the perception of brand antagonists, and even enabling enthusiastic
customers to sell their products.
Hospitals are analyzing medical data and patient records to predict those patients that
are likely to seek readmission within a few months of discharge. The hospital can
then intervene in hopes of preventing another costly hospital stay.
Web-based businesses are developing information products that combine data
gathered from customers to offer more appealing recommendations and more
successful coupon programs.
The government is making data public at both the national, state, and city level for
users to develop new applications that can generate public good.
Sports teams are using data for tracking ticket sales and even for tracking team
strategies.

Future of Big Data

According to global research firm Gartner, by 2015 nearly 4.4 million new jobs will be created globally
by the Big Data demand and only one-third of them will be filled. India, along with China, could be one
of the biggest suppliers of manpower for the Big Data industry. This is one of the Top Predictions for
2013: Balancing Economics, Risk, and Opportunity & Innovation that Gartner released in Chennai today.
Big Data spending is expected to exceed $130 billion by 2015, generating jobs. Companies and
professionals should seek to acquire relevant skills to deal with high-volume, real-time data.
Every day 2.5 quintillion bytes of data are created so much that 90 per cent of the data in the world
today have been created in the last two years alone. This data come from everywhere: sensors used to
gather climate information, posts to social media sites, digital pictures and videos, purchase transaction
records, and cell phone GPS signals to name a few. Specialists are required to analyse these Big Data in
a company to either take corrective actions or predict future trends.
Advanced information management/analytical skills and business expertise are growing in importance.

Thank you

You might also like