Professional Documents
Culture Documents
Big Data BOOKnew Modfied
Big Data BOOKnew Modfied
to understand the challenges in architecture to store & access the Big data
(ii)
(iii)
analyzing Hadoop and other related tools that provide SQL-like access to unstructured data NoSQL
for
their critical features, data consistency and ability to scale to extreme volumes
(iv)
to introduce application of Big data science that applies to various challenging areas.
It paves the way for the academicians / professionals to uncover research issues in the storage and analysis of
huge volume of data from the various sources. Let us discover how big Big Data is!
Course Objectives
After the completion of the 'Big Data and Hadoop' Course at, you should be able to:
Master the concepts of Hadoop Distributed File System and Map Reduce framework.
Understand Hadoop 2.x Architecture -- HDFS Federation, Name Node High
Availability.
Setup a Hadoop Cluster.
Understand Data Loading Techniques using Sqoop and Flume.
Program in MapReduce.
Learn to write Complex MapReduce programs.
Perform Data Analytics using Pig and Hive.
Implement HBase, MapReduce Integration, Advanced Usage and Advanced Indexing.
Implement best Practices for Hadoop Development.
Implement a Hadoop Project.
Work on a Real Life Project on Big Data Analytics and gain Hands on Project
Experience.
Volume
We currently see the exponential growth in the data storage as the data is now more than text
data. We can find data in the format of videos, musics and large images on our social media
channels. It is very common to have Terabytes and Petabytes of the storage system for
enterprises. As the database grows the applications and architecture built to support the data
needs to be reevaluated quite often. Sometimes the same data is re-evaluated with multiple
angles and even though the original data is the same the new found intelligence
creates explosion of the data. The big volume indeed represents Big Data.
Velocity
The data growth and social media explosion have changed how we look at the data. There was a
time when we used to believe that data of yesterday is recent. The matter of the fact newspapers
is still following that logic. However, news channels and radios have changed how fast we
receive the news. Today, people reply on social media to update them with the latest happening.
On social media sometimes a few seconds old messages (a tweet, status updates etc.) is not
something interests users. They often discard old messages and pay attention to recent updates.
The data movement is now almost real time and the update window has reduced to fractions of
the seconds. This high velocity data represent Big Data.
Variety
Data can be stored in multiple formats. For example database, excel, csv, access or for the matter
of the fact, it can be stored in a simple text file. Sometimes the data is not even in the traditional
format as we assume, it may be in the form of video, SMS, pdf or something we might have not
thought about it. It is the need of the organization to arrange it and make it meaningful. It will be
easy to do so if we have data in the same format, however it is not the case most of the time. The
real world have data in many different formats and that is the challenge we need to overcome
with the Big Data. This variety of the data represent represents Big Data.
Big Data in Simple Words
Big Data is not just about lots of data, it is actually a concept providing an opportunity to find
new insight into your existing data as well guidelines to capture and analysis your future data. It
makes any business more agile and robust so it can adapt and overcome business challenges.
Data in Flat File
In earlier days data was stored in the flat file and there was no structure in the flat file. If any
data has to be retrieved from the flat file it was a project by itself. There was no possibility of
retrieving the data efficiently and data integrity has been just a term discussed without any
modeling or structure around. Database residing in the flat file had more issues than we would
like to discuss in todays world. It was more like a nightmare when there was any data processing
involved in the application. Though, applications developed at that time were also not that
advanced the need of the data was always there and there was always need of proper data
management.
Data Warehousing
The enormous data growth now presented a big challenge for the organizations who wanted to
build intelligent systems based on the data and provide near real time superior user experience to
their customers. Various organizations immediately start building data warehousing solutions
where the data was stored and processed.
The trend of the business intelligence becomes the need of everyday. Data was received from the
transaction system and overnight was processed to build intelligent reports from it. Though this
is a great solution it has its own set of challenges. The relational database model and data
warehousing concepts are all built with keeping traditional relational database modeling in the
mind and it still has many challenges when unstructured data was present.
Interesting Challenge
Every organization had expertise to manage structured
data but the world had already changed to unstructured
data. There was intelligence in the videos, photos, SMS,
text, social media messages and various other data
sources. All of these needed to now bring to a single
platform and build a uniform system which does what
businesses need.
The way we do business has also been changed. There
was a time when user only got the features what technology supported, however, now users ask
for the feature and technology is built to support the same. The need of the real time
intelligence from the fast paced data flow is now becoming a necessity.
Large amount (Volume) of difference (Variety) of high speed data (Velocity) is the properties of
the data. The traditional database system has limits to resolve the challenges this new kind of the
data presents. Hence the need of the Big Data Science. We need innovation in how we handle
and manage data. We need creative ways to capture data and present to users. Big Data is
Reality!
This "Big data architecture and patterns" series presents a structured and pattern-based approach
to simplify the task of defining an overall big data architecture. Because it is important to assess
whether a business scenario is a big data problem, we include pointers to help determine which
business problems are good candidates for big data solutions.
.
Above image gives good overview of how in Big Data Architecture various components are
associated with each other. In Big Data various different data sources are part of the architecture
hence extract, transform and integration are one of the most essential layers of the architecture.
Most of the data is stored in relational as well as non-relational data marts and data warehousing
solutions. As per the business need various data are processed as well converted to proper reports
and visualizations for end users. Just like software the hardware is almost the most important part
of the Big Data Architecture. In the big data architecture hardware infrastructure is extremely
important and failure over instances as well as redundant physical infrastructure is usually
implemented.
NoSQL in Data Management
NoSQL is a very famous buzz word and it really means Not Relational SQL or Not Only SQL.
This is because in Big Data Architecture the data is in any format. It can be unstructured,
relational or in any other format or from any other data source. To bring all the data together
relational technology is not enough, hence new tools, architecture and other algorithms are
invented which takes care of all the kind of data. This is collectively called NoSQL.
What is NoSQL?
NoSQL stands for Not Relational SQL or Not Only SQL. Lots of people think that NoSQL
means there is No SQL, which is not true they both sound same but the meaning is totally
different. NoSQL does use SQL but it uses more than SQL to achieve its goal.
As per Wikipedias NoSQL Database Definition A NoSQL database provides a mechanism
for storage and retrieval of data that uses looser consistency models than traditional relational
databases.
What is Hadoop?
Apache Hadoop is an open-source, free and Java based software framework offers a powerful
distributed platform to store and manage Big Data. It is licensed under an Apache V2 license. It
runs applications on large clusters of commodity hardware and it processes thousands of
terabytes of data on thousands of the nodes. Hadoop is inspired from Googles MapReduce and
Google File System (GFS) papers. The major advantage of Hadoop framework is that it provides
reliability and high availability.
Hadoop MapReduce is the method to split a larger data problem into smaller chunk and
distribute it to many different commodity servers. Each server have their own set of resources and
they have processed them locally. Once the commodity server has processed the data they send it
back collectively to main server. This is effectively a process where we process large data
effectively and efficiently. (We will understand this in tomorrows blog post).
Hadoop Distributed File System (HDFS) is a virtual file system. There is a big difference
between any other file system and Hadoop. When we move a file on HDFS, it is automatically
split into many small pieces. These small chunks of the file are replicated and stored on other
servers (usually 3) for the fault tolerance or high availability. (We will understand this in the day
after tomorrows blog post).
Besides above two core components Hadoop project also contains following modules as well.
There are a few other projects (like Pig, Hive) related to above Hadoop as well which we will
gradually explore in later blog posts.
A Multi-node Hadoop Cluster Architecture
Now let us quickly see the architecture of the a multi-node Hadoop cluster.
A small Hadoop cluster includes a single master node and multiple worker or slave node. As
discussed earlier, the entire cluster contains two layers. One of the layer of MapReduce Layer
and another is of HDFS Layer. Each of these layer have its own relevant component. The master
node consists of a Job Tracker, Task Tracker, NameNode and DataNode. A slave or worker node
consists of a DataNode and TaskTracker. It is also possible that slave node or worker node is
only data or compute node. The matter of the fact that is the key feature of the Hadoop.
Why Use Hadoop?
There are many advantages of using Hadoop. Let me quickly list them over here:
Robust and Scalable We can add new nodes as needed as well modify them.
Affordable and Cost Effective We do not need any special hardware for running
Hadoop. We can just use commodity server.
Adaptive and Flexible Hadoop is built keeping in mind that it will handle structured
and unstructured data.
Highly Available and Fault Tolerant When a node fails, the Hadoop framework
automatically fails over to another node.
single answer. This particular procedure can be very effective when it is implemented on a very
large amount of data (Big Data).
The MapReduce Framework has five different steps:
Input Reader
Map Function
Partition Function
Compare Function
Reduce Function
Output Writer
Architecture of HDFS
The architecture of the HDFS is master/slave architecture. An HDFS cluster always consists of
single NameNode. This single NameNode is a master server and it manages the file system as
well regulates access to various files. In additional to NameNode there are multiple DataNodes.
There is always one DataNode for each data server. In HDFS a big file is split into one or more
blocks and those blocks are stored in a set of DataNodes.
The primary task of the NameNode is to open, close or rename files and directory and regulate
access to the file system, whereas the primary task of the DataNode is read and write to the file
systems. DataNode is also responsible for the creation, deletion or replication of the data based
on the instruction from NameNode.
In reality, NameNode and DataNode are software designed to run on commodity machine build
in Java language.
Visual Representation of HDFS Architecture
Let us understand how HDFS works with the help of the diagram. Client APP or HDFS Client
connects to NameSpace as well as DataNode. Client App access to the DataNode is regulated by
NameSpace Node. NameSpace Node allows Client App to connect to the DataNode based by
allowing the connection to the DataNode directly. A big data file is divided into multiple data
blocks (let us assume that those data chunks are A,B,C and D. Client App will later on write data
blocks directly to the DataNode. Client App does not have to directly write to all the node.
It just has to write to any one of the node and NameNode will decide on which other DataNode it
will have to replicate the data. In our example Client App directly writes to DataNode 1 and
detained 3. However, data chunks are automatically replicated to other nodes. All the information
like in which DataNode which data block is placed is written back to NameNode.
Here are a few questions often received since the beginning of the Big Data Series
Does the relational database have no space in the story of the Big Data?
Does relational database is no longer relevant as Big Data is evolving?
Is relational database not capable to handle Big Data?
Is it true that one no longer has to learn about relational data if Big Data is the final
destination?
Well, every single time when we hear that one person wants to learn about Big Data and is no
longer interested in learning about relational database.
It is very clear that one who is aspiring to become Big Data Scientist or Big Data Expert they
should learn about relational database.
NoSQL Movement
The reason for the NoSQL Movement in recent time was because of the two important
advantages of the NoSQL databases.
1. Performance
2. Flexible Schema
In personal experience I have found that when I use NoSQL I have found both of the above listed
advantages when I use NoSQL database. There are instances when I found relational database
too much restrictive when my data is unstructured as well as they have in thedatatype which my
Relational Database does not support. It is the same case when I have found that NoSQL solution
performing much better than relational databases. I must say that I am a big fan of NoSQL
solutions in the recent times but I have also seen occasions and situations where relational
database is still perfect fit even though the database is growing increasingly as well have all the
symptoms of the big data.
Situations in Relational Database Outperforms
Adhoc reporting is the one of the most common scenarios where NoSQL is does not have
optimal solution. For example reporting queries often needs to aggregate based on the columns
which are not indexed as well are built while the report is running, in this kind of scenario
NoSQL databases (document database stores, distributed key value stores) database often does
not perform well. In the case of the ad-hoc reporting I have often found it is much easier to work
with relational databases.
SQL is the most popular computer language of all the time. we have been using it for almost over
10 years and we feel that we will be using it for a long time in future. There are plenty of the
tools, connectors and awareness of the SQL language in the industry. Pretty much every
programming language has a written drivers for the SQL language and most of the developers
have learned this language during their school/college time. In many cases, writing query based
on SQL is much easier than writing queries in NoSQL supported languages. I believe this is the
current situation but in the future this situation can reverse when No SQL query languages
are equally popular.
ACID (Atomicity Consistency Isolation Durability) Not all the NoSQL solutions offers ACID
compliant language. There are always situations (for example banking transactions,
eCommerce shopping carts etc.) where if there is no ACID the operations can be invalid as well
database integrity can be at risk. Even though the data volume indeed qualify as a Big Data there
are always operations in the application which absolutely needs ACID compliance matured
language.
The Mixed Bag
We have often heard argument that all the big social media sites now a days have moved away
from Relational Database. Actually this is not entirely true. While researching about Big Data
and Relational Database, I have found that many of the popular social media sites uses Big Data
solutions along with Relational Database. Many are using relational databases to deliver the
results to end user on the run time and many still uses a relational database as their major
backbone.
Here are a few examples:
There are many for prominent organizations which are running large scale applications uses
relational database along with various Big Data frameworks to satisfy their
variousbusiness needs.
I believe that RDBMS is like a vanilla ice cream. Everybody loves it and everybody has
it.NoSQL and other solutions are like chocolate ice cream or custom ice cream there is a huge
base which loves them and wants them but not every ice cream maker can make it just right for
everyones taste. No matter how fancy an ice cream store is there is always plain vanilla ice
cream available there. Just like the same, there are always cases and situations in the Big Datas
story where traditional relational database is the part of the whole story. In the real world
scenarios there will be always the case when there will be need of the relational database
concepts and its ideology. It is extremely important to accept relational database as one of the
key components of the Big Data instead of treating it as a substandard technology.
Ray of Hope NewSQL
In this module we discussed that there are places where we need ACID compliance from our Big
Data application and NoSQL will not support that out of box. There is a new termed coined for
the application/tool which supports most of the properties of the traditional RDBMS and
supports Big Data infrastructure NewSQL.
What is NewSQL?
NewSQL stands for new scalable and high performance
SQL Database vendors. The products sold by NewSQL
vendors are horizontally scalable. NewSQL is not kind
of databases but it is about vendors who supports
emerging data products with relational database
properties (like ACID, Transaction etc.) along with high
performance. Products from NewSQL vendors usually
follow in memory data for speedy access as well are
available immediate scalability.NewSQL term was
coined by 451 groups analyst Matthew Aslett
On the definition of NewSQL, Aslett writes:NewSQL
is our shorthand for the various new scalable/high performance SQL database vendors. We have
previously referred to these products as ScalableSQL to differentiate them from the incumbent
relational database products. Since this implies horizontal scalability, which is not necessarily a
feature of all the products, we adopted the term NewSQL in the new report. And to clarify, like
NoSQL, NewSQL is not to be taken too literally: the new thing about the NewSQL vendors is
the vendor, not the SQL.
In other words NewSQL incorporates the concepts and principles of Structured Query
Language (SQL) and NoSQL languages. It combines reliability of SQL with the speed and
performance of NoSQL.
Categories of NewSQL
There are three major categories of the NewSQL
New Architecture In this framework each node owns a subset of the data and queries are split
into smaller query to sent to nodes to process the data. E.g. NuoDB, Clustrix, VoltDB
MySQL Engines Highly Optimized storage engine for SQL with the interface of MySQLare
the example of such category. E.g. InnoDB, Akiban
Transparent Sharding This system automatically split database across multiple nodes.
E.g.Scalearc
Facebook, a $5.1 billion company has over 1 billion active users in 2012,
according to Wikipedia. Storing and managing data of such magnitude could have been a
problem, even for a company like Facebook. But thanks to Apache Hadoop! Facebook uses
Hadoop to keep track of each and every profile it has on it, as well as all the data related to them
like their images, posts, comments, videos, etc.
Opportunities for Hadoopers!
Opportunities for Hadoopers are infinite - from a Hadoop Developer, to a Hadoop Tester or a
Hadoop Architect, and so on. If cracking and managing Big Data is your passion in life, then
think no more and Join Edureka's Hadoop Online course and carve a niche for yourself! Happy
Hadooping!
The primary goal of big data analytics is to help companies make more informed business
decisions by enabling data scienti The primary goal of big data analytics is its, predictive
modelers and other analytics professionals to analyze large volumes of transaction data, as well
as other forms of data that may be untapped by conventional business intelligence (BI) programs.
That could include Web server logs and Internet clickstream data, social media content and social
network activity reports, text from customer emails and survey responses, mobile-phone call
detail records and machine data captured by sensors connected to the Internet of Things. Some
people exclusively associate big data with semi-structured and unstructured data of that sort, but
consulting firms like Gartner Inc. and Forrester Research Inc. also consider transactions and
other structured data to be valid components of big data analytics applications.
Big data can be analyzed with the software tools commonly used as part of advanced
analytics disciplines such as predictive analytics, data mining, text analytics and statistical
analysis. Mainstream BI software and data visualization tools can also play a role in the analysis
process. But the semi-structured and unstructured data may not fit well in traditional data
warehouses based on relational databases. Furthermore, data warehouses may not be able to
handle the processing demands posed by sets of big data that need to be updated frequently or
even continually -- for example, real-time data on the performance of mobile applications or of
oil and gas pipelines. As a result, many organizations looking to collect, process and analyze big
data have turned to a newer class of technologies that includesHadoop and related tools such
as YARN, MapReduce, Spark, Hive and Pigas well as NoSQL databases. Those technologies
form the core of an open source software framework that supports the processing of large and
diverse data sets across clustered systems. In some cases, Hadoop clusters and NoSQL systems
are being used as landing pads and staging areas for data before it gets loaded into a data
warehouse for analysis, often in a summarized form that is more conducive to relational
structures. Increasingly though, big data vendors are pushing the concept of a Hadoop data lake
that serves as the central repository for an organization's incoming streams of raw data. In such
architectures, subsets of the data can then be filtered for analysis in data warehouses and
analytical databases, or it can be analyzed directly in Hadoop using batch query tools, stream
processing software and SQL on Hadoop technologies that run interactive, ad hoc queries written
in SQL.
Potential pitfalls that can trip up organizations on big data analytics initiatives include a lack of
internal analytics skills and the high cost of hiring experienced analytics professionals. The
amount of information that's typically involved, and its variety, can also cause data management
headaches, including data quality and consistency issues. In addition, integrating Hadoop
systems and data warehouses can be a challenge, although various vendors now offer software
connectors between Hadoop and relational databases, as well as other data integration tools with
big data capabilities.
Applications
Twitter Data Analysis: Twitter data analysis is used to understand the hottest trends by dwelling
into the twitter data. Using flume data is fetched from twitter to Hadoop in JSON format. Using
JSON-serde twitter data is read and fed into HIVE tables so that we can do different analysis
using HIVE queries. For eg: Top 10 popular tweets etc.
Stack Exchange Ranking and Percentile data-set: Stack Exchange is a place where you will
find enormous data from multiple websites of Stack Group (like: stack overflow) which is open
sourced. The place is a gold mine for people who wants to come up with several POC and are
searching for suitable data-sets. In there you may query out the data you are interested in which
will contain more than 50,000 odd records. For eg: You can download Stack Overflow Rank and
Percentile data and find out the top 10 rankers.
Loan Dataset: The project is designed to find the good and bad URL links based on the reviews
given by the users. The primary data will be highly unstructured. Using MR jobs the data will be
transformed into structured form and then pumped to HIVE tables. Using Hive queries we can
query out the information very easily. In the phase two we will feed another dataset which
contains the corresponding cached web pages of the URL's into HBASE. Finally the entire
project is showcased into a UI where you can check the ranking of the URL and view the cached
page.
Data -sets by Government: These Data sets could be like Worker Population Ratio (per 1000)
for persons of age (15-59) years according to the current weekly status approach for each
state/UT.
Machine Learning Dataset like Badges datasets : Such dataset is for system to encode names,
for example +/- label followed by a person's name.
Big data analytics is the process of examining large amounts of data of a variety of types to
uncover hidden patterns, unknown correlations, and other useful information. Such information
can provide competitive advantages over rival organizations and result in business benefits, such
as more effective marketing and increased revenue. New methods of working with big data, such
as Hadoop and MapReduce, offer alternatives to traditional data warehousing.
Big Data Analytics with R and Hadoop is focused on the techniques of integrating R and Hadoop
by various tools such as RHIPE and RHadoop. A powerful data analytics engine can be built,
which can process analytics algorithms over a large scale dataset in a scalable manner. This can
be implemented through data analytics operations of R, MapReduce, and HDFS of
HadoopWeather Dataset: It has all the details of weather over a period of time using which you
may find out the highest, lowest or average temperature.
Hadoop that many data analysts would be immediately comfortable with) or PIG (a procedural
scripting language for Hadoop).
MapReduce allows us to take unstructured data and transform (map) it to something meaningful,
and then aggregate (reduce) for reporting. All of this happens in parallel across all nodes in the
Hadoop cluster.
A simple example of MapReduce could map social media posts to a list of words and a count of
their occurrences, and then reduce, that list to a count of occurrence of a word per day. In a more
complex example we can use a dictionary in the map process to cleanse the social media posts,
and then use a statistical model to determine the tone of an individual post.
So Now What?
Once MapReduce has done its magic, the meaningful data now stored in Hadoop can be loaded
into existing enterprise business intelligence (BI) platform or analyzed directly using powerful
self-service tools like PowerPivot and PowerView. Customers utilizing SQL Server as their
enterprise BI platform have a variety of options to access their Hadoop data, including: Sqoop,
SQL Server Integration Services, and Polybase (SQL Server PDW 2012 only).
Hadoop and SQL Server
Having social media data loaded into an existing enterprise BI platform allows dashboards to be
created that give at glance information on how customers feel about a brand. Imagine how
powerful it would be to have the ability to visualize how customer sentiment is affecting top line
sales over time! This type of powerful analysis allows businesses to have the insight needed to
quickly adapt, and its all made possible through Hadoop.
For instance, Road Runner Sports has a shoe-dog app on its website in which it asks a few
questions and provides recommendations on which running shoes I should consider.
Unfortunately, it provides A LOT of options with no real way to distinguish which shoes are
considered my best option.
In this example, small data can make a difference is if the app takes into account my previous
shoe purchases and my review of those shoes before providing a recommendation. Additionally,
if I tend to buy the same brand and their customer service experts believe I would benefit from
another brand perhaps they could offer me a discount as an incentive
. Finally, it could limit the number of options by only showing me the highest rated shoes or only
those in a specific price range.
We are able also provide some interesting opportunities particularly those related to fitness.
Combining exercise with nutrition information could possibly yield better results for users. This
in turn could keep a user better engaged with both the device and the app. For instance, I am
currently training for a marathon; it would be great after a run if my device (based on my
workout metrics distance, speed, heart rate, calories, etc.) offered me some suggestions related
to nutrition such as be sure to consume x amount of water, protein, carbs, etc. within 30
minutes. Or perhaps, based on my food log make sure I am consuming the right amount of
food based on my level of exercise.
Finally, how great would it be if every time you call into a customer help desk they have a
record of your past calls, tweets or message in Facebook so the CSR could quickly state I see
that you have called about x issue multiple times are you having the same issue or is this a
new issue? This simple step could easily diffuse an angry customer and go a long way to
building trust.
Small data is personal. Small data is local. The goal is to turn all of this information that is
readily available into action and improve the customer experience. The opportunities are endless
and apply across all industry segments with no business being too small to use data analytics.
And remember that bigger is not always better.
2. MapReduce
Originally developed by Google, the MapReduce website describe it as "a programming model
and software framework for writing applications that rapidly process vast amounts of data in
parallel on large clusters of compute nodes." It's used by Hadoop, as well as many other data
processing applications. Operating System: OS Independent.
3. GridGain
GridGrain offers an alternative to Hadoop's MapReduce that is compatible with the Hadoop
Distributed File System. It offers in-memory processing for fast analysis of real-time data. You
can download the open source version from GitHub or purchase a commercially supported
version from the link above. Operating System: Windows, Linux, OS X.
4. HPCC
Developed by LexisNexis Risk Solutions, HPCC is short for "high performance computing
cluster." It claims to offer superior performance to Hadoop. Both free community versions and
paid enterprise versions are available. Operating System: Linux.
5. Storm
Now owned by Twitter, Storm offers distributed real-time computation capabilities and is often
described as the "Hadoop of realtime." It's highly scalable, robust, fault-tolerant and works with
nearly all programming languages. Operating System: Linux.
Databases/Data Warehouses
6. Cassandra
Originally developed by Facebook, this NoSQL database is now managed by the Apache
Foundation. It's used by many organizations with large, active datasets, including Netflix,
Twitter, Urban Airship, Constant Contact, Reddit, Cisco and Digg. Commercial support and
services are available through third-party vendors. Operating System: OS Independent.
7. HBase
Another Apache project, HBase is the non-relational data store for Hadoop. Features include
linear and modular scalability, strictly consistent reads and writes, automatic failover support and
much more. Operating System: OS Independent.
8. MongoDB
MongoDB was designed to support humongous databases. It's a NoSQL database with
document-oriented storage, full index support, replication and high availability, and more.
Commercial support is available through 10gen. Operating system: Windows, Linux, OS X,
Solaris.
9. Neo4j
The "worlds leading graph database," Neo4j boasts performance improvements up to 1000x or
more versus relational databases. Interested organizations can purchase advanced or enterprise
versions from Neo Technology. Operating System: Windows, Linux.
10. CouchDB
Designed for the Web, CouchDB stores data in JSON documents that you can access via the Web
or or query using JavaScript. It offers distributed scaling with fault-tolerant storage. Operating
system: Windows, Linux, OS X, Android.
11. OrientDB
This NoSQL database can store up to 150,000 documents per second and can load graphs in just
milliseconds. It combines the flexibility of document databases with the power of graph
databases, while supporting features such as ACID transactions, fast indexes, native and SQL
queries, and JSON import and export. Operating system: OS Independent.
12. Terrastore
Based on Terracotta, Terrastore boasts "advanced scalability and elasticity features without
sacrificing consistency." It supports custom data partitioning, event processing, push-down
predicates, range queries, map/reduce querying and processing and server-side update functions.
Operating System: OS Independent.
13. FlockDB
Best known as Twitter's database, FlockDB was designed to store social graphs (i.e., who is
following whom and who is blocking whom). It offers horizontal scaling and very fast reads and
writes. Operating System: OS Independent.
14. Hibari
Used by many telecom companies, Hibari is a key-value, big data store with strong consistency,
high availability and fast performance. Support is available through Gemini Mobile. Operating
System: OS Independent.
15. Riak
Riak humbly claims to be "the most powerful open-source, distributed database you'll ever put
into production." Users include Comcast, Yammer, Voxer, Boeing, SEOMoz, Joyent, Kiip.me,
Formspring, the Danish Government and many others. Operating System: Linux, OS X.
16. Hypertable
This NoSQL database offers efficiency and fast performance that result in cost savings versus
similar databases. The code is 100 percent open source, but paid support is available. Operating
System: Linux, OS X.
In the context of data analytics for intrusion detection, the following evolution is anticipated:
1 st generation: Intrusion detection systems Security architects realized the need for layered
security (e.g., reactive security and breach response) because a system with 100% protective
security is impossible.
2 nd generation: Security information and event management (SIEM) Managing alerts from
different intrusion detection sensors and rules was a big challenge in enterprise settings. SIEM
systems aggregate and filter alarms from many sources and present actionable information to
security analysts.
3 rd generation: Big Data analytics in security (2nd generation SIEM) Big Data tools have
the potential to provide a significant advance in actionable security intelligence by reducing the
time for correlating, consolidating, and contextualizing diverse security event information, and
also for correlating long-term historical data for forensic purposes.
Analyzing logs, network packets, and system events for forensics and intrusion detection has
traditionally been a significant problem; however, traditional technologies fail to provide the
tools to support long-term, large-scale analytics for several reasons:
1. Storing and retaining a large quantity of data was not economically feasible. As a result, most
event logs and other recorded computer activity were deleted after a fixed retention period (e.g.,
60 days).
2. Performing analytics and complex queries on large, structured data sets was inefficient
because traditional tools did not leverage Big Data technologies.
3. Traditional tools were not designed to analyze and manage unstructured data. As a result,
traditional tools had rigid, defined schemas. Big Data tools (e.g., Piglatin scripts and regular
expressions) can query data in flexible formats.
4. Big Data systems use cluster computing infrastructures. As a result, the systems are more
reliable and available, and provide guarantees that queries on the systems are processed to
completion.
New Big Data technologies, such as databases related to the Hadoop ecosystem and stream
processing, are enabling the storage and analysis of large heterogeneous data sets at an
unprecedented scale and speed. These technologies will transform security analytics by:
(a) Collecting data at a massive scale from many internal enterprise sources and external sources
such as vulnerability databases;
(b) Performing deeper analytics on the data;
(c) Providing a consolidated view of security-related information; and
(d) Achieving real-time analysis of streaming data. It is important to note that Big Data tools still
require system architects and analysts to have a deep knowledge of their system in order to
properly configure the Big Data analysis tools.
Network Security
In a recently published case study, Zions Bancorporation 8 announced that it is using Hadoop
clusters and business intelligence tools to parse more data more quickly than with traditional
SIEM tools. In their experience, the quantity of data and the frequency analysis of events are too
much for traditional SIEMs to handle alone. In their traditional systems, searching among a
months load of data could take between 20 minutes and an hour. In their new Hadoop system
running queries with Hive, they get the same results in about one minute.
The security data warehouse driving this implementation not only enables users to mine
meaningful security information from sources such as firewalls and security devices, but also
from website traffic, business processes and other day-to-day transactions. 10 This incorporation
of unstructured data and multiple disparate data sets into a single analytical framework is one of
the main promises of Big Data.
classification techniques (e.g., decision trees and support vector machines) were used to identify
infected hosts and malicious domains. The analysis has already identified many malicious
activities from the ISP data set.
Netflow Monitoring to Identify Botnets
This section summarizes the BotCloud research project (Fraois, J. et al. 2011, November),
which leverages the MapReduce paradigm for analyzing enormous quantities of Netflow data to
identify infected hosts participating in a botnet (Franois, 2011, November). The rationale for
using MapReduce for this project stemmed from the large amount of Netflow data collected for
data analysis. 720 million Netflow records (77GB) were collected in only 23 hours. Processing
this data with traditional tools is challenging. However, Big Data solutions like MapReduce
greatly enhance analytics by enabling an easy-to-deploy distributed computing paradigm.
BotCloud relies on BotTrack, which examines host relationships using a combination of
PageRank and clustering algorithms to track the command-and-control (C&C) channels in the
botnet (Franois et al., 2011, May). Botnet detection is divided into the following steps:
dependency graph creation, PageRank algorithm, and DBScan clustering.The dependency graph
was constructed from Netflow records by representing each host (IP address) as a node. There is
an edge from node A to B if, and only if, there is at least one Netflow record having A as the
source address and B as the destination address. PageRank will discover patterns in this graph
(assuming that P2P communications between bots have similar characteristics since they are
involved in same type of activities) and the clustering phase will then group together hosts
having the same pattern. Since PageRank is the most resource-consuming part, it is the only one
implemented in MapReduce.
BotCloud used a small Hadoop cluster of 12 commodity nodes (11 slaves + 1 master): 6 Intel
Core 2 Duo 2.13GHz nodes with 4 GB of memory and 6 Intel Pentium 4 3GHz nodes with 2GB
of memory. The dataset contained about 16 million hosts and 720 million Netflow records. This
leads to a dependency graph of 57 million edges.
The number of edges in the graph is the main parameter affecting the computational complexity.
Since scores are propagated through the edges, the number of intermediate MapReduce keyvalue pairs is dependent on the number of links. Figure 5 shows the time to complete an
iteration with different edges and cluster sizes.
The results demonstrate that the time for analyzing the complete dataset (57 million edges) was
reduced by a factor of seven by this small Hadoop cluster. Full results (including the accuracy of
the algorithm for identifying botnets) are described in Franois et al. (2011, May).
Advanced Persistent Threats Detection
An Advanced Persistent Threat (APT) is a targeted attack against a high-value asset or a
physical system. In contrast to mass-spreading malware, such as worms, viruses, and Trojans,
APT attackers operate in low-and-slow mode. Low mode maintains a low profile in the
networks and slow mode allows for long execution time. APT attackers often leverage stolen
user credentials or zero-day exploits to avoid triggering alerts. As such, this type of attack can
take place over an extended period of time while the victim organization remains oblivious to
the intrusion.
The 2010 Verizon data breach investigation report concludes that in 86% of the cases, evidence
about the data breach was recorded in the organization logs, but the detection mechanisms failed
to raise security alarms (Verizon, 2010).
APTs are among the most serious information security threats that organizations face today. A
common goal of an APT is to steal intellectual property (IP) from the targeted organization, to
gain access to sensitive customer data, or to access strategic business information that could be
used for financial gain, blackmail, and embarrassment, data poisoning, illegal insider trading or
disrupting an organizations business. APTs are operated by highly-skilled, well-funded and
motivated attackers targeting sensitive information from specific organizations and operating
over periods of months or years. APTs have become very sophisticated and diverse in the
methods and technologies used, particularly in the ability to use organizations own employees
to penetrate the IT systems by using social engineering methods. They often trick users into
opening spear-phishing messages that are customized for each victim (e.g., emails, SMS, and
PUSH messages) and then downloading and installing specially crafted malware that may
contain zero-day exploits (Verizon, 2010; Curry et al., 2011; and Alperovitch, 2011).
Today, detection relies heavily on the expertise of human analysts to create custom signatures
and perform manual investigation. This process is labor-intensive, difficult to generalize, and not
scalable. Existing anomaly detection proposals commonly focus on obvious outliers (e.g.,
volume-based), but are ill-suited for stealthy APT attacks and suffer from high false positive
rates.
Big Data analysis is a suitable approach for APT detection. A challenge in detecting APTs is the
massive amount of data to sift through in search of anomalies. The data comes from an everincreasing number of diverse information sources that have to be audited. This massive volume
of data makes the detection task look like searching for a needle in a haystack (Giura & Wang,
2012). Due to the volume of data, traditional network perimeter defense systems can become
ineffective in detecting targeted attacks and they are not scalable to the increasing size of
organizational networks. As a result, a new approach is required. Many enterprises collect data
about users and hosts activities within the organizations network, as logged by firewalls, web
proxies, domain controllers, intrusion detection systems, and VPN servers. While this data is
typically used for compliance and forensic investigation, it also contains a wealth of information
about user behavior that holds promise for detecting stealthy attacks.
Beehive: Behavior Profiling for APT Detection
At RSA Labs, the observation about APTs is that, however subtle the attack might be, the
attackers behavior (in attempting to steal sensitive information or subvert system operations)
should cause the compromised users actions to deviate from their usual pattern. Moreover,
since APT attacks consist of multiple stages (e.g., exploitation, command-and-control, lateral
movement, and objectives), each action by the attacker provides an opportunity to detect
behavioral deviations from the norm.
Correlating these seemingly independent events can reveal evidence of the intrusion, exposing
stealthy attacks that could not be identified with previous methods.
These detectors of behavioral deviations are referred to as anomaly sensors, with each sensor
examining one aspect of the hosts or users activities within an enterprises network. For
instance, a sensor may keep track of the external sites a host contacts in order to identify unusual
connections (potential command-and-control channels), profile the set of machines each user
logs into to find anomalous access patterns (potential pivoting behavior in the lateral
movement stage), study users regular working hours to flag suspicious activities in the middle
of the night, or track the flow of data between internal hosts to find unusual sinks where large
amounts of data are gathered (potential staging servers before data exfiltration).
While the triggering of one sensor indicates the presence of a singular unusual activity, the
triggering of multiple sensors suggests more suspicious behavior. The human analyst is given
the flexibility of combining multiple sensors according to known attack patterns (e.g.,
command-and-control communications followed by lateral movement) to look for abnormal
events that may warrant investigation or to generate behavioral reports of a given users
activities across time.
The prototype APT detection system at RSA Lab is named Beehive. The name refers to the
multiple weak components (the sensors) that work together to achieve a goal (APT detection),
just as bees with differentiated roles cooperate to maintain a hive. Preliminary results showed
that Beehive is able to process a days worth of data (around a billion log messages) in an hour
and identified policy violations and malware infections that would otherwise have gone
unnoticed (Yen et al., 2013).
In addition to detecting APTs, behavior profiling also supports other applications, including IT
management (e.g., identifying critical services and unauthorized IT infrastructure within the
organization by examining usage patterns), and behavior-based authentication (e.g.,
authenticating users based on their interaction with other users and hosts, the applications they
typically access, or their regular working hours). Thus, Beehive provides insights into an
organizations environment for security and beyond.
Consumer product companies and retail organizations are monitoring social media
like Facebook and Twitter to get an unprecedented view into customer behavior,
preferences, and product perception.
Manufacturers are monitoring minute vibration data from their equipment, which
changes slightly as it wears down, to predict the optimal time to replace or maintain.
Replacing it too soon wastes money; replacing it too late triggers an expensive work
stoppage
Manufacturers are also monitoring social networks, but with a different goal than
marketers: They are using it to detect aftermarket support issues before a warranty
failure becomes publicly detrimental.
Financial Services organizations are using data mined from customer interactions to
slice and dice their users into finely tuned segments. This enables these financial
institutions to create increasingly relevant and sophisticated offers.
Advertising and marketing agencies are tracking social media to understand
responsiveness to campaigns, promotions, and other advertising mediums.
Insurance companies are using Big Data analysis to see which home insurance
applications can be immediately processed, and which ones need a validating inperson visit from an agent.
By embracing social media, retail organizations are engaging brand advocates,
changing the perception of brand antagonists, and even enabling enthusiastic
customers to sell their products.
Hospitals are analyzing medical data and patient records to predict those patients that
are likely to seek readmission within a few months of discharge. The hospital can
then intervene in hopes of preventing another costly hospital stay.
Web-based businesses are developing information products that combine data
gathered from customers to offer more appealing recommendations and more
successful coupon programs.
The government is making data public at both the national, state, and city level for
users to develop new applications that can generate public good.
Sports teams are using data for tracking ticket sales and even for tracking team
strategies.
According to global research firm Gartner, by 2015 nearly 4.4 million new jobs will be created globally
by the Big Data demand and only one-third of them will be filled. India, along with China, could be one
of the biggest suppliers of manpower for the Big Data industry. This is one of the Top Predictions for
2013: Balancing Economics, Risk, and Opportunity & Innovation that Gartner released in Chennai today.
Big Data spending is expected to exceed $130 billion by 2015, generating jobs. Companies and
professionals should seek to acquire relevant skills to deal with high-volume, real-time data.
Every day 2.5 quintillion bytes of data are created so much that 90 per cent of the data in the world
today have been created in the last two years alone. This data come from everywhere: sensors used to
gather climate information, posts to social media sites, digital pictures and videos, purchase transaction
records, and cell phone GPS signals to name a few. Specialists are required to analyse these Big Data in
a company to either take corrective actions or predict future trends.
Advanced information management/analytical skills and business expertise are growing in importance.
Thank you