BDRPRT

CHAPTER 1
INTRODUCTION
Amount of data generated every day is expanding in drastic manner. Big data is a popular term
used to describe the data which is in Zetta bytes. Government, companies many organisations
try to acquire and store data about their citizens and customers in order to know them better
and predict the customer behaviour. Social networking websites generate new data every
second and handling such a data is one of the major challenges companies are facing. Data
which is stored in data warehouses is causing disruption because it is in a raw format, proper
analysis and processing is to be done in order to produce usable information out of it. Big Data
has to deal with large and complex datasets that can be structured, semi structured, or
unstructured and will typically not fit into memory to be processed. They have to be processed
in place, which means that computation has to be done where the data resides for processing.
Big data challenges include analysis, capture, search, sharing, storage, transfer, visualization,
and privacy violations. The trend to larger data sets is due to the additional information
derivable from analysis of a single large set of related data, as compared to separate smaller
sets with the same total amount of data, allowing correlations to be found to "spot business
trends, prevent diseases, combat crime and so on.
Big Data usually includes datasets with sizes. It is not possible for such systems to process this
amount of data within the time frame mandated by the business. Big Data volumes are a
constantly moving target, as of 2012 ranging from a few dozen terabytes to many petabytes of
data in a single dataset.

New tools are being used to handle such a large amount of data in short time. Apache Hadoop
is java based programming framework which is used for processing large data sets in
distributed computer environment. Hadoop is used in system where multiple nodes are present
which can process terabytes of data. Hadoop uses its own file system HDFS which facilitates
fast transfer of data which can sustain node failure and avoid system failure as whole. Hadoop
uses MapReduce algorithm which breaks down the big data into smaller chunks and performs
the operations on it. Hadoop framework is used by many big companies like Google, yahoo,
IBM for applications such as search engine, advertising and information gathering and
processing.
Various technologies will come in hand-in-hand to accomplish this task such as Spring Hadoop
Data Framework for the basic foundations and running of the Map-Reduce jobs, Apache
Maven for distributed building of the code, REST Web services for the communication, and
lastly Apache Hadoop for distributed processing of the huge dataset. The volume of data that
one has to deal has exploded to unimaginable levels in the past decade, and at the same time,
the price of data storage has systematically reduced. Private companies and research
institutions capture terabytes of data about their users’ interactions, business, social media, and
also sensors from devices such as mobile phones and automobiles. The challenge of this era is
to make sense of this sea of data. This is where big data analytics comes into picture. Big Data
Analytics largely involves collecting data from different sources, it in a way that it becomes
available to be consumed by analysts and finally deliver data products useful to the
organization business. The process of converting large amounts of unstructured raw data,
retrieved from different sources to a data product useful for organizations forms the core of Big
Data Analytics.
Big data is a broad term for data sets so large or complex that traditional data processing
applications are inadequate. The term often refers simply to the use of predictive analytics or
other certain advanced methods to extract value from data, and seldom to a particular size of
data set. Accuracy in big data may lead to more confident decision making. And better
decisions can mean greater operational efficiency, cost reductions and reduced risk. Data sets
grow in size in part because they are increasingly being gathered by cheap and numerous
information-sensing mobile devices, aerial (remote sensing), software logs, cameras,
microphones, radio-frequency identification (RFID) readers, and wireless sensor networks. The
world's technological per-capita capacity to store information has roughly doubled every 40
months since the 1980s; as of 2012, every day 2.5 Exabyte of data were created; The challenge
for large enterprises is determining who should own big data initiatives that straddle the entire
organization.
CHAPTER 2
BIG DATA
Every day, we create 2.5 quintillion bytes of data — so much that 90% of the data in the world
today has been created in the last two years alone. This data comes from everywhere: sensors
used to gather climate information, posts to social media sites, digital pictures and videos,
purchase transaction records, and cell phone GPS signals to name a few. This data is big data.
The data lying in the servers of the company was just data until yesterday – sorted and filed.
Suddenly, the slang Big Data got popular and now the data in a company is Big Data. The term
covers each and every piece of data an organization has stored till now. It includes data stored
in clouds and even the URLs that you have been bookmarked. A company might not have
digitized all the data. They may not have structured all the data already. But then, all the
digital, papers, structured and non-structured data with the company is now Big Data.
Big data is an all-encompassing term for any collection of data sets so large and complex that it
becomes difficult to process using traditional data processing applications. It refers to the large
amounts, at least terabytes, of poly-structured data that flows continuously through and around
organizations, including video, text, sensor logs, and transactional records. The business
benefits of analyzing this data can be significant. According to a recent study by the MIT Sloan
School of Management, organizations that use analytics are twice as likely to be top performers
in their industry as those that don’t.
Big data burst upon the scene in the first decade of the 21st century, and the first organizations
to embrace it were online and startup firms. In a nutshell, Big Data is your data. It's the
information owned by your company, obtained and processed through new techniques to
produce value in the best way possible.
Companies have sought for decades to make the best use of information to improve their
business capabilities. However, it's the structure (or lack thereof) and size of Big Data that
makes it so unique. Big Data is also special because it represents both significant information -
which can open new doors – and the way this information is analyzed to help open those doors.
The analysis goes hand-in-hand with the information, so in this sense "Big Data" represents a
noun – "the data" - and a verb – "combing the data to find value." The days of keeping
company data in Microsoft Office documents on carefully organized file shares are behind us,
much like the bygone era of sailing across the ocean in tiny ships. That 50 gigabyte file share in
2002 looks quite tiny compared to a modern-day 50 terabyte marketing database containing
customer preferences and habits.
Some of the popular organizations that hold Big Data are as follows:
1. Facebook: It has 40 PB of data and captures 100 TB/day
2. Yahoo!: It has 60 PB of data
3. Twitter: It captures 8 TB/day
4. EBay: It has 40 PB of data and captures 50 TB/day
How much data is considered as Big Data differs from company to company. Though true that
one company's Big Data is another's small, there is something common: doesn't fit in memory,
nor disk, has rapid influx of data that needs to be processed and would benefit from distributed
software stacks. For some companies, 10 TB of data would be considered Big Data and for
others 1 PB would be Big Data. So only you can determine whether the data is really Big Data.
It is sufficient to say that it would start in the low terabyte range.

2.1 Attributes of Big Data:
As far back as 2001, industry analyst Doug Laney (currently with Gartner) articulated the now
mainstream definition of big data as the three vs. of big data: volume, velocity and variety.
1. Volume: The quantity of data that is generated is very important in this context. It is the size
of the data which determines the value and potential of the data under consideration and
whether it can actually be considered as Big Data or not. The name ‘Big Data’ itself contains a
term which is related to size and hence the characteristic. Many factors contribute to the
increase in data volume. Transaction-based data stored through the years. Unstructured data
streaming in from social media. Increasing amounts of sensor and machine-to-machine data
being collected. In the past, excessive data volume was a storage issue. But with decreasing
storage costs, other issues emerge, including how to determine relevance within large data
volumes and how to use analytics to create value from relevant data.
1. It is estimated that 2.5 Quintillion data is generated every day.
2. 40 Zettabytes of data will be created by 2020,an increase of 30 times from 2005
3. 6 billion people around the world are using mobile phones
2. Velocity: The term ‘velocity’ in the context refers to the speed of generation of data or how
fast the data is generated and processed to meet the demands and the challenges which lie
ahead in the path of growth and development. Data is streaming in at unprecedented speed and
must be dealt with in a timely manner. RFID tags, sensors and smart metering are driving the
need to deal with torrents of data in near-real time. Reacting quickly enough to deal with data
velocity is a challenge for most organizations.
1. 20 Hours of video being uploaded every minute
2. 2.9 million emails sent every second

3. New York stock exchange captures 1TB of information in every trading session
3. Variety: The next aspect of Big Data is its variety. This means that the category to which
Big Data belongs to is also a very essential fact that needs to be known by the data analysts.
This helps the people, who are closely analyzing the data and are associated with it, to
effectively use the data to their advantage and thus upholding the importance of the Big Data.
Data today comes in all types of formats. Structured, numeric data in traditional databases.
Information created from line-of-business applications. Unstructured text documents, email,
video, audio, stock ticker data and financial transactions.
1. Global size of data in health care is about 150 exabytes as of 2011
2. 30 Billion pieces of content are shared on facebook every month
3. 4 billion hours of video are watched on youtube each month
4. Veracity: In addition to the increasing velocities and varieties of data, data flows can be
highly inconsistent with periodic peaks. Is something trending in social media? Daily, seasonal
and event-triggered peak data loads can be challenging to manage. Even more so with
unstructured data involved. This is a factor which can be a problem for those who analyse the
data. Poor data quality causes US economy around $3.1 trillion a year
5. Complexity: Data management can become a very complex process, especially when large
volumes of data come from multiple sources. These data need to be linked, connected and
correlated in order to be able to grasp the information that is supposed to be conveyed by these
data. This situation, is therefore, termed as the ‘complexity’ of Big Data.
6. Volatility: Big data volatility refers to how long is data valid and how long should it be
stored. In this world of real time data you need to determine at what point is data no longer
relevant to the current analysis.

2.2 Big Data Applications:
1. ADVERTISING: Big data analytics help companies like Google and other advertising
companies to identify the behaviour of a person and to target the ads accordingly. Big data
analytics help in more personal and targeted ads.
2. ONLINE MARKETING: Big data analytics is used by online retailers like amazon, ebay,
flipkart etc. to identify their potential customers giving them offers , for varying the price of
products according to the trends etc.
3. HEALTH CARE: The average amount of data per hospital will increase from 167TB to
665TB in 2015.With Big Data medical professionals can improve patient care and reduce cost
by extracting relevant clinical information.
4. CUSTOMER SERVICE: Service representatives can use data to gain a more holistic view
of their customers, understanding their likes and dislikes in real time.

CHAPTER 3
BIG DATA ANALYTICS
Big data is difficult to work with using most relational database management systems and
desktop statistics and visualization packages, requiring instead "massively parallel software
running on tens, hundreds, or even thousands of servers"
Rapidly ingesting, storing, and processing big data requires a cost effective infrastructure that
can scale with the amount of data and the scope of analysis. Most organizations with traditional
data platforms—typically relational database management systems (RDBMS) coupled to
enterprise data warehouses (EDW) using ETL tools—find that their legacy infrastructure is
either technically incapable or financially impractical for storing and analyzing big data. A
traditional ETL process extracts data from multiple sources, then cleanses, formats, and loads it
into a data warehouse for analysis. When the source data sets are large, fast, and unstructured,
traditional ETL can become the bottleneck, because it is too complex to develop, too expensive
to operate, and takes too long to execute.
By most accounts, 80 percent of the development effort in a big data project goes into data
integration and only 20 percent goes toward data analysis. Furthermore, a traditional EDW
platform can cost upwards of USD 60K per terabyte. Analyzing one petabyte—the amount of
data Google processes in 1 hour—would cost USD 60M. Clearly “more of the same” is not a
big data strategy that any CIO can afford. So we require more efficient analytics for Big Data.
Big Analytics delivers competitive advantage in two ways compared to the traditional
analytical model. First, Big Analytics describes the efficient use of a simple model applied to
volumes of data that would be too large for the traditional analytical environment. Research
suggests that a simple algorithm with a large volume of data is more accurate than a
sophisticated algorithm with little data. The algorithm is not the competitive advantage; the
ability to apply it to huge amounts of data—without compromising performance—generates
the competitive edge.
3.1 Challenges of Big Data Analytics:
For most organizations, big data analysis is a challenge. Consider the sheer volume of data and
the many different formats of the data (both structured and unstructured data) collected across
the entire organization and the many different ways different types of data can be combined,
contrasted and analyzed to find patterns and other useful information.
1. Meeting the need for speed: In today’s hypercompetitive business environment, companies
not only have to find and analyze the relevant data they need, they must find it quickly.
Visualization helps organizations perform analyses and make decisions much more rapidly, but
the challenge is going through the sheer volumes of data and accessing the level of detail
needed, all at a high speed. One possible solution is hardware. Some vendors are using
increased memory and powerful parallel processing to crunch large volumes of data extremely
quickly. Another method is putting data in-memory but using a grid computing approach,
where many machines are used to solve a problem.
2. Understanding the data: It takes a lot of understanding to get data in the right shape so that
you can use visualization as part of data analysis. For example, if the data comes from social
media content, you need to know who the user is in a general sense – such as a customer using
a particular set of products – and understand what it is you’re trying to visualize out of the data.
One solution to this challenge is to have the proper domain expertise in place. Make sure the
people analyzing the data have a deep understanding of where the data comes from, what
audience will be consuming the data and how that audience will interpret the information.
3. Addressing data quality : Even if you can find and analyze data quickly and put it in the
proper context for the audience that will be consuming the information, the value of data for
decision-making purposes will be jeopardized if the data is not accurate or timely. This is a
challenge with any data analysis, but when considering the volumes of information involved in
big data projects, it becomes even more pronounced. Again, data visualization will only prove
to be a valuable tool if the data quality is assured. To address this issue, companies need to
have a data governance or information management process in place to ensure the data is clean.
4. Displaying meaningful results: Plotting points on a graph for analysis becomes difficult
when dealing with extremely large amounts of information or a variety of categories of
information. For example, imagine you have 10 billion rows of retail SKU data that you’re
trying to compare. The user trying to view 10 billion plots on the screen will have a hard time
seeing so many data points. One way to resolve this is to cluster data into a higher-level view
where smaller groups of data become visible. By grouping the data together, or “binning,” you
can more effectively visualize the data.

3.2 Objectives of Big Data Analytics:
1. Cost Reduction from Big Data Technologies: Some organizations pursuing big data
believe strongly that MIPS and terabyte storage for structured data are now most cheaply
delivered through big data technologies like Hadoop clusters. Organizations that were focused
on cost reduction made the decision to adopt big data tools primarily within the IT organization
on largely technical and economic criteria.
2. Time Reduction from Big Data: The second common objective of big data technologies
and solutions is time reduction. Many companies make use of big data analytics to generate
real-time decisions to save a lot of time. Another key objective involving time reduction is to
be able to interact with the customer in real time, using analytics and data derived from the
customer experience.
3. Developing New Big Data-Based Offerings: One of the most ambitious things an
organization can do with big data is to employ it in developing new product and service
offerings based on data. Many of the companies that employ this approach are online firms,
which have an obvious need to employ data-based products and services.

CHAPTER 4
APACHE HADOOP
Doug Cutting and Mike Carafella helped create Apache Hadoop in 2005 out of necessity as
data from the web exploded, and grew far beyond the ability of traditional systems to handle it.
Hadoop was initially inspired by papers published by Google outlining its approach to handling
an avalanche of data, and has since become the de facto standard for storing, processing and
analyzing hundreds of terabytes, and even petabytes of data.
Apache Hadoop is an open source distributed software platform for storing and processing
data. Written in Java, it runs on a cluster of industry-standard servers configured with direct-
attached storage. Using Hadoop, you can store petabytes of data reliably on tens of thousands
of servers while scaling performance cost-effectively by merely adding inexpensive nodes to
the cluster.
Apache Hadoop is 100% open source, and pioneered a fundamentally new way of storing and
processing data. Instead of relying on expensive, proprietary hardware and different systems to
store and process data, Hadoop enables distributed parallel processing of huge amounts of data
across inexpensive, industry-standard servers that both store and process the data, and can scale
without limits. With Hadoop, no data is too big. And in today’s hyper-connected world where
more and more data is being created every day, Hadoop’s breakthrough advantages mean that
businesses and organizations can now find value in data that was recently considered useless.
HDFS : Self-healing, high-bandwidth, clustered storage.

Map Reduce : Distributed, fault-tolerant resource management, coupled with scalable data
processing.
YARN : YARN is the architectural center of Hadoop that allows multiple data processing
engines such as interactive SQL, real-time streaming, data science and batch processing to
handle data stored in a single platform, unlocking an entirely new approach to analytics. It is
the foundation of the new generation of Hadoop and is enabling organizations everywhere to
realize a modern data architecture.
YARN is the prerequisite for Enterprise Hadoop, providing resource management and a central
platform to deliver consistent operations, security, and data governance tools across Hadoop
cluster. HDFS is Hadoop's own rack-aware filesystem, which is a UNIX-based data storage
layer of Hadoop. HDFS is derived from concepts of Google filesystem. An important
characteristic of Hadoop is the partitioning of data and computation across many (thousands of)
hosts, and the execution of application computations in parallel, close to their data. On HDFS,
data files are replicated as sequences of blocks in the cluster. A Hadoop cluster scales
computation capacity, storage capacity, and I/O bandwidth by simply adding commodity
servers. HDFS can be accessed from applications in many different ways. Natively, HDFS
provides a Java API for applications to use.
The Hadoop clusters at Yahoo! span 40,000 servers and store 40 petabytes of application data,
with the largest Hadoop cluster being 4,000 servers. Also, one hundred other organizations
worldwide are known to use Hadoop.
HDFS was designed to be a scalable, fault-tolerant, distributed storage system that works
closely with MapReduce. HDFS will “just work” under a variety of physical and systemic
circumstances. By distributing storage and computation across many servers, the combined
storage resource can grow with demand while remaining economical at every size.
HDFS supports parallel reading and writing and is optimized for streaming reading and writing.
The bandwidth scales linearly with the number of nodes.HDFS provides a block redundancy
factor which is normally 3 i.e. every block will be replicated 3 times in various nodes .This
helps to get higher fault tolerance.
These specific features ensure that the Hadoop clusters are highly functional and highly
available:
1. Rack awareness allows consideration of a node’s physical location, when allocating
storage and scheduling tasks
2. Minimal data motion. MapReduce moves compute processes to the data on HDFS and
not the other way around. Processing tasks can occur on the physical node where the
data resides.
3. Utilities diagnose the health of the files system and can rebalance the data on different
nodes
4. Rollback allows system operators to bring back the previous version of HDFS after an
upgrade, in case of human or system errors
5. Standby NameNode provides redundancy and supports high availability
6. Highly operable. Hadoop handles different types of cluster that might otherwise
require operator intervention. This design allows a single operator to maintain a cluster
of 1000s of nodes.
CHAPTER 5
BIG DATA FRAMEWORKS
1. Apache Hadoop
Apache Hadoop is an open source, scalable and fault tolerant framework written in Java. It is a
processing framework that exclusively provides batch processing, and efficiently processes
large volumes of data on a cluster of commodity hardware. Hadoop is not only a storage
system but is a platform for storing large volumes of data as well as for processing.
Modern versions of Hadoop are composed of several components or layers that work together
to process batch data. These are listed below.
1. HDFS (Hadoop Distributed File System): This is the distributed file system layer that
coordinates storage and replication across the cluster nodes. HDFS ensures that data
remains available in spite of inevitable host failures. It is used as the source of data, to
store intermediate processing results, and to persist the final calculated results.
2. YARN: This stands for Yet Another Resource Negotiator. It is the cluster coordinating
component of the Hadoop stack, and is responsible for coordinating and managing the
underlying resources and scheduling jobs that need to be run.
3. MapReduce: This is Hadoop’s native batch processing engine.
2. Apache Storm
Apache Storm is a stream processing framework that focuses on extremely low latency and is
perhaps the best option for workloads that require near real-time processing. It can handle very
large quantities of data and deliver results with less latency than other solutions. Storm is
simple, can be used with any programming language, and is also a lot of fun.
Storm has many use cases: real-time analytics, online machine learning, continuous
computation, distributed RPC, ETL, and more. It is fast—a benchmark clocked it at over a
million tuples processed per second per node. It is also scalable, fault-tolerant, guarantees your
data will be processed, and is easy to set up and operate.
3. Apache Samza
Apache Samza is a stream processing framework that is tightly tied to the Apache Kafka
messaging system. While Kafka can be used by many stream processing systems, Samza is
designed specifically to take advantage of Kafka’s unique architecture and guarantees. It uses
Kafka to provide fault tolerance, buffering and state storage.
Samza uses YARN for resource negotiation. This means that, by default, a Hadoop cluster is
required (at least HDFS and YARN). It also means that Samza can rely on the rich features
built into YARN.
4. Apache Spark
Apache Spark is a general purpose and lightning fast cluster computing system. It provides
high-level APIs like Java, Scala, Python and R, and is a tool for running Spark applications. It
is 100 times faster than Big Data Hadoop and ten times faster than accessing data from the
disk. It can be integrated with Hadoop and can process existing Hadoop HDFS data.
Apache Spark is a next generation batch processing framework with stream processing
capabilities. Built using many of the same principles of Hadoop’s MapReduce engine, Spark
focuses primarily on speeding up batch processing workloads by offering full in-memory
computation and processing optimisation.

5. Apache Flink
Apache Flink is an open source platform; it is a streaming data flow engine that provides
communication, fault tolerance and data distribution for distributed computations over data
streams. It is a scalable data analytics framework that is fully compatible with Hadoop. Flink
can execute both stream processing and batch processing easily.
While Spark performs batch and stream processing, its streaming is not appropriate for many
use cases because of its micro-batch architecture. Flink’s stream-first approach offers low
latency, high throughput, and real entry-by-entry processing.

BDRPRT

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

BDRPRT

Uploaded by

Copyright:

Available Formats

CHAPTER 1

trends, prevent diseases, combat crime and so on.

data in a single dataset.

information-sensing mobile devices, aerial (remote sensing), software logs, cameras,

in their industry as those that don’t.

produce value in the best way possible.

customer preferences and habits.

1. Facebook: It has 40 PB of data and captures 100 TB/day

2. Yahoo!: It has 60 PB of data

3. Twitter: It captures 8 TB/day

4. EBay: It has 40 PB of data and captures 50 TB/day

It is sufficient to say that it would start in the low terabyte range.

1. It is estimated that 2.5 Quintillion data is generated every day.

2. 40 Zettabytes of data will be created by 2020,an increase of 30 times from 2005

3. 6 billion people around the world are using mobile phones

velocity is a challenge for most organizations.

1. 20 Hours of video being uploaded every minute

2. 2.9 million emails sent every second

Information created from line-of-business applications. Unstructured text documents, email,

video, audio, stock ticker data and financial transactions.

1. Global size of data in health care is about 150 exabytes as of 2011

2. 30 Billion pieces of content are shared on facebook every month

3. 4 billion hours of video are watched on youtube each month

data. This situation, is therefore, termed as the ‘complexity’ of Big Data.

relevant to the current analysis.

analytics help in more personal and targeted ads.

products according to the trends etc.

by extracting relevant clinical information.

of their customers, understanding their likes and dislikes in real time.

BIG DATA ANALYTICS

running on tens, hundreds, or even thousands of servers"

data platforms—typically relational database management systems (RDBMS) coupled to

to operate, and takes too long to execute.

ability to apply it to huge amounts of data—without compromising performance—generates

the competitive edge.

3.1 Challenges of Big Data Analytics:

contrasted and analyzed to find patterns and other useful information.

where many machines are used to solve a problem.

when dealing with extremely large amounts of information or a variety of categories of

can more effectively visualize the data.

on largely technical and economic criteria.

which have an obvious need to employ data-based products and services.

analyzing hundreds of terabytes, and even petabytes of data.

of servers while scaling performance cost-effectively by merely adding inexpensive nodes to

HDFS : Self-healing, high-bandwidth, clustered storage.

realize a modern data architecture.

layer of Hadoop. HDFS is derived from concepts of Google filesystem. An important

provides a Java API for applications to use.

worldwide are known to use Hadoop.

helps to get higher fault tolerance.

1. Rack awareness allows consideration of a node’s physical location, when allocating

storage and scheduling tasks

upgrade, in case of human or system errors

5. Standby NameNode provides redundancy and supports high availability

BIG DATA FRAMEWORKS

to process batch data. These are listed below.

underlying resources and scheduling jobs that need to be run.

3. MapReduce: This is Hadoop’s native batch processing engine.

data will be processed, and is easy to set up and operate.

Kafka to provide fault tolerance, buffering and state storage.

built into YARN.

focuses primarily on speeding up batch processing workloads by offering full in-memory

computation and processing optimisation.