You are on page 1of 18

CHAPTER 1

INTRODUCTION

Amount of data generated every day is expanding in drastic manner. Big data is a popular term

used to describe the data which is in Zetta bytes. Government, companies many organisations

try to acquire and store data about their citizens and customers in order to know them better

and predict the customer behaviour. Social networking websites generate new data every

second and handling such a data is one of the major challenges companies are facing. Data

which is stored in data warehouses is causing disruption because it is in a raw format, proper

analysis and processing is to be done in order to produce usable information out of it. Big Data

has to deal with large and complex datasets that can be structured, semi structured, or

unstructured and will typically not fit into memory to be processed. They have to be processed

in place, which means that computation has to be done where the data resides for processing.

Big data challenges include analysis, capture, search, sharing, storage, transfer, visualization,

and privacy violations. The trend to larger data sets is due to the additional information

derivable from analysis of a single large set of related data, as compared to separate smaller

sets with the same total amount of data, allowing correlations to be found to "spot business

trends, prevent diseases, combat crime and so on.

Big Data usually includes datasets with sizes. It is not possible for such systems to process this

amount of data within the time frame mandated by the business. Big Data volumes are a

constantly moving target, as of 2012 ranging from a few dozen terabytes to many petabytes of

data in a single dataset.


New tools are being used to handle such a large amount of data in short time. Apache Hadoop

is java based programming framework which is used for processing large data sets in

distributed computer environment. Hadoop is used in system where multiple nodes are present

which can process terabytes of data. Hadoop uses its own file system HDFS which facilitates

fast transfer of data which can sustain node failure and avoid system failure as whole. Hadoop

uses MapReduce algorithm which breaks down the big data into smaller chunks and performs

the operations on it. Hadoop framework is used by many big companies like Google, yahoo,

IBM for applications such as search engine, advertising and information gathering and

processing.

Various technologies will come in hand-in-hand to accomplish this task such as Spring Hadoop

Data Framework for the basic foundations and running of the Map-Reduce jobs, Apache

Maven for distributed building of the code, REST Web services for the communication, and

lastly Apache Hadoop for distributed processing of the huge dataset. The volume of data that

one has to deal has exploded to unimaginable levels in the past decade, and at the same time,

the price of data storage has systematically reduced. Private companies and research

institutions capture terabytes of data about their users’ interactions, business, social media, and

also sensors from devices such as mobile phones and automobiles. The challenge of this era is

to make sense of this sea of data. This is where big data analytics comes into picture. Big Data

Analytics largely involves collecting data from different sources, it in a way that it becomes

available to be consumed by analysts and finally deliver data products useful to the

organization business. The process of converting large amounts of unstructured raw data,

retrieved from different sources to a data product useful for organizations forms the core of Big

Data Analytics.

Big data is a broad term for data sets so large or complex that traditional data processing

applications are inadequate. The term often refers simply to the use of predictive analytics or
other certain advanced methods to extract value from data, and seldom to a particular size of

data set. Accuracy in big data may lead to more confident decision making. And better

decisions can mean greater operational efficiency, cost reductions and reduced risk. Data sets

grow in size in part because they are increasingly being gathered by cheap and numerous

information-sensing mobile devices, aerial (remote sensing), software logs, cameras,

microphones, radio-frequency identification (RFID) readers, and wireless sensor networks. The

world's technological per-capita capacity to store information has roughly doubled every 40

months since the 1980s; as of 2012, every day 2.5 Exabyte of data were created; The challenge

for large enterprises is determining who should own big data initiatives that straddle the entire

organization.
CHAPTER 2

BIG DATA

Every day, we create 2.5 quintillion bytes of data — so much that 90% of the data in the world

today has been created in the last two years alone. This data comes from everywhere: sensors

used to gather climate information, posts to social media sites, digital pictures and videos,

purchase transaction records, and cell phone GPS signals to name a few. This data is big data.

The data lying in the servers of the company was just data until yesterday – sorted and filed.

Suddenly, the slang Big Data got popular and now the data in a company is Big Data. The term

covers each and every piece of data an organization has stored till now. It includes data stored

in clouds and even the URLs that you have been bookmarked. A company might not have

digitized all the data. They may not have structured all the data already. But then, all the

digital, papers, structured and non-structured data with the company is now Big Data.

Big data is an all-encompassing term for any collection of data sets so large and complex that it

becomes difficult to process using traditional data processing applications. It refers to the large

amounts, at least terabytes, of poly-structured data that flows continuously through and around

organizations, including video, text, sensor logs, and transactional records. The business

benefits of analyzing this data can be significant. According to a recent study by the MIT Sloan

School of Management, organizations that use analytics are twice as likely to be top performers

in their industry as those that don’t.

Big data burst upon the scene in the first decade of the 21st century, and the first organizations

to embrace it were online and startup firms. In a nutshell, Big Data is your data. It's the
information owned by your company, obtained and processed through new techniques to

produce value in the best way possible.

Companies have sought for decades to make the best use of information to improve their

business capabilities. However, it's the structure (or lack thereof) and size of Big Data that

makes it so unique. Big Data is also special because it represents both significant information -

which can open new doors – and the way this information is analyzed to help open those doors.

The analysis goes hand-in-hand with the information, so in this sense "Big Data" represents a

noun – "the data" - and a verb – "combing the data to find value." The days of keeping

company data in Microsoft Office documents on carefully organized file shares are behind us,

much like the bygone era of sailing across the ocean in tiny ships. That 50 gigabyte file share in

2002 looks quite tiny compared to a modern-day 50 terabyte marketing database containing

customer preferences and habits.

Some of the popular organizations that hold Big Data are as follows:

1. Facebook: It has 40 PB of data and captures 100 TB/day

2. Yahoo!: It has 60 PB of data

3. Twitter: It captures 8 TB/day

4. EBay: It has 40 PB of data and captures 50 TB/day

How much data is considered as Big Data differs from company to company. Though true that

one company's Big Data is another's small, there is something common: doesn't fit in memory,

nor disk, has rapid influx of data that needs to be processed and would benefit from distributed

software stacks. For some companies, 10 TB of data would be considered Big Data and for

others 1 PB would be Big Data. So only you can determine whether the data is really Big Data.

It is sufficient to say that it would start in the low terabyte range.


2.1 Attributes of Big Data:

As far back as 2001, industry analyst Doug Laney (currently with Gartner) articulated the now

mainstream definition of big data as the three vs. of big data: volume, velocity and variety.

1. Volume: The quantity of data that is generated is very important in this context. It is the size

of the data which determines the value and potential of the data under consideration and

whether it can actually be considered as Big Data or not. The name ‘Big Data’ itself contains a

term which is related to size and hence the characteristic. Many factors contribute to the

increase in data volume. Transaction-based data stored through the years. Unstructured data

streaming in from social media. Increasing amounts of sensor and machine-to-machine data

being collected. In the past, excessive data volume was a storage issue. But with decreasing

storage costs, other issues emerge, including how to determine relevance within large data

volumes and how to use analytics to create value from relevant data.

1. It is estimated that 2.5 Quintillion data is generated every day.

2. 40 Zettabytes of data will be created by 2020,an increase of 30 times from 2005

3. 6 billion people around the world are using mobile phones

2. Velocity: The term ‘velocity’ in the context refers to the speed of generation of data or how

fast the data is generated and processed to meet the demands and the challenges which lie

ahead in the path of growth and development. Data is streaming in at unprecedented speed and

must be dealt with in a timely manner. RFID tags, sensors and smart metering are driving the

need to deal with torrents of data in near-real time. Reacting quickly enough to deal with data

velocity is a challenge for most organizations.

1. 20 Hours of video being uploaded every minute

2. 2.9 million emails sent every second


3. New York stock exchange captures 1TB of information in every trading session

3. Variety: The next aspect of Big Data is its variety. This means that the category to which

Big Data belongs to is also a very essential fact that needs to be known by the data analysts.

This helps the people, who are closely analyzing the data and are associated with it, to

effectively use the data to their advantage and thus upholding the importance of the Big Data.

Data today comes in all types of formats. Structured, numeric data in traditional databases.

Information created from line-of-business applications. Unstructured text documents, email,

video, audio, stock ticker data and financial transactions.

1. Global size of data in health care is about 150 exabytes as of 2011

2. 30 Billion pieces of content are shared on facebook every month

3. 4 billion hours of video are watched on youtube each month

4. Veracity: In addition to the increasing velocities and varieties of data, data flows can be

highly inconsistent with periodic peaks. Is something trending in social media? Daily, seasonal

and event-triggered peak data loads can be challenging to manage. Even more so with

unstructured data involved. This is a factor which can be a problem for those who analyse the

data. Poor data quality causes US economy around $3.1 trillion a year

5. Complexity: Data management can become a very complex process, especially when large

volumes of data come from multiple sources. These data need to be linked, connected and

correlated in order to be able to grasp the information that is supposed to be conveyed by these

data. This situation, is therefore, termed as the ‘complexity’ of Big Data.

6. Volatility: Big data volatility refers to how long is data valid and how long should it be

stored. In this world of real time data you need to determine at what point is data no longer

relevant to the current analysis.


2.2 Big Data Applications:

1. ADVERTISING: Big data analytics help companies like Google and other advertising

companies to identify the behaviour of a person and to target the ads accordingly. Big data

analytics help in more personal and targeted ads.

2. ONLINE MARKETING: Big data analytics is used by online retailers like amazon, ebay,

flipkart etc. to identify their potential customers giving them offers , for varying the price of

products according to the trends etc.

3. HEALTH CARE: The average amount of data per hospital will increase from 167TB to

665TB in 2015.With Big Data medical professionals can improve patient care and reduce cost

by extracting relevant clinical information.

4. CUSTOMER SERVICE: Service representatives can use data to gain a more holistic view

of their customers, understanding their likes and dislikes in real time.


CHAPTER 3

BIG DATA ANALYTICS

Big data is difficult to work with using most relational database management systems and

desktop statistics and visualization packages, requiring instead "massively parallel software

running on tens, hundreds, or even thousands of servers"

Rapidly ingesting, storing, and processing big data requires a cost effective infrastructure that

can scale with the amount of data and the scope of analysis. Most organizations with traditional

data platforms—typically relational database management systems (RDBMS) coupled to

enterprise data warehouses (EDW) using ETL tools—find that their legacy infrastructure is

either technically incapable or financially impractical for storing and analyzing big data. A

traditional ETL process extracts data from multiple sources, then cleanses, formats, and loads it

into a data warehouse for analysis. When the source data sets are large, fast, and unstructured,

traditional ETL can become the bottleneck, because it is too complex to develop, too expensive

to operate, and takes too long to execute.

By most accounts, 80 percent of the development effort in a big data project goes into data

integration and only 20 percent goes toward data analysis. Furthermore, a traditional EDW

platform can cost upwards of USD 60K per terabyte. Analyzing one petabyte—the amount of

data Google processes in 1 hour—would cost USD 60M. Clearly “more of the same” is not a

big data strategy that any CIO can afford. So we require more efficient analytics for Big Data.
Big Analytics delivers competitive advantage in two ways compared to the traditional

analytical model. First, Big Analytics describes the efficient use of a simple model applied to

volumes of data that would be too large for the traditional analytical environment. Research

suggests that a simple algorithm with a large volume of data is more accurate than a

sophisticated algorithm with little data. The algorithm is not the competitive advantage; the

ability to apply it to huge amounts of data—without compromising performance—generates

the competitive edge.

3.1 Challenges of Big Data Analytics:

For most organizations, big data analysis is a challenge. Consider the sheer volume of data and

the many different formats of the data (both structured and unstructured data) collected across

the entire organization and the many different ways different types of data can be combined,

contrasted and analyzed to find patterns and other useful information.

1. Meeting the need for speed: In today’s hypercompetitive business environment, companies

not only have to find and analyze the relevant data they need, they must find it quickly.

Visualization helps organizations perform analyses and make decisions much more rapidly, but

the challenge is going through the sheer volumes of data and accessing the level of detail

needed, all at a high speed. One possible solution is hardware. Some vendors are using

increased memory and powerful parallel processing to crunch large volumes of data extremely

quickly. Another method is putting data in-memory but using a grid computing approach,

where many machines are used to solve a problem.

2. Understanding the data: It takes a lot of understanding to get data in the right shape so that

you can use visualization as part of data analysis. For example, if the data comes from social

media content, you need to know who the user is in a general sense – such as a customer using
a particular set of products – and understand what it is you’re trying to visualize out of the data.

One solution to this challenge is to have the proper domain expertise in place. Make sure the

people analyzing the data have a deep understanding of where the data comes from, what

audience will be consuming the data and how that audience will interpret the information.

3. Addressing data quality : Even if you can find and analyze data quickly and put it in the

proper context for the audience that will be consuming the information, the value of data for

decision-making purposes will be jeopardized if the data is not accurate or timely. This is a

challenge with any data analysis, but when considering the volumes of information involved in

big data projects, it becomes even more pronounced. Again, data visualization will only prove

to be a valuable tool if the data quality is assured. To address this issue, companies need to

have a data governance or information management process in place to ensure the data is clean.

4. Displaying meaningful results: Plotting points on a graph for analysis becomes difficult

when dealing with extremely large amounts of information or a variety of categories of

information. For example, imagine you have 10 billion rows of retail SKU data that you’re

trying to compare. The user trying to view 10 billion plots on the screen will have a hard time

seeing so many data points. One way to resolve this is to cluster data into a higher-level view

where smaller groups of data become visible. By grouping the data together, or “binning,” you

can more effectively visualize the data.


3.2 Objectives of Big Data Analytics:

1. Cost Reduction from Big Data Technologies: Some organizations pursuing big data

believe strongly that MIPS and terabyte storage for structured data are now most cheaply

delivered through big data technologies like Hadoop clusters. Organizations that were focused

on cost reduction made the decision to adopt big data tools primarily within the IT organization

on largely technical and economic criteria.

2. Time Reduction from Big Data: The second common objective of big data technologies

and solutions is time reduction. Many companies make use of big data analytics to generate

real-time decisions to save a lot of time. Another key objective involving time reduction is to

be able to interact with the customer in real time, using analytics and data derived from the

customer experience.

3. Developing New Big Data-Based Offerings: One of the most ambitious things an

organization can do with big data is to employ it in developing new product and service

offerings based on data. Many of the companies that employ this approach are online firms,

which have an obvious need to employ data-based products and services.


CHAPTER 4

APACHE HADOOP

Doug Cutting and Mike Carafella helped create Apache Hadoop in 2005 out of necessity as

data from the web exploded, and grew far beyond the ability of traditional systems to handle it.

Hadoop was initially inspired by papers published by Google outlining its approach to handling

an avalanche of data, and has since become the de facto standard for storing, processing and

analyzing hundreds of terabytes, and even petabytes of data.

Apache Hadoop is an open source distributed software platform for storing and processing

data. Written in Java, it runs on a cluster of industry-standard servers configured with direct-

attached storage. Using Hadoop, you can store petabytes of data reliably on tens of thousands

of servers while scaling performance cost-effectively by merely adding inexpensive nodes to

the cluster.

Apache Hadoop is 100% open source, and pioneered a fundamentally new way of storing and

processing data. Instead of relying on expensive, proprietary hardware and different systems to

store and process data, Hadoop enables distributed parallel processing of huge amounts of data

across inexpensive, industry-standard servers that both store and process the data, and can scale

without limits. With Hadoop, no data is too big. And in today’s hyper-connected world where

more and more data is being created every day, Hadoop’s breakthrough advantages mean that

businesses and organizations can now find value in data that was recently considered useless.

HDFS : Self-healing, high-bandwidth, clustered storage.


Map Reduce : Distributed, fault-tolerant resource management, coupled with scalable data

processing.

YARN : YARN is the architectural center of Hadoop that allows multiple data processing

engines such as interactive SQL, real-time streaming, data science and batch processing to

handle data stored in a single platform, unlocking an entirely new approach to analytics. It is

the foundation of the new generation of Hadoop and is enabling organizations everywhere to

realize a modern data architecture.

YARN is the prerequisite for Enterprise Hadoop, providing resource management and a central

platform to deliver consistent operations, security, and data governance tools across Hadoop

cluster. HDFS is Hadoop's own rack-aware filesystem, which is a UNIX-based data storage

layer of Hadoop. HDFS is derived from concepts of Google filesystem. An important

characteristic of Hadoop is the partitioning of data and computation across many (thousands of)

hosts, and the execution of application computations in parallel, close to their data. On HDFS,

data files are replicated as sequences of blocks in the cluster. A Hadoop cluster scales

computation capacity, storage capacity, and I/O bandwidth by simply adding commodity

servers. HDFS can be accessed from applications in many different ways. Natively, HDFS

provides a Java API for applications to use.

The Hadoop clusters at Yahoo! span 40,000 servers and store 40 petabytes of application data,

with the largest Hadoop cluster being 4,000 servers. Also, one hundred other organizations

worldwide are known to use Hadoop.

HDFS was designed to be a scalable, fault-tolerant, distributed storage system that works

closely with MapReduce. HDFS will “just work” under a variety of physical and systemic
circumstances. By distributing storage and computation across many servers, the combined

storage resource can grow with demand while remaining economical at every size.

HDFS supports parallel reading and writing and is optimized for streaming reading and writing.

The bandwidth scales linearly with the number of nodes.HDFS provides a block redundancy

factor which is normally 3 i.e. every block will be replicated 3 times in various nodes .This

helps to get higher fault tolerance.

These specific features ensure that the Hadoop clusters are highly functional and highly

available:

1. Rack awareness allows consideration of a node’s physical location, when allocating

storage and scheduling tasks

2. Minimal data motion. MapReduce moves compute processes to the data on HDFS and

not the other way around. Processing tasks can occur on the physical node where the

data resides.

3. Utilities diagnose the health of the files system and can rebalance the data on different

nodes

4. Rollback allows system operators to bring back the previous version of HDFS after an

upgrade, in case of human or system errors

5. Standby NameNode provides redundancy and supports high availability

6. Highly operable. Hadoop handles different types of cluster that might otherwise

require operator intervention. This design allows a single operator to maintain a cluster

of 1000s of nodes.
CHAPTER 5

BIG DATA FRAMEWORKS

1. Apache Hadoop

Apache Hadoop is an open source, scalable and fault tolerant framework written in Java. It is a

processing framework that exclusively provides batch processing, and efficiently processes

large volumes of data on a cluster of commodity hardware. Hadoop is not only a storage

system but is a platform for storing large volumes of data as well as for processing.

Modern versions of Hadoop are composed of several components or layers that work together

to process batch data. These are listed below.

1. HDFS (Hadoop Distributed File System): This is the distributed file system layer that

coordinates storage and replication across the cluster nodes. HDFS ensures that data

remains available in spite of inevitable host failures. It is used as the source of data, to

store intermediate processing results, and to persist the final calculated results.

2. YARN: This stands for Yet Another Resource Negotiator. It is the cluster coordinating

component of the Hadoop stack, and is responsible for coordinating and managing the

underlying resources and scheduling jobs that need to be run.

3. MapReduce: This is Hadoop’s native batch processing engine.

2. Apache Storm

Apache Storm is a stream processing framework that focuses on extremely low latency and is

perhaps the best option for workloads that require near real-time processing. It can handle very

large quantities of data and deliver results with less latency than other solutions. Storm is

simple, can be used with any programming language, and is also a lot of fun.
Storm has many use cases: real-time analytics, online machine learning, continuous

computation, distributed RPC, ETL, and more. It is fast—a benchmark clocked it at over a

million tuples processed per second per node. It is also scalable, fault-tolerant, guarantees your

data will be processed, and is easy to set up and operate.

3. Apache Samza

Apache Samza is a stream processing framework that is tightly tied to the Apache Kafka

messaging system. While Kafka can be used by many stream processing systems, Samza is

designed specifically to take advantage of Kafka’s unique architecture and guarantees. It uses

Kafka to provide fault tolerance, buffering and state storage.

Samza uses YARN for resource negotiation. This means that, by default, a Hadoop cluster is

required (at least HDFS and YARN). It also means that Samza can rely on the rich features

built into YARN.

4. Apache Spark

Apache Spark is a general purpose and lightning fast cluster computing system. It provides

high-level APIs like Java, Scala, Python and R, and is a tool for running Spark applications. It

is 100 times faster than Big Data Hadoop and ten times faster than accessing data from the

disk. It can be integrated with Hadoop and can process existing Hadoop HDFS data.

Apache Spark is a next generation batch processing framework with stream processing

capabilities. Built using many of the same principles of Hadoop’s MapReduce engine, Spark

focuses primarily on speeding up batch processing workloads by offering full in-memory

computation and processing optimisation.


5. Apache Flink

Apache Flink is an open source platform; it is a streaming data flow engine that provides

communication, fault tolerance and data distribution for distributed computations over data

streams. It is a scalable data analytics framework that is fully compatible with Hadoop. Flink

can execute both stream processing and batch processing easily.

While Spark performs batch and stream processing, its streaming is not appropriate for many

use cases because of its micro-batch architecture. Flink’s stream-first approach offers low

latency, high throughput, and real entry-by-entry processing.

You might also like