You are on page 1of 19

1. What is big data analytics example?

 Social media,
 cloud applications,
 machine sensor data

2. Traditional data and Big data

Traditional data Big data


Traditional data is generated in enterprise Big data is generated in outside and enterprise
level. level.
Its volume ranges from Gigabytes to Its volume ranges from Petabytes to Zettabytes
Terabytes. or Exabytes.
Traditional database system deals with big data system deals with structured, semi
structured data. structured and unstructured data.
Traditional data is generated per hour or per But big data is generated more frequently
day or more. mainly per seconds.
Traditional data source is centralized and it is Big data source is distributed and it is managed
managed in centralized form. in distributed form
Data integration is very easy. Data integration is very difficult.
The size of the data is very small. The size is more than the traditional data size.

3. What are the three types of big data?

 Structured Data,
 Unstructured Data,
 Semi-Structured Data.

4. What are some of the challenges of working with big data

 Sharing and Accessing Data


 Privacy and Security
 Analytical Challenges
 Quality of data
 Fault tolerance
 Scalability

5. Difference Between Reporting and Analytics.

Reporting Analytics
Shows WHAT is happening Explain WHY it is happening
Organizing , formatting ,summarizing Exploring, Questioning ,interpreting
Translates data into information Translates data into recommendations to drive
action
Reporting is defined and specifies Analytics is full of possibilities and potential
Results are pushed to users User needs to pull results and interpret
answers to questions

6. Why we use distributed file

The main purpose of the Distributed File System (DFS) is to allows users of physically
distributed systems to share their data and resources by using a Common File System. A
collection of workstations and mainframes connected by a Local Area Network (LAN) is a
configuration on Distributed File System. A DFS is executed as a part of the operating system

7. Define Large Scale Distributed File

 To exploit cluster computing, files must look and behave somewhat differently from the
conventional file systems found on single computers
 Files can be enormous, possibly a terabyte in size. If you have only small files, there is no
point using a DFS for them.
 Files are rarely updated. Rather, they are read as data for some calculation, and possibly
additional data is appended to files from time to time.

8. What is HDFS and how it works?

Hadoop Distributed File System (HDFS): the storage system for Hadoop spread out over
multiple machines as a means to reduce cost and increase reliability.

HDFS exposes a file system namespace and allows user data to be stored in files. Internally, a
file is split into one or more blocks and these blocks are stored in a set of DataNodes. The
NameNode executes file system namespace operations like opening, closing, and renaming files
and directories.

9. What is MapReduce execution?

MapReduce program executes in three stages, namely map stage, shuffle stage, and reduce stage.
Map stage − The map or mapper's job is to process the input data. Generally the input data is in
the form of file or directory and is stored in the Hadoop file system (HDFS).

10. Define Matrix Vector Multiplication.

Matrix-vector multiplication is an operation between a matrix and a vector that produces a new
vector. Notably, matrix-vector multiplication is only defined between a matrix and a vector
where the length of the vector equals the number of columns of the matrix.

11.a) Describe the Evolution of analytic scalability.


Evolution of Analytical Scalability
The amount of data organizations process continues to increase. So the technologies used are
• Massive Parallel Processing
• MapReduce History of Scalability
• 1900- Analytics (Manual Computation)
• 1970-Calculators
• 1980-Mainframes
• 2000- Databases
• Sources of Big Data generate TB to PB data in days/weeks/hours
Convergence of Analytic and Data Environment
Traditional Analytic Architecture Pull all data together into a separate analytics
environment to do analysis.
The heavy processing occurs in the analytic environment

Analysts do the merge operation on the data sets which contain rows
and columns.

The columns represent information about the customers such as name, spending level, or status.

In merge or join, two or more data sets are combined together. They are typically merged/joined
so that specific rows of one data set or table are combined with specific rows of another.
Analysts also do data preparation. Data preparation is made up of joins, aggregations,
derivations, and transformations. In this process, they pull data from various sources and merge
it all together to create the variables required for analysis.

The massively Parallel Processing (MPP) system is the most mature, proven, and widely
deployed mechanism for storing and analyzing large amounts of data.

An MPP database breaks the data into independent pieces managed by independent storage and
central processing unit (CPU) resources.

MPP systems build in redundancy to make recovery easy.

MPP systems have resource management tools:

1. Manage the CPU and disk space


2. Query optimizer

b) What is big data analytics discuss big data analytic processes and tools?
Big Data analytics is a process used to extract meaningful insights, such as hidden patterns,
unknown correlations, market trends, and customer preferences. Big Data analytics provides
various advantages—it can be used for better decision making, preventing fraudulent activities,
among other things.
Big Data is a massive amount of data sets that cannot be stored, processed, or analyzed using
traditional tools.
There are millions of data sources that generate data at a very rapid rate. These data sources are
present across the world. Some of the largest sources of data are social media platforms and
networks. This data includes pictures, videos, messages, and more.

Data also exists in different formats, like structured data, semi-structured data, and unstructured
data. For example, in a regular Excel sheet, data is classified as structured data—with a definite
format. All this data combined makes up Big Data.

Uses and Examples of Big Data Analytics

There are many different ways that Big Data analytics can be used in order to improve
businesses and organizations. Here are some examples:

 Using analytics to understand customer behavior in order to optimize the customer experience

 Predicting future trends in order to make better business decisions

 Improving marketing campaigns by understanding what works and what doesn't

 Increasing operational efficiency by understanding where bottlenecks are and how to fix them

 Detecting fraud and other forms of misuse sooner

These are just a few examples — the possibilities are really endless when it comes to
Big Data analytics. It all depends on how you want to use it in order to improve your business.

Benefits and Advantages of Big Data Analytics

1. Risk Management : The organization leverages it to narrow down a list of suspects or root
causes of problems.

2. Product Development and Innovations: uses Big Data analytics to analyze how efficient the
engine designs are and if there is any need for improvements.

3. Quicker and Better Decision Making Within Organizations: They will analyze several
different factors, such as population, demographics, accessibility of the location, and more.

4. Improve Customer Experience: The airline identifies negative tweets and does what’s
necessary to remedy the situation. By publicly addressing these issues and offering solutions, it
helps the airline build good customer relations.
Different Types of Big Data Analytics

1. Descriptive Analytics: This summarizes past data into a form that people can easily read. This
helps in creating reports, like a company’s revenue, profit, sales, and so on. Also, it helps in the
tabulation of social media metrics.

2. Diagnostic Analytics: This is done to understand what caused a problem in the first place.
Techniques like drill-down, data mining, and data recovery are all examples. Organizations use
diagnostic analytics because they provide an in-depth insight into a particular problem.

3.Predictive Analytics:This type of analytics looks into the historical and present data to make
predictions of the future. Predictive analytics uses data mining, AI, and machine learning to
analyze current data and make predictions about the future. It works on predicting customer
trends, market trends, and so on.

4.Prescriptive Analytics:This type of analytics prescribes the solution to a particular problem.


Perspective analytics works with both descriptive and predictive analytics. Most of the time, it
relies on AI and machine learning.

Big Data Analytics Tools

Here are some of the key big data analytics tools :

 Hadoop - helps in storing and analyzing data

 MongoDB - used on datasets that change frequently

 Talend - used for data integration and management

 Cassandra - a distributed database used to handle chunks of data

 Spark - used for real-time processing and analyzing large amounts of data

 STORM - an open-source real-time computational system

 Kafka - a distributed streaming platform that is used for fault-tolerant storage

b) What is Big Data Analytics and Why It is Important?

What is Big Data Analytics?


Big Data analytics is a process used to extract meaningful insights, such as hidden patterns,
unknown correlations, market trends, and customer preferences. Big Data analytics provides
various advantages—it can be used for better decision making, preventing fraudulent activities,
among other things

What is Big Data?

Big Data is a massive amount of data sets that cannot be stored, processed, or analyzed using
traditional tools.

Today, there are millions of data sources that generate data at a very rapid rate. These data
sources are present across the world. Some of the largest sources of data are social media
platforms and networks. Let’s use Facebook as an example—it generates more than 500
terabytes of data every day. This data includes pictures, videos, messages, and more.

Data also exists in different formats, like structured data, semi-structured data, and unstructured
data. For example, in a regular Excel sheet, data is classified as structured data—with a definite
format. In contrast, emails fall under semi-structured, and your pictures and videos fall under
unstructured data. All this data combined makes up Big Data.

Uses and Examples of Big Data Analytics

There are many different ways that Big Data analytics can be used in order to improve
businesses and organizations. Here are some examples:

 Using analytics to understand customer behavior in order to optimize the customer experience

 Predicting future trends in order to make better business decisions

 Improving marketing campaigns by understanding what works and what doesn't

 Increasing operational efficiency by understanding where bottlenecks are and how to fix them

 Detecting fraud and other forms of misuse sooner

These are just a few examples — the possibilities are really endless when it comes to Big Data
analytics. It all depends on how you want to use it in order to improve your business.

History of Big Data Analytics


The history of Big Data analytics can be traced back to the early days of computing, when
organizations first began using computers to store and analyze large amounts of data. However,
it was not until the late 1990s and early 2000s that Big Data analytics really began to take off, as
organizations increasingly turned to computers to help them make sense of the rapidly growing
volumes of data being generated by their businesses.

Today, Big Data analytics has become an essential tool for organizations of all sizes across a
wide range of industries. By harnessing the power of Big Data, organizations are able to gain
insights into their customers, their businesses, and the world around them that were simply not
possible before.

As the field of Big Data analytics continues to evolve, we can expect to see even more amazing
and transformative applications of this technology in the years to come.

Benefits and Advantages of Big Data Analytics

1. Risk Management

Use Case: Banco de Oro, a Phillippine banking company, uses Big Data analytics to identify
fraudulent activities and discrepancies. The organization leverages it to narrow down a list of
suspects or root causes of problems.

2. Product Development and Innovations

Use Case: Rolls-Royce, one of the largest manufacturers of jet engines for airlines and armed
forces across the globe, uses Big Data analytics to analyze how efficient the engine designs are
and if there is any need for improvements.

3. Quicker and Better Decision Making Within Organizations

Use Case: Starbucks uses Big Data analytics to make strategic decisions. For example, the
company leverages it to decide if a particular location would be suitable for a new outlet or not.
They will analyze several different factors, such as population, demographics, accessibility of the
location, and more.

4. Improve Customer Experience

Use Case: Delta Air Lines uses Big Data analysis to improve customer experiences. They
monitor tweets to find out their customers’ experience regarding their journeys, delays, and so
on. The airline identifies negative tweets and does what’s necessary to remedy the situation. By
publicly addressing these issues and offering solutions, it helps the airline build good customer
relations.

The Lifecycle Phases of Big Data Analytics

Now, let’s review how Big Data analytics works:

 Stage 1 - Business case evaluation - The Big Data analytics lifecycle begins with a business
case, which defines the reason and goal behind the analysis.

 Stage 2 - Identification of data - Here, a broad variety of data sources are identified.

 Stage 3 - Data filtering - All of the identified data from the previous stage is filtered here to
remove corrupt data.

 Stage 4 - Data extraction - Data that is not compatible with the tool is extracted and then
transformed into a compatible form.

 Stage 5 - Data aggregation - In this stage, data with the same fields across different datasets
are integrated.

 Stage 6 - Data analysis - Data is evaluated using analytical and statistical tools to discover
useful information.

 Stage 7 - Visualization of data - With tools like Tableau, Power BI, and QlikView, Big Data
analysts can produce graphic visualizations of the analysis.

 Stage 8 - Final analysis result - This is the last step of the Big Data analytics lifecycle, where
the final results of the analysis are made available to business stakeholders who will take
action.

Different Types of Big Data Analytics

Here are the four types of Big Data analytics:

1. Descriptive Analytics

This summarizes past data into a form that people can easily read. This helps in creating reports,
like a company’s revenue, profit, sales, and so on. Also, it helps in the tabulation of social media
metrics.

Use Case: The Dow Chemical Company analyzed its past data to increase facility utilization
across its office and lab space. Using descriptive analytics, Dow was able to identify
underutilized space. This space consolidation helped the company save nearly US $4 million
annually.

2. Diagnostic Analytics

This is done to understand what caused a problem in the first place. Techniques like drill-
down, data mining, and data recovery are all examples. Organizations use diagnostic analytics
because they provide an in-depth insight into a particular problem.

Use Case: An e-commerce company’s report shows that their sales have gone down, although
customers are adding products to their carts. This can be due to various reasons like the form
didn’t load correctly, the shipping fee is too high, or there are not enough payment options
available. This is where you can use diagnostic analytics to find the reason.

3. Predictive Analytics

This type of analytics looks into the historical and present data to make predictions of the future.
Predictive analytics uses data mining, AI, and machine learning to analyze current data and make
predictions about the future. It works on predicting customer trends, market trends, and so on.

Use Case: PayPal determines what kind of precautions they have to take to protect their clients
against fraudulent transactions. Using predictive analytics, the company uses all the historical
payment data and user behavior data and builds an algorithm that predicts fraudulent activities.

4. Prescriptive Analytics

This type of analytics prescribes the solution to a particular problem. Perspective analytics works
with both descriptive and predictive analytics. Most of the time, it relies on AI and machine
learning.

Use Case: Prescriptive analytics can be used to maximize an airline’s profit. This type of
analytics is used to build an algorithm that will automatically adjust the flight fares based on
numerous factors, including customer demand, weather, destination, holiday seasons, and oil
prices.

Big Data Analytics Tools


Here are some of the key big data analytics tools :

 Hadoop - helps in storing and analyzing data

 MongoDB - used on datasets that change frequently

 Talend - used for data integration and management

 Cassandra - a distributed database used to handle chunks of data

 Spark - used for real-time processing and analyzing large amounts of data

 STORM - an open-source real-time computational system

 Kafka - a distributed streaming platform that is used for fault-tolerant storage

Big Data Industry Applications

Here are some of the sectors where Big Data is actively used:

 Ecommerce - Predicting customer

Marketing

Education

Healthcare

Media and entertainment

Banking

Telecommunications

Government

12.a) How does MapReduce work in big data?

MapReduce is a programming model for writing applications that can process Big Data in
parallel on multiple nodes. MapReduce provides analytical capabilities for analyzing huge
volumes of complex data.
Big Data is a collection of large datasets that cannot be processed using traditional computing
techniques.Big Data is not only about scale and volume, it also involves one or more of the
following aspects − Velocity, Variety, Volume, and Complexity.
Why MapReduce?

Traditional Enterprise Systems normally have a centralized server to store and process data. The
following illustration depicts a schematic view of a traditional enterprise system. Traditional
model is certainly not suitable to process huge volumes of scalable data and cannot be
accommodated by standard database servers. Moreover, the centralized system creates too much
of a bottleneck while processing multiple files simultaneously.

Google solved this bottleneck issue using an algorithm called MapReduce. MapReduce divides a
task into small parts and assigns them to many computers. Later, the results are collected at one
place and integrated to form the result dataset.

How MapReduce Works?

The MapReduce algorithm contains two important tasks, namely Map and Reduce.
 The Map task takes a set of data and converts it into another set of data, where individual
elements are broken down into tuples (key-value pairs).
 The Reduce task takes the output from the Map as an input and combines those data
tuples (key-value pairs) into a smaller set of tuples.
The reduce task is always performed after the map job.
Let us now take a close look at each of the phases and try to understand their significance.
 Input Phase − Here we have a Record Reader that translates each record in an input file
and sends the parsed data to the mapper in the form of key-value pairs.
 Map − Map is a user-defined function, which takes a series of key-value pairs and
processes each one of them to generate zero or more key-value pairs.
 Intermediate Keys − They key-value pairs generated by the mapper are known as
intermediate keys.
 Combiner − A combiner is a type of local Reducer that groups similar data from the map
phase into identifiable sets. It takes the intermediate keys from the mapper as input and
applies a user-defined code to aggregate the values in a small scope of one mapper. It is
not a part of the main MapReduce algorithm; it is optional.
 Shuffle and Sort − The Reducer task starts with the Shuffle and Sort step. It downloads
the grouped key-value pairs onto the local machine, where the Reducer is running. The
individual key-value pairs are sorted by key into a larger data list. The data list groups the
equivalent keys together so that their values can be iterated easily in the Reducer task.
 Reducer − The Reducer takes the grouped key-value paired data as input and runs a
Reducer function on each one of them. Here, the data can be aggregated, filtered, and
combined in a number of ways, and it requires a wide range of processing. Once the
execution is over, it gives zero or more key-value pairs to the final step.
 Output Phase − In the output phase, we have an output formatter that translates the final
key-value pairs from the Reducer function and writes them onto a file using a record
writer.
Let us try to understand the two tasks Map &f Reduce with the help of a small diagram −
MapReduce-Example

Let us take a real-world example to comprehend the power of MapReduce. Twitter receives
around 500 million tweets per day, which is nearly 3000 tweets per second. The following
illustration shows how Tweeter manages its tweets with the help of MapReduce.

As shown in the illustration, the MapReduce algorithm performs the following actions −
 Tokenize − Tokenizes the tweets into maps of tokens and writes them as key-value pairs.
 Filter − Filters unwanted words from the maps of tokens and writes the filtered maps as
key-value pairs.
 Count − Generates a token counter per word.
 Aggregate Counters − Prepares an aggregate of similar counter values into small
manageable units.
The MapReduce algorithm contains two important tasks, namely Map and Reduce.

 The map task is done by means of Mapper Class


 The reduce task is done by means of Reducer Class.
Mapper class takes the input, tokenizes it, maps and sorts it. The output of Mapper class is used
as input by Reducer class, which in turn searches matching pairs and reduces them.

MapReduce implements various mathematical algorithms to divide a task into small parts and
assign them to multiple systems. In technical terms, MapReduce algorithm helps in sending the
Map & Reduce tasks to appropriate servers in a cluster.

b) Explain Hadoop YARN architecture and types

HADOOP YARN:

Apache Hadoop YARN (Yet Another Resource Negotiator) is a cluster management


technology. YARN is one of the key features in the second-generation Hadoop 2 version of the
Apache Software Foundation's open source distributed processing framework. Originally
described by Apache as a redesigned resource manager, YARN is now characterized as a large-
scale, distributed operating system for big data applications.

Sometimes called MapReduce 2.0, YARN is a software rewrite that decouples


MapReduce's resource management and scheduling capabilities from the data processing
component, enabling Hadoop to support more varied processing approaches and a broader array
of applications. For example, Hadoop clusters can now run interactive querying and streaming
data applications simultaneously with MapReduce batch jobs.

YARN combines a central resource manager that reconciles the way applications use Hadoop
system resources with node manager agents that monitor the processing operations of individual
cluster nodes. Running on commodity hardware clusters, Hadoop has attracted particular interest
as a staging area and data store for large volumes of structured and unstructured data intended for
use in analytics applications. Separating HDFS from MapReduce with YARN makes the Hadoop
environment more suitable for operational applications that can't wait for batch jobs to finish.
Apache Hadoop Yarn – Concepts & Applications:
As previously described, YARN is essentially a system for managing distributed
applications. It consists of a central ResourceManager, which arbitrates all available cluster
resources, and a per-node NodeManager, which takes direction from the ResourceManager and
is responsible for managing resources available on a single node.
Resource Manager
In YARN, the ResourceManager is, primarily, a pure scheduler. In essence, it’s strictly
limited to arbitrating available resources in the system among the competing applications –
a market maker if you will. It optimizes for cluster utilization against various constraints such as
capacity guarantees, fairness, and SLAs. To allow for different policy constraints the
ResourceManager has a pluggable scheduler that allows for different algorithms such as capacity
and fair scheduling to be used as necessary.
ApplicationMaster
The ApplicationMaster is, in effect, an instance of a framework-specific library and is
responsible for negotiating resources from the ResourceManager and working with the
NodeManager(s) to execute and monitor the containers and their resource consumption. It has
the responsibility of negotiating appropriate resource containers from the ResourceManager,
tracking their status and monitoring progress.
The ApplicationMaster allows YARN to exhibit the following key characteristics:

 Scale: The Application Master provides much of the functionality of the traditional
ResourceManager so that the entire system can scale more dramatically. In tests, we’ve already
successfully simulated 10,000 node clusters composed of modern hardware without significant
issue. This is one of the key reasons that we have chosen to design the ResourceManager as
a pure scheduler i.e. it doesn’t attempt to provide fault-tolerance for resources. We shifted that to
become a primary responsibility of the ApplicationMaster instance. Furthermore, since there is
an instance of an ApplicationMaster per application, the ApplicationMaster itself isn’t a common
bottleneck in the cluster.
 Open: Moving all application framework specific code into the ApplicationMaster generalizes
the system so that we can now support multiple frameworks such as MapReduce, MPI and Graph
Processing.

Resource Model
YARN supports a very general resource model for applications. An application can request
resources with highly specific requirements such as:
 Resource-name (hostname, rackname – we are in the process of generalizing this further
to support more complex network topologies with YARN-18).
 Memory (in MB)

 CPU (cores, for now)

 In future, expect us to add more resource-types such as disk/network I/O, GPUs etc.

Hadoop YARN architecture:

Fig: Hadoop YARN Architecture

 The JobTracker is responsible for resource management (managing the worker nodes i.e.
TaskTrackers), tracking resource consumption/availability and also job life-cycle
management (scheduling individual tasks of the job, tracking progress, providing fault-
tolerance for tasks etc).
 The TaskTracker has simple responsibilities – launch/teardown tasks on orders from the
JobTracker and provide task-status information to the JobTracker periodically.

How Yarn Works

YARN’s original purpose was to split up the two major responsibilities of the
JobTracker/TaskTracker into separate entities:

 a global ResourceManager

 a per-application ApplicationMaster

 a per-node slave NodeManager

 a per-application Container running on a NodeManager

The ResourceManager and the NodeManager formed the new generic system for
managing applications in a distributed manner. The ResourceManager is the ultimate authority
that arbitrates resources among all applications in the system. The ApplicationMaster is a
framework-specific entity that negotiates resources from the ResourceManager and works with
the NodeManager(s) to execute and monitor the component tasks.

The ResourceManager has a scheduler, which is responsible for allocating resources to


the various applications running in the cluster, according to constraints such as queue capacities
and user limits. The scheduler schedules based on the resource requirements of each application.

Each ApplicationMaster has responsibility for negotiating appropriate resource containers


from the scheduler, tracking their status, and monitoring their progress. From the system
perspective, the ApplicationMaster runs as a normal container.

The NodeManager is the per-machine slave, which is responsible for launching the
applications’ containers, monitoring their resource usage (cpu, memory, disk, network) and
reporting the same to the ResourceManager.

You might also like