Professional Documents
Culture Documents
Social media,
cloud applications,
machine sensor data
Structured Data,
Unstructured Data,
Semi-Structured Data.
Reporting Analytics
Shows WHAT is happening Explain WHY it is happening
Organizing , formatting ,summarizing Exploring, Questioning ,interpreting
Translates data into information Translates data into recommendations to drive
action
Reporting is defined and specifies Analytics is full of possibilities and potential
Results are pushed to users User needs to pull results and interpret
answers to questions
The main purpose of the Distributed File System (DFS) is to allows users of physically
distributed systems to share their data and resources by using a Common File System. A
collection of workstations and mainframes connected by a Local Area Network (LAN) is a
configuration on Distributed File System. A DFS is executed as a part of the operating system
To exploit cluster computing, files must look and behave somewhat differently from the
conventional file systems found on single computers
Files can be enormous, possibly a terabyte in size. If you have only small files, there is no
point using a DFS for them.
Files are rarely updated. Rather, they are read as data for some calculation, and possibly
additional data is appended to files from time to time.
Hadoop Distributed File System (HDFS): the storage system for Hadoop spread out over
multiple machines as a means to reduce cost and increase reliability.
HDFS exposes a file system namespace and allows user data to be stored in files. Internally, a
file is split into one or more blocks and these blocks are stored in a set of DataNodes. The
NameNode executes file system namespace operations like opening, closing, and renaming files
and directories.
MapReduce program executes in three stages, namely map stage, shuffle stage, and reduce stage.
Map stage − The map or mapper's job is to process the input data. Generally the input data is in
the form of file or directory and is stored in the Hadoop file system (HDFS).
Matrix-vector multiplication is an operation between a matrix and a vector that produces a new
vector. Notably, matrix-vector multiplication is only defined between a matrix and a vector
where the length of the vector equals the number of columns of the matrix.
Analysts do the merge operation on the data sets which contain rows
and columns.
The columns represent information about the customers such as name, spending level, or status.
In merge or join, two or more data sets are combined together. They are typically merged/joined
so that specific rows of one data set or table are combined with specific rows of another.
Analysts also do data preparation. Data preparation is made up of joins, aggregations,
derivations, and transformations. In this process, they pull data from various sources and merge
it all together to create the variables required for analysis.
The massively Parallel Processing (MPP) system is the most mature, proven, and widely
deployed mechanism for storing and analyzing large amounts of data.
An MPP database breaks the data into independent pieces managed by independent storage and
central processing unit (CPU) resources.
b) What is big data analytics discuss big data analytic processes and tools?
Big Data analytics is a process used to extract meaningful insights, such as hidden patterns,
unknown correlations, market trends, and customer preferences. Big Data analytics provides
various advantages—it can be used for better decision making, preventing fraudulent activities,
among other things.
Big Data is a massive amount of data sets that cannot be stored, processed, or analyzed using
traditional tools.
There are millions of data sources that generate data at a very rapid rate. These data sources are
present across the world. Some of the largest sources of data are social media platforms and
networks. This data includes pictures, videos, messages, and more.
Data also exists in different formats, like structured data, semi-structured data, and unstructured
data. For example, in a regular Excel sheet, data is classified as structured data—with a definite
format. All this data combined makes up Big Data.
There are many different ways that Big Data analytics can be used in order to improve
businesses and organizations. Here are some examples:
Using analytics to understand customer behavior in order to optimize the customer experience
Increasing operational efficiency by understanding where bottlenecks are and how to fix them
These are just a few examples — the possibilities are really endless when it comes to
Big Data analytics. It all depends on how you want to use it in order to improve your business.
1. Risk Management : The organization leverages it to narrow down a list of suspects or root
causes of problems.
2. Product Development and Innovations: uses Big Data analytics to analyze how efficient the
engine designs are and if there is any need for improvements.
3. Quicker and Better Decision Making Within Organizations: They will analyze several
different factors, such as population, demographics, accessibility of the location, and more.
4. Improve Customer Experience: The airline identifies negative tweets and does what’s
necessary to remedy the situation. By publicly addressing these issues and offering solutions, it
helps the airline build good customer relations.
Different Types of Big Data Analytics
1. Descriptive Analytics: This summarizes past data into a form that people can easily read. This
helps in creating reports, like a company’s revenue, profit, sales, and so on. Also, it helps in the
tabulation of social media metrics.
2. Diagnostic Analytics: This is done to understand what caused a problem in the first place.
Techniques like drill-down, data mining, and data recovery are all examples. Organizations use
diagnostic analytics because they provide an in-depth insight into a particular problem.
3.Predictive Analytics:This type of analytics looks into the historical and present data to make
predictions of the future. Predictive analytics uses data mining, AI, and machine learning to
analyze current data and make predictions about the future. It works on predicting customer
trends, market trends, and so on.
Spark - used for real-time processing and analyzing large amounts of data
Big Data is a massive amount of data sets that cannot be stored, processed, or analyzed using
traditional tools.
Today, there are millions of data sources that generate data at a very rapid rate. These data
sources are present across the world. Some of the largest sources of data are social media
platforms and networks. Let’s use Facebook as an example—it generates more than 500
terabytes of data every day. This data includes pictures, videos, messages, and more.
Data also exists in different formats, like structured data, semi-structured data, and unstructured
data. For example, in a regular Excel sheet, data is classified as structured data—with a definite
format. In contrast, emails fall under semi-structured, and your pictures and videos fall under
unstructured data. All this data combined makes up Big Data.
There are many different ways that Big Data analytics can be used in order to improve
businesses and organizations. Here are some examples:
Using analytics to understand customer behavior in order to optimize the customer experience
Increasing operational efficiency by understanding where bottlenecks are and how to fix them
These are just a few examples — the possibilities are really endless when it comes to Big Data
analytics. It all depends on how you want to use it in order to improve your business.
Today, Big Data analytics has become an essential tool for organizations of all sizes across a
wide range of industries. By harnessing the power of Big Data, organizations are able to gain
insights into their customers, their businesses, and the world around them that were simply not
possible before.
As the field of Big Data analytics continues to evolve, we can expect to see even more amazing
and transformative applications of this technology in the years to come.
1. Risk Management
Use Case: Banco de Oro, a Phillippine banking company, uses Big Data analytics to identify
fraudulent activities and discrepancies. The organization leverages it to narrow down a list of
suspects or root causes of problems.
Use Case: Rolls-Royce, one of the largest manufacturers of jet engines for airlines and armed
forces across the globe, uses Big Data analytics to analyze how efficient the engine designs are
and if there is any need for improvements.
Use Case: Starbucks uses Big Data analytics to make strategic decisions. For example, the
company leverages it to decide if a particular location would be suitable for a new outlet or not.
They will analyze several different factors, such as population, demographics, accessibility of the
location, and more.
Use Case: Delta Air Lines uses Big Data analysis to improve customer experiences. They
monitor tweets to find out their customers’ experience regarding their journeys, delays, and so
on. The airline identifies negative tweets and does what’s necessary to remedy the situation. By
publicly addressing these issues and offering solutions, it helps the airline build good customer
relations.
Stage 1 - Business case evaluation - The Big Data analytics lifecycle begins with a business
case, which defines the reason and goal behind the analysis.
Stage 2 - Identification of data - Here, a broad variety of data sources are identified.
Stage 3 - Data filtering - All of the identified data from the previous stage is filtered here to
remove corrupt data.
Stage 4 - Data extraction - Data that is not compatible with the tool is extracted and then
transformed into a compatible form.
Stage 5 - Data aggregation - In this stage, data with the same fields across different datasets
are integrated.
Stage 6 - Data analysis - Data is evaluated using analytical and statistical tools to discover
useful information.
Stage 7 - Visualization of data - With tools like Tableau, Power BI, and QlikView, Big Data
analysts can produce graphic visualizations of the analysis.
Stage 8 - Final analysis result - This is the last step of the Big Data analytics lifecycle, where
the final results of the analysis are made available to business stakeholders who will take
action.
1. Descriptive Analytics
This summarizes past data into a form that people can easily read. This helps in creating reports,
like a company’s revenue, profit, sales, and so on. Also, it helps in the tabulation of social media
metrics.
Use Case: The Dow Chemical Company analyzed its past data to increase facility utilization
across its office and lab space. Using descriptive analytics, Dow was able to identify
underutilized space. This space consolidation helped the company save nearly US $4 million
annually.
2. Diagnostic Analytics
This is done to understand what caused a problem in the first place. Techniques like drill-
down, data mining, and data recovery are all examples. Organizations use diagnostic analytics
because they provide an in-depth insight into a particular problem.
Use Case: An e-commerce company’s report shows that their sales have gone down, although
customers are adding products to their carts. This can be due to various reasons like the form
didn’t load correctly, the shipping fee is too high, or there are not enough payment options
available. This is where you can use diagnostic analytics to find the reason.
3. Predictive Analytics
This type of analytics looks into the historical and present data to make predictions of the future.
Predictive analytics uses data mining, AI, and machine learning to analyze current data and make
predictions about the future. It works on predicting customer trends, market trends, and so on.
Use Case: PayPal determines what kind of precautions they have to take to protect their clients
against fraudulent transactions. Using predictive analytics, the company uses all the historical
payment data and user behavior data and builds an algorithm that predicts fraudulent activities.
4. Prescriptive Analytics
This type of analytics prescribes the solution to a particular problem. Perspective analytics works
with both descriptive and predictive analytics. Most of the time, it relies on AI and machine
learning.
Use Case: Prescriptive analytics can be used to maximize an airline’s profit. This type of
analytics is used to build an algorithm that will automatically adjust the flight fares based on
numerous factors, including customer demand, weather, destination, holiday seasons, and oil
prices.
Spark - used for real-time processing and analyzing large amounts of data
Here are some of the sectors where Big Data is actively used:
Marketing
Education
Healthcare
Banking
Telecommunications
Government
MapReduce is a programming model for writing applications that can process Big Data in
parallel on multiple nodes. MapReduce provides analytical capabilities for analyzing huge
volumes of complex data.
Big Data is a collection of large datasets that cannot be processed using traditional computing
techniques.Big Data is not only about scale and volume, it also involves one or more of the
following aspects − Velocity, Variety, Volume, and Complexity.
Why MapReduce?
Traditional Enterprise Systems normally have a centralized server to store and process data. The
following illustration depicts a schematic view of a traditional enterprise system. Traditional
model is certainly not suitable to process huge volumes of scalable data and cannot be
accommodated by standard database servers. Moreover, the centralized system creates too much
of a bottleneck while processing multiple files simultaneously.
Google solved this bottleneck issue using an algorithm called MapReduce. MapReduce divides a
task into small parts and assigns them to many computers. Later, the results are collected at one
place and integrated to form the result dataset.
The MapReduce algorithm contains two important tasks, namely Map and Reduce.
The Map task takes a set of data and converts it into another set of data, where individual
elements are broken down into tuples (key-value pairs).
The Reduce task takes the output from the Map as an input and combines those data
tuples (key-value pairs) into a smaller set of tuples.
The reduce task is always performed after the map job.
Let us now take a close look at each of the phases and try to understand their significance.
Input Phase − Here we have a Record Reader that translates each record in an input file
and sends the parsed data to the mapper in the form of key-value pairs.
Map − Map is a user-defined function, which takes a series of key-value pairs and
processes each one of them to generate zero or more key-value pairs.
Intermediate Keys − They key-value pairs generated by the mapper are known as
intermediate keys.
Combiner − A combiner is a type of local Reducer that groups similar data from the map
phase into identifiable sets. It takes the intermediate keys from the mapper as input and
applies a user-defined code to aggregate the values in a small scope of one mapper. It is
not a part of the main MapReduce algorithm; it is optional.
Shuffle and Sort − The Reducer task starts with the Shuffle and Sort step. It downloads
the grouped key-value pairs onto the local machine, where the Reducer is running. The
individual key-value pairs are sorted by key into a larger data list. The data list groups the
equivalent keys together so that their values can be iterated easily in the Reducer task.
Reducer − The Reducer takes the grouped key-value paired data as input and runs a
Reducer function on each one of them. Here, the data can be aggregated, filtered, and
combined in a number of ways, and it requires a wide range of processing. Once the
execution is over, it gives zero or more key-value pairs to the final step.
Output Phase − In the output phase, we have an output formatter that translates the final
key-value pairs from the Reducer function and writes them onto a file using a record
writer.
Let us try to understand the two tasks Map &f Reduce with the help of a small diagram −
MapReduce-Example
Let us take a real-world example to comprehend the power of MapReduce. Twitter receives
around 500 million tweets per day, which is nearly 3000 tweets per second. The following
illustration shows how Tweeter manages its tweets with the help of MapReduce.
As shown in the illustration, the MapReduce algorithm performs the following actions −
Tokenize − Tokenizes the tweets into maps of tokens and writes them as key-value pairs.
Filter − Filters unwanted words from the maps of tokens and writes the filtered maps as
key-value pairs.
Count − Generates a token counter per word.
Aggregate Counters − Prepares an aggregate of similar counter values into small
manageable units.
The MapReduce algorithm contains two important tasks, namely Map and Reduce.
MapReduce implements various mathematical algorithms to divide a task into small parts and
assign them to multiple systems. In technical terms, MapReduce algorithm helps in sending the
Map & Reduce tasks to appropriate servers in a cluster.
HADOOP YARN:
YARN combines a central resource manager that reconciles the way applications use Hadoop
system resources with node manager agents that monitor the processing operations of individual
cluster nodes. Running on commodity hardware clusters, Hadoop has attracted particular interest
as a staging area and data store for large volumes of structured and unstructured data intended for
use in analytics applications. Separating HDFS from MapReduce with YARN makes the Hadoop
environment more suitable for operational applications that can't wait for batch jobs to finish.
Apache Hadoop Yarn – Concepts & Applications:
As previously described, YARN is essentially a system for managing distributed
applications. It consists of a central ResourceManager, which arbitrates all available cluster
resources, and a per-node NodeManager, which takes direction from the ResourceManager and
is responsible for managing resources available on a single node.
Resource Manager
In YARN, the ResourceManager is, primarily, a pure scheduler. In essence, it’s strictly
limited to arbitrating available resources in the system among the competing applications –
a market maker if you will. It optimizes for cluster utilization against various constraints such as
capacity guarantees, fairness, and SLAs. To allow for different policy constraints the
ResourceManager has a pluggable scheduler that allows for different algorithms such as capacity
and fair scheduling to be used as necessary.
ApplicationMaster
The ApplicationMaster is, in effect, an instance of a framework-specific library and is
responsible for negotiating resources from the ResourceManager and working with the
NodeManager(s) to execute and monitor the containers and their resource consumption. It has
the responsibility of negotiating appropriate resource containers from the ResourceManager,
tracking their status and monitoring progress.
The ApplicationMaster allows YARN to exhibit the following key characteristics:
Scale: The Application Master provides much of the functionality of the traditional
ResourceManager so that the entire system can scale more dramatically. In tests, we’ve already
successfully simulated 10,000 node clusters composed of modern hardware without significant
issue. This is one of the key reasons that we have chosen to design the ResourceManager as
a pure scheduler i.e. it doesn’t attempt to provide fault-tolerance for resources. We shifted that to
become a primary responsibility of the ApplicationMaster instance. Furthermore, since there is
an instance of an ApplicationMaster per application, the ApplicationMaster itself isn’t a common
bottleneck in the cluster.
Open: Moving all application framework specific code into the ApplicationMaster generalizes
the system so that we can now support multiple frameworks such as MapReduce, MPI and Graph
Processing.
Resource Model
YARN supports a very general resource model for applications. An application can request
resources with highly specific requirements such as:
Resource-name (hostname, rackname – we are in the process of generalizing this further
to support more complex network topologies with YARN-18).
Memory (in MB)
In future, expect us to add more resource-types such as disk/network I/O, GPUs etc.
The JobTracker is responsible for resource management (managing the worker nodes i.e.
TaskTrackers), tracking resource consumption/availability and also job life-cycle
management (scheduling individual tasks of the job, tracking progress, providing fault-
tolerance for tasks etc).
The TaskTracker has simple responsibilities – launch/teardown tasks on orders from the
JobTracker and provide task-status information to the JobTracker periodically.
YARN’s original purpose was to split up the two major responsibilities of the
JobTracker/TaskTracker into separate entities:
a global ResourceManager
a per-application ApplicationMaster
The ResourceManager and the NodeManager formed the new generic system for
managing applications in a distributed manner. The ResourceManager is the ultimate authority
that arbitrates resources among all applications in the system. The ApplicationMaster is a
framework-specific entity that negotiates resources from the ResourceManager and works with
the NodeManager(s) to execute and monitor the component tasks.
The NodeManager is the per-machine slave, which is responsible for launching the
applications’ containers, monitoring their resource usage (cpu, memory, disk, network) and
reporting the same to the ResourceManager.