You are on page 1of 28

1.

Introduction

What is Data?

Data can be defined as a representation of facts, concepts, or instructions in a formalized manner.

In the information age, data are a large set of bits encoding numbers, texts, images, sounds, videos,
and so on. Unless we add information to data, they are meaningless. When we add information,
giving a meaning to them, these data become knowledge. But before data become knowledge,
typically, they pass through several steps where they are still referred to as data, despite being a bit
more organized; that is, they have some information associated with them.

Table 1.1- Characteristics of Data

• Data is the set of instances of information in the form of- large set of bits encoding numbers,
texts, images, sounds, videos, and so on.

• Unless we generate a coherent and comprehensive view of the data instances, in the form of
information, data is meaningless.

• When we add information, giving a meaning to them, the data becomes knowledge.

• Data goes through several steps in the process of generating knowledge.

Data is typically generated through various sources such as sensors, surveys, transactions, social
media, and web interactions. It can be structured, semi-structured, or unstructured. Structured data
is organized in a predefined format, like a database table with rows and columns, while unstructured
data has no predefined structure, like text from social media posts.

What can we do with data?

Data can be used to-

Gain insights: Analyzing data can help us understand patterns and trends that would be difficult or
impossible to detect otherwise. By identifying trends and patterns, we can gain valuable insights that
can be used to inform decisions and strategies. Data is processed and analyzed using various
techniques such as statistical analysis, machine learning, and artificial intelligence. The insights
gained from data can be used to improve decision-making, identify trends, and optimize processes in
various fields like business, healthcare, finance, and more.

Improve decision-making: Data can help us make more informed decisions by providing us with the
information we need to evaluate different options and choose the best course of action. using data
for decision-making involves a structured approach that involves collecting, analyzing, and
interpreting data to inform decisions. The key is to ensure that the data is relevant, accurate, and
unbiased, and that the insights gained from the analysis are used to inform the decision-making
process.

Using data for decision-making involves the following steps:

Identify the problem or decision to be made: The first step is to clearly define the problem or
decision to be made. This involves identifying the goal, the stakeholders involved, and the scope of
the decision.

Collect relevant data: Once the problem has been defined, the next step is to collect data that is
relevant to the problem or decision. This may involve gathering data from various sources, such as
internal databases, external sources, or surveys.

Clean and organize the data: Before the data can be analyzed, it needs to be cleaned and organized.
This involves removing any duplicates, correcting errors, and formatting the data in a way that
makes it easy to analyze.

Analyze the data: The next step is to analyze the data to identify patterns and insights that are
relevant to the decision or problem. This may involve using statistical analysis, machine learning
algorithms, or other techniques to identify correlations and trends in the data.

Interpret the results: Once the data has been analyzed, the next step is to interpret the results and
identify any implications for the decision or problem at hand. This may involve identifying potential
risks or opportunities, evaluating different options, and assessing the impact of different decisions.

Make the decision: Based on the insights gained from the data analysis, a decision can be made. It's
important to consider the limitations of the data and any potential biases that may have influenced
the analysis.

Monitor and evaluate the decision: After the decision has been made, it's important to monitor and
evaluate the results to determine if the decision was effective and if any adjustments need to be
made in the future.

Further, data analysis can be used for-

Optimizing processes: By analyzing data, we can identify inefficiencies in processes and systems, and
take steps to improve them. This can lead to increased efficiency, cost savings, and improved
performance.

Personalization: By analyzing data about individual customers, we can personalize our products and
services to better meet their needs and preferences.

Predictive modeling: Data can be used to create predictive models that can be used to anticipate
future events, identify risks, and inform decision-making.

Automation: Data can be used to automate processes and decision-making, reducing the need for
human intervention and improving efficiency.

The science that analyze crude data to extract useful knowledge (patterns) from data is called
Analytics
Differences between Small Data, Medium Data and Big Data

Data can be small, medium or big.

Small data is data in a volume and format that makes it accessible, informative and actionable. Small
data refers to data sets that are relatively small in size, usually ranging from a few hundred to a few
thousand data points. These data sets can be easily managed and analyzed using traditional data
analysis tools and techniques, such as spreadsheets or simple statistical software. Small data
typically come from traditional sources such as surveys, experiments, or individual observations.

Medium data refers to data sets that are too large to fit on a single machine but don’t require
enormous clusters of thousands. Medium data sets are larger and more complex than small data,
but still manageable using traditional data analysis tools and techniques. Medium data sets can
contain millions or billions of data points and can come from a variety of sources, such as
transactional data, social media, and web logs. Analyzing medium data sets requires more advanced
techniques, such as machine learning algorithms.

Big data is extremely large data sets that may be analysed computationally to reveal patterns,
trends, and associations, especially relating to human behaviour and interactions. Extremely large
and complex data sets that cannot be easily managed or analyzed using traditional data analysis
tools and techniques. Big data sets can contain trillions of data points and come from a variety of
sources, such as sensors, IoT devices, and large-scale simulations. Analyzing big data sets requires
specialized tools and techniques, such as distributed computing systems, parallel processing, and
advanced machine learning algorithms.

Define Big Data

The most standard big data definition is often described using the "3 Vs" framework: Volume,
Velocity, and Variety. Specifically, big data is characterized by:

Volume: Big data refers to extremely large data sets that are too large to be processed using
traditional data processing methods. The data sets can range from terabytes to petabytes and
beyond, and can come from a variety of sources, including sensors, social media, web logs, and
more.

Velocity: Big data is characterized by the speed at which data is being produced, collected, and
processed. With the rise of real-time data processing, big data sets are often generated at a very
high velocity, making it difficult to store and analyze them using traditional methods.

Variety: Big data sets are characterized by the variety of data types and formats that exist within
them. Data can come in structured, semi-structured, and unstructured formats, and can include text,
audio, video, and more.

Overall, big data refers to data sets that are large, complex, and varied, and require specialized tools
and techniques for processing, storage, and analysis. The insights gained from big data analysis can
be used to drive decision-making, identify patterns and trends, and develop new products and
services.
Data and its sources in the Big Data age.

In the context of big data, there are three types of data: structured, semi-structured, and
unstructured.

Structured Data:

Structured data is organized in a well-defined manner, where the data is presented in a structured
format. It is typically stored in tables with clearly defined columns and rows, and it can be easily
analyzed using traditional data processing tools like relational databases. Examples of structured
data include data from transactional systems like financial transactions, customer information in a
CRM system, and inventory data in a supply chain management system.

Semi-structured Data:

Semi-structured data contains some structure, but not enough to fit into a traditional relational
database. It may contain data that is partially structured or inconsistent. Semi-structured data can
be analyzed using NoSQL databases or other big data technologies. Examples of semi-structured
data include XML or JSON data, log files, and sensor data.

Unstructured Data:

Unstructured data refers to data that has no predefined structure, and it can be in any format such
as text, audio, or video. It is challenging to process and analyze unstructured data using traditional
database tools, and it requires advanced techniques like natural language processing (NLP) or
machine learning to derive insights from it. Examples of unstructured data include social media
feeds, emails, videos, and images.

In summary, structured data is organized in a well-defined manner, semi-structured data contains


some structure, and unstructured data has no predefined structure. Understanding the different
types of data is essential in designing big data solutions that can handle the complexities and volume
of data in modern applications.

Data coming from different sources.

In big data, there are various sources of data, including:

Enterprise Applications: Enterprise applications such as customer relationship management (CRM),


enterprise resource planning (ERP), and supply chain management (SCM) systems generate large
amounts of structured data.

Social Media: Social media platforms such as Twitter, Facebook, Instagram, and LinkedIn generate
massive amounts of unstructured data in the form of posts, comments, likes, and shares.

Internet of Things (IoT): IoT devices such as sensors, smart meters, and wearables generate vast
amounts of semi-structured or unstructured data that can be used to monitor and optimize
industrial processes, public utilities, and personal health.

Web and Mobile Applications: Web and mobile applications generate structured, semi-structured,
and unstructured data in the form of user clicks, searches, reviews, and feedback.

Machine and Sensor Data: Machine and sensor data are generated by machines such as
manufacturing equipment, vehicles, and aircraft, as well as sensors such as temperature sensors,
GPS sensors, and motion sensors.
Government and Public Data: Government agencies, non-profit organizations, and research
institutions generate a wealth of structured, semi-structured, and unstructured data that can be
used for research, analysis, and decision-making.

Transactional Data: Transactional data, such as financial transactions and online purchases,
generates vast amounts of structured data that can be analyzed to gain insights into customer
behavior, market trends, and fraud detection.

These are just some of the many sources of data in big data, and the volume, variety, and velocity of

data generated by these sources continue to grow exponentially, making it increasingly important to
develop big data solutions that can handle and process these vast amounts of data.

What is Analytics?

Data analytics is the process of examining and interpreting large sets of data in order to discover
patterns, trends, and other meaningful insights. The goal of data analytics is to gain a better
understanding of data and use that understanding to inform business decisions, improve
performance, and drive innovation.

Data analytics involves several steps and generically includes-

 Learning the application domain


 Data Collection and creating the target dataset
 Organization and data storage
 Pre-processing/ Cleaning
 Transformation, data reduction and projection
 Choosing the data mining function and algorithm
 Data mining and Modeling
 Interpretation and knowledge delivery

It can be performed using a variety of techniques, such as statistical analysis, data mining, machine
learning, and natural language processing. Data analytics can be applied to structured data, which is
highly organized and easy to analyze, as well as unstructured data, such as text, audio, and video,
which requires specialized tools and techniques for analysis.

There are several different types of data analytics, some of the most commonly used analytics
methodologies are:

Descriptive Analytics: Descriptive analytics is the most basic type of analytics methodology, which
involves analyzing data to understand what has happened in the past. It provides a summary of
historical data in the form of metrics, charts, and graphs. This type of analytics can be used to
identify trends, patterns, and anomalies in the data. Descriptive analytics can be used for a variety of
applications, such as understanding customer behavior, optimizing business processes, and tracking
key performance indicators (KPIs). By providing a summary of historical data, descriptive analytics
can help organizations identify areas for improvement and make data-driven decisions. Main steps
of descriptive data analytics are as follows:

Descriptive analytics involves several key steps, including:

Data Collection: The first step in descriptive analytics is collecting the relevant data. This data can
come from a variety of sources, such as surveys, customer transactions, and website analytics.

Data Cleaning: Once the data is collected, it must be cleaned and prepared for analysis. This involves
removing any errors, inconsistencies, or missing data points.

Data Exploration: The next step in descriptive analytics is exploring the data to gain a better
understanding of its characteristics. This can be done using visualizations, such as histograms, scatter
plots, and heat maps.

Data Analysis: After exploring the data, the next step is to analyze it to identify trends, patterns, and
anomalies. This can be done using statistical analysis techniques, such as mean, median, and
standard deviation.

Data Visualization: Finally, the results of the analysis are presented using visualizations, such as bar
charts, line graphs, and pie charts. These visualizations can help stakeholders understand the
insights gained from the analysis and make informed decisions.

Diagnostic Analytics: Diagnostic analytics involves analyzing data to understand why something has
happened in the past. This type of analytics focuses on the causes and correlations of events and can
be used to identify the root cause of a problem or opportunity.

Diagnostic analytics can be used for a variety of applications, such as identifying the reasons for low
customer satisfaction, understanding the factors driving sales growth, or diagnosing the causes of
operational inefficiencies. By providing insights into the underlying causes of events, diagnostic
analytics can help organizations make more informed decisions and take actions to address the root
cause of problems or capitalize on opportunities.

Some of the key steps of diagnostic analytics involves several key steps, including:

Data Collection: The first step in diagnostic analytics is collecting the relevant data. This data can
come from a variety of sources, such as customer feedback, sales data, or website analytics.

Data Cleaning: Once the data is collected, it must be cleaned and prepared for analysis. This involves
removing any errors, inconsistencies, or missing data points.
Data Exploration: The next step in diagnostic analytics is exploring the data to gain a better
understanding of its characteristics. This can be done using visualizations, such as scatter plots or
heat maps.

Data Analysis: After exploring the data, the next step is to analyze it to identify patterns and
relationships. This can be done using statistical analysis techniques, such as regression analysis,
correlation analysis, or hypothesis testing.

Root Cause Analysis: Finally, the results of the analysis are used to identify the root cause of a
problem or opportunity. This involves examining the relationships between variables and
determining which variables are driving the outcome of interest.

Predictive Analytics: Predictive analytics involves analyzing data to predict what will happen in the
future. This type of analytics uses statistical and machine learning techniques to analyze historical
data and predict future outcomes. Predictive analytics can be used for a variety of applications, such
as fraud detection, customer segmentation, and supply chain optimization. This type of analytics is
used to identify patterns and trends in data, and then use that information to forecast future
outcomes. By providing insights into future outcomes, predictive analytics can help organizations
make more informed decisions, optimize their operations, and drive growth and innovation.

Predictive analytics involves several key steps, including:

Data Collection: The first step in predictive analytics is collecting the relevant data. This data can
come from a variety of sources, such as customer transactions, website analytics, or social media
activity.

Data Cleaning: Once the data is collected, it must be cleaned and prepared for analysis. This involves
removing any errors, inconsistencies, or missing data points.

Data Exploration: The next step in predictive analytics is exploring the data to gain a better
understanding of its characteristics. This can be done using visualizations, such as histograms, scatter
plots, or heat maps.

Data Analysis: After exploring the data, the next step is to analyze it to identify patterns and
relationships. This can be done using statistical analysis techniques, such as regression analysis or
time series analysis, or machine learning algorithms, such as decision trees, neural networks, or
random forests.

Predictive Modeling: Once the data has been analyzed, the next step is to build a predictive model
that can be used to forecast future outcomes. This model is trained using historical data, and then
used to make predictions about future events.

Model Evaluation: Finally, the predictive model is evaluated to determine its accuracy and
effectiveness. This involves comparing the predicted outcomes to actual outcomes, and making any
necessary adjustments to the model.

Prescriptive Analytics: Prescriptive analytics involves analyzing data to determine the best course of
action to take in the future. This type of analytics goes beyond predicting future outcomes and
provides recommendations on what actions to take to achieve a specific goal. It can be used to
optimize business processes, improve customer experiences, and enhance decision-making. It
involves using advanced modeling and optimization techniques to recommend the best course of
action to take in a given situation. This type of analytics goes beyond just predicting what might
happen in the future, and provides specific recommendations for what actions to take to achieve a
desired outcome. It can be used for a variety of applications, such as optimizing supply chain
operations, maximizing revenue and profitability, or improving customer satisfaction. By providing
specific recommendations for action, prescriptive analytics helps organizations make more informed
decisions, increase efficiency, and drive innovation.

Prescriptive analytics involves several key steps, including:

Business objective and case understanding: It involves understanding the value chains, processes
and specific application regarding which we aim to develop prescriptive model.

Data Collection: The first step in prescriptive analytics is collecting the relevant data. This data can
come from a variety of sources, such as customer feedback, sales data, or operational data.

Data Cleaning: Once the data is collected, it must be cleaned and prepared for analysis. This involves
removing any errors, inconsistencies, or missing data points.

Data Exploration: The next step in prescriptive analytics is exploring the data to gain a better
understanding of its characteristics. This can be done using visualizations, such as scatter plots or
heat maps.

Data Analysis: After exploring the data, the next step is to analyze it to identify patterns and
relationships. This can be done using statistical analysis techniques, such as regression analysis or
hypothesis testing, or machine learning algorithms, such as decision trees or neural networks.

Prescriptive Modeling: Once the data has been analyzed, the next step is to build a prescriptive
model that can be used to recommend the best course of action to take. This model is trained using
historical data, and then used to simulate different scenarios and recommend the optimal solution.

Model Evaluation: Finally, the prescriptive model is evaluated to determine its accuracy and
effectiveness. This involves testing the model against real-world scenarios and making any necessary
adjustments to improve its performance.

Algorithms for Data Analytics

What is an algorithm- It is a self-contained, step-by-step set of instructions easily understandable by


humans, allowing the implementation of a given method designed to solve a specific problem or
perform a specific task. They are self-contained in order to be easily translated to an arbitrary
programming language. In the context of data analytics, algorithms are used to process and analyze
data in order to extract insights, make predictions, or identify patterns and trends.

Some examples of various algorithms used in data analytics:

Linear regression: This is a statistical algorithm that is used to model the relationship between two
variables. It is commonly used to make predictions about future events based on historical data.

Decision trees: This is a machine learning algorithm that is used to classify data into different
categories based on a set of predetermined rules. It is often used in applications such as customer
segmentation or fraud detection.

Random forests: This is a machine learning algorithm that is used to make predictions by combining
the results of multiple decision trees. It is often used in applications such as credit scoring or
predicting customer churn.
K-means clustering: This is a machine learning algorithm that is used to group data into clusters
based on their similarity. It is often used in applications such as market segmentation or image
recognition.

Principal component analysis (PCA): This is a statistical algorithm that is used to reduce the
dimensionality of data by identifying the most important variables. It is often used in applications
such as image processing or speech recognition.

Apriori algorithm: This is a data mining algorithm that is used to identify patterns in large datasets. It
is often used in applications such as market basket analysis or recommendation systems.

Gradient descent: This is a mathematical optimization algorithm that is used to find the minimum of
a function. It is often used in applications such as neural networks or linear regression.

The choice of algorithm depends on the specific problem being solved, as well as the nature of the
data being analyzed.
Big Data Analytics

Big data refers to data sets that are large, complex, and varied, and require specialized tools and
techniques for processing, storage, and analysis. The insights gained from big data analysis can be
used to drive decision-making, identify patterns and trends, and develop new products and services.

The most standard big data definition is often described using the "3 Vs" framework: Volume,
Velocity, and Variety. Specifically, big data is characterized by:

Volume: Big data refers to extremely large data sets that are too large to be processed using
traditional data processing methods. The data sets can range from terabytes to petabytes and
beyond, and can come from a variety of sources, including sensors, social media, web logs, and
more.

Velocity: Big data is characterized by the speed at which data is being produced, collected, and
processed. With the rise of real-time data processing, big data sets are often generated at a very
high velocity, making it difficult to store and analyze them using traditional methods.

Variety: Big data sets are characterized by the variety of data types and formats that exist within
them. Data can come in structured, semi-structured, and unstructured formats, and can include text,
audio, video, and more.

However, as the field is developing, more and “Vs” are being added to the definition of Big Data.
Here we study the six V’s of Big Data, which are-

The six V's of Big Data are:

Volume: refers to the enormous amount of data generated and collected by businesses, social
media platforms, and IoT devices. With the increasing digitization of our lives, the volume of data is
growing exponentially.

Velocity: refers to the speed at which data is generated and processed. In today's fast-paced
business environment, the speed at which data is generated and analyzed can make a significant
difference in decision-making.

Variety: refers to the different types of data available, including structured, unstructured, and semi-
structured data. This includes data from sources such as social media, sensors, video, and audio.
Data variety is present in the form of-

Structural variety- formats, models

Semantic variety- how to interpret and operate on data

Media variety- medium in data is delivered.

Availability variety- in terms of real time, batch/intermittent, and non-real time data.

Veracity: refers to the accuracy and reliability of the data. With so much data available, it can be
difficult to ensure that the data is accurate and trustworthy.

Value: refers to the insights and value that can be derived from the data. The ultimate goal of Big
Data is to turn the massive amount of data into actionable insights that can drive business growth
and success

Valence: Valence == Connectendness


Valence represents the fraction of data items that are connected out of the total number of possible
connections. It refers to how big data can bond with each other, forming connections
between otherwise disparate datasets.

Valence poses the following challenges in big data systems-

 More complex data exploration algorithms


 Modeling and prediction valence changes
 Group event detection
 Emergent behavior analysis

The five P’s of Big Data Analytics

P 1: Purpose

A goal or purpose should always be formulated before even conceptualizing a big data solution. As
the Big Data systems are still maturing and there are varied applications and tools available. It is
important that the business purpose for big data system application is conceptualized before
implementing them. Possible examples can be:

Better business insights

Fraud prevention/detection

Prediction

Maximization problems, etc.

It is essential for a project within the field of Big Data or Data Science to have a specific purpose or
goal. You should never aimlessly work on a project, just because everyone is doing it, since it will not
be useful for you or your company and costly to implement.

P 2: People

Various types of people with different skillsets play an important role within a data science project.
In order to work successfully with data, developers, testers, data scientists and domain experts are
essential. Furthermore, stakeholders/project sponsors and project manager/product owner are
involved in data projects. In this relation, the former group of people have to be informed about the
progress of the project, whereas the latter have the task of mediating between stakeholders and the
development team.

P 3: Processes

There are two main types of processes within data science projects: organizational vs. technical
processes.
Platforms:

IT is concerned with platforms you will use for your analytics and products, which are critical for
successfully managing a data science project. Which lead to more queries such as:

 How should the data integration be realized? (Manually via Java vs. tools like talend,
Dataflow, etc.)
 Which cloud should be used? (AWS vs. Google vs. Azure and
 public vs. private vs. hybrid)
 What are my technical requirements?

Programmability:

It concerns with the tools and programming languages do you want to use. It is determined and
influenced by the IT governance and strategy of the organization. Examples for tools and
programming languages are:

 Programing languages: SQL, Python, R


 Big Data tools: Hadoop, Google’s Cloud Storage & Big Query, AWS Redshift and S3
 Streaming software: Kafka, Spark, talend
 BI tools: Tableau, Qlik, Google Data Studio

Drivers for adopting big data

There are several business drivers for Big Data adaptation, including:

Enhanced Customer Experience: Big Data can help businesses better understand their customers'
needs, preferences, and behavior, allowing them to deliver more personalized and targeted
products and services.

Operational Efficiency: Big Data can help businesses optimize their operations, reduce costs, and
improve productivity by identifying inefficiencies, automating processes, and predicting maintenance
needs.

Competitive Advantage: Big Data can provide businesses with a competitive advantage by enabling
them to make more informed decisions, gain insights into market trends, and develop innovative
products and services.
Improved Risk Management: Big Data can help businesses identify and mitigate risks, such as fraud
and security threats, by detecting patterns and anomalies in data.

Data-Driven Decision Making: Big Data can help businesses make data-driven decisions by providing
insights and analysis that support strategic planning, forecasting, and performance measurement.

Internet of everything and technological trends: The changing technological environment makes it
tough for organization to operate upon older tools of small and medium data.

Business Architecture: As new business architecture emerge, they are more and more dependent
upon information technology and internet of things and sensory technologies. For such businesses
use of big data technologies is a significant boost.
2. Business Cases for Big Data

Google

Big data and big business go hand in hand – this is the first in a series where I will examine the
different uses that the world’s leading corporations are making of the endless amount of digital

information the world is producing every day. Google has not only significantly influenced the way
we can now analyse big data (think MapReduce, BigQuery, etc.) – but they are probably more
responsible than anyone else for making it part of our everyday lives. I believe that many of the
innovative things Google is doing today, most companies will do in years to come. Many people,
particularly those who didn’t get online until this century had started, will have had their first direct
experience of manipulating big data through Google. Although these days Google’s big data
innovation goes well beyond basic search, it’s still their core business. They process 3.5 billion
requests per day, and each request queries a database of 20 billion web pages. This is refreshed
daily, as Google’s bots crawl the web, copying down what they see and taking it back to be stored in
Google’s index database. What pushed Google in front of other search engines has been its ability to
analyse wider data sets for their search. Initially it was PageRank which included information about
sites that linked to a particular site in the index, to help take a measure of that site’s importance in
the grand scheme of things. Previously leading search engines worked almost entirely on the
principle of matching relevant keywords in the search query to sites containing those words.

PageRank revolutionized search by incorporating other elements alongside keyword analysis.

Their aim has always been to make as much of the world’s information available to as many people
as possible (and get rich trying, of course…) and the way Google search works has been constantly

revised and updated to keep up with this mission. Moving further away from keyword-based search
and towards semantic search is the current aim. This involves analysing not just the “objects”
(words) in the query, but the connection between them, to determine what it means as accurately
as possible. To this end, Google throws a whole heap of other information into the mix. Starting in
2007 it launched Universal Search, which pulls in data from hundreds of sources including language
databases, weather forecasts and historical data, financial data, travel information, currency
exchange rates, sports statistics and a database of mathematical functions.

It continued to evolve in 2012 into the Knowledge Graph, which displays information on the subject
of the search from a wide range of resources directly into the search results. It then mixes what it
knows about you from your previous search history (if you are signed in), which can include
information about your location, as well as data from your Google+ profile and Gmail messages, to
come up with its best guess at what you are looking for. The ultimate aim is undoubtedly to build the
kind of machine we have become used to seeing in science fiction for decades – a computer which
you can have a conversation with in your native tongue, and which will answer you with precisely
the information you want.

Search is by no means all of what Google does, though. After all, it’s free, right? And Google is one of
the most profitable businesses on the planet. That profit comes from what it gets in return for its

searches – information about you. Google builds up vast amounts of data about the people using it.

Essentially it then matches up companies with potential customers, through its Adsense algorithm.
The companies pay handsomely for these introductions, which appear as adverts in the customers’

browsers.
In 2010 it launched BigQuery, its commercial service for allowing companies to store and analyse big
data sets on its cloud platforms. Companies pay for the storage space and computer time taken in

running the queries. Another big data project Google is working on is the self-driving car. Using and
generating massive amounts of data from sensors, cameras, tracking devices and coupling this with
on-board and realtime data analysis from Google Maps, Streetview and other sourcesallows the
Google car to safely drive on the roads without any input from a human driver. Perhaps the most
astounding use Google have found for their enormous data though, is predicting the future.

In 2008 the company published a paper in the science journal Nature claiming that their technology
had the capability to detect outbreaks of flu with more accuracy than current medical techniques for

detecting the spread of epidemics. The results were controversial – debate continues over the
accuracy of the predictions. But the incident unveiled the possibility of “crowd prediction”, which in
my opinion is likely to be a reality in the future as analytics becomes more sophisticated.

Google may not quite yet be ready to predict the future – but its position as a main player and
innovator in the big data space seems like a safe bet.
Role of Big Data in Sports

Big data has played an increasingly important role in sports over the past few years. By collecting
and analyzing large amounts of data, teams and athletes can gain valuable insights that help them
make better decisions, optimize their training and performance, and ultimately improve their results.
Modern Sports industry requires tracking the movement trajectories of players, referees, and the
ball and then establish dynamic evaluation indicators and convert these data into valuable
information. Different data mining methods have been applied to uncover hidden relationships,
patterns, and laws in sports big data. Sports news agencies also focus upon gathering data about
various players, teams, their past records, and generate real time analytics.

Here are some of the ways big data is being used in sports:

Performance Analysis: Sports teams use data analytics to track the performance of their athletes in
real-time. This can include metrics such as speed, distance covered, heart rate, and other biometric
data. This helps coaches and trainers optimize training routines, identify strengths and weaknesses,
and make adjustments to improve performance.

Talent Scouting: Teams also use data analytics to identify potential recruits. By analyzing
performance data from various leagues and competitions, teams can find talented players who may
have gone unnoticed otherwise.

Injury Prevention: Big data can also be used to prevent injuries. By analyzing data on player
movements and biometric data, teams can identify when an athlete is at risk of injury and take
preventive measures.

Fan Engagement: Sports teams use data analytics to understand their fan base and provide them
with personalized experiences. This can include targeted advertising, promotions, and offers based
on fan preferences.

Broadcasting: Sports broadcasters use data analytics to enhance the viewing experience. They can
analyze data from cameras and microphones to provide viewers with unique insights into the game,
such as player heart rates, shot angles, and more.

Overall, big data is revolutionizing the sports industry, providing teams and athletes with powerful
tools to improve performance and engage with fans in new and exciting ways.
Role of Big Data in Operations Management

Big data plays a significant role in operations management and supply chain management. By
collecting, analyzing, and utilizing large sets of data, companies can optimize their operations and
supply chains, reduce costs, and improve efficiency. Some of the ways big data is being used in
operations management and supply chain management are as follows:

Predictive Analytics: Big data is used to predict demand and forecast trends in operations and supply
chains. Companies use historical data to predict demand for their products and services, which helps
them plan their production schedules and optimize their inventory levels.

Quality Control: Big data is used to monitor quality control processes and identify defects early in the
production process. Companies can use data from sensors and other sources to detect potential
problems and take corrective action before they become more significant issues.

Supplier Management: Big data is used to monitor supplier performance and identify potential risks
in the supply chain. Companies can use data to track supplier performance and identify trends that
may indicate potential issues, such as delayed deliveries or quality problems.

Inventory Management: Big data is used to optimize inventory levels and reduce costs. By analyzing
data on demand, lead times, and other factors, companies can reduce excess inventory and
minimize the risk of stockouts.

Real-time Tracking: Big data is used to track shipments in real-time, allowing companies to respond
quickly to changes in the supply chain. Companies can use data from GPS and other sources to
monitor shipments, identify delays, and take corrective action.

Overall, big data is a valuable tool for operations management and supply chain management. By
using data to optimize processes and reduce costs, companies can improve their efficiency and stay
competitive in an increasingly complex global market.

Role of Big data in Marketing

Big data plays a significant role in marketing, as it provides valuable insights into consumer behavior,
preferences, and trends. as it enables them to make data-driven decisions, improve their marketing
effectiveness, and deliver personalized experiences to their customers. Customer Analytics (48%),
Operational Analytics (21%), Fraud and Compliance (12%) New Product & Service Innovation (10%)
and Enterprise Data Warehouse Optimization (10%) are among the most popular big data use cases
in sales and marketing.

Some of the ways big data can be used in marketing are as follows:

 Big data allows marketers to track and analyze consumer behavior across multiple channels,
including social media, websites, and mobile apps. This data can help them understand how
consumers interact with their brand, which products or services they prefer, and what influences
their purchasing decisions.
 Big data can help marketers create personalized experiences for their customers, by analyzing
their preferences, behaviors, and interests. This allows them to tailor their marketing messages
and offers to each customer's unique needs, which can increase engagement and drive sales.
 Big data can be used to help marketers forecast consumer trends and behavior. This can help
them anticipate customer needs, optimize their marketing campaigns, and make more informed
decisions.
 Big data can be used to optimize marketing campaigns in real-time, by analyzing customer data
and adjusting messaging, targeting, and channels accordingly. This allows marketers to maximize
the effectiveness of their campaigns and improve their ROI.
 Big data can provide valuable insights into customer preferences and opinions, which can be
used to inform product development, customer service, and other business decisions.

By combining big data with an integrated marketing management strategy, marketing organizations
can make a substantial impact in the key areas:

 Customer Engagement- insight into not just who your customers are, but where they are, what
they want, how they want to be contacted and when
 Listen to the voice of customers coming from diverse media
 Reduce Customer Acquisition Cost
 Customer retention- discover what influences customer loyalty and what keeps them coming
back again and again.
 Optimization marketing campaign cost and performance.

Role of Big Data Analytics in Gaming Industry

Big data plays a significant role in the computer gaming industry, as it allows game developers and
publishers to better understand their players and improve the gaming experience. It enables game
developers and publishers to create better games, improve the gaming experience, and increase
player engagement and revenue.

Some of the ways big data is used in the gaming industry are:

Player behavior analysis: Big data is used to analyze player behavior and preferences, such as which
games they play, how long they play, and which in-game items they purchase. This data can be used
to optimize game design, identify popular features, and create personalized experiences. Big data
can be used to build predictive models that help game developers and publishers forecast player
behavior and trends. This can help them anticipate customer needs, optimize their game
development and marketing strategies, and make more informed decisions.

Optimize customer engagement: Big data can be used to analyze marketing campaigns and
customer engagement, such as how players respond to different types of promotions or incentives.
This data can be used to create more effective marketing campaigns and improve customer
engagement.

Game performance optimization: Big data can be used to analyze the performance of games, such as
load times, frame rates, and connection speeds. This data can be used to optimize game
performance and improve the user experience.

Fraud detection: Big data is used to detect fraudulent behavior, such as cheating, hacking, and credit
card fraud. This helps game developers and publishers protect their game and revenue.

Role of Big Data in Accounts and Finance

Big data has become increasingly important in the field of accounting and finance as it allows for the
analysis of large volumes of structured and unstructured data to uncover insights that can help
inform decision-making and improve business outcomes. Some specific roles of big data in
accounting and finance include:
Risk management: Big data can be used to identify and mitigate potential risks in financial
transactions and investments. By analyzing large amounts of data from various sources, such as
financial statements, credit reports, and market trends, accountants and finance professionals can
identify potential risks and take steps to reduce their impact.

Fraud detection: Big data can also be used to detect and prevent financial fraud. By analyzing large
amounts of financial data and identifying patterns or anomalies, accountants and finance
professionals can identify potential cases of fraud and take appropriate action. This can include
analyzing transactions, identifying red flags, and conducting deeper investigations to identify
potential fraud.

Financial forecasting: Big data can be used to create more accurate financial forecasts, which can
help organizations make better decisions about investments, capital expenditures, and other
financial matters. This can include analyzing financial statements, cash flow statements, and balance
sheets to identify trends and patterns that can help accountants make more informed decisions

Customer insights: Big data can also be used to gain insights into customer behavior, preferences,
and trends, which can help organizations make more informed decisions about product
development, marketing, and sales.

Cost analysis: Big data analytics can be used to analyze cost data to identify areas where cost savings
can be achieved. This can include analyzing expenses, identifying inefficiencies, and identifying
opportunities to streamline processes.

Big Data Applications for Government and Policy

Big data has the potential to revolutionize how governments and policymakers make decisions, solve
problems, and deliver services to citizens.

 Law enforcement agencies can use big data to analyze crime patterns and predict potential
criminal activity. This can help them allocate resources more effectively and prevent crimes
before they happen.
 Governments can use real-time traffic data to optimize traffic flow, reduce congestion, and
improve public transportation systems. This can result in a more efficient and sustainable
transportation system. Huge amounts of data from location-based social networks and high-
speed data from telecoms have affected travel behaviour. When data comes in such large
volume, analysing it to produce useful insight can be really tough. However, there are
possibilities for Governments, regarding traffic control, route planning, intelligent transport
systems, congestion management (by predicting traffic conditions). For private-sector :
revenue management, technological enhancements, logistics and for competitive advantage
(by consolidating shipments and optimizing freight movement), and for Individual : route
planning to save on fuel and time, for travel arrangements in tourism, etc
 Failure in utilizing the data to curb the cost of rising healthcare and by inefficient systems
mainly because electronic data is unavailable, inadequate, or unusable. Governments can use
big data to improve healthcare services by analyzing patient data, identifying disease
outbreaks, and predicting future health trends. This can help governments allocate resources
more effectively and improve public health outcomes. Free public health data and Google
Maps have been used by the University of Florida to create visual data that allows for faster
identification and efficient analysis of healthcare information, used in tracking the spread of
chronic disease
 Education can be a major Big Data application area. Big Data can be used to measure teacher’s
effectiveness to ensure a pleasant experience for both students and teachers. Teacher’s
performance can be fine-tuned and measured against student numbers, subject matter,
student demographics, student aspirations, behavioral classification, and several other
variables. Big data can be used to analyze student data and provide personalized learning
experiences. For example, by analyzing student performance data, teachers can identify areas
where individual students need extra help and adjust their instruction accordingly.
 Big data can be used to develop and refine educational curricula. By analyzing data on student
learning outcomes and teacher performance, policymakers can identify areas where curricula
need to be revised or updated.
3. Fundamentals of Big Data
3.1.Evolution of Big Data Systems

Businesses have long struggled with finding a pragmatic approach to capturing information about
their customers, products, and services. When a company only had a handful of customers who all
bought the same product in the same way, things were pretty straightforward and simple. But over
time, companies and the markets they participate in have grown more complicated.

Data management has to include technology advances in hardware, storage, networking, and
computing models such as virtualization and cloud computing. The convergence of emerging
technologies and reduction in costs for everything from storage to compute cycles have transformed
the data landscape and made new opportunities possible. As all these technology factors converge,
it is transforming the way we manage and leverage data. Big data is the latest trend to emerge
because of these factors. The term "big data" refers to extremely large and complex datasets that
cannot be processed using traditional data processing techniques.

But before we delve into the details of big data, it is important to look at the evolution of data
management and how it has led to big data. Big data is not a stand-alone technology; rather, it is a
combination of the last 50 years of technology evolution. The history of big data can be traced back
to the early days of computing when mainframe computers were used to process large volumes of
data. As computing moved into the commercial market in the late 1960s, data was stored in flat files
that imposed no structure. When companies needed to get to a level of detailed understanding
about customers, they had to apply brute-force methods, including very detailed programming
models to create some value. Later in the 1970s, things changed with the invention of the relational
data model and the relational database management system (RDBMS) that imposed structure and a
method for improving performance. Most importantly, the relational model added a level of
abstraction (the structured query language [SQL], report generators, and data management tools) so
that it was easier for programmers to satisfy the growing business demands to extract value from
data.

The relational model offered an ecosystem of tools to help companies better organize their data and
be able to compare transactions from one geography to another. In addition, it helped business
managers who wanted to be able to examine information such as inventory and compare it to
customer order information for decision-making purposes.

But a problem emerged from this exploding demand for answers:

 Storing this growing volume of data was expensive.


 Accessing this growing data was slow.
 Making matters worse, lots of data duplication existed.
 The actual business value of that data was hard to measure.

The Entity-Relationship (ER) model emerged, which added additional abstraction to increase the
usability of the data. In this model, each item was defined independently of its use. Therefore,
developers could create new relationships between data sources without complex programming.

In the 1990s, data warehouses emerged as a way to store and manage large amounts of data for
analysis. The data warehouse enabled the IT organization to select a subset of the data being stored
so that it would be easier for the business to try to gain insights. The data warehouse was intended
to help companies deal with increasingly large amounts of structured data that they needed to be
able to analyze by reducing the volume of the data to something smaller and more focused on a
particular area of business. However, these systems were expensive and only accessible to large
organizations. Sometimes these data warehouses themselves were too complex and large and didn’t
offer the speed and agility that the business required. The answer was a further refinement of the
data being managed through data marts. Data warehouses and data marts solved many problems
for companies needing a consistent way to manage massive transactional data. But when it came to
managing huge volumes of unstructured or semi-structured data, the warehouse was not able to
evolve enough to meet changing demands. To complicate matters, data warehouses are typically fed
in batch intervals, usually weekly or daily. This is fine for planning, financial reporting, and traditional
marketing campaigns, but is too slow for increasingly real-time business.

Enterprise Content Management systems evolved in the 1980s to provide businesses with the
capability to better manage unstructured data, mostly documents. In the 1990s with the rise of the
web, organizations wanted to move beyond documents and store and manage web content, images,
audio, and video. New requirements were driven, in large part, by a convergence of factors including
the web, virtualization, and cloud computing.

The early 2000s saw the rise of the internet and the proliferation of digital data. Companies like
Google and Amazon began to collect vast amounts of data on their users, leading to the
development of new technologies like Hadoop, which allowed for the processing of massive
amounts of data using distributed computing. In 2008, the term "big data" began to gain popularity,
and companies began to invest heavily in data analysis and management tools. The emergence of
cloud computing and the growth of social media platforms further fueled the explosion of big data.

Today, big data is used in a wide range of applications, from scientific research to business analytics
and marketing. The field continues to evolve rapidly, with new technologies and approaches
constantly emerging to help organizations make sense of their data and derive insights that can drive
innovation and growth.

3.2.Evolution of Big data from Traditional Database management systems

Big data evolved from traditional database management systems (DBMS) in response to the growing
need to process and analyze extremely large and complex datasets that could not be handled by
traditional DBMS.

Traditional DBMS were designed to handle structured data that could be easily organized and stored
in tables with well-defined relationships. However, with the advent of the internet and the
proliferation of digital data, the volume, variety, and velocity of data being generated began to
outstrip the capabilities of traditional DBMS.

Traditional databases such as, relational database management systems (RDBMS), were designed to
handle structured data with well-defined relationships and schemas, and they were optimized for
transactions and data consistency. However, as the volume and variety of data exploded with the
rise of the internet and social media, RDBMS struggled to keep up with the demands of modern data
processing. One of the key challenges with RDBMS was their scalability limitations. They were
designed to run on a single server, and as the amount of data increased, it became increasingly
difficult and expensive to scale up by adding more powerful hardware. Another challenge with
RDBMS was their limited ability to handle unstructured and semi-structured data, such as social
media feeds, log files, and sensor data. This type of data did not fit neatly into the structured tables
and columns of RDBMS and required new storage and processing techniques.
As a result, new technologies and approaches were developed to address the challenges posed by
big data. Many of the technologies at the heart of big data, such as virtualization, parallel processing,
distributed file systems, and in-memory databases, have been around for decades. Advanced
analytics have also been around for decades, although they have not always been practical. Other
technologies such as Hadoop and MapReduce have been on the scene for only a few years. This
combination of technology advances can now address significant business problems.

One of the key technologies that emerged was Hadoop, an open-source distributed computing
platform that could process and analyze large datasets using clusters of commodity hardware.
Hadoop, along with other big data technologies like NoSQL databases, data warehouses, and data
lakes, provided more flexible and scalable ways of storing and processing large amounts of data.
These technologies were designed to handle unstructured and semi-structured data, such as social
media feeds, log files, and sensor data, which could not be easily managed by traditional DBMS. Big
data developments of NoSQL databases, such as MongoDB and Apache Couchbase, which were
designed to handle unstructured and semi-structured data more efficiently.

In addition, big data technologies also used approaches to data analysis, which have been around for
some time, but were more useful now than ever, such as machine learning and predictive analytics,
which could help organizations derive insights and make better decisions based on their data.

Overall, big data represents a significant evolution from traditional systems, and is also
simultaneously a new change, that providing new capabilities and approaches to help organizations
manage, process, and analyze the massive amounts of data being generated in today's digital world.

3.3.Terminologies Used In Big Data Environments


 As-a-service infrastructure

Data-as-a-service, software-as-a-service, platform-as-a-service – all refer to the idea that rather than
selling data, licences to use data, or platforms for running Big Data technology, it can be provided
“as a service”, rather than as a product. This reduces the upfront capital investment necessary for
customers to begin putting their data, or platforms, to work for them, as the provider bears all of the
costs of setting up and hosting the infrastructure. As a customer, as-a-service infrastructure can
greatly reduce the initial cost and setup time of getting Big Data initiatives up and running.

 Data mining

Data mining is the process of discovering insights from data. In terms of Big Data, because it is so

large, this is generally done by computational methods in an automated way using methods such

as decision trees, clustering analysis and, most recently, machine learning. This can be thought of

as using the brute mathematical power of computers to spot patterns in data which would not be

visible to the human eye due to the complexity of the dataset.

 Hadoop

Hadoop is a framework for Big Data computing which has been released into the public domain

as open source software, and so can freely be used by anyone. It consists of a number of modules

all tailored for a different vital step of the Big Data process – from file storage (Hadoop File System

– HDFS) to database (HBase) to carrying out data operations (Hadoop MapReduce – see below).
It has become so popular due to its power and flexibility that it has developed its own industry of

retailers (selling tailored versions), support service providers and consultants.

 HADOOP Core Component:

Hadoop is a popular open-source framework for storing and processing large-scale data sets. The
core components of the Hadoop environment are:

Hadoop Distributed File System (HDFS): HDFS is a distributed file system that is used to store data
across multiple nodes in a cluster. It is designed to handle large files and is fault-tolerant, meaning
that it can continue to function even if some of the nodes in the cluster fail.

Yet Another Resource Negotiator (YARN): YARN is the resource manager in Hadoop. It manages the
resources in the cluster, including memory, CPU, and disk space. YARN also schedules jobs and
allocates resources to those jobs.

MapReduce: MapReduce is a programming model used to process large datasets in parallel across
multiple nodes in a Hadoop cluster. It is a distributed processing model that breaks down a large
data set into smaller chunks, processes those chunks in parallel, and then combines the results.

Hadoop Common: Hadoop Common includes libraries and utilities used by other Hadoop
components. It provides a consistent set of tools for interacting with HDFS and MapReduce.

Together, these core components of Hadoop create a scalable and reliable framework for storing
and processing large-scale data sets.

 Hadoop Distributed File System (HDFS)

HDFS is a distributed file system that provides highly scalable and reliable storage for big data
processing. It is a core component of the Apache Hadoop software framework. HDFS works by
splitting large files into smaller blocks, which are then stored across a cluster of commodity
hardware nodes. Each block is replicated across multiple nodes to ensure fault tolerance and high
availability. This means that if one node fails, the data can be recovered from one of the other
nodes.

HDFS consists of two main components: the NameNode and the DataNodes. The NameNode is
responsible for maintaining metadata about the file system, such as the location of each block of
data, and for coordinating access to the data. The DataNodes are responsible for storing and serving
the actual data blocks. When a client wants to access data stored in HDFS, it sends a request to the
NameNode. The NameNode then returns the locations of the data blocks to the client, which can
then retrieve the data directly from the DataNodes. This approach minimizes network traffic and
improves performance.

HDFS is designed to be highly scalable and fault-tolerant. It can handle petabytes of data and can be
deployed on commodity hardware, making it a cost-effective solution for big data storage.
Additionally, HDFS supports data replication across multiple data centers, enabling disaster recovery
and high availability scenarios.

A client needs to follow the following steps to read data from Hadoop Distributed File System
(HDFS):
 The client sends a request to the NameNode for the location of the file to be read.
 The NameNode responds with the locations of the blocks of the file, which are stored across
multiple DataNodes in the cluster.
 The client selects a DataNode to read the first block of the file.
 The client establishes a connection with the selected DataNode and sends a request to read
the block.
 The DataNode reads the block and sends it back to the client.
 If the client needs to read additional blocks, it requests the locations of the next block from
the NameNode and selects a different DataNode to read the block from.
 This process is repeated until the entire file is read.

It's important to note that HDFS is optimized for handling large files, so reading smaller files may
result in less optimal performance. Additionally, HDFS is designed for batch processing rather than
real-time processing, so it may not be the best solution for use cases that require low-latency access
to data.

In a Hadoop Distributed File System (HDFS), the NameNode is a crucial component that manages the
metadata of the files stored in the system. To ensure the high availability of the NameNode, several
redundancy mechanisms can be employed. Here are some of the most common ones:

Backup Node: The Backup Node is a secondary NameNode that maintains a backup copy of the
namespace. The primary NameNode periodically uploads the namespace image to the Backup Node,
which can take over in case of a failure of the primary NameNode. However, the Backup Node does
not serve read or write requests from clients.

High Availability (HA) Cluster: In an HA cluster, there are two NameNodes - an active one and a
standby one - running on different machines. The active NameNode manages the namespace and
serves client requests, while the standby NameNode monitors the active NameNode and takes over
its duties in case of a failure. The standby NameNode regularly receives updates from the active
NameNode, ensuring that it always has an up-to-date namespace.

Quorum-based Journaling: In this approach, a group of machines act as a quorum and maintain a
shared journal that records all changes to the namespace. Both the active and standby NameNodes
write to this journal, and in case of a failure of the active NameNode, the standby NameNode reads
the journal to reconstruct the latest namespace state.

Federation: Federation involves running multiple independent HDFS clusters, each with its own
NameNode, and using a separate mount point to provide a unified view of the file system to the
clients. This approach provides higher scalability and fault tolerance but requires more complex
setup and administration.

Each of these mechanisms has its own advantages and disadvantages, and the choice of redundancy
mechanism depends on factors such as the size of the HDFS cluster, the workload characteristics,
and the desired level of availability.

 YARN (Yet Another Resource Negotiator):

YARN is a component of the Hadoop ecosystem that acts as a resource manager and job scheduler. It
was introduced in Hadoop 2.0 as a replacement for the previous MapReduce JobTracker.

YARN allows multiple data processing engines such as Apache Spark, Apache Flink, and Apache
HBase to run on the same Hadoop cluster, making it more flexible and efficient. It separates the job
scheduling and resource management functions, allowing them to scale independently of each
other. This means that resources can be allocated dynamically to different applications as needed,
which leads to higher resource utilization and faster job completion times.

YARN has two main components: ResourceManager and NodeManager. The ResourceManager is
responsible for allocating resources to applications and managing the application’s life cycle. The
NodeManager runs on each node in the cluster and is responsible for managing resources and
executing tasks on that node. It communicates with the ResourceManager to request resources and
report resource usage.

YARN supports a variety of resource types, including CPU, memory, and disk, allowing for fine-
grained resource allocation. It also supports features like containerization, which helps with resource
isolation and security. In summary, YARN is a key component of the Hadoop ecosystem that provides
efficient resource management and job scheduling capabilities, enabling multiple data processing
engines to run on the same cluster.

 MapReduce

MapReduce is a programming model and processing framework used for processing large volumes
of data in a distributed computing environment. It is a key component of the Hadoop ecosystem but
can also be used in other distributed computing environments.

MapReduce divides a large data set into smaller subsets, called "maps," which are then processed in
parallel across multiple nodes in a cluster. The mapper job is the first stage in the data processing
pipeline. The mapper is responsible for processing the input data and producing a set of
intermediate key-value pairs that are used as input for the reducer.

The mapper job receives a chunk of data as input, which is typically a block of data stored in HDFS
(Hadoop Distributed File System). The input data is divided into smaller portions and each portion is
processed independently by a mapper task running on a node in the Hadoop cluster. The input data
can be in any format, but it is typically in the form of text, with each line representing a record.

The mapper task reads the input data, extracts the relevant information, and transforms it into a set
of key-value pairs. The key is used to group related data together, and the value is the actual data
being processed. The output of the mapper is then written to disk as intermediate data, which is
sorted and shuffled by Hadoop before being passed to the reducer.

Mappers can be customized to perform any data processing operation, such as filtering, sorting,
aggregation, or transformation. MapReduce provides a flexible programming model that allows
developers to implement complex data processing logic using simple mapper and reducer functions.

The reduce function takes the intermediate key-value pairs produced by the mapper function and
produces the final output. The reduce function is responsible for aggregating and summarizing the
intermediate data generated by the mapper function. The reduce function receives a set of key-
value pairs, where each key represents a unique value generated by the mapper function, and the
values represent a list of all the intermediate values that were generated for that key.

The reduce function processes the values associated with each key and produces a set of output key-
value pairs. The output key-value pairs are typically in a different format than the input key-value
pairs, as they represent a summary or aggregation of the input data.

The reduce function can be customized to perform a wide variety of data processing operations,
such as counting, averaging, summarizing, or filtering. MapReduce provides a flexible programming
model that allows developers to implement complex data processing logic using simple mapper and
reducer functions.

The reduce function typically runs on a different set of nodes in the Hadoop cluster than the mapper
function, allowing for parallel processing of the intermediate data. MapReduce also includes a
shuffle and sort phase, which groups the intermediate data by key and redistributes it to the nodes
running the reduce function.

In summary, the mapper job in MapReduce is responsible for processing input data and producing a
set of intermediate key-value pairs that are used as input for the reducer. The map phase involves
processing the input data and producing intermediate key-value pairs. The mapper task runs
independently on each node in the Hadoop cluster and can be customized to perform a wide variety
of data processing operations. The reduce phase takes these intermediate results, sorts them by key,
and aggregates the values associated with each key. This produces the final output, which is a set of
key-value pairs. The reduce function in MapReduce is responsible for aggregating and summarizing
the intermediate data generated by the mapper function. The reduce function runs in parallel across
multiple nodes in the Hadoop cluster, enabling efficient processing of large data sets.

MapReduce is fault-tolerant, meaning that it can continue to function even if some of the nodes in
the cluster fail. It also enables parallel processing of data, which can lead to significant performance
improvements over traditional single-machine processing.

MapReduce has been used for a variety of data processing tasks, including text processing, data
mining, machine learning, and more. The popularity of MapReduce has led to the development of
many related tools and frameworks, such as Apache Pig and Apache Hive, which provide higher-level
abstractions for working with MapReduce.
3.4.The Big Data Architecture

While the functional requirements for a Big Data Analytics system is fairly clear. The big data system
established will depend on the nature of the analysis you are supporting. You will need the right
amount of computational power and speed.

While there will be requirement of some analysis in real time, you will inevitably be storing some
amount of data as well. Hence, your architecture also has to have the right amount of redundancy so
that you are protected from unanticipated latency and downtime. Your organization and its needs
will determine how much attention you have to pay to different performance issues. So, before
establishing a big data system, an organization must start out by asking yourself the following
questions

Read Chapter 4 of the book- Big data for dummies.


Vol. 336. Hoboken, by Judith Hurwitz, Alan Nugent,
Fern Halper, and Marcia Kaufman.

You might also like