Professional Documents
Culture Documents
Introductions
Big data is exactly what the name suggests, a “big” amount of data. Big Data means a data set that is large
in terms of volume and is more complex. Because of the large volume and higher complexity of Big Data,
traditional data processing software cannot handle it. Big Data simply means datasets containing a large
amount of diverse data, both structured as well as unstructured. Big Data allows companies to address
issues they are facing in their business, and solve these problems effectively using Big Data Analytics.
1. Volume: The name Big Data itself is related to an enormous size. Big Data is a vast 'volumes' of data
generated from many sources daily, such as business processes, machines, social media platforms,
networks, human interactions, and many more. Facebook can generate approximately
a billion messages, 4.5 billion times that the "Like" button is recorded, and more than 350 million new
posts are uploaded each day. Big data technologies can handle large amounts of data.
2. Variety: Big Data can be structured, unstructured, and semi-structured that are being collected from
different sources. Data will only be collected from databases and sheets in the past, But these days the
data will comes in array forms, that are PDFs, Emails, audios, SM posts, photos, videos, etc.
3. Veracity: Veracity means how much the data is reliable. It has many ways to filter or translate the data.
Veracity is the process of being able to handle and manage data efficiently. Big Data is also essential in
business development.
For example, Facebook posts with hashtags.
4. Value
Value is an essential characteristic of big data. It is not the data that we process or store. It
is valuable and reliable data that we store, process, and also analyze.
5. Velocity
Velocity plays an important role compared to others. Velocity creates the speed by which the data is
created in real-time. It contains the linking of incoming data sets speeds, rate of change, and activity
bursts. The primary aspect of Big Data is to provide demanding data rapidly.
Big data velocity deals with the speed at the data flows from sources like application logs, business
processes, networks, and social media sites, sensors, mobile devices, etc.
The 6th V
6. Variability:
How fast or available data that extent is the structure of your data is changing?
How often does the meaning or shape of your data change?
Example: if you are eating same ice-cream daily and the taste just keep changing.
6. Variability:- Big data is not only in huge amount but also have a lot of variation in it. At some places in a
device, it is small and simple whereas at the same place in other devices it is large and complex. For
example:-for some people collecting magazines or books is a passion. raw data variability.
They don’t want to sell then out even after going through it a lot of times but for others, they buy it, read it
and then sell it. The Same is the case with big data, at some places, it is simple at some complex. It just
depends.
Here are some more examples of how big data is used by organizations: (Where big Data)
In the energy industry, big data helps oil and gas companies identify potential drilling locations and
monitor pipeline operations; likewise, utilities use it to track electrical grids.
Financial services firms use big data systems for risk management and real-time analysis of market data.
Manufacturers and transportation companies rely on big data to manage their supply chains and optimize
delivery routes.
Other government uses include emergency response, crime prevention and smart city initiatives.
a) Travel and Tourism: Travel and tourism are the users of Big Data. It enables us to forecast
travel facilities requirements at multiple locations, improve business through dynamic
pricing, and many more.
b) Financial and banking sector: The financial and banking sectors use big data technology
extensively. Big data analytics help banks and customer behaviour on the basis of investment patterns,
shopping trends, motivation to invest, and inputs that are obtained
from personal or financial backgrounds.
c) Healthcare
Big data has started making a massive difference in the healthcare sector, with the help of predictive
analytics, medical professionals, and health care personnel. It can produce personalized
healthcare and solo patients also.
f) E-commerce
Amazon: Amazon is a tremendous e-commerce website dealing with lots of traffic daily. But, when
there is a pre-announced sale on Amazon, traffic increase rapidly that may crash the website. So, to
handle this type of traffic and data, it uses Big Data. Big Data help in organizing and analyzing the data
for far use.
3. Analytical Challenges:
There are some huge analytical challenges in big data which arise some main challenges questions like how
to deal with a problem if data volume gets too large?
Or how to find out the important data points?
Or how to use data to the best advantage?
These large amount of data on which these type of analysis is to be done can be structured (organized
data), semi-structured (Semi-organized data) or unstructured (unorganized data). There are two
techniques through which decision making can be done:
Either incorporate massive data volumes in the analysis.
Or determine upfront which Big data is relevant.
4. Technical challenges:
Quality of data:
When there is a collection of a large amount of data and storage of this data, it comes at a cost. Big
companies, business leaders and IT leaders always want large data storage.
For better results and conclusions, Big data rather than having irrelevant data, focuses on quality data
storage.
This further arise a question that how it can be ensured that data is relevant, how much data would be
enough for decision making and whether the stored data is accurate or not.
Fault tolerance:
Fault tolerance is another technical challenge and fault tolerance computing is extremely hard, involving
intricate algorithms.
Nowadays some of the new technologies like cloud computing and big data always intended that whenever
the failure occurs the damage done should be within the acceptable threshold that is the whole task should
not begin from the scratch.
Scalability:
Big data projects can grow and evolve rapidly. The scalability issue of Big Data has lead towards cloud
computing.
It leads to various challenges like how to run and execute various jobs so that goal of each workload can be
achieved cost-effectively.
It also requires dealing with the system failures in an efficient manner. This leads to a big question again
that what kinds of storage devices are to be used.
Dimensions of Scalability
1. Volume Scalability: This refers to the ability of a system to handle large volumes of data. Big Data systems
need to be able to store and process massive volumes of data, often in the order of petabytes or exabytes.
2. Velocity Scalability: This refers to the speed at which data is generated, processed, and analyzed. Big data
systems need to be able to cope with the fast-paced nature of data processing, typically in real-time or
near-real-time.
3. Variety Scalability: This refers to the ability of a system to handle different types of data, including
structured and unstructured data, as well as semi-structured data. Big data systems need to be able to
handle a wide range of data formats and sources.
4. Veracity Scalability: This refers to the ability of a system to handle the accuracy and reliability of data. Big
data systems need to be able to manage the quality of data, including identifying and handling data errors
or inconsistencies.
Getting Value out of Big Data
To get value out of big data, businesses need to follow certain steps:
1. Identify the objective: Businesses must identify the goal they want to achieve before sifting
through the data. This will help them focus on specific metrics to be analyzed.
2. Gather the right data: Big data is not just about large quantities of data, it's about the right data.
Businesses must ensure they have the relevant data to achieve their objective.
3. Analyze the data: Businesses must apply advanced analytics and data mining techniques to extract
insights from the data.
4. Visualize the data: Visualization is the best way to make data accessible to everyone in the
organization. Interactive visualizations can make it easier for everyone to understand the data.
5. Act on the insights: Businesses must use the insights gained from the data to make informed
decisions. This can be done by incorporating the insights into business operations, identifying areas
for improvement, and making changes accordingly.
In summary, getting value out of big data requires a deep understanding of the goals, appropriate data,
and advanced analytical tools to extract insights that drive business decisions.
Step One: Process and Clean Data
a) It is important to verify your data matches your business goals. If it does not, there are several
questions to address: What are the viable proxies? Are there outliers that need to be taken into
account? Does the data contain bias? Are there missing values? Look for functionalities that will
correctly address the various needs to clean and process the data.
b) There are a number of methods that can be used to impute, or fill in missing values, such as mean
interpolation, Kalman filter, and ARMA.
c) This step is one of the most important, but may take 70-90 percent of your data analysis project time.
The quality of your data will greatly affect your analysis results.
Step Two: Explore and Visualize Data
a) Explore the processed data and visually inspect the data for patterns, trends, and clusters.
b) This is the time to examine relationships and build hypotheses according to your findings. The
easiest way to complete this process is with the aid of visualization tools.
c) There are a number of simple yet powerful visual aids, such as scatter plots, line graphs, stacked
bar charts, box-plots, and heat-maps.
Step Three: Data Mine
a) You can use various methods to facilitate pattern recognition, including clustering K-Means,
hierarchical clustering, and multi-dimensional scaling.
b) Organizations that leverage and mine their data predictively have a significant competitive advantage
over their rivals, as they can gain important insights and react quickly to expand their business in a way
that was not possible without predictive analytics.
Step Four: Build Model
a) Be sure to have a wide range of models that provide different perspectives of the data. Some possible
models to consider are decision trees, Naïve Bayes classifier, neural networks, SVM, and discriminant
analysis.
b) Every algorithm has its suitability, and it is important to understand that all models have limitations.
c) There could be more than one model that would work for a problem.
d) Avoid overfitting.
e) Be sure to document and communicate the assumptions and results clearly.
Step Five: Generate Results and Optimize
a) Predictive results are used to establish objective functions in order to generate actionable results.
There are many applicable methods, such as linear and quadratic programming,
b) One specific method may be more appropriate than another depending on the nature of the objective
function (linear, quadratic, or discontinuous) and constraints on the variables (linear or not).
c) The goal is to produce results that lead to valuable business decisions. If the hospital staff knows a
certain surgical procedure has high readmissions, they may change the process to help reduce
readmissions, such as allowing for an extra day of post-operative care.
Step Six: Validate Results
After you implement your business decisions, allow time to produce results. It is important to carefully
validate the results against the initial business objective. Returning to our healthcare example, the
hospital’s business objective is reducing readmissions. Analysts should review data to see if current rates
have declined in an appreciable way.
It can include scanning your internal databases or purchasing databases from external sources.
Many companies store the sales data they have in customer relationship management (CRM) systems.
The CRM data can be easily analyzed by exporting it to more advanced tools using data pipelines.
Step 3: Processing the Data to Analyze
After the first and second steps, when you have all the data you need, you will have to process it before going further
and analyzing it.
Data can be messy if it has not been appropriately maintained, leading to errors that easily corrupt the analysis.
These issues can be values set to null when they should be zero or the exact opposite, missing values, duplicate values,
and many more.
You will have to go through the data and check it for problems to get more accurate insights. The most common errors
that you can encounter and should look out for are:
Missing values
Corrupted values like invalid entries
Time zone differences
Date range errors like a recorded sale before the sales even started
You will have to also look at the aggregate of all the rows and columns in the file and see if the values you obtain make
sense. If it doesn’t, you will have to remove or replace the data
that doesn’t make sense. Once you have completed the data cleaning process, your data will be ready for an exploratory
data analysis (EDA).
Step 4: Exploring the Data
In this step, you will have to develop ideas that can help identify hidden patterns and insights. You will have to find more
interesting patterns in the data, such as why sales of a particular product or service have gone up or down.
You must analyze or notice this kind of data more thoroughly. This is one of the most crucial steps in a data science
process.
d) What is table/Relation?
Everything in a relational database is stored in the form of relations. The RDBMS database uses tables to
store data. A table is a collection of related data entries and contains rows and columns to store data. Each
table represents some real-world objects such as person, place, or event about which information is collected.
The organized collection of data into a relational table is known as the logical view of the database.
e) Properties of a Relation:
o Each relation has a unique name by which it is identified in the database.
o Relation does not contain duplicate tuples.
o The tuples of a relation have no specific order.
o All attributes in a relation are atomic, i.e., each cell of a relation contains exactly one value.
Let's see the example of the student table.
ID Name AGE COURSE
1 Ajeet 24 B.Tech
2 aryan 20 C.A
5 Vimal 26 BSC
g) What is a column/attribute?
A column is a vertical entity in the table which contains all information associated with a specific field in a
table. For example, "name" is a column in the above table which contains all information about a student's
name.
Properties of an Attribute:
o Every attribute of a relation must have a name.
o Null values are permitted for the attributes.
o Default values can be specified for an attribute automatically inserted if no other value is specified
for an attribute.
o Attributes that uniquely identify each tuple of a relation are the primary key.
Name
Ajeet
Aryan
Vimal
i) Degree:
The total number of attributes that comprise a relation is known as the degree of the table.
For example, the student table has 4 attributes, and its degree is 4.
ID Name AGE COURSE
1 Ajeet 24 B.Tech
2 aryan 20 C.A
5 Vimal 26 BSC
j) Cardinality:
The total number of tuples at any one time in a relation is known as the table's cardinality. The relation whose
cardinality is 0 is called an empty table.
For example, the student table has 3 rows, and its cardinality is 3.
ID Name AGE COURSE
1 Ajeet 24 B.Tech
2 aryan 20 C.A
5 Vimal 26 BSC
k) Domain:
The domain refers to the possible values each attribute can contain. It can be specified using standard data
types such as integers, floating numbers, etc. For example, An attribute entitled Marital_Status may be
limited to married or unmarried values.
l) NULL Values
The NULL value of the table specifies that the field has been left blank during record creation. It is different
from the value filled with zero or a field that contains space.
m) Data Integrity
There are the following categories of data integrity exist with each RDBMS:
Entity integrity: It specifies that there should be no duplicate rows in a table.
Domain integrity: It enforces valid entries for a given column by restricting the type, the format, or the range
of values.
Referential integrity specifies that rows cannot be deleted, which are used by other records.
User-defined integrity: It enforces some specific business rules defined by users. These rules are different
from the entity, domain, or referential integrity.
Traditional RDBMS uses SQL syntax to store and retrieve data for further insights. Instead, a NoSQL database system encompasses a wide
range of database technologies that can store structured, semi-structured, unstructured and polymorphic data.
Why NoSQL?
The concept of NoSQL databases became popular with Internet giants like Google, Facebook, Amazon, etc.
who deal with huge volumes of data. The system response time becomes slow when you use RDBMS for
massive volumes of data.
To resolve this problem, we could “scale up” our systems by upgrading our existing hardware. This process is
expensive.
The alternative for this issue is to distribute database load on multiple hosts whenever the load increases. This
method is known as “scaling out.”
NoSQL database is non-relational, so it scales out better than relational databases as they are designed with
web applications in mind.
Types of NoSQL Databases
NoSQL Databases are mainly categorized into four types: Key-value pair, Column-oriented, Graph-based
and Document-oriented. Every category has its unique attributes and limitations. None of the above-specified
database is better to solve all the problems. Users should select the database based on their product needs.
1. Key Value Pair Based
Data is stored in key/value pairs. It is designed in such a way to handle lots of
data and heavy load.
Key-value pair storage databases store data as a hash table where each key is
unique, and the value can be a JSON, BLOB(Binary Large Objects), string, etc.
For example, a key-value pair may contain a key like “Website” associated with
a value like “Guru99”.
It is one of the most basic NoSQL database example. This kind of NoSQL database is used as a collection,
dictionaries, associative arrays, etc. Key value stores help the developer to store schema-less data. They
work best for shopping cart contents.
2. Column-based
Column-oriented databases work on columns and are based on BigTable paper by
Google. Every column is treated separately. Values of single column databases are
stored contiguously.
They deliver high performance on aggregation queries like SUM, COUNT, AVG, MIN
etc. as the data is readily available in a column.
Column-based NoSQL databases are widely used to manage data warehouses, business intelligence, CRM,
Library card catalogs,
HBase, Cassandra, HBase, Hypertable are NoSQL query examples of column based database.
3. Document-Oriented:
Document-Oriented NoSQL DB stores and retrieves data as a key value pair but the value part is stored as a
document. The document is stored in JSON or XML formats. The value is understood by the DB and can be
queried.
4. Graph-Based
A graph type database stores entities as well the relations amongst those entities. The entity is stored as a
node with the relationship as edges. An edge gives a relationship between nodes. Every node and edge has
a unique identifier.
Compared to a relational database where tables are loosely connected, a Graph database is a multi-
relational in nature. Traversing relationship is fast as they are already captured into the DB, and there is no
need to calculate them.
Graph base database mostly used for social networks, logistics, spatial data.
Neo4J, Infinite Graph, OrientDB, FlockDB are some popular graph-based databases.
Advantages of NoSQL
Can be used as Primary or Analytic Data Source
Big Data Capability
No Single Point of Failure
Easy Replication
It provides fast performance and horizontal scalability.
Support Key Developer Languages and Platforms
Simple to implement than using RDBMS
Disadvantages of NoSQL
No standardization rules
Limited query capabilities
RDBMS databases and tools are comparatively mature
It does not offer any traditional database capabilities, like consistency when multiple transactions are
performed simultaneously.
When the volume of data increases it is difficult to maintain unique values as keys become difficult
Doesn’t work as well with relational data
A Data Mart is a subset of a directorial information store, generally oriented to a specific purpose or
primary data subject which may be distributed to provide business needs.
Data Marts are analytical record stores designed to focus on particular business functions for a specific
community within an organization.
Data marts are derived from subsets of data in a data warehouse, though in the bottom-up data
warehouse design methodology, the data warehouse is created from the union of organizational data
marts.
The fundamental use of a data mart is Business Intelligence (BI) applications.
BI is used to gather, store, access, and analyze record. It can be
used by smaller businesses to utilize the data they have
accumulated since it is less expensive than implementing a data
warehouse.
Other than these two categories, one more type exists that is called "Hybrid Data Marts."
Hybrid Data Marts
It allows us to combine input from sources other than a data warehouse. This could be helpful for many
situations; especially when Adhoc integrations are needed, such as after a new group or product is added to
the organizations.
Steps in Implementing a Data Mart
A. Designing
The design step is the first in the data mart process. This phase covers all of the functions from initiating the
request for a data mart through gathering data about the requirements and developing the logical and
physical design of the data mart.
It involves the following tasks:
1. Gathering the business and technical requirements
2. Identifying data sources
3. Selecting the appropriate subset of data
4. Designing the logical and physical architecture of the data mart.
B. Constructing
This step contains creating the physical database and logical structures associated with the data mart to
provide fast and efficient access to the data.
It involves the following tasks:
1. Creating the physical database and logical structures such as tablespaces associated with the data
mart.
2. creating the schema objects such as tables and indexes describe in the design step.
3. Determining how best to set up the tables and access structures.
C. Populating
This step includes all of the tasks related to the getting data from the source, cleaning it up, modifying it to
the right format and level of detail, and moving it into the data mart.
It involves the following tasks:
1. Mapping data sources to target data sources
2. Extracting data
3. Cleansing and transforming the information.
4. Loading data into the data mart
5. Creating and storing metadata
D. Accessing
This step involves putting the data to use: querying the data, analyzing it, creating reports, charts and graphs
and publishing them.
It involves the following tasks:
1. Set up and intermediate layer (Meta Layer) for the front-end tool to use. This layer translates
database operations and objects names into business conditions so that the end-clients can interact
with the data mart using words which relates to the business functions.
2. Set up and manage database architectures like summarized tables which help queries agree through
the front-end tools execute rapidly and efficiently.
E. Managing
This step contains managing the data mart over its lifetime. In this step, management functions are performed
as:
1. Providing secure access to the data.
2. Managing the growth of the data.
3. Optimizing the system for better performance.
4. Ensuring the availability of data event with system failures.
Data lakes can be deployed on-premises or in the cloud, depending on the organization's needs and
preferences. Some of the popular data lake platforms include Hadoop, AWS S3, Azure Data Lake
Storage, and GCP Data Lake Storage.
In a nutshell, data lakes provide organizations with a cost-effective, scalable, and flexible solution for
storing and managing large volumes of data. They help organizations to better understand their data,
make informed decisions, and drive business growth.
3. Loading
The Load is the process of writing the data into the target database. During the load step, it is necessary to
ensure that the load is performed correctly and with as little resources as possible.
Loading can be carried in two ways:
1. Refresh: Data Warehouse data is completely rewritten. This means that older file is replaced. Refresh
is usually used in combination with static extraction to populate a data warehouse initially.
2. Update: Only those changes applied to source information are added to the Data Warehouse. An
update is typically carried out without deleting or modifying preexisting data. This method is used in
combination with incremental extraction to update data warehouses regularly.
DATA PIPELINES
Data Pipeline :
Data Pipeline deals with information that is flowing from one end to another. In simple words, we can say
collecting the data from various resources than processing it as per requirement and transferring it to the
destination by following some sequential activities. It is a set of manner that first extracts data from various
resources and transforms it to a destination means it processes it as well as moves it from one system to
another system.
1. Apache Hadoop: Apache Hadoop is an open-source big data processing tool that provides a
distributed file system (Hadoop Distributed File System - HDFS) and a distributed processing
framework (MapReduce) for parallel processing of large data sets. Hadoop is designed to run on
commodity hardware and can scale up to handle petabytes of data.
2. Apache Spark: Apache Spark is another open-source big data processing tool that provides a
general-purpose computing engine for parallel data processing. It supports both batch and real-time
processing and can be used with a variety of data sources, including Hadoop Distributed File System
(HDFS), Cassandra, and HBase.
4. Apache Storm: Apache Storm is an open-source distributed streaming data processing system that
can handle real-time data streams with low latency and high throughput. It uses a master-slave
architecture to process continuous data streams across a cluster of nodes.
5. Apache Kafka: Apache Kafka is an open-source distributed messaging system that is designed for
high throughput and low latency. It can handle real-time data streams and is often used in conjunction
with other big data processing tools such as Apache Spark and Apache Storm. Kafka provides a
scalable, fault-tolerant messaging infrastructure for distributed data processing.
These are some of the most popular big data processing tools that are used by enterprises to manage and
analyze large and complex data sets. Each tool has its own strengths and weaknesses, and choosing the right
tool depends on the specific requirements of the organization.
1. Data Storage: The storage component of the Big Data platform needs to be capable of handling
massive amounts of data. It can be a distributed file system like Hadoop Distributed File System
(HDFS), Cloud-based storage like Amazon S3, Microsoft Azure or Google Cloud Storage, or a
traditional relational database like MySQL, Oracle or Microsoft SQL Server.
2. Data Processing: The processing component of a Big Data platform allows for data ingestion,
transformation, and cleansing. Data processing tools like Apache Spark, Apache Storm, Apache Flink
or Apache Beam can be used to process large data sets in real-time.
3. Data Management: The data management component of Big Data platform deals with the
management of data lifecycle, metadata, and data security. It includes tools for data governance
and data integration such as Apache Atlas, Apache NiFi, and Apache Flume.
4. Data Analysis: The analysis component of a Big Data platform allows for advanced analytics on large
datasets, including predictive modeling, machine learning, and artificial intelligence. Some of the
popular data analysis tools used in Big Data platforms are Apache Hadoop, Apache Spark MLlib,
and Apache Mahout.
Overall, a big data platform is designed to help organizations to overcome the challenges of handling
massive volumes of data in real-time and derive valuable insights to support data-driven decision-making
processes.
Types of data
There are four types of big data BI that really aid business:
1. Prescriptive – This type of analysis reveals what actions should be taken. This is the most valuable
kind of analysis and usually results in rules and recommendations for next steps.
Prescriptive data refers to the analysis and interpretation of data to provide insights and
recommendations on how to improve future outcomes.
It goes beyond descriptive and diagnostic analytics, which focus on understanding what happened
and why it happened.
Prescriptive data involves the use of advanced analytics techniques, such as machine learning and
optimization algorithms, to identify the best course of action based on the data available.
The goal of prescriptive data is to help decision-makers make informed decisions that lead to
optimal outcomes.
2. Predictive – An analysis of likely scenarios of what might happen. The deliverables are usually a
predictive forecast.
Predictive data refers to data that is analyzed, processed and transformed using advanced
analytical algorithms and techniques to generate insights about future outcomes or trends.
It involves extracting valuable insights or patterns from historical data to develop a prediction
model that can be used to anticipate future outcomes or trends.
Predictive data is commonly used in financial forecasting, marketing analysis, risk management,
and supply chain optimization. It is becoming increasingly important for businesses to make
informed decisions and stay ahead of the competition.
3. Diagnostic – A look at past performance to determine what happened and why. The result of the
analysis is often an analytic dashboard.
Diagnostic data refers to information collected from a system, device or application to analyze
and troubleshoot technical issues or to gain insights into its performance.
This data is used by technicians, analysts and developers to identify problems, diagnose faults,
and optimize performance.
Diagnostic data can consist of logs, system files, event messages, error codes, network activity,
device drivers, user behavior patterns, and other data points that provide insights into the
behavior and health of the system, device or application.
It can be collected manually or automatically, and in some cases, may need the user's permission
to be accessed or collected.
4. Descriptive – What is happening now based on incoming data. To mine the analytics, you typically
use a real-time dashboard and/or email reports.
Descriptive data is a type of data that describes the characteristics, properties, or attributes of a
group or population.
It is used to summarize and describe the data that is collected from a sample or population.
Descriptive data can be presented in a variety of ways, including graphs, charts, tables, and other
visual aids.
This type of data is commonly used in statistical analysis, research studies, and surveys to provide a
summary of the data that has been collected.
Some examples of descriptive data include age, gender, income, education level, and geographic
location.
1. CSV (Comma-Separated Values): CSV is a text-based file format used to store tabular data. In this
format, the values are separated by commas, and each row represents a record. CSV files are easy to read
and edit but are not suitable for complex data structures.
2. JSON (JavaScript Object Notation): JSON is a lightweight text-based file format that is used to send and
receive data between servers and applications. JSON stores data in a hierarchical format and is easy to
read and edit. It is commonly used in web applications that use JavaScript.
3. AVRO: Avro is a binary file format that is compact and efficient. It is designed to support complex data
structures and provides schema evolution, which means that users can evolve data schemas without breaking
the existing data. Avro files are commonly used in Hadoop and other big data processing systems.
4. Parquet: Parquet is a columnar file format that stores data in a compressed format. It is optimized for big
data processing and is used for analytics and data warehousing applications. Parquet files can be read and
written by several different data processing systems.
5. ORC: ORC (Optimized Row Columnar) is a columnar file format similar to Parquet. It stores data in a
compressed format and supports schema evolution. ORC is often preferred over Parquet in certain clustering
solutions, due to its compression rate.
6. XML: XML (eXtensible Markup Language) is a text-based file format that is used to store hierarchical
data. XML is versatile and can represent any type of data, including structured, semi-structured, and
unstructured data. However, XML files can be large and slow to process.
Overall, the choice of file format depends on the specific requirements of the application. Factors to consider
include the size and complexity of the data, compatibility with different systems, and the performance and
efficiency of the file format.
1. Databases: Service bindings can be used to connect to different types of databases, including SQL and
NoSQL databases. This allows for the extraction of structured and unstructured data, respectively. Examples
of databases that can be accessed through service bindings include MySQL, MongoDB, Oracle, and
Cassandra.
2. Cloud Storage: Service bindings can also be used to access data stored in cloud storage systems, such as
Amazon S3, Microsoft Azure Blob storage, and Google Cloud Storage. This allows for the extraction of large
volumes of data that are stored in a distributed manner across multiple servers.
3. Social Media: Social media platforms like Twitter and Facebook generate large volumes of data that can
be accessed through service bindings. This data can be used to gain insights into customer behavior, sentiment
analysis, and to develop targeted marketing strategies.
4. Sensors and IoT Devices: The Internet of Things (IoT) has resulted in a proliferation of sensors and
connected devices that generate vast amounts of data. Service bindings can be used to extract data from
these devices, including temperature sensors, GPS trackers, and smart home devices.
5. Web Services: Finally, service bindings can be used to access data from web services, such as weather
APIs, Google Maps, and Amazon Alexa. This data can be used to build applications that are weather-
dependent, location-based, or that interface with voice assistants.
In conclusion, service bindings provide a versatile means of accessing data from a wide variety of sources in
Big Data. This data can be used to gain insights into customer behavior, improve business processes, and
develop new products and services.
Example:
One source of data using service bindings in big data is cloud-based applications such as Amazon Web
Services (AWS) or Microsoft Azure, which provide service bindings to access data stored in their cloud
platforms. Another source is third-party big data platforms such as Hadoop, Cassandra, or MongoDB, which
also offer service bindings to connect and retrieve data from their databases.
1. Distributed architecture: Big data platforms are designed to run on clusters of commodity hardware,
allowing them to scale horizontally to handle large amounts of data and processing power.
2. Data storage: Big data platforms provide a distributed file system that can store and manage large
amounts of unstructured and structured data. Examples include Hadoop Distributed File System
(HDFS), Apache Cassandra, and Apache HBase.
3. Data processing: Big data platforms provide distributed processing frameworks that can process
large amounts of data in parallel across a cluster of nodes. Examples include Hadoop MapReduce,
Apache Spark, and Apache Flink.
4. Data ingestion: Big data platforms provide tools for ingesting data from various sources such as
databases, file systems, and streaming data sources. Examples include Apache Flume, Apache Kafka,
and Apache Nifi.
5. Data integration: Big data platforms provide tools for integrating data from various sources and
formats. Examples include Apache Sqoop, Apache NiFi, and Talend.
6. Data analysis: Big data platforms provide tools for analyzing and querying data using SQL-like
languages or programming languages such as Python or R. Examples include Apache Hive, Apache
Pig, and Apache Drill.
7. Data visualization: Big data platforms provide tools for visualizing data to make it easier to
understand and communicate insights. Examples include Tableau, QlikView, and Apache Superset.
Overall, big data platforms provide a comprehensive set of features and tools for storing, processing,
integrating, analyzing, and visualizing large amounts of data. They are designed to handle the challenges of
big data, including scalability, fault tolerance, and heterogeneity of data sources and formats.
UNIT- 3 Introduction to Big Data Modeling and Management
Data Storage
Big data involves dealing with large volumes of data that cannot be stored and processed using
traditional methods. There are several types of storage solutions commonly used in big data:
1. Hadoop Distributed File System (HDFS) – HDFS is an open-source distributed file system designed
specifically to provide reliable and scalable storage for big data applications. It is the primary storage
solution in Hadoop framework and can store data in petabytes.
2. NoSQL Databases – NoSQL databases like MongoDB, Cassandra, and Couchbase are designed to
handle large volumes of unstructured data. They provide high scalability and performance for real-time
data processing.
3. Object Storage – Object storage solutions like Amazon S3, Google Cloud Storage, and Microsoft
Azure Blob Storage can store and manage large amounts of unstructured data. They are highly scalable
and provide durable storage with low latency access.
4. In-Memory Storage – In-memory storage solutions like Apache Ignite and Redis can store and
manipulate data in RAM, providing high-speed data processing and low latency access. They are
commonly used in applications that require real-time data analysis and processing.
5. Data Warehouses – Data warehouses like Amazon Redshift and Snowflake are designed to store
large volumes of structured data for analytics purposes. They provide fast query response times and
integrations with popular BI tools.
Overall, the choice of data storage solution in big data depends on the type of data, volume, and the
specific requirements of the application.
Data quality is an essential attribute for big data. Big data refers to huge amounts of data, which can
come from various sources, such as sensor data, social media, and transactional data.
In big data, data quality refers to the accuracy, consistency, completeness, and relevance of data.
Maintaining data quality in big data is crucial to ensure the accuracy of insights and conclusions derived
from the data analysis.
There are several challenges to achieving high data quality in big data, such as the following:
1. Data Variability: Big data sources have different data formats, structures, and quality standards,
making it challenging to integrate and validate data seamlessly.
2. Data Volume: Big data gets generated in massive quantities, making it impossible to implement
traditional data validation mechanisms.
3. Data Complexity: The data in big data can be highly complex, making it essential to have domain
experts validate and filter out data that may be irrelevant.
4. Data Latency: Big data in real-time scenarios can have a high volume with significant velocity, making
it necessary to prioritize data validation mechanisms.
To address these challenges, companies need robust data quality management strategies that include
automated data validation, cleansing, and enrichment techniques. The use of modern technologies like
machine learning and AI can also help in improving data quality in big data.
Accuracy is often measured by how the values agree with an information source that is known
to be correct.
2. Completeness: The data makes all required records and values available.
3. Consistency: Data values drawn from multiple locations do not conflict with each other, either
across a record or message, or along all values of a single attribute. Note that consistent data
4. Timeliness: Data is updated as frequently as necessary, including in real time, to ensure that it
5. Validity: The data conforms to defined business rules and falls within allowable parameters
6. Uniqueness: No record exists more than once within the data set, even if it exists in multiple
locations. Every record can be uniquely identified and accessed within the data set and across
applications.
Benefits of DataOps
Transitioning to a DataOps strategy can bring an organization the following benefits:
Provides more trustworthy real-time data insights.
Reduces the cycle time of data science applications.
Enables better communication and collaboration among teams and team members.
Increases transparency by using data analysis to predict all possible scenarios.
Builds processes to be reproducible and to reuse code whenever possible.
Ensures better quality data.
Creates a unified, interoperable data hub.
1. Data ingestion: This process is about bringing in data from different
sources like databases, IoT devices, social media, and others into the
big data platform.
5. Data Storage: The data storage aspect of big data operations involves storing the massive amounts
of data that big data applications usually deal with.
6. Data Retrieval: This operation involves retrieving data from the storage as per the requirements of
the application.
7. Data Security: Data security operations involve protecting data by implementing various security
measures like encryption, access controls, etc.
8. Data Governance: Data governance refers to the process of managing the overall data lifecycle,
including data quality, data lineage, and data policies.
Data ingestion is the process of collecting and importing large volumes of data from various sources
into a data storage system. In big data, data ingestion plays a crucial role in enabling data-driven
decision making by ensuring that the data is cleaned, processed, and ready for analysis.
There are several challenges associated with data ingestion in big data, including:
1. Data Velocity: The speed at which data is generated is increasing rapidly, and it is becoming more
challenging to ingest data in real-time or near real-time.
2. Data Variety: Big data comprises multiple data types, such as structured, semi-structured, and
unstructured data, which require different ingestion techniques.
3. Data Volume: The size of data is growing exponentially, and it is challenging to store and manage
such vast amounts of data.
4. Data Quality: Big data often includes data from multiple sources, which can be inconsistent or
contain errors, making it challenging to ensure data quality.
To overcome these challenges, organizations use various tools and technologies such as Apache Kafka,
Apache NiFi, Apache Flume, and AWS Glue, which facilitate data ingestion from multiple sources, clean
and process data, and prepare it for analysis. The ingestion process involves filtering, transforming,
and validating data to ensure its accuracy and consistency, which is crucial for making data-driven
decisions.
2. Data Extraction: Once the data sources have been identified, the data needs to be extracted from
those sources. This may involve data extraction from a variety of sources such as databases, social
media platforms, websites, sensors, and devices.
3. Data Transformation: After data has been extracted, it often needs to be transformed into a
format suitable for processing. This may involve data cleaning, normalization, and transformation into
a standard format.
4. Data Validation: Before data can be processed, it needs to be validated to ensure that it is accurate
and complete. This may involve data validation against predefined rules or patterns.
5. Data Load: Once data has been validated, it can be loaded into a big data system. This may involve
loading data onto a Hadoop Distributed File System (HDFS) or other big data platforms.
6. Data Processing: Once the data has been loaded into the big data system, it can be processed. This
may involve data analysis, machine learning, or other advanced processing techniques.
7. Data Storage: Finally, processed data needs to be stored in a manner that is accessible and usable
by other systems. This may involve storing data in a NoSQL or SQL database, or making it available via
REST APIs.
1. Hardware limitations: Traditional DBMS are strongly tied to the hardware they run on. As data
volume and user requests increase, the hardware may not be able to cope with the load, resulting in
slow performance or system crashes.
2. Architecture limitations: Traditional DBMS usually follow a centralized architecture, where all the
data is stored on a single server. As the number of users and data volume increase, this architecture
may become a bottleneck, affecting the system's scalability.
3. Cost limitations: Scaling traditional DBMS often requires significant investment in additional
hardware, licenses, and maintenance. This can make it difficult for small and mid-sized businesses to
scale their systems.
To overcome these limitations, some traditional DBMS vendors have introduced technologies such
as sharding, clustering, and replication. However, these techniques are often complex to implement
and require significant technical expertise.
Traditional DBMS (Database Management System) emphasizes data security as one of the
primary features. In a traditional DBMS, security is enforced through a combination of
authorization, authentication, and access control mechanisms.
Authorization: This mechanism ensures that users are only allowed to access the data that they are
authorized to access. It requires a username and password to gain access to the database, and each
user is granted specific privileges to access the database objects.
Authentication: Authentication is the process of verifying the identity of a user before granting
access to the database. In traditional DBMS, users are authenticated through a username and
password.
Access Control: Access control determines what users are allowed to do with the data. This mechanism
restricts access to database objects by specifying permission levels according to user roles or groups.
For example, an administrator may have full access to the database, while a user may only have access
to a limited set of data.
In addition to these security mechanisms, traditional DBMS also includes other security features such
as audit trails, encryption, and backups to prevent data theft and loss.
Overall, traditional DBMS ensures data security by implementing multiple security measures to protect
the database from unauthorized access, data theft, and loss.
2. Scalability: The traditional DBMS architecture supports scalability, enabling the system to handle
an increasing volume of data and users.
3. Security: Traditional DBMS systems offer robust security mechanisms, including authentication,
authorization, and encryption.
4. Query Optimization: Traditional DBMS systems provide sophisticated query optimization techniques
that enable users to retrieve data quickly and efficiently.
5. ACID Compliance: Traditional DBMS systems adhere to the ACID principle (Atomicity, Consistency,
Isolation, and Durability), ensuring that transactions are reliable and consistent.
10. Support: Traditional DBMS systems have been around for several decades and have a vast support
network of vendors, developers, and users, making them reliable and well-tested.
2. Limited scalability: Traditional DBMS systems can be difficult to scale as organizations grow,
requiring additional investment to maintain performance.
3. Complex management: Traditional DBMS systems require skilled professionals to manage and
maintain them, adding to the cost and complexity of operations.
4. Limited flexibility: Traditional DBMS systems can be inflexible, making it difficult to adapt to
changing business needs and requirements.
5. Data redundancy: Traditional DBMS systems can lead to data redundancy, which can impact data
quality and efficiency.
6. Security risks: Traditional DBMS systems are vulnerable to security risks such as hacking, malware,
and data breaches.
Big data management systems are software systems that are designed to efficiently and effectively
process and manage very large amounts of data. These systems are typically used in businesses or large
organizations where there is a need to store, process, and analyze large amounts of data on a daily
basis.
1. Hadoop: Hadoop is an open-source software framework that is designed to store and process very
large datasets across clusters of computers. It is highly scalable and can handle petabytes of data.
2. Apache Spark: Apache Spark is a powerful big data processing engine that is designed to perform
distributed processing of large datasets. It is highly efficient and can process data in real-time.
3. MongoDB: MongoDB is a NoSQL database that is designed for managing unstructured and semi-
structured data. It is highly scalable and can handle large volumes of data.
Big data management systems are essential for businesses and organizations that need to manage and
analyze large datasets to gain insights and make data-driven decisions. With the increasing volumes of
data being generated every day, these systems are becoming more important than ever before.
2. Speed: Big Data Management Systems are optimized for high-speed data processing, making it
possible to extract insights and analyze data in real-time.
3. Cost-Effective: Big Data Management Systems are cost-effective as they utilize commodity
hardware, which is less expensive than specialized hardware.
4. Improved Decision Making: Big Data Management Systems provide real-time analysis of large data
sets, enabling informed decision-making based on current and relevant data.
6. Data Security: Big Data Management Systems provide advanced security features, including robust
encryption, access controls, and auditing, ensuring the safety and privacy of sensitive data.
2. Complexity: Big data management systems are typically more complex than traditional data
management systems. They require sophisticated software and hardware configurations, as well as
advanced analytical skills to effectively manage, query, and analyze the data.
3. Security: Big data management systems are often targeted by cybercriminals seeking to steal
sensitive data or disrupt operations. This makes security a critical concern, and organizations must
invest in robust security measures to prevent data breaches and other types of cyberattacks.
types of businesses starting from very small to big organizations. In a traditional database system, a
centralized database architecture used to store and maintain the data in a fixed format or fields in a
file. For managing and accessing the data Structured Query Language (SQL) is used.
2. Big data: We can consider big data an upper version of traditional data. Big data deal with too large
or complex data sets which is difficult to manage in traditional data-processing application software.
It deals with large volume of both structured, semi structured and unstructured data. Volume, Velocity
and Variety, Veracity and Value refer to the 5’V characteristics of big data. Big data not only refers
to large amount of data it refers to extracting meaningful data by analyzing the huge amount of
Traditional data is generated in enterprise level. Big data is generated outside the enterprise
level.
Its volume ranges from Gigabytes to Terabytes. Its volume ranges from Petabytes to
Zettabytes or Exabytes.
Traditional database system deals with structured Big data system deals with structured, semi-
Traditional data is generated per hour or per day or But big data is generated more frequently
Traditional data source is centralized and it is Big data source is distributed and it is
The size of the data is very small. The size is more than the traditional data
size.
Traditional data base tools are required to perform Special kind of data base tools are required
operation.
Normal functions can manipulate data. Special kind of functions can manipulate data.
Its data model is strict schema based and it is static. Its data model is a flat schema based and it
is dynamic.
Traditional data is stable and inter relationship. Big data is not stable and unknown
relationship.
Traditional data is in manageable volume. Big data is in huge volume which becomes
unmanageable.
It is easy to manage and manipulate the data. It is difficult to manage and manipulate the
data.
Its data sources includes ERP transaction data, CRM Its data sources includes social media, device
transaction data, financial data, organizational data, data, sensor data, video, images, audio etc.
Data Model
Def
A data model is an abstract model that organizes elements of data and standardizes how they relate
to one another and to the properties of real-world entities. For instance, a data model may specify that
the data element representing a car be composed of a number of other elements which, in turn,
represent the color and size of the car and define its owner.
Structure
A data model structure is a logical representation of the relationships between data elements in a
system or application. It defines how data is organized, stored, and retrieved in a database, and
provides a blueprint for database design. There are several types of data model structures, including:
1. Hierarchical model: This type of model represents data as a tree-like structure. Each record has a
parent record and zero or more child records.
2. Network model: In this model, data is organized as a collection of records or objects, and
relationships between them are represented as links or pointers.
3. Relational model: This is a widely used model where data is represented as tables with rows and
columns. The relationship between tables is defined by common fields.
4. Object-oriented model: This type of model represents data as a collection of objects, with each
object having a set of attributes and methods.
5. Document model: In this model, data is stored as nested documents or collections, with each
document having a unique identifier and key-value pairs.
6. Graph model: This model is used to represent complex networks, with data being stored as nodes
and edges.
The choice of data model structure depends on the specific needs of the application or system, as well
as the type of data being stored and the expected usage patterns.
1. Create: This operation allows users to add new data to the model. It involves defining the structure
and properties of the new data and assigning it to the appropriate entity in the model.
2. Read: This operation allows users to retrieve data from the model. It involves specifying the exact
data required by the user, and the model returns the requested data from the relevant entity.
3. Update: This operation allows users to modify existing data in the model. It involves making changes
to the properties of the data currently stored in the entity.
4. Delete: This operation allows users to remove data from the model. It involves identifying the data
to be deleted, and the model removes the data from the relevant entity.
1. Entity Integrity Constraint: Each entity in a database must have a unique identifier.
2. Referential Integrity Constraint: The relationship between entities must be consistent, and each
foreign key must match a primary key.
3. Domain Constraint: The values of an attribute of an entity must be from an allowed set of values.
4. Cardinality Constraint: The minimum and maximum number of relationships between entities must
be explicitly defined.
5. Structural Constraint: The structure of the data model must be consistent and well-defined.
6. Business Rule Constraint: The model must adhere to the business rules and requirements that govern
the organization.
7. Time Constraint: The data model must be designed to accommodate future changes and adjustments.
8. Execution Constraint: The model must be implementable given the technology and resources available
to the organization.
9. Security Constraint: The model must meet the security requirements of the organization, including
access control and privacy.
10. Usability Constraint: The model must be user-friendly, easy to navigate, and intuitive for users to
work with.
the State column is limited to two uppercase letters, and the ID and ZIP code columns are limited to
integers.
If you attempt to insert a record in the database that does not fit this structure, it will not allow it,
Structured big data is typically relational. This means that a record such as the user table above can
be linked to a record or records in another table. Let's say the user table is for a shopping cart, and
that Sara has two orders, and Sam hasn't ordered yet.
This type of static structure makes the data consistent and easy to enter, query, and organize. The
language used to query database tables like these is SQL (Structured Query Language). Using SQL,
developers can write queries that join the records in database tables in endless combinations based on
their relationships.
The disadvantage of structured data is that updating the structure of a table can be a complex process.
A lot of thought must be put into table structures before you even begin using the database. This type
2. Unstructured data
Everything that is stored digitally is data. Unstructured data includes text, email, video, audio, server
logs, webpages, and on and on. Unlike structured and semi-structured data that can be queried and
searched in a consistent manner, unstructured data doesn't follow a consistent data model.
This means that instead of simply using queries to turn this data into useful information, a more
This is where machine learning, artificial intelligence, natural language processing, and optical
One example of unstructured data is scanned receipts that are stored for expense reports. In their
native image format, the data is essentially useless. Here, OCR software can turn the images into
custom processor. Advantages include the mere existence of many types of unstructured data, as the
insights gathered from it often can't be found in any other data source.
3. Semi-structured data
Semi-structured big data fits somewhere between structured and unstructured data. A common source
The data in a NoSQL database is organized, but it isn't relational and doesn't follow a consistent
schema.
For example, a user record in a NoSQL database may look like this:
Here, users access the data they need by the keys in the record. And while it looks similar to the
records in the structured data example above, it isn't in a consistent table format.
Instead, it's in JSON format, which is used to store and transmit data objects. While this one
record in the database may have this set of attributes, it doesn't mean the rest of the records will
An advantage of semi-structured data stored in a NoSQL database is that it is very flexible. If you
need to add more data to a record, simply add it with a new key. This can also be a disadvantage if you
But NoSQL data isn't the only type of semi-structured big data. XML and YAML are two other flexible
Email can also be considered semi-structured data since parts of it can be parsed consistently, such as
email addresses, time sent, and IP addresses, while the body is unstructured data.
In big data, a cluster refers to a group of interconnected computers or servers that work together
to process and store large amounts of data.
A cluster is typically made up of commodity hardware that is connected via a network, allowing the
resources of each node in the cluster to be pooled and used to handle the processing and storage
needs of large datasets.
The primary benefit of clustering in big data is that it allows for parallel processing of data across
multiple nodes, which can greatly reduce the time it takes to analyze and process large datasets.
Clusters can also provide high availability and fault tolerance, since the data is replicated across
multiple nodes in the cluster, and can be accessed even if one or more nodes fail.
Hadoop clusters are composed of a network of master and worker nodes that orchestrate and
execute the various jobs across the Hadoop distributed file system. The master nodes typically
utilize higher quality hardware and include a NameNode, Secondary NameNode, and JobTracker, with
each running on a separate machine.
The workers consist of virtual machines, running both DataNode and TaskTracker services on
commodity hardware, and do the actual work of storing and processing the jobs as directed by the
master nodes. The final part of the system are the Client Nodes, which are responsible for loading
the data and fetching the results.
Master nodes are responsible for storing data in HDFS and overseeing key operations, such
as running parallel computations on the data using MapReduce.
The worker nodes comprise most of the virtual machines in a Hadoop cluster, and perform the
job of storing the data and running computations. Each worker node runs the DataNode and
TaskTracker services, which are used to receive the instructions from the master nodes.
Client nodes are in charge of loading the data into the cluster. Client nodes first submit
MapReduce jobs describing how data needs to be processed and then fetch the results once
the processing is finished.
What is cluster size in Hadoop?
A Hadoop cluster size is a set of metrics that defines storage and compute capabilities to run
Hadoop workloads, namely :
Number of nodes : number of Master nodes, number of Edge Nodes, number of Worker
Nodes.
Configuration of each type node: number of cores per node, RAM and Disk Volume.
Hadoop clusters can boost the processing speed of many big data analytics jobs, given their
ability to break down large computational tasks into smaller tasks that can be run in a parallel,
distributed fashion.
Hadoop clusters are easily scalable and can quickly add nodes to increase throughput, and
maintain processing speed, when faced with increasing data blocks.
The use of low cost, high availability commodity hardware makes Hadoop clusters relatively
easy and inexpensive to set up and maintain.
Hadoop clusters replicate a data set across the distributed file system, making them resilient
to data loss and cluster failure.
Hadoop clusters make it possible to integrate and leverage data from multiple different
source systems and data formats.
Choose the Hardware: First, you need to choose the hardware for your Hadoop cluster. Hadoop is
designed to run on commodity hardware, so you don't need to invest in expensive hardware. You
should choose servers with sufficient processing power, memory, and disk space to handle the
workload of your Hadoop applications.
Install Hadoop: Next, you need to install Hadoop on your cluster. Hadoop consists of several
components, including the Hadoop Distributed File System (HDFS) and MapReduce, which are
responsible for storing and processing data, respectively. You can download Hadoop from the Apache
website and follow the installation instructions.
Configure Hadoop: Once Hadoop is installed, you need to configure it for your specific cluster. This
involves setting various configuration parameters, such as the amount of memory and disk space
allocated to Hadoop, the number of nodes in the cluster, and the replication factor for data stored in
HDFS.
Set up Networking: You need to ensure that all nodes in the cluster are connected to each other and
can communicate over the network. This involves configuring IP addresses, hostnames, and network
settings for each node.
Test the Cluster: After configuring the Hadoop cluster, you should test it to ensure that it is
working correctly. You can run sample Hadoop applications or use tools such as Hadoop Streaming to
process data on the cluster.
Scale the Cluster: Finally, you can scale the cluster by adding more nodes to handle larger workloads.
You can add new nodes to the cluster and configure them to join the existing cluster.
Big Data Integration
Big Data Integration refers to the process of combining, cleaning, and transforming large and
complex data sets from disparate sources to make it usable and useful for analysis and decision-
making purposes. It involves extracting data from multiple sources, integrating and transforming it,
and loading it into a target system such as a data warehouse or a data lake.
Suppose a company wants to analyze its sales data from multiple sources, including online sales, retail
stores, and call centers. These data sources store information in different formats, such as
structured data, semi-structured data, and unstructured data. The company needs to integrate these
1. The first step is to extract data from each of the sources, including data from the online
sales system, point-of-sale (POS) systems in retail stores, and call center logs. This data may
include customer information, sales transactions, product information, and other relevant
details.
2. Once the data has been extracted, it needs to be transformed and cleaned. This involves
converting the data into a common format, resolving data quality issues, and identifying any
duplicates or errors in the data. This is a crucial step as it ensures that the data is consistent
3. After the data has been transformed and cleaned, it is loaded into a data warehouse or a data
lake. The data warehouse is used for structured data, while the data lake is used for semi-
structured and unstructured data. The integrated data can now be used for analysis,
In conclusion, Big Data Integration is a complex process that requires a combination of technical skills,
tools, and methodologies to bring together data from multiple sources and make it usable and useful for
Big Data Integration has both advantages and disadvantages. Here are some of them:
Advantages:
1. Improved data quality: Integration enables data cleansing, validation, and standardization
3. Cost savings: Integration helps businesses save money by reducing duplication of data and
4. Enhanced productivity: Integration streamlines data processes, reducing the time and
Disadvantages:
1. Complexity: Big Data Integration can be a complex process, requiring technical skills and
2. Security risks: Integration can create security risks, as it involves transferring data between
multiple systems, increasing the risk of data breaches and other security threats.
3. Data inconsistency: Integration may result in data inconsistencies if data is not transformed
4. Scalability: Integration can become more difficult and complex as data volumes increase,
Big Data Processing is the process of collecting, storing, processing, and analyzing large and complex
data sets that cannot be processed by traditional data processing systems. Big Data Processing
enables organizations to gain insights into their data, make better decisions, and create new business
opportunities.
Suppose a financial institution wants to analyze its customers' financial transactions to identify
potential fraudulent activities. The institution collects transaction data from multiple sources,
including ATM withdrawals, credit card transactions, and online transactions. The data includes
1. The first step in Big Data Processing is to store the data in a distributed file system, such as
Hadoop Distributed File System (HDFS). The data is divided into smaller chunks and
2. The next step is to process the data using a processing framework, such as Apache Spark or
Apache Flink. The processing framework enables the institution to analyze the data in real-
time, identifying patterns and anomalies that may indicate fraudulent activities.
For example, the institution may use machine learning algorithms to analyze the transaction
data and identify patterns that are indicative of fraudulent activities, such as unusually large
transactions, transactions made in different locations at the same time, or transactions made
Once the processing is complete, the institution can store the results in a data warehouse or data
lake for further analysis and reporting. The results may be used to generate alerts for potential
fraudulent activities, identify new trends and patterns, and improve the institution's overall fraud
prevention system.
In conclusion, Big Data Processing is a complex process that involves storing, processing, and
analyzing large and complex data sets. It enables organizations to gain insights into their data, make
better decisions, and create new business opportunities. With the right tools and expertise,
businesses can harness the power of Big Data Processing to transform their operations and gain a
1. Scalability: Big Data Processing enables businesses to process large volumes of data quickly
2. Real-time processing: Big Data Processing enables real-time processing of data, allowing
businesses to respond to events as they happen and make better decisions in real-time.
3. Improved insights: Big Data Processing enables businesses to gain deeper insights into their
data, identifying patterns and trends that were not previously visible.
4. Cost savings: Big Data Processing can be less expensive than traditional data processing, as it
Disadvantages:
1. Complexity: Big Data Processing can be complex, requiring specialized tools, technologies, and
2. Security risks: Big Data Processing involves collecting, storing, and processing large volumes of
data, increasing the risk of data breaches and other security threats.
3. Data quality issues: Big Data Processing can be prone to data quality issues if data is not
4. Lack of standardization: Big Data Processing involves data from multiple sources, which may
Retrieving data query in Big Data involves querying and retrieving large volumes of data from
distributed file systems and NoSQL databases. Big Data processing frameworks like Apache Hadoop
and Apache Spark provide tools for querying Big Data using SQL-like languages or specialized query
Here are some common methods for retrieving data query in Big Data:
1. Hive Query Language (HQL): Hive is a data warehousing tool that enables SQL-like querying
of Big Data stored in Hadoop Distributed File System (HDFS). HQL is a SQL-like language
2. Pig Latin: Pig is a high-level data processing language used for large-scale data processing on
Hadoop. Pig Latin is a language used to write scripts to transform and query large datasets in
Hadoop.
For example, the following Pig Latin script retrieves all the records from a dataset called "users"
gender:chararray);
3. Apache Spark SQL: Spark SQL is a data processing module in Apache Spark that enables
querying and processing of structured and semi-structured data using SQL-like queries.
For example, the following Spark SQL query retrieves all the records from a table called
transactions.createOrReplaceTempView("transactions")
val result = spark.sql("SELECT * FROM transactions WHERE amount > 1000") result.show()
In conclusion, retrieving data query in Big Data involves using specialized tools and languages to query
and retrieve large volumes of data from distributed file systems and NoSQL databases. The choice
of tool and language depends on the type of data being queried and the specific requirements of the
query.
Data retrieval in Big Data refers to the process of querying and retrieving large volumes of
structured and unstructured data stored in distributed file systems and NoSQL databases.
Retrieving data from Big Data requires specialized tools and techniques to process and analyze large
Here are some common methods used for data retrieval in Big Data:
1. Distributed File Systems: Distributed file systems like Apache Hadoop Distributed File
System (HDFS) store and manage large volumes of structured and unstructured data across
multiple nodes in a cluster. Data retrieval from HDFS involves using tools like Apache Hive or
2. NoSQL Databases: NoSQL databases like MongoDB, Cassandra, and HBase store
unstructured data and provide flexible data models that enable querying of data using
specialized query languages like MongoDB Query Language (MQL) or Cassandra Query
Language (CQL).
3. Apache Spark: Apache Spark is a data processing framework that enables fast and efficient
processing of large volumes of data using distributed computing techniques. Spark provides
modules like Spark SQL, Spark Streaming, and Spark Machine Learning (ML) that enable
4. Data Warehousing: Data warehousing tools like Amazon Redshift or Google BigQuery enable
querying and analysis of large volumes of structured data stored in cloud-based data
warehouses. These tools enable querying of data using SQL-like languages and provide scalable
In conclusion, data retrieval in Big Data involves using specialized tools and techniques to query and
retrieve large volumes of structured and unstructured data stored in distributed file systems and
NoSQL databases. The choice of tool and technique depends on the type of data being retrieved and
Information integration in Big Data refers to the process of integrating data from multiple sources
and systems to provide a unified view of data. Big Data systems typically involve large volumes of
data from multiple sources, such as social media, IoT devices, and transactional systems, which may
Here are some common methods used for information integration in Big Data:
1. ETL (Extract, Transform, Load): ETL is a process that involves extracting data from various
sources, transforming it into a common format, and loading it into a target database or data
warehouse. ETL tools like Apache NiFi, Talend, and Pentaho enable integration of data from
without physically copying the data into a central repository. Data virtualization tools like
Denodo and Composite enable integration of data from multiple sources in real-time and
3. Master Data Management (MDM): MDM is a process that involves creating a single,
authoritative source of master data that can be used by various systems and applications.
MDM tools like Informatica and Talend enable integration of master data from multiple
format and applying a schema as it is read. This enables integration of data from multiple
sources without the need for a predefined schema. Tools like Apache Drill and Apache Impala
In conclusion, information integration in Big Data involves using specialized tools and techniques to
integrate data from multiple sources and systems to provide a unified view of data. The choice of
tool and technique depends on the type of data being integrated, the data structures and formats,
Examples of technologies available to integrate information include deduplication, and string metrics
which allow the detection of similar text in different data sources by fuzzy matching. A host of
methods for these research areas are available such as those presented in the International Society
of Information Fusion. Other methods rely on causal estimates of the outcomes based on a model of
the sources.
Big Data Processing pipelines
Big Data processing pipelines refer to a set of processes and technologies that are used to collect,
store, process, and analyze large volumes of data. These pipelines are designed to handle the
enormous volume, velocity, and variety of data generated by various sources such as social media, IoT
1. Data ingestion: This is the first stage of the pipeline, where data is collected from various
sources such as sensors, applications, or web logs. The data may be ingested in real-time or
batch mode, and various technologies such as Apache Kafka, Apache NiFi, or Flume can be used
2. Data storage: Once data is ingested, it needs to be stored for further processing and
analysis. Various Big Data storage systems such as Hadoop Distributed File System (HDFS),
Amazon S3, or Azure Blob Storage can be used to store large volumes of data.
3. Data processing: In this stage, the data is processed to derive meaningful insights. The data
can be processed in real-time or batch mode using technologies such as Apache Spark, Apache
4. Data analysis: In this stage, the processed data is analyzed to derive insights using
5. Data visualization: The final stage of the pipeline involves presenting the insights obtained
from data analysis in an understandable format using various data visualization tools such as
In addition to the above stages, Big Data processing pipelines can include various other components
such as data validation, data cleansing, data transformation, and machine learning models for
predictive analysis.
In conclusion, Big Data processing pipelines are an essential part of Big Data systems that help
organizations collect, store, process, and analyze large volumes of data. These pipelines can be
customized based on the organization's specific requirements and the type of data being processed.
insights from large and complex datasets. These operations involve the use of advanced analytical
methods, machine learning algorithms, and statistical models to derive insights that can be used to
Let's take an example of how analytical operations can be used in Big Data to improve customer
1. Descriptive analytics: The e-commerce company can analyze their historical transaction data
to understand which products are popular among customers and what factors influence their
purchasing decisions. This analysis can help the company identify customer segments based on
2. Diagnostic analytics: The e-commerce company can use diagnostic analytics to identify the
factors that influence customer churn. By analyzing customer feedback and transaction data,
the company can identify patterns that lead to customer dissatisfaction and address these
issues.
3. Predictive analytics: The e-commerce company can use predictive analytics to forecast
customer demand for specific products, which can help them optimize their inventory and
pricing strategies. By analyzing customer behavior and transaction data, the company can
identify which products are likely to be popular in the future and adjust their supply
accordingly.
4. Prescriptive analytics: The e-commerce company can use prescriptive analytics to optimize
their marketing campaigns. By analyzing customer behavior and transaction data, the company
can identify which marketing channels are most effective for different customer segments
Overall, by applying these analytical operations to their Big Data, the e-commerce company can
improve their customer segmentation, reduce customer churn, optimize their inventory and pricing
Analytical operations are an essential part of Big Data processing pipelines. These operations enable
organizations to analyze large volumes of data and derive insights that can inform business decisions.
Some of the most common analytical operations used in Big Data processing pipelines include:
1. Data Cleaning: Data cleaning involves removing any inconsistencies, errors, or missing values
from the data. This process ensures that the data is accurate and reliable, which is essential
2. Data Transformation: Data transformation involves converting data from one format to
another. This can include merging datasets, filtering data, and changing the structure of the
3. Data Aggregation: Data aggregation involves grouping data by specific attributes and
summarizing the data to identify patterns and trends. Common aggregation operations include
counting, summing, averaging, and finding the minimum or maximum value of a particular
attribute.
make it easier to understand and interpret. This can include charts, graphs, and heat maps,
among others.
5. Machine Learning: Machine learning involves using algorithms to analyze data and identify
patterns and trends that are not immediately apparent. This can include classification,
6. Predictive Analytics: Predictive analytics involves using statistical models and machine learning
algorithms to analyze historical data and make predictions about future events. This can
7. Data Mining: Data mining involves analyzing large volumes of data to identify hidden patterns
and relationships. This can include association rules, sequential patterns, and anomaly
These analytical operations are typically performed in a specific order as part of a Big Data
processing pipeline. For example, data cleaning and transformation are typically performed first to
ensure that the data is accurate and structured properly. Data aggregation and visualization are then
used to identify patterns and trends in the data, while machine learning, predictive analytics, and
data mining are used to generate insights that can inform business decisions.
Overall, analytical operations are a crucial part of Big Data processing pipelines, enabling
Aggregation operation
Aggregation operations in Big Data refer to the process of grouping and summarizing data to derive
insights and make decisions. Aggregation operations are commonly used in data analytics and business
intelligence applications to summarize large volumes of data into manageable and meaningful insights.
Suppose a retail company wants to analyze their sales data to identify trends and patterns in
customer behavior. They have a massive dataset that contains sales data for all their stores across
To perform aggregation operations on this dataset, the retail company can use tools like Apache
Spark, Apache Flink, or Hadoop MapReduce. These tools enable the company to perform aggregation
operations, such as count, sum, average, minimum, maximum, and grouping by different dimensions
For example, the company can use aggregation operations to answer questions such as:
What is the total revenue for each region, store, and product category?
What is the average sales price for each region, store, and product category?
Which regions or stores have the highest and lowest sales revenue?
What is the total number of products sold for each region, store, and product category?
By answering these questions, the retail company can identify trends and patterns in customer
behavior, optimize their inventory and pricing strategies, and make data-driven decisions to improve
In conclusion, aggregation operations are a crucial aspect of Big Data processing and can provide
The high-level operations in big data processing can be broadly categorized into three phases: data
ingestion, data processing, and data analysis. Here is an overview of each phase:
1. Data Ingestion: The first phase of big data processing is data ingestion. In this phase, data is
acquired from various sources such as sensors, logs, databases, social media, and other data
feeds. The data is then ingested into the big data platform, which can be a Hadoop cluster or
a cloud-based storage system. The data ingestion process involves data cleansing, data
normalization, and data validation to ensure that the data is accurate and reliable.
2. Data Processing: The second phase of big data processing is data processing. In this phase,
the data is processed using various techniques such as MapReduce, Spark, and Hive to extract
useful information from the data. This phase also involves data transformation, data
enrichment, and data aggregation to prepare the data for analysis. The data processing phase
3. Data Analysis: The final phase of big data processing is data analysis. In this phase, the
processed data is analyzed to derive insights and make informed decisions. Data analysis
techniques include data visualization, statistical analysis, machine learning, and predictive
modeling. The results of the data analysis are used to identify trends, patterns, and anomalies
in the data, which can be used to optimize business operations, improve customer satisfaction,
Overall, big data processing involves a complex series of operations that require significant
computational resources and expertise. By following these high-level operations, organizations can
leverage the power of big data to gain a competitive advantage and improve their business outcomes.
Big data workflow refers to the series of steps and processes involved in managing, processing, and
analyzing large volumes of data. Here is an example of a typical big data workflow:
1. Data Collection: The first step in the big data workflow is to collect data from various
sources. This can include structured data from databases, semi-structured data from log
files, and unstructured data from social media platforms. For example, a retail company may
collect data on customer purchases, social media activity, and website traffic.
2. Data Ingestion: Once data is collected, it needs to be ingested into a data storage system
such as Hadoop Distributed File System (HDFS) or Amazon S3. This involves converting the
data into a format that can be processed and analyzed by big data tools. For example, a retail
company may use Apache Kafka to ingest customer purchase data in real-time.
3. Data Processing: The next step in the big data workflow is to process the data using big data
tools such as Apache Spark or Hadoop MapReduce. This involves filtering, cleaning, and
transforming the data to prepare it for analysis. For example, a retail company may use
Apache Spark to filter out invalid purchase transactions and clean the data.
4. Data Analysis: Once data is processed, it can be analyzed using various analytical tools such as
Apache Hive or Apache Pig. This involves querying the data to extract insights and patterns
that can be used to make business decisions. For example, a retail company may use Apache
Hive to query customer purchase data and identify popular product categories.
5. Data Visualization: The final step in the big data workflow is to visualize the results of data
analysis using tools such as Tableau or QlikView. This involves creating charts, graphs, and
dashboards that can be used to communicate insights to stakeholders. For example, a retail
company may use Tableau to create a dashboard that displays customer purchase trends and
insights.
Overall, the big data workflow is a complex process that involves multiple steps and tools. By
following a structured approach to data management and analysis, businesses can leverage the power
Big data management refers to the process of organizing, storing, and processing large volumes of
data efficiently and effectively. Here is an example of how big data management can be applied in a
real-world scenario:
Imagine a healthcare company that collects data from multiple sources, such as electronic medical
records, insurance claims, and medical devices. The company wants to use this data to improve patient
To manage this data, the company would first need to identify the different data sources and the
types of data they collect. This could include patient demographics, medical histories, lab results, and
medication records. The company would then need to determine how to store this data in a way that
One approach to storing and managing this data could be to use a big data platform such as Hadoop.
Hadoop is a distributed file system that can store and process large volumes of data across multiple
servers. The company could use Hadoop to store the different types of data in separate data sets or
"tables", and use a tool like Apache Spark or Hive to query and analyze the data.
To ensure data quality, the company could implement data cleaning and validation processes to remove
or correct any errors or inconsistencies in the data. They could also use data governance policies and
procedures to ensure that the data is secure, compliant, and accessible to authorized users only.
Once the data is managed and organized, the company could use analytics tools and techniques to gain
insights into patient outcomes and healthcare costs. For example, they could use machine learning
algorithms to predict patient risk factors, identify patterns in disease prevalence, or optimize
treatment plans.
Overall, big data management is a critical aspect of any organization that wants to leverage the
power of big data. By implementing best practices in data storage, processing, and analysis,
businesses can gain valuable insights and make informed decisions that drive business success.
JULY:
sets. It was originally developed by Apache Software Foundation and is now maintained and
Hadoop is designed to handle big data, which refers to data that is too large or complex for
traditional data processing systems to handle. It achieves this by breaking down large data
sets into smaller chunks and distributing them across multiple servers or nodes in a cluster.
Components of Hadoop
Hadoop is a framework that uses distributed storage and parallel processing to store and
manage Big Data. It is the most commonly used software to handle Big Data. There are three
components of Hadoop.
1. Hadoop HDFS - Hadoop Distributed File System (HDFS) is the storage unit of Hadoop.
Data is stored in a distributed manner in HDFS. There are two components of HDFS - name node and
data node. While there is only one name node, there can be multiple data nodes.
HDFS is specially designed for storing huge datasets in commodity hardware. An enterprise version
of a server costs roughly $10,000 per terabyte for the full processor. In case you need to buy 100 of
Hadoop enables you to use commodity machines as your data nodes. This way, you don’t have to spend
millions of dollars just on your data nodes. However, the name node is always an enterprise server.
o Low Latency data access: Applications that require very less time to access the first data
should not use HDFS as it is giving importance to whole data rather than time to fetch the first
record.
o Lots Of Small Files:The name node contains the metadata of files in memory and if the files are
small in size it takes a lot of memory for name node's memory which is not feasible.
o Multiple Writes:It should not be used when we have to write multiple times.
HDFS Concepts
1. Blocks: A Block is the minimum amount of data that it can read or write.
HDFS blocks are 128 MB by default and this is configurable.Files & HDFS are broken into block-
sized chunks, which are stored as independent units. Unlike a file system, if the file is in HDFS is
smaller than block size, then it does not occupy full blocks size, i.e. 5 MB of file stored in HDFS of
block size 128 MB takes 5MB of space only.
The HDFS block size is large just to minimize the cost of seek.
2. Name Node: HDFS works in master-worker pattern where the name node acts as master.
Name Node is controller and manager of HDFS as it knows the status and the metadata of all the
files in HDFS; the metadata information being file permission, names and location of each block.The
metadata are small, so it is stored in the memory of name node, allowing faster access to data.
Moreover the HDFS cluster is accessed by multiple clients concurrently, so all this information is
handled by a single machine.
The file system operations like opening, closing, renaming etc. are executed by it.
3. Data Node: They store and retrieve blocks when they are told to; by client or name node.
They report back to name node periodically, with list of blocks that they are storing. The data
node being a commodity hardware also does the work of block creation, deletion and
If it fails the file system can not be used as there would be no way of knowing how to reconstruct
Secondary Name Node: It is a separate physical machine which acts as a helper of name node. It
performs periodic check points. It communicates with the name node and take snapshot of meta data
HDFS (Hadoop Distributed File System) is designed to store large files efficiently and reliably
on a distributed system. It provides several basic file operations for managing files in HDFS.
1. Creating a file: HDFS allows creating a file using a command-line interface or through an API.
When a file is created, it is divided into smaller blocks, typically 64MB or 128MB, and each block is
replicated across multiple nodes in the cluster for fault tolerance. The file blocks are stored in the
DataNodes, while the metadata related to the file is stored in the NameNode.
2. Copying a file:
To copy a file from a local file system to HDFS, use the following command:
This command copies the file from the local file system to HDFS.
3. Reading a file: Once a file is created, it can be read using standard read operations in Java.
HDFS supports both sequential and random access to files. However, random access is slower due to
the need to perform several network hops to locate the required block.
This command displays the contents of the specified file in the console.
4. Writing to a file: Once a file is created, it can be written using standard write operations in
Java. HDFS supports both sequential and random access to files. However, random access is slower
due to the need to perform several network hops to locate the required block.
This command appends the specified data to the end of the file.
5. Deleting a file: HDFS allows deleting files and directories using a command-line interface or
through an API. Once a file is deleted, all its replicas across the cluster are also deleted.
6. Renaming a file: HDFS allows setting access permissions on files and directories, similar to Unix-
style file permissions. The permissions determine the level of access allowed for users and groups
This command renames the specified file to the new file name.
Features of HDFS
o Highly Scalable - HDFS is highly scalable as it can scale hundreds of nodes in a single cluster.
o Replication - Due to some unfavorable conditions, the node containing the data may be loss.
So, to overcome such problems, HDFS always maintains the copy of data on a different
machine.
o Fault tolerance - In HDFS, the fault tolerance signifies the robustness of the system in the
event of failure. The HDFS is highly fault-tolerant that if any machine fails, the other
machine containing the copy of that data automatically become active.
o Distributed data storage - This is one of the most important features of HDFS that makes
Hadoop very powerful. Here, data is divided into multiple blocks and stored into nodes.
o Portable - HDFS is designed in such a way that it can easily portable from platform to
another.
Goals of HDFS
o Handling the hardware failure - The HDFS contains multiple server machines. Anyhow, if any
machine fails, the HDFS goal is to recover it quickly.
o Streaming data access - The HDFS applications usually run on the general-purpose file
system. This application requires streaming access to their data sets.
o Coherence Model - The application that runs on HDFS require to follow the write-once-
ready-many approach. So, a file once created need not to be changed. However, it can be
appended and truncate.
Hadoop MapReduce
Hadoop MapReduce is the processing unit of Hadoop. In the MapReduce approach, the processing is
done at the slave nodes, and the final result is sent to the master node. A data containing code is
used to process the entire data. This coded data is usually very small in comparison to the data itself.
You only need to send a few kilobytes worth of code to perform a heavy-duty process on computers.
The input dataset is first split into chunks of data. In this example, the input has three lines of text
with three separate entities - “bus car train,” “ship ship train,” “bus ship car.” The dataset is then
split into three chunks, based on these entities, and processed parallelly.
In the map phase, the data is assigned a key and a value of 1. In this case, we have one bus, one car,
These key-value pairs are then shuffled and sorted together based on their keys. At the reduce
phase, the aggregation takes place, and the final output is obtained.
1. Data locality: Hadoop MapReduce is designed to process data where it is stored. This means that
the data is moved to the compute node, rather than the other way around. This can improve
2. Simplified programming model: Hadoop MapReduce provides a simplified programming model that
allows developers to write code that is easy to read and maintain. It abstracts away many of the
complexities of distributed computing, allowing developers to focus on the business logic of their
application.
GOALS
Hadoop cluster. This allows for the processing of large data sets in a distributed and parallel manner,
recover from node failures or other system failures. If a node fails, Hadoop can automatically
3. Flexibility: Hadoop MapReduce is flexible and can be used for a wide range of data processing
Hadoop YARN
Hadoop YARN stands for Yet Another Resource Negotiator. It is the resource management unit of
Hadoop and is available as a component of Hadoop version 2.
Hadoop YARN acts like an OS to Hadoop. It is a file system that is built on top of HDFS.
It is responsible for managing cluster resources to make sure you don't overload one machine.
It performs job scheduling to make sure that the jobs are scheduled in the right place
Suppose a client machine wants to do a query or fetch some code for data analysis. This job request
goes to the resource manager (Hadoop Yarn), which is responsible for resource allocation and
management.
In the node section, each of the nodes has its node managers. These node managers manage the
nodes and monitor the resource usage in the node. The containers contain a collection of physical
resources, which could be RAM, CPU, or hard drives. Whenever a job request comes in, the app
master requests the container from the node manager. Once the node manager gets the resource, it
Apache Hadoop YARN (Yet Another Resource Negotiator) is a resource management platform that is
responsible for managing the resources (CPU, memory, and disk) of a Hadoop cluster. Its key
features include:
1. Scalability: Hadoop YARN is designed to scale to large clusters with thousands of nodes,
2. Flexibility: Hadoop YARN supports a variety of processing models, including batch processing,
3. Fault tolerance: Hadoop YARN is designed to be fault-tolerant, meaning that it can recover
4. Security: Hadoop YARN provides security features, such as authentication and authorization,
Real-time analytics refers to the process of analyzing data as it is generated or received, in order to
provide immediate insights and actionable intelligence. Real-time analytics is used in many industries,
including finance, healthcare, retail, and manufacturing, where real-time decisions can impact
1. Fraud Detection: Real-time analytics can be used to detect fraud in financial transactions. For
example, credit card companies use real-time analytics to detect fraudulent transactions by
fraudulent transaction is detected, the credit card company can immediately flag the
retail and e-commerce. For example, online retailers can use real-time analytics to analyze
customer browsing behavior and provide personalized product recommendations in real-time.
There are several technologies that are used for real-time analytics, including:
Stream Processing: Stream processing is a technology that allows real-time processing of large
volumes of data as it is generated. Stream processing systems typically use distributed computing to
analyze data streams in real-time, and can be used to detect anomalies, identify patterns, and
generate alerts. Some popular stream processing frameworks include Apache Flink, Apache Kafka,
In-Memory Databases: In-memory databases are databases that store data in memory rather than
on disk. This enables faster access to data and allows for real-time analysis of large volumes of data.
In-memory databases are often used in conjunction with stream processing systems to provide real-
time analytics capabilities. Some popular in-memory databases include Apache Ignite, MemSQL, and
SAP HANA.
Machine Learning: Machine learning algorithms can be used for real-time analytics by analyzing data
streams in real-time and identifying patterns and anomalies. Machine learning algorithms can be used
to build predictive models, detect fraud, and optimize business processes in real-time. Some popular
Complex Event Processing: Complex Event Processing (CEP) is a technology that enables the
detection of complex patterns and relationships in real-time data streams. CEP systems can be used
to analyze large volumes of data in real-time and identify meaningful events and patterns. CEP is
often used in industries such as finance and healthcare to detect fraud. Some popular CEP
MapReduce is a programming model that allows developers to process large volumes of data in a
distributed computing environment. It is commonly used for processing Big Data and is a core
component of many Big Data processing frameworks, such as Apache Hadoop, Apache Spark, and
Amazon EMR.
problem that needs to be solved. This involves understanding the input data, the processing
2. Designing the MapReduce architecture: Once the problem has been defined, the next step is
to design the MapReduce architecture. This involves determining how the input data will be
divided into smaller chunks (known as "input splits"), how these input splits will be processed in
parallel by different nodes in the cluster, and how the results will be combined to generate
3. Implementing the MapReduce program: The next step is to implement the MapReduce program
using a programming language such as Java, Python, or Scala. The MapReduce program typically
consists of two main functions: the map function and the reduce function.
The map function takes an input record and generates a set of key-value pairs as output. The key-
value pairs are then shuffled and sorted based on the key, and passed to the reduce function.
The reduce function takes a set of key-value pairs as input and generates a single output value for
each key. The output values from the reduce function are then combined to generate the final output.
4. Testing the MapReduce application: After the MapReduce application has been implemented, it
should be tested to ensure that it works correctly. This involves running the application on a
5. Deploying the MapReduce application: Once the MapReduce application has been tested, it can
EMR cluster. The application can then be run on a large dataset to process the data in parallel
In conclusion, developing a MapReduce application involves several steps, including defining the
problem, designing the MapReduce architecture, implementing the MapReduce program, testing the
application, and deploying the application to a distributed computing environment. By following these
steps, developers can build efficient and scalable applications for processing Big Data.