You are on page 1of 75

BIG DATA

Introductions
Big data is exactly what the name suggests, a “big” amount of data. Big Data means a data set that is large
in terms of volume and is more complex. Because of the large volume and higher complexity of Big Data,
traditional data processing software cannot handle it. Big Data simply means datasets containing a large
amount of diverse data, both structured as well as unstructured. Big Data allows companies to address
issues they are facing in their business, and solve these problems effectively using Big Data Analytics.

What are the 5 Vs of Big Data/ characteristics of big Data


Or
Nature of data
Big Data contains a large amount of data that is not being processed by traditional data storage or the
processing unit. It is used by many multinational companies to process the data and business of
many organizations.
The data flow would exceed 150 exabytes per day before replication. There are five v's of Big Data that
explains the characteristics.

1. Volume: The name Big Data itself is related to an enormous size. Big Data is a vast 'volumes' of data
generated from many sources daily, such as business processes, machines, social media platforms,
networks, human interactions, and many more. Facebook can generate approximately
a billion messages, 4.5 billion times that the "Like" button is recorded, and more than 350 million new
posts are uploaded each day. Big data technologies can handle large amounts of data.
2. Variety: Big Data can be structured, unstructured, and semi-structured that are being collected from
different sources. Data will only be collected from databases and sheets in the past, But these days the
data will comes in array forms, that are PDFs, Emails, audios, SM posts, photos, videos, etc.

The data is categorized as below:


a. Structured data: In Structured schema, along with all the required columns. It is in a tabular form.
Structured Data is stored in the relational database management system.
b. Semi-structured: In Semi-structured, the schema is not appropriately defined, e.g., JSON, XML, CSV,
TSV, and email. OLTP (Online Transaction Processing) systems are built to work with semi-structured data.
It is stored in relations, i.e., tables.
c. Unstructured Data: All the unstructured files, log files, audio files, and image files are included in
the unstructured data. Some organizations have much data available, but they did not know how
to derive the value of data since the data is raw.
d. Quasi-structured Data: The data format contains textual data with inconsistent data formats that
are formatted with effort and time with some tools.
Example: Web server logs, i.e., the log file is created and maintained by some server that contains a list
of activities.

3. Veracity: Veracity means how much the data is reliable. It has many ways to filter or translate the data.
Veracity is the process of being able to handle and manage data efficiently. Big Data is also essential in
business development.
For example, Facebook posts with hashtags.

4. Value
Value is an essential characteristic of big data. It is not the data that we process or store. It
is valuable and reliable data that we store, process, and also analyze.

5. Velocity
Velocity plays an important role compared to others. Velocity creates the speed by which the data is
created in real-time. It contains the linking of incoming data sets speeds, rate of change, and activity
bursts. The primary aspect of Big Data is to provide demanding data rapidly.
Big data velocity deals with the speed at the data flows from sources like application logs, business
processes, networks, and social media sites, sensors, mobile devices, etc.

The 6th V
6. Variability:
 How fast or available data that extent is the structure of your data is changing?
 How often does the meaning or shape of your data change?
 Example: if you are eating same ice-cream daily and the taste just keep changing.

6. Variability:- Big data is not only in huge amount but also have a lot of variation in it. At some places in a
device, it is small and simple whereas at the same place in other devices it is large and complex. For
example:-for some people collecting magazines or books is a passion. raw data variability.
They don’t want to sell then out even after going through it a lot of times but for others, they buy it, read it
and then sell it. The Same is the case with big data, at some places, it is simple at some complex. It just
depends.

Why is big data important?


 Companies use big data in their systems to improve operations, provide better customer service, create
personalized marketing campaigns and take other actions that, ultimately, can increase revenue and
profits.
 Businesses that use it effectively hold a potential competitive advantage over those that don't because
they're able to make faster and more informed business decisions.
 For example, big data provides valuable insights into customers that companies can use to refine their
marketing, advertising and promotions in order to increase customer engagement and conversion rates. Both
historical and real-time data can be analyzed to assess the evolving preferences of consumers or corporate
buyers, enabling businesses to become more responsive to customer wants and needs.

 Here are some more examples of how big data is used by organizations: (Where big Data)
 In the energy industry, big data helps oil and gas companies identify potential drilling locations and
monitor pipeline operations; likewise, utilities use it to track electrical grids.
 Financial services firms use big data systems for risk management and real-time analysis of market data.
 Manufacturers and transportation companies rely on big data to manage their supply chains and optimize
delivery routes.
 Other government uses include emergency response, crime prevention and smart city initiatives.

Applications of Big Data


 The term Big Data is referred to as large amount of complex and unprocessed data.
 Now a day's companies use Big Data to make business more informative and allows to take business
decisions by enabling data scientists, analytical modellers and other professionals to analyse large
volume of transactional data.

a) Travel and Tourism: Travel and tourism are the users of Big Data. It enables us to forecast
travel facilities requirements at multiple locations, improve business through dynamic
pricing, and many more.
b) Financial and banking sector: The financial and banking sectors use big data technology
extensively. Big data analytics help banks and customer behaviour on the basis of investment patterns,
shopping trends, motivation to invest, and inputs that are obtained
from personal or financial backgrounds.

c) Healthcare
Big data has started making a massive difference in the healthcare sector, with the help of predictive
analytics, medical professionals, and health care personnel. It can produce personalized
healthcare and solo patients also.

d) Telecommunication and media


Telecommunications and the multimedia sector are the main users of Big Data. There are zettabytes to be
generated every day and handling large-scale data that require big data technologies.

e) Government and Military


We see the figures that the government makes on the record. In the military, a fighter plane requires to
process petabytes of data.
Government agencies use Big Data and run many agencies, managing utilities, dealing with traffic jams, and
the effect of crime like hacking and online fraud.
 Aadhar Card: The government has a record of 1.21 billion citizens. This vast data is analyzed and store
to find things like the number of youth in the country. Some schemes are built to target the maximum
population. Big data cannot store in a traditional database, so it stores and analyze data by using the
Big Data Analytics tools.

f) E-commerce
 Amazon: Amazon is a tremendous e-commerce website dealing with lots of traffic daily. But, when
there is a pre-announced sale on Amazon, traffic increase rapidly that may crash the website. So, to
handle this type of traffic and data, it uses Big Data. Big Data help in organizing and analyzing the data
for far use.

Big Challenges with Big Data


The challenges in Big Data are the real implementation hurdles. These require immediate attention and
need to be handled because if not handled then the failure of the technology may take place which can
also lead to some unpleasant result. Big data challenges include the storing, analyzing the extremely large
and fast-growing data.
Some of the Big Data challenges are:
1. Sharing and Accessing Data:
 Perhaps the most frequent challenge in big data efforts is the inaccessibility of data sets from external
sources.
 Sharing data can cause substantial challenges.
 It include the need for inter and intra- institutional legal documents.
 Accessing data from public repositories leads to multiple difficulties.
 It is necessary for the data to be available in an accurate, complete and timely manner because if data in
the companies information system is to be used to make accurate decisions in time then it becomes
necessary for data to be available in this manner.

2. Privacy and Security:


 It is another most important challenge with Big Data. This challenge includes sensitive, conceptual,
technical as well as legal significance.
 Most of the organizations are unable to maintain regular checks due to large amounts of data generation.
However, it should be necessary to perform security checks and observation in real time because it is most
beneficial.
 There is some information of a person which when combined with external large data may lead to some
facts of a person which may be secretive and he might not want the owner to know this information about
that person.
 Some of the organization collects information of the people in order to add value to their business. This is
done by making insights into their lives that they’re unaware of.

3. Analytical Challenges:
 There are some huge analytical challenges in big data which arise some main challenges questions like how
to deal with a problem if data volume gets too large?
 Or how to find out the important data points?
 Or how to use data to the best advantage?
 These large amount of data on which these type of analysis is to be done can be structured (organized
data), semi-structured (Semi-organized data) or unstructured (unorganized data). There are two
techniques through which decision making can be done:
 Either incorporate massive data volumes in the analysis.
 Or determine upfront which Big data is relevant.

4. Technical challenges:
 Quality of data:
 When there is a collection of a large amount of data and storage of this data, it comes at a cost. Big
companies, business leaders and IT leaders always want large data storage.
 For better results and conclusions, Big data rather than having irrelevant data, focuses on quality data
storage.
 This further arise a question that how it can be ensured that data is relevant, how much data would be
enough for decision making and whether the stored data is accurate or not.

 Fault tolerance:
 Fault tolerance is another technical challenge and fault tolerance computing is extremely hard, involving
intricate algorithms.
 Nowadays some of the new technologies like cloud computing and big data always intended that whenever
the failure occurs the damage done should be within the acceptable threshold that is the whole task should
not begin from the scratch.
 Scalability:
 Big data projects can grow and evolve rapidly. The scalability issue of Big Data has lead towards cloud
computing.
 It leads to various challenges like how to run and execute various jobs so that goal of each workload can be
achieved cost-effectively.
 It also requires dealing with the system failures in an efficient manner. This leads to a big question again
that what kinds of storage devices are to be used.

Dimensions of Scalability
1. Volume Scalability: This refers to the ability of a system to handle large volumes of data. Big Data systems
need to be able to store and process massive volumes of data, often in the order of petabytes or exabytes.
2. Velocity Scalability: This refers to the speed at which data is generated, processed, and analyzed. Big data
systems need to be able to cope with the fast-paced nature of data processing, typically in real-time or
near-real-time.
3. Variety Scalability: This refers to the ability of a system to handle different types of data, including
structured and unstructured data, as well as semi-structured data. Big data systems need to be able to
handle a wide range of data formats and sources.
4. Veracity Scalability: This refers to the ability of a system to handle the accuracy and reliability of data. Big
data systems need to be able to manage the quality of data, including identifying and handling data errors
or inconsistencies.
Getting Value out of Big Data
To get value out of big data, businesses need to follow certain steps:
1. Identify the objective: Businesses must identify the goal they want to achieve before sifting
through the data. This will help them focus on specific metrics to be analyzed.
2. Gather the right data: Big data is not just about large quantities of data, it's about the right data.
Businesses must ensure they have the relevant data to achieve their objective.
3. Analyze the data: Businesses must apply advanced analytics and data mining techniques to extract
insights from the data.
4. Visualize the data: Visualization is the best way to make data accessible to everyone in the
organization. Interactive visualizations can make it easier for everyone to understand the data.
5. Act on the insights: Businesses must use the insights gained from the data to make informed
decisions. This can be done by incorporating the insights into business operations, identifying areas
for improvement, and making changes accordingly.
In summary, getting value out of big data requires a deep understanding of the goals, appropriate data,
and advanced analytical tools to extract insights that drive business decisions.
Step One: Process and Clean Data
a) It is important to verify your data matches your business goals. If it does not, there are several
questions to address: What are the viable proxies? Are there outliers that need to be taken into
account? Does the data contain bias? Are there missing values? Look for functionalities that will
correctly address the various needs to clean and process the data.
b) There are a number of methods that can be used to impute, or fill in missing values, such as mean
interpolation, Kalman filter, and ARMA.
c) This step is one of the most important, but may take 70-90 percent of your data analysis project time.
The quality of your data will greatly affect your analysis results.
Step Two: Explore and Visualize Data
a) Explore the processed data and visually inspect the data for patterns, trends, and clusters.
b) This is the time to examine relationships and build hypotheses according to your findings. The
easiest way to complete this process is with the aid of visualization tools.
c) There are a number of simple yet powerful visual aids, such as scatter plots, line graphs, stacked
bar charts, box-plots, and heat-maps.
Step Three: Data Mine
a) You can use various methods to facilitate pattern recognition, including clustering K-Means,
hierarchical clustering, and multi-dimensional scaling.
b) Organizations that leverage and mine their data predictively have a significant competitive advantage
over their rivals, as they can gain important insights and react quickly to expand their business in a way
that was not possible without predictive analytics.
Step Four: Build Model
a) Be sure to have a wide range of models that provide different perspectives of the data. Some possible
models to consider are decision trees, Naïve Bayes classifier, neural networks, SVM, and discriminant
analysis.
b) Every algorithm has its suitability, and it is important to understand that all models have limitations.
c) There could be more than one model that would work for a problem.
d) Avoid overfitting.
e) Be sure to document and communicate the assumptions and results clearly.
Step Five: Generate Results and Optimize
a) Predictive results are used to establish objective functions in order to generate actionable results.
There are many applicable methods, such as linear and quadratic programming,
b) One specific method may be more appropriate than another depending on the nature of the objective
function (linear, quadratic, or discontinuous) and constraints on the variables (linear or not).
c) The goal is to produce results that lead to valuable business decisions. If the hospital staff knows a
certain surgical procedure has high readmissions, they may change the process to help reduce
readmissions, such as allowing for an extra day of post-operative care.
Step Six: Validate Results
After you implement your business decisions, allow time to produce results. It is important to carefully
validate the results against the initial business objective. Returning to our healthcare example, the
hospital’s business objective is reducing readmissions. Analysts should review data to see if current rates
have declined in an appreciable way.

What is Data Science Process?


 Data Science is all about a systematic process used by Data Scientists to analyze, visualize and model large
amounts of data.
 A data science process helps data scientists use the tools to find unseen patterns, extract data, and convert
information to actionable insights that can be meaningful to the company. This aids companies and
businesses in making decisions that can help in customer retention and profits.
 Further, a data science process helps in discovering hidden patterns of structured and unstructured raw
data.
 The process helps in turning a problem into a solution by treating the business problem as a project..
The six steps of the data science process are as follows:

Step 1: Framing the Problem


a) Before solving a problem, the pragmatic thing to do is to know what exactly the problem is.
Data questions must be first translated to actionable business questions. People will more
than often give ambiguous inputs on their issues. And, in this first step, you will have to
learn to turn those inputs into actionable outputs.
A great way to go through this step is to ask questions like:
 Who the customers are?
 How to identify them?
 What is the sale process right now?
 Why are they interested in your products?
 What products they are interested in?

Step 2: Collecting the Raw Data for the Problem


 After defining the problem, you will need to collect the requisite data to derive insights and turn the
business problem into a probable solution.
 The process involves thinking through your data and finding ways to collect and get the data you need

It can include scanning your internal databases or purchasing databases from external sources.
 Many companies store the sales data they have in customer relationship management (CRM) systems.
The CRM data can be easily analyzed by exporting it to more advanced tools using data pipelines.
Step 3: Processing the Data to Analyze
After the first and second steps, when you have all the data you need, you will have to process it before going further
and analyzing it.
Data can be messy if it has not been appropriately maintained, leading to errors that easily corrupt the analysis.
These issues can be values set to null when they should be zero or the exact opposite, missing values, duplicate values,
and many more.
You will have to go through the data and check it for problems to get more accurate insights. The most common errors
that you can encounter and should look out for are:
Missing values
Corrupted values like invalid entries
Time zone differences
Date range errors like a recorded sale before the sales even started

You will have to also look at the aggregate of all the rows and columns in the file and see if the values you obtain make
sense. If it doesn’t, you will have to remove or replace the data
that doesn’t make sense. Once you have completed the data cleaning process, your data will be ready for an exploratory
data analysis (EDA).
Step 4: Exploring the Data
In this step, you will have to develop ideas that can help identify hidden patterns and insights. You will have to find more
interesting patterns in the data, such as why sales of a particular product or service have gone up or down.
You must analyze or notice this kind of data more thoroughly. This is one of the most crucial steps in a data science
process.

Step 5: Performing In-depth Analysis


This step will test your mathematical, statistical, and technological knowledge. You must use all the data science tools to
crunch the data successfully and discover every insight you can. You might have to prepare a predictive model that can
compare your average customer with those who are underperforming.
You might find several reasons in your analysis, like age or social media activity, as crucial factors in predicting the
consumers of a service or product.
You might find several aspects that affect the customer, like some people may prefer being reached over the phone
rather than social media.
 These findings can prove helpful as most of the marketing done nowadays is on social media and only
aimed at the youth. How the product is marketed hugely affects sales, and you will have to target
demographics that are not a lost cause after all.
 Once you are all done with this step, you can combine the quantitative and qualitative data that you have
and move them into action.

Step 6: Communicating Results of this Analysis


 After all these steps, it is vital to convey your insights and findings to the sales head and make them
understand their importance. It will help if you communicate appropriately to solve the problem you have
been given. Proper communication will lead to action. In contrast, improper contact may lead to inaction.
 You need to link the data you have collected and your insights with the sales head’s knowledge so that
they can understand it better.
You can start by explaining why a product was underperforming and why specific demographics were not
interested in the sales pitch. After presenting the problem, you can move on to the solution to that
problem. You will have to make a strong narrative with clarity and strong objectives

Foundations of Big Data


Big data refers to large, diverse, and complex data sets that are beyond the capabilities of traditional data
processing systems to manage, process, and analyze effectively. These data sets are generated from a
variety of sources such as social media, mobile devices, sensors, and online transactions. The foundations
of big data encompass various aspects such as hardware, software, and data management techniques.
The foundations of big data systems can be summarized into three key components: storage, processing,
and analysis.
1. Storage:
The sheer volume of data generated by businesses and organizations requires a scalable and reliable
storage solution. Traditional relational databases are no longer sufficient in managing big data sets, where
data is often unstructured and requires distributed storage. Big data systems use distributed file systems
that allow data to be stored across multiple servers in a cluster, ensuring scalability, fault tolerance, and
easy access.
2. Processing:
Big data processing involves the manipulation of large volumes of data to extract useful insights. This
process requires specialized tools and frameworks that can handle complex algorithms, parallel processing,
and efficient data movement. Apache Hadoop is one of the most popular big data processing frameworks
that enables distributed processing of massive amounts of data.
3. Analysis:
Finally, analyzing big data involves extracting insights, patterns, and trends from massive data sets, which
can be integrated with other sources to provide a complete picture. Advanced analytics and machine
learning algorithms are key components of big data systems that enable automated decision-making,
predictive analytics, and real-time monitoring.
Together, these components form the foundations of big data systems, enabling organizations to harness
the power of large data sets to drive innovation, improve efficiency, and gain competitive advantage.

Foundations of big data programming


Big data programming is a set of techniques and technologies used to process and analyze large and
complex data sets. The foundations of big data programming include various programming languages,
frameworks, and tools that are used to perform data analysis, processing, and manipulation.
Some of the key foundations of big data programming include:
1. Programming Languages:
The most popular programming languages used for big data programming include Java, Python, R, and
Scala. Each of these languages has its own set of strengths and weaknesses, making them useful for
specific types of data analysis and processing tasks.
2. Distributed Computing Frameworks:
Distributed computing frameworks play a crucial role in big data programming as they enable developers
to process massive data sets by breaking them down into smaller chunks and processing them on a
distributed system. Some of the most popular distributed computing frameworks include Apache Hadoop,
Apache Spark, and Apache Flink.
3. Data Management Tools:
Data management tools enable developers to store, process, and manage large amounts of data
efficiently. Some of the commonly used data management tools include Apache Kafka, Apache Hive, and
Apache Cassandra.
4. Data Visualization Tools:
Data visualization tools are used to create interactive dashboards and charts that help present complex
data in a visually appealing format. Some of the popular data visualization tools include Tableau, D3.js, and
Matplotlib.
5. Cloud Computing Platforms:
Cloud computing platforms provide developers with the ability to store and process large data sets without
the need for expensive hardware. Some of the popular cloud computing platforms for big data
programming include Amazon Web Services (AWS), Microsoft Azure, and Google Cloud Platform.
In summary, big data programming is a complex field that requires a combination of programming
languages, distributed computing frameworks, data management tools, and data visualization tools. These
foundations form the backbone of big data programming and are essential for developers to build scalable
and efficient big data applications.

distributed file system in big data in detail


A Distributed File System (DFS) in Big Data is a decentralized network of interconnected computers that
enables data sharing and management across a wide area. This technology has revolutionized the way data
is managed in Big Data applications by allowing users to store, access, and manage large amounts of data
across multiple machines simultaneously.
 DFS systems offer several advantages over traditional file systems, including fault tolerance, scalability,
and accessibility.
 They are also highly reliable, as data is replicated across multiple nodes, ensuring that data is always
available when needed. In addition, they can reduce the costs associated with storing large amounts of
data, as they eliminate the need for expensive centralized storage systems.
Distributed file systems also have a number of key features that make them ideal for use in Big Data
applications. These include:
1. Parallel processing: DFS can handle large data sets in parallel and distribute computation tasks to multiple
nodes, enabling faster data processing.
2. Replication: Data is replicated across multiple nodes, ensuring that data is always available even if some of
the nodes fail or go offline.
3. High throughput: DFS can handle high volumes of data and can scale horizontally to handle even larger
data volumes.
4. Fault tolerance: DFS are designed to be fault-tolerant, which means they have built-in mechanisms to
handle hardware and software failures and provide continuous data availability.
5. Scalability: DFS can scale horizontally, allowing for the addition of more nodes to increase capacity and
performance.
 DFS systems have been used successfully in various Big Data applications, including Hadoop, which is an
open-source DFS system. Hadoop provides a distributed platform for large-scale data processing
applications, such as data mining and machine learning.
 Additionally, other DFS systems like Ceph, GlusterFS, and Lustre have become increasingly popular in
recent years.
 Overall, DFS systems represent a significant innovation in the management and analysis of Big Data.
Their scalability, parallel processing, high throughput, and fault-tolerance make them an ideal solution for
large-scale data management and analysis.
DFS has two components:
 Location Transparency –
Location Transparency achieves through the namespace component.
 Redundancy –
Redundancy is done through a file replication component.
In the case of failure and heavy load, these components together improve data availability by allowing the
sharing of data in different locations to be logically grouped under one folder, which is known as the “DFS
root”.
It is not necessary to use both the two components of DFS together, it is possible to use the namespace
component without using the file replication component and it is perfectly possible to use the file
replication component without using the namespace component between servers.
Applications :
 NFS –
NFS stands for Network File System. It is a client-server architecture that allows a computer user to
view, store, and update files remotely. The protocol of NFS is one of the several distributed file system
standards for Network-Attached Storage (NAS).
 CIFS –
CIFS stands for Common Internet File System. CIFS is an accent of SMB. That is, CIFS is an application of
SIMB protocol, designed by Microsoft.
 SMB –
SMB stands for Server Message Block. It is a protocol for sharing a file and was invented by IMB. The
SMB protocol was created to allow computers to perform read and write operations on files to a
remote host over a Local Area Network (LAN).
 Hadoop –
Hadoop is a group of open-source software services. It gives a software framework for distributed
storage and operating of big data using the MapReduce programming model. The core of Hadoop
contains a storage part, known as Hadoop Distributed File System (HDFS), and an operating part which
is a MapReduce programming model.
Advantages :
 DFS allows multiple user to access or store the data.
 It allows the data to be share remotely.
 It improved the availability of file, access time, and network efficiency.
 Improved the capacity to change the size of the data and also improves the ability to exchange the
data.
 Distributed File System provides transparency of data even if server or disk fails.
Disadvantages :
 In Distributed File System nodes and connections needs to be secured therefore we can say that
security is at stake.
 There is a possibility of lose of messages and data in the network while movement from one node
to another.
 Database connection in case of Distributed File System is complicated.
 Also handling of the database is not easy in Distributed File System as compared to a single user
system.
 There are chances that overloading will take place if all nodes tries to send data at once.
Different technique of big data analytics
1. Descriptive Analytics: This technique involves analyzing historical data to identify patterns and trends,
and to produce reports that summarize data. It helps in providing insights into what has happened in
the past.
2. Predictive Analytics: It is a technique that uses statistical algorithms and machine learning models to
forecast future trends and events based on historical data. It helps in identifying what is likely to
happen in the future.
3. Prescriptive Analytics: It is a technique that involves the optimization of decision-making processes
using algorithms and simulation models. It helps in identifying what action should be taken to achieve
desired outcomes.
4. Social Media Analytics: This technique involves analyzing social media data to gain insights into
consumer behavior, sentiment analysis, and market trends.
5. Text Analytics: It is a technique that involves analyzing unstructured data such as text, voice, and video
to extract insights and useful information.
6. Real-Time Analytics: It involves analyzing data as it is generated to make immediate decisions. This
technique is useful in applications such as fraud detection, stock market analysis, and online
advertising.
7. Spatial Analytics: It is a technique that involves analyzing geographical and location-based data to
identify patterns and trends. It is useful in applications such as urban planning, disaster management,
and logistics.
8. Streaming Analytics: It is a technique that involves analyzing data as it is generated in real-time. It is
useful in applications such as smart home devices, wearables, and IoT devices.
What is RDBMS (Relational Database Management System)
a) All modern database management systems like SQL, MS SQL Server, IBM DB2, ORACLE, My-SQL, and
Microsoft Access are based on RDBMS.
b) It is called Relational Database Management System (RDBMS) because it is based on the relational model
introduced by E.F. Codd.
c) How it works
 Data is represented in terms of tuples (rows) in RDBMS.
 A relational database is the most commonly used database. It contains several tables, and each table has
its primary key.
 Due to a collection of an organized set of tables, data can be accessed easily in RDBMS.
Brief History of RDBMS
From 1970 to 1972, E.F. Codd published a paper to propose using a relational database model.
RDBMS is originally based on E.F. Codd's relational model invention.

Following are the various terminologies of RDBMS:

d) What is table/Relation?
Everything in a relational database is stored in the form of relations. The RDBMS database uses tables to
store data. A table is a collection of related data entries and contains rows and columns to store data. Each
table represents some real-world objects such as person, place, or event about which information is collected.
The organized collection of data into a relational table is known as the logical view of the database.
e) Properties of a Relation:
o Each relation has a unique name by which it is identified in the database.
o Relation does not contain duplicate tuples.
o The tuples of a relation have no specific order.
o All attributes in a relation are atomic, i.e., each cell of a relation contains exactly one value.
Let's see the example of the student table.
ID Name AGE COURSE
1 Ajeet 24 B.Tech
2 aryan 20 C.A
5 Vimal 26 BSC

f) What is a row or record?


A row of a table is also called a record or tuple. It contains the specific information of each entry in the table.
It is a horizontal entity in the table. For example, The above table contains 5 records.
Properties of a row:
o No two tuples are identical to each other in all their entries.
o All tuples of the relation have the same format and the same number of entries.
o The order of the tuple is irrelevant. They are identified by their content, not by their position.

ID Name AGE COURSE


1 Ajeet 24 B.Tech

g) What is a column/attribute?
A column is a vertical entity in the table which contains all information associated with a specific field in a
table. For example, "name" is a column in the above table which contains all information about a student's
name.
Properties of an Attribute:
o Every attribute of a relation must have a name.
o Null values are permitted for the attributes.
o Default values can be specified for an attribute automatically inserted if no other value is specified
for an attribute.
o Attributes that uniquely identify each tuple of a relation are the primary key.
Name
Ajeet
Aryan
Vimal

h) What is data item/Cells?


The smallest unit of data in the table is the individual data item. It is stored at the intersection of tuples and
attributes.
Properties of data items:
o Data items are atomic.
o The data items for an attribute should be drawn from the same domain.
ID Name AGE COURSE
1 Ajeet 24 B.Tech

i) Degree:
The total number of attributes that comprise a relation is known as the degree of the table.
For example, the student table has 4 attributes, and its degree is 4.
ID Name AGE COURSE
1 Ajeet 24 B.Tech
2 aryan 20 C.A
5 Vimal 26 BSC

j) Cardinality:
The total number of tuples at any one time in a relation is known as the table's cardinality. The relation whose
cardinality is 0 is called an empty table.
For example, the student table has 3 rows, and its cardinality is 3.
ID Name AGE COURSE
1 Ajeet 24 B.Tech
2 aryan 20 C.A
5 Vimal 26 BSC

k) Domain:
The domain refers to the possible values each attribute can contain. It can be specified using standard data
types such as integers, floating numbers, etc. For example, An attribute entitled Marital_Status may be
limited to married or unmarried values.
l) NULL Values
The NULL value of the table specifies that the field has been left blank during record creation. It is different
from the value filled with zero or a field that contains space.
m) Data Integrity
There are the following categories of data integrity exist with each RDBMS:
Entity integrity: It specifies that there should be no duplicate rows in a table.
Domain integrity: It enforces valid entries for a given column by restricting the type, the format, or the range
of values.
Referential integrity specifies that rows cannot be deleted, which are used by other records.
User-defined integrity: It enforces some specific business rules defined by users. These rules are different
from the entity, domain, or referential integrity.

Difference between DBMS and RDBMS


Although DBMS and RDBMS both are used to store information in physical database but there are some
remarkable differences between them.
No. DBMS RDBMS
1) DBMS applications store data as file. RDBMS applications store data in a tabular form.
2) In DBMS, data is generally stored in In RDBMS, the tables have an identifier called primary key and
either a hierarchical form or a the data values are stored in the form of tables.
navigational form.
3) Normalization is not present in DBMS. Normalization is present in RDBMS.
4) DBMS does not apply any security with RDBMS defines the integrity constraint for the purpose of
regards to data manipulation. ACID (Atomocity, Consistency, Isolation and Durability)
property.
5) DBMS uses file system to store data, so in RDBMS, data values are stored in the form of tables, so
there will be no relation between the a relationship between these data values will be stored in the
tables. form of a table as well.
6) DBMS has to provide some uniform RDBMS system supports a tabular structure of the data and a
methods to access the stored information. relationship between them to access the stored information.
7) DBMS does not support distributed RDBMS supports distributed database.
database.
8) DBMS is meant to be for small RDBMS is designed to handle large amount of data. it
organization and deal with small data. it supports multiple users.
supports single user.
9) Examples of DBMS are file Example of RDBMS are mysql, postgre, sql server, oracle etc.
systems, xml etc.
After observing the differences between DBMS and RDBMS, you can say that RDBMS is an extension of
DBMS. There are many software products in the market today who are compatible for both DBMS and
RDBMS. Means today a RDBMS application is DBMS application and vice-versa.

What is NoSQL (dec-2022)


 NoSQL Database is a non-relational Data Management System, that does not require a fixed schema.
 It avoids joins, and is easy to scale.
 The major purpose of using a NoSQL database is for distributed data stores with humongous data
storage needs. NoSQL is used for Big data and real-time web apps.
 For example, companies like Twitter, Facebook and Google collect terabytes of user data every single day.
 NoSQL database stands for “Not Only SQL” or “Not SQL.” Though a better term would be “NoREL”,
NoSQL caught on. Carl Strozz introduced the NoSQL concept in 1998.

Traditional RDBMS uses SQL syntax to store and retrieve data for further insights. Instead, a NoSQL database system encompasses a wide
range of database technologies that can store structured, semi-structured, unstructured and polymorphic data.
Why NoSQL?
The concept of NoSQL databases became popular with Internet giants like Google, Facebook, Amazon, etc.
who deal with huge volumes of data. The system response time becomes slow when you use RDBMS for
massive volumes of data.
To resolve this problem, we could “scale up” our systems by upgrading our existing hardware. This process is
expensive.
The alternative for this issue is to distribute database load on multiple hosts whenever the load increases. This
method is known as “scaling out.”

NoSQL database is non-relational, so it scales out better than relational databases as they are designed with
web applications in mind.
Types of NoSQL Databases
NoSQL Databases are mainly categorized into four types: Key-value pair, Column-oriented, Graph-based
and Document-oriented. Every category has its unique attributes and limitations. None of the above-specified
database is better to solve all the problems. Users should select the database based on their product needs.
1. Key Value Pair Based
Data is stored in key/value pairs. It is designed in such a way to handle lots of
data and heavy load.
Key-value pair storage databases store data as a hash table where each key is
unique, and the value can be a JSON, BLOB(Binary Large Objects), string, etc.
For example, a key-value pair may contain a key like “Website” associated with
a value like “Guru99”.
It is one of the most basic NoSQL database example. This kind of NoSQL database is used as a collection,
dictionaries, associative arrays, etc. Key value stores help the developer to store schema-less data. They
work best for shopping cart contents.
2. Column-based
Column-oriented databases work on columns and are based on BigTable paper by
Google. Every column is treated separately. Values of single column databases are
stored contiguously.
They deliver high performance on aggregation queries like SUM, COUNT, AVG, MIN
etc. as the data is readily available in a column.
Column-based NoSQL databases are widely used to manage data warehouses, business intelligence, CRM,
Library card catalogs,
HBase, Cassandra, HBase, Hypertable are NoSQL query examples of column based database.
3. Document-Oriented:
Document-Oriented NoSQL DB stores and retrieves data as a key value pair but the value part is stored as a
document. The document is stored in JSON or XML formats. The value is understood by the DB and can be
queried.

4. Graph-Based
A graph type database stores entities as well the relations amongst those entities. The entity is stored as a
node with the relationship as edges. An edge gives a relationship between nodes. Every node and edge has
a unique identifier.

Compared to a relational database where tables are loosely connected, a Graph database is a multi-
relational in nature. Traversing relationship is fast as they are already captured into the DB, and there is no
need to calculate them.
Graph base database mostly used for social networks, logistics, spatial data.
Neo4J, Infinite Graph, OrientDB, FlockDB are some popular graph-based databases.
Advantages of NoSQL
 Can be used as Primary or Analytic Data Source
 Big Data Capability
 No Single Point of Failure
 Easy Replication
 It provides fast performance and horizontal scalability.
 Support Key Developer Languages and Platforms
 Simple to implement than using RDBMS
Disadvantages of NoSQL
 No standardization rules
 Limited query capabilities
 RDBMS databases and tools are comparatively mature
 It does not offer any traditional database capabilities, like consistency when multiple transactions are
performed simultaneously.
 When the volume of data increases it is difficult to maintain unique values as keys become difficult
 Doesn’t work as well with relational data

What is Data Mart?

 A Data Mart is a subset of a directorial information store, generally oriented to a specific purpose or
primary data subject which may be distributed to provide business needs.
 Data Marts are analytical record stores designed to focus on particular business functions for a specific
community within an organization.
 Data marts are derived from subsets of data in a data warehouse, though in the bottom-up data
warehouse design methodology, the data warehouse is created from the union of organizational data
marts.
 The fundamental use of a data mart is Business Intelligence (BI) applications.
 BI is used to gather, store, access, and analyze record. It can be
used by smaller businesses to utilize the data they have
accumulated since it is less expensive than implementing a data
warehouse.

Reasons for creating a data mart


o Creates collective data by a group of users
o Easy access to frequently needed data
o Ease of creation
o Improves end-user response time
o Lower cost than implementing a complete data warehouses
o Potential clients are more clearly defined than in a comprehensive data warehouse
o It contains only essential business data and is less cluttered.
Types of Data Marts
There are mainly two approaches to designing data marts.
These approaches are
1. Dependent Data Marts
 A dependent data marts is a logical subset of a
physical subset of a higher data warehouse.
 According to this technique, the data marts are treated
as the subsets of a data warehouse.
 In this technique, firstly a data warehouse is created
from which further various data marts can be created.
 These data mart are dependent on the data warehouse
and extract the essential record from it.
 In this technique, as the data warehouse creates the data mart; therefore, there is no need for data mart
integration. It is also known as a top-down approach.

2. Independent Data Marts


 Here, firstly independent data marts are created, and
then a data warehouse is designed using these
independent multiple data marts.
 In this approach, as all the data marts are designed
independently; therefore, the integration of data marts
is required.
 It is also termed as a bottom-up approach as the data
marts are integrated to develop a data warehouse.

Other than these two categories, one more type exists that is called "Hybrid Data Marts."
Hybrid Data Marts
It allows us to combine input from sources other than a data warehouse. This could be helpful for many
situations; especially when Adhoc integrations are needed, such as after a new group or product is added to
the organizations.
Steps in Implementing a Data Mart
A. Designing
The design step is the first in the data mart process. This phase covers all of the functions from initiating the
request for a data mart through gathering data about the requirements and developing the logical and
physical design of the data mart.
It involves the following tasks:
1. Gathering the business and technical requirements
2. Identifying data sources
3. Selecting the appropriate subset of data
4. Designing the logical and physical architecture of the data mart.

B. Constructing
This step contains creating the physical database and logical structures associated with the data mart to
provide fast and efficient access to the data.
It involves the following tasks:
1. Creating the physical database and logical structures such as tablespaces associated with the data
mart.
2. creating the schema objects such as tables and indexes describe in the design step.
3. Determining how best to set up the tables and access structures.
C. Populating
This step includes all of the tasks related to the getting data from the source, cleaning it up, modifying it to
the right format and level of detail, and moving it into the data mart.
It involves the following tasks:
1. Mapping data sources to target data sources
2. Extracting data
3. Cleansing and transforming the information.
4. Loading data into the data mart
5. Creating and storing metadata
D. Accessing
This step involves putting the data to use: querying the data, analyzing it, creating reports, charts and graphs
and publishing them.
It involves the following tasks:
1. Set up and intermediate layer (Meta Layer) for the front-end tool to use. This layer translates
database operations and objects names into business conditions so that the end-clients can interact
with the data mart using words which relates to the business functions.
2. Set up and manage database architectures like summarized tables which help queries agree through
the front-end tools execute rapidly and efficiently.
E. Managing
This step contains managing the data mart over its lifetime. In this step, management functions are performed
as:
1. Providing secure access to the data.
2. Managing the growth of the data.
3. Optimizing the system for better performance.
4. Ensuring the availability of data event with system failures.

Difference between Data Warehouse and Data Mart

Data Warehouse Data Mart


A Data Warehouse is a vast repository of information collected from A data mart is an only subtype of a Data Warehouses. It is
various organizations or departments within a corporation. architecture to meet the requirement of a specific user group.
It may hold multiple subject areas. It holds only one subject area. For example, Finance or Sales.
It holds very detailed information. It may hold more summarized data.
Works to integrate all data sources It concentrates on integrating data from a given subject area or
set of source systems.
In data warehousing, Fact constellation is used. In Data Mart, Star Schema and Snowflake Schema are used.
It is a Centralized System. It is a Decentralized System.
Data Warehousing is the data-oriented. Data Marts is a project-oriented.
Data lakes
 A data lake is a large-scale data storage and management system that is used to store data in its raw,
unprocessed form.
 It provides a secure and scalable repository to store structured and unstructured data, without any
predefined schema, in their native formats.
 Data lakes allow organizations to store any type of data, regardless of its source or structure.
 Data is stored in a central repository or pool, which can be accessed by various departments and data
scientists within the organization for further analysis or processing.
 The data can be of any size or format, and can be processed in batch or real-time.
 Data lakes are often used in big data and machine learning projects, where the amount of data being
collected, analyzed, and stored is massive.
 They provide a central repository for historical data, enabling organizations to perform predictive
analytics, machine learning, and other data-intensive tasks.

Some of the key features of a data lake include:


1. Scalability: Data lakes are designed to handle massive amounts of data and can easily scale up or
down, as per the organization's needs.
2. Flexibility: Data lakes allow organizations to store any type of data, structured or unstructured,
without any schema.
3. Cost-effective: Data lakes are often more cost-effective than traditional data warehouses as they
allow organizations to store data in its raw form, reducing the need for expensive data
transformation processes.
4. Agility: Data lakes enable organizations to quickly and easily access large volumes of data for
analysis and processing.
5. Security: Data lakes provide advanced security features to ensure the safety of the data stored in
them.

 Data lakes can be deployed on-premises or in the cloud, depending on the organization's needs and
preferences. Some of the popular data lake platforms include Hadoop, AWS S3, Azure Data Lake
Storage, and GCP Data Lake Storage.
 In a nutshell, data lakes provide organizations with a cost-effective, scalable, and flexible solution for
storing and managing large volumes of data. They help organizations to better understand their data,
make informed decisions, and drive business growth.

ETL (Extract, Transform, and Load) Process


 The mechanism of extracting information from source systems and bringing it into the data warehouse is
commonly called ETL, which stands for Extraction, Transformation and Loading.
 The ETL process requires active inputs from various stakeholders, including developers, analysts, testers,
top executives and is technically challenging.
 To maintain its value as a tool for decision-makers, Data warehouse technique needs to change with
business changes.
 ETL is a recurring method (daily, weekly, monthly) of a Data warehouse system and needs to be agile,
automated, and well documented.

How ETL Works?


1. Extraction
o Extraction is the operation of extracting information from a source system for further use in a data
warehouse environment. This is the first stage of the ETL process.
o Extraction process is often one of the most time-consuming tasks in the ETL.
o The source systems might be complicated and poorly documented, and thus determining which data needs
to be extracted can be difficult.
o The data has to be extracted several times in a periodic manner to supply all changed data to the
warehouse and keep it up-to-date.
Cleansing
The cleansing stage is crucial in a data warehouse technique because it is supposed to improve data
quality. The primary data cleansing features found in ETL tools are rectification and homogenization.
example:
If an enterprise wishes to contact its users or its suppliers, a complete, accurate and up-to-date list of contact
addresses, email addresses and telephone numbers must be available.
2. Transformation
Transformation is the core of the reconciliation phase. It converts records from its operational source format
into a particular data warehouse format. If we implement a three-layer architecture, this phase outputs our
reconciled data layer.
The following points must be rectified in this phase:
o Loose texts may hide valuable information. For example,
XYZ PVT Ltd does not explicitly show that this is a Limited
Partnership company.

o Different formats can be used for individual data. For


example, data can be saved as a string or as three
integers.
Cleansing and Transformation processes are often closely linked in ETL tools.

3. Loading
The Load is the process of writing the data into the target database. During the load step, it is necessary to
ensure that the load is performed correctly and with as little resources as possible.
Loading can be carried in two ways:
1. Refresh: Data Warehouse data is completely rewritten. This means that older file is replaced. Refresh
is usually used in combination with static extraction to populate a data warehouse initially.
2. Update: Only those changes applied to source information are added to the Data Warehouse. An
update is typically carried out without deleting or modifying preexisting data. This method is used in
combination with incremental extraction to update data warehouses regularly.

DATA PIPELINES
 Data Pipeline :
Data Pipeline deals with information that is flowing from one end to another. In simple words, we can say
collecting the data from various resources than processing it as per requirement and transferring it to the
destination by following some sequential activities. It is a set of manner that first extracts data from various
resources and transforms it to a destination means it processes it as well as moves it from one system to
another system.

 How to build a Data Pipeline :


An organization can decide the methods of development to be followed just to abstract data from sources
and transfer it to the destination.
Batch transforming and processing are two common methods of development. Then there is a decision on
what transformation process- ELT(Extract/Load/Transform) or ETL -to use before the data is moved to the
required destination.

 Challenges to building Data Pipeline :


Netflix, has built its own data pipeline. However, building your own data pipeline is very difficult and time is
taken. Here are some common challenges to creating a data pipeline in-house:
 Connection
 Flexibility
 Centralization
 Latency
A data pipeline is a process that moves data from one system or application to another. It involves a series of
steps or stages needed to collect, clean, transform, and load data into a final destination, such as a
database, data warehouse, or data lake.

 The main components of a data pipeline are:


1. Data Source: The origin of the data, which can be structured or unstructured, inside or outside the
organization. Examples are databases, APIs, logs, sensors, files, or social media.
2. Data Ingestion: The process of extracting data from the source and bringing it into the pipeline. This
step requires connectivity and security measures to ensure data integrity, privacy, and compliance.
3. Data Processing: The stage where data is cleaned, standardized, validated, aggregated, enriched, or
transformed according to business rules and analytics requirements. This step can involve different
tools or programming languages, such as SQL, Python, or Spark.
4. Data Storage: The destination where the processed data is stored for further analysis, reporting, or
visualization. This step can use different types of databases, such as relational, NoSQL, or graph,
depending on the data model and the scalability needs.
5. Data Quality: The monitoring and auditing of the data pipeline to ensure data accuracy,
completeness, consistency, timeliness, and relevance. This step can use data profiling, data lineage, or
data validation techniques to detect and prevent errors or anomalies.
6. Data Governance: The management and control of the data pipeline to align with the organizational
policies and regulations, such as GDPR, CCPA, or HIPAA. This step involves roles and responsibilities,
access controls, data cataloging, and data lineage tracking.
A data pipeline can be visualized as a flowchart, where each step represents a block or a node connected
by arrows or lines. The flowchart can be customized to reflect the specific data pipeline architecture, tools,
and stakeholders.
Overall, a data pipeline is a critical component of any data-driven organization that wants to make
informed decisions, improve customer experience, optimize operations, or innovate products and
services.
foundations of big data in details
Big data refers to the large volumes of structured, semi-structured, and unstructured data generated by
businesses, organizations, individuals, and other entities. This data has grown exponentially in recent years
due to the widespread adoption of digital technologies and the increasing connectivity of people and
devices.
Foundations of big data include:
Volume: Variety: Velocity: Veracity: Value: Variability:
Overall, the foundations of big data are crucial to understand in order to harness its full potential. By
analyzing and making sense of large volumes of diverse data, businesses and organizations can gain
valuable insights and make informed decisions that drive success.

Big Data Processing Tools(dec-2022)


Big data processing tools are software applications that facilitate the analysis and management of large and
complex data sets, which are often too large to handle using traditional data processing techniques. These
tools use distributed computing systems that can process data across multiple systems to achieve high
performance, scalability, and fault tolerance. Here are some of the most popultar big data processing
tools in detail:

1. Apache Hadoop: Apache Hadoop is an open-source big data processing tool that provides a
distributed file system (Hadoop Distributed File System - HDFS) and a distributed processing
framework (MapReduce) for parallel processing of large data sets. Hadoop is designed to run on
commodity hardware and can scale up to handle petabytes of data.

2. Apache Spark: Apache Spark is another open-source big data processing tool that provides a
general-purpose computing engine for parallel data processing. It supports both batch and real-time
processing and can be used with a variety of data sources, including Hadoop Distributed File System
(HDFS), Cassandra, and HBase.

3. Apache Cassandra: Apache Cassandra is an open-source distributed NoSQL database that is


designed for high scalability and fault tolerance. It can handle big data workloads by distributing
data across multiple nodes in a cluster and using a peer-to-peer architecture to achieve high
availability and data replication.

4. Apache Storm: Apache Storm is an open-source distributed streaming data processing system that
can handle real-time data streams with low latency and high throughput. It uses a master-slave
architecture to process continuous data streams across a cluster of nodes.

5. Apache Kafka: Apache Kafka is an open-source distributed messaging system that is designed for
high throughput and low latency. It can handle real-time data streams and is often used in conjunction
with other big data processing tools such as Apache Spark and Apache Storm. Kafka provides a
scalable, fault-tolerant messaging infrastructure for distributed data processing.

These are some of the most popular big data processing tools that are used by enterprises to manage and
analyze large and complex data sets. Each tool has its own strengths and weaknesses, and choosing the right
tool depends on the specific requirements of the organization.

Modern Data Ecosystem


The modern data ecosystem is a complex and interconnected system that has evolved to meet the needs of
businesses and organizations in the digital age. It is made up of numerous components that work together to
manage, store, process, and analyze large amounts of data. These components include:
1. Data sources: These are the devices, systems, and other sources that generate data, such as sensors,
cameras, databases, and social media platforms.
2. Data capture: This is the process of collecting and storing data from various sources in standardized
formats. This can be done manually or automatically using tools like ETL (extract, transform, load)
software.
3. Data storage: This involves storing data in different storage mediums like databases, file systems,
cloud-based storage, etc.
4. Data processing: This includes transforming and manipulating data to conform to the desired format,
and can include tasks like filtering, sorting, aggregating, and merging data.
5. Data analysis: Once data has been processed, it can then be analyzed using various data analysis
tools, such as machine learning algorithms, statistical models, and data visualization tools.
6. Data visualization: This involves presenting data in visual form to aid in understanding and insight
generation.
7. Data governance: This refers to the policies, procedures, and standards that govern the use of data
within an organization, including data security, access controls, and compliance with regulatory
requirements.
8. Data management: This encompasses the process of managing data across its entire lifecycle,
including data quality assurance, archiving, and disposal.
9. Data integration: This involves combining multiple sources of data into a single dataset or system.
10. Data distribution: This involves sharing data between different systems, applications, and users.
Overall, the modern data ecosystem is a complex and dynamic system that requires careful management and
continual evolution to keep up with the ever-changing needs of businesses and organizations.
what is big data platform
Big data platform is a technology infrastructure that enables organizations to collect, store, manage and
analyze large volumes of structured, unstructured, and semi-structured data in real time. A big data platform
typically comprises of several components including data storage, processing, management, and analysis
tools.

Here are the key components of a Big Data platform:

1. Data Storage: The storage component of the Big Data platform needs to be capable of handling
massive amounts of data. It can be a distributed file system like Hadoop Distributed File System
(HDFS), Cloud-based storage like Amazon S3, Microsoft Azure or Google Cloud Storage, or a
traditional relational database like MySQL, Oracle or Microsoft SQL Server.

2. Data Processing: The processing component of a Big Data platform allows for data ingestion,
transformation, and cleansing. Data processing tools like Apache Spark, Apache Storm, Apache Flink
or Apache Beam can be used to process large data sets in real-time.

3. Data Management: The data management component of Big Data platform deals with the
management of data lifecycle, metadata, and data security. It includes tools for data governance
and data integration such as Apache Atlas, Apache NiFi, and Apache Flume.

4. Data Analysis: The analysis component of a Big Data platform allows for advanced analytics on large
datasets, including predictive modeling, machine learning, and artificial intelligence. Some of the
popular data analysis tools used in Big Data platforms are Apache Hadoop, Apache Spark MLlib,
and Apache Mahout.

Overall, a big data platform is designed to help organizations to overcome the challenges of handling
massive volumes of data in real-time and derive valuable insights to support data-driven decision-making
processes.

features of big data platform


1. Scalability: A big data platform should be scalable enough to process and store data smoothly, even if the
volume increases in size over time.
2. Flexibility: The platform should be flexible enough to incorporate new data sources and types of data
ingested.
3. Performance: A big data platform should have high performance and speed in processing, analyzing, and
accessing data.
4. Security: It should have robust security features in place to protect data from unauthorized access, misuse,
and data breaches.
5. Data integration: A big data platform should be capable of integrating multiple data sources, formats,
and types of data such as structured, unstructured, and semi-structured.
6. Data processing: The platform should support different data processing techniques such as batch
processing, real-time processing, and stream processing.
7. Data storage: A big data platform should provide optimized storage to handle large volumes of data
while minimizing storage costs.
8. Analytics: The platform should offer built-in analytics tools that allow users to perform various analyses,
including predictive and machine learning models.
9. Extensibility: The platform should allow customization and extension, enabling users to tailor it according to
their specific needs.
10User interface: A big data platform should provide an easy-to-use user interface that enables users to
access and work with data conveniently.

Types of data

There are four types of big data BI that really aid business:
1. Prescriptive – This type of analysis reveals what actions should be taken. This is the most valuable
kind of analysis and usually results in rules and recommendations for next steps.
 Prescriptive data refers to the analysis and interpretation of data to provide insights and
recommendations on how to improve future outcomes.
 It goes beyond descriptive and diagnostic analytics, which focus on understanding what happened
and why it happened.
 Prescriptive data involves the use of advanced analytics techniques, such as machine learning and
optimization algorithms, to identify the best course of action based on the data available.
 The goal of prescriptive data is to help decision-makers make informed decisions that lead to
optimal outcomes.
2. Predictive – An analysis of likely scenarios of what might happen. The deliverables are usually a
predictive forecast.
 Predictive data refers to data that is analyzed, processed and transformed using advanced
analytical algorithms and techniques to generate insights about future outcomes or trends.
 It involves extracting valuable insights or patterns from historical data to develop a prediction
model that can be used to anticipate future outcomes or trends.
 Predictive data is commonly used in financial forecasting, marketing analysis, risk management,
and supply chain optimization. It is becoming increasingly important for businesses to make
informed decisions and stay ahead of the competition.
3. Diagnostic – A look at past performance to determine what happened and why. The result of the
analysis is often an analytic dashboard.
 Diagnostic data refers to information collected from a system, device or application to analyze
and troubleshoot technical issues or to gain insights into its performance.
 This data is used by technicians, analysts and developers to identify problems, diagnose faults,
and optimize performance.
 Diagnostic data can consist of logs, system files, event messages, error codes, network activity,
device drivers, user behavior patterns, and other data points that provide insights into the
behavior and health of the system, device or application.
 It can be collected manually or automatically, and in some cases, may need the user's permission
to be accessed or collected.
4. Descriptive – What is happening now based on incoming data. To mine the analytics, you typically
use a real-time dashboard and/or email reports.
 Descriptive data is a type of data that describes the characteristics, properties, or attributes of a
group or population.
 It is used to summarize and describe the data that is collected from a sample or population.
 Descriptive data can be presented in a variety of ways, including graphs, charts, tables, and other
visual aids.
 This type of data is commonly used in statistical analysis, research studies, and surveys to provide a
summary of the data that has been collected.
 Some examples of descriptive data include age, gender, income, education level, and geographic
location.

Understanding Different Types of File Formats in big data


In big data, there are several types of file formats used to store and manipulate data. These formats differ
in their efficiency, compatibility, and flexibility. Some of the most common file formats used in big data are:

1. CSV (Comma-Separated Values): CSV is a text-based file format used to store tabular data. In this
format, the values are separated by commas, and each row represents a record. CSV files are easy to read
and edit but are not suitable for complex data structures.

2. JSON (JavaScript Object Notation): JSON is a lightweight text-based file format that is used to send and
receive data between servers and applications. JSON stores data in a hierarchical format and is easy to
read and edit. It is commonly used in web applications that use JavaScript.

3. AVRO: Avro is a binary file format that is compact and efficient. It is designed to support complex data
structures and provides schema evolution, which means that users can evolve data schemas without breaking
the existing data. Avro files are commonly used in Hadoop and other big data processing systems.

4. Parquet: Parquet is a columnar file format that stores data in a compressed format. It is optimized for big
data processing and is used for analytics and data warehousing applications. Parquet files can be read and
written by several different data processing systems.

5. ORC: ORC (Optimized Row Columnar) is a columnar file format similar to Parquet. It stores data in a
compressed format and supports schema evolution. ORC is often preferred over Parquet in certain clustering
solutions, due to its compression rate.

6. XML: XML (eXtensible Markup Language) is a text-based file format that is used to store hierarchical
data. XML is versatile and can represent any type of data, including structured, semi-structured, and
unstructured data. However, XML files can be large and slow to process.

Overall, the choice of file format depends on the specific requirements of the application. Factors to consider
include the size and complexity of the data, compatibility with different systems, and the performance and
efficiency of the file format.

Sources of data using service bindings


In Big Data, service bindings allow for the connection and integration of different services and data sources.
These bindings can be used to extract data from various sources, such as:

1. Databases: Service bindings can be used to connect to different types of databases, including SQL and
NoSQL databases. This allows for the extraction of structured and unstructured data, respectively. Examples
of databases that can be accessed through service bindings include MySQL, MongoDB, Oracle, and
Cassandra.

2. Cloud Storage: Service bindings can also be used to access data stored in cloud storage systems, such as
Amazon S3, Microsoft Azure Blob storage, and Google Cloud Storage. This allows for the extraction of large
volumes of data that are stored in a distributed manner across multiple servers.

3. Social Media: Social media platforms like Twitter and Facebook generate large volumes of data that can
be accessed through service bindings. This data can be used to gain insights into customer behavior, sentiment
analysis, and to develop targeted marketing strategies.

4. Sensors and IoT Devices: The Internet of Things (IoT) has resulted in a proliferation of sensors and
connected devices that generate vast amounts of data. Service bindings can be used to extract data from
these devices, including temperature sensors, GPS trackers, and smart home devices.

5. Web Services: Finally, service bindings can be used to access data from web services, such as weather
APIs, Google Maps, and Amazon Alexa. This data can be used to build applications that are weather-
dependent, location-based, or that interface with voice assistants.

In conclusion, service bindings provide a versatile means of accessing data from a wide variety of sources in
Big Data. This data can be used to gain insights into customer behavior, improve business processes, and
develop new products and services.

Example:

One source of data using service bindings in big data is cloud-based applications such as Amazon Web
Services (AWS) or Microsoft Azure, which provide service bindings to access data stored in their cloud
platforms. Another source is third-party big data platforms such as Hadoop, Cassandra, or MongoDB, which
also offer service bindings to connect and retrieve data from their databases.

Features of big data platform


Here are some common features of big data platforms:

1. Distributed architecture: Big data platforms are designed to run on clusters of commodity hardware,
allowing them to scale horizontally to handle large amounts of data and processing power.

2. Data storage: Big data platforms provide a distributed file system that can store and manage large
amounts of unstructured and structured data. Examples include Hadoop Distributed File System
(HDFS), Apache Cassandra, and Apache HBase.

3. Data processing: Big data platforms provide distributed processing frameworks that can process
large amounts of data in parallel across a cluster of nodes. Examples include Hadoop MapReduce,
Apache Spark, and Apache Flink.
4. Data ingestion: Big data platforms provide tools for ingesting data from various sources such as
databases, file systems, and streaming data sources. Examples include Apache Flume, Apache Kafka,
and Apache Nifi.

5. Data integration: Big data platforms provide tools for integrating data from various sources and
formats. Examples include Apache Sqoop, Apache NiFi, and Talend.

6. Data analysis: Big data platforms provide tools for analyzing and querying data using SQL-like
languages or programming languages such as Python or R. Examples include Apache Hive, Apache
Pig, and Apache Drill.

7. Data visualization: Big data platforms provide tools for visualizing data to make it easier to
understand and communicate insights. Examples include Tableau, QlikView, and Apache Superset.

Overall, big data platforms provide a comprehensive set of features and tools for storing, processing,
integrating, analyzing, and visualizing large amounts of data. They are designed to handle the challenges of
big data, including scalability, fault tolerance, and heterogeneity of data sources and formats.
UNIT- 3 Introduction to Big Data Modeling and Management

Data Storage

Big data involves dealing with large volumes of data that cannot be stored and processed using
traditional methods. There are several types of storage solutions commonly used in big data:

1. Hadoop Distributed File System (HDFS) – HDFS is an open-source distributed file system designed
specifically to provide reliable and scalable storage for big data applications. It is the primary storage
solution in Hadoop framework and can store data in petabytes.

2. NoSQL Databases – NoSQL databases like MongoDB, Cassandra, and Couchbase are designed to
handle large volumes of unstructured data. They provide high scalability and performance for real-time
data processing.

3. Object Storage – Object storage solutions like Amazon S3, Google Cloud Storage, and Microsoft
Azure Blob Storage can store and manage large amounts of unstructured data. They are highly scalable
and provide durable storage with low latency access.

4. In-Memory Storage – In-memory storage solutions like Apache Ignite and Redis can store and
manipulate data in RAM, providing high-speed data processing and low latency access. They are
commonly used in applications that require real-time data analysis and processing.

5. Data Warehouses – Data warehouses like Amazon Redshift and Snowflake are designed to store
large volumes of structured data for analytics purposes. They provide fast query response times and
integrations with popular BI tools.
Overall, the choice of data storage solution in big data depends on the type of data, volume, and the
specific requirements of the application.

Data Quality in big data

Data quality is an essential attribute for big data. Big data refers to huge amounts of data, which can
come from various sources, such as sensor data, social media, and transactional data.
In big data, data quality refers to the accuracy, consistency, completeness, and relevance of data.
Maintaining data quality in big data is crucial to ensure the accuracy of insights and conclusions derived
from the data analysis.
There are several challenges to achieving high data quality in big data, such as the following:

1. Data Variability: Big data sources have different data formats, structures, and quality standards,
making it challenging to integrate and validate data seamlessly.

2. Data Volume: Big data gets generated in massive quantities, making it impossible to implement
traditional data validation mechanisms.

3. Data Complexity: The data in big data can be highly complex, making it essential to have domain
experts validate and filter out data that may be irrelevant.
4. Data Latency: Big data in real-time scenarios can have a high volume with significant velocity, making
it necessary to prioritize data validation mechanisms.

To address these challenges, companies need robust data quality management strategies that include
automated data validation, cleansing, and enrichment techniques. The use of modern technologies like
machine learning and AI can also help in improving data quality in big data.

Dimensions of Data Quality


Data quality operates in six core dimensions:
1. Accuracy: The data reflects the real-world objects and/or events it is intended to model.

Accuracy is often measured by how the values agree with an information source that is known

to be correct.

2. Completeness: The data makes all required records and values available.

3. Consistency: Data values drawn from multiple locations do not conflict with each other, either

across a record or message, or along all values of a single attribute. Note that consistent data

is not necessarily accurate or complete.

4. Timeliness: Data is updated as frequently as necessary, including in real time, to ensure that it

meets user requirements for accuracy, accessibility and availability.

5. Validity: The data conforms to defined business rules and falls within allowable parameters

when those rules are applied.

6. Uniqueness: No record exists more than once within the data set, even if it exists in multiple

locations. Every record can be uniquely identified and accessed within the data set and across

applications.

Data Operations in big data


 DataOps is an Agile approach to designing, implementing and maintaining a distributed data
architecture that will support a wide range of open source tools and frameworks in production.
 The goal of DataOps is to create business value from big data.

Benefits of DataOps
Transitioning to a DataOps strategy can bring an organization the following benefits:
 Provides more trustworthy real-time data insights.
 Reduces the cycle time of data science applications.
 Enables better communication and collaboration among teams and team members.
 Increases transparency by using data analysis to predict all possible scenarios.
 Builds processes to be reproducible and to reuse code whenever possible.
 Ensures better quality data.
 Creates a unified, interoperable data hub.
1. Data ingestion: This process is about bringing in data from different
sources like databases, IoT devices, social media, and others into the
big data platform.

2. Data Preparation: This operation involves cleaning, formatting,


transforming, and aggregating data in various formats so that it makes
sense to analyze in a big data application.

3. Data Analysis: Data analysis is the process of using mathematical


and statistical tools to perform queries and calculations on the data.

4. Data Visualization: Data visualization helps in representing data in


a graphical and more accessible format that allows for better decision-
making.

5. Data Storage: The data storage aspect of big data operations involves storing the massive amounts
of data that big data applications usually deal with.

6. Data Retrieval: This operation involves retrieving data from the storage as per the requirements of
the application.

7. Data Security: Data security operations involve protecting data by implementing various security
measures like encryption, access controls, etc.

8. Data Governance: Data governance refers to the process of managing the overall data lifecycle,
including data quality, data lineage, and data policies.

Data Ingestion in big data

Data ingestion is the process of collecting and importing large volumes of data from various sources
into a data storage system. In big data, data ingestion plays a crucial role in enabling data-driven
decision making by ensuring that the data is cleaned, processed, and ready for analysis.
There are several challenges associated with data ingestion in big data, including:

1. Data Velocity: The speed at which data is generated is increasing rapidly, and it is becoming more
challenging to ingest data in real-time or near real-time.

2. Data Variety: Big data comprises multiple data types, such as structured, semi-structured, and
unstructured data, which require different ingestion techniques.

3. Data Volume: The size of data is growing exponentially, and it is challenging to store and manage
such vast amounts of data.

4. Data Quality: Big data often includes data from multiple sources, which can be inconsistent or
contain errors, making it challenging to ensure data quality.

To overcome these challenges, organizations use various tools and technologies such as Apache Kafka,
Apache NiFi, Apache Flume, and AWS Glue, which facilitate data ingestion from multiple sources, clean
and process data, and prepare it for analysis. The ingestion process involves filtering, transforming,
and validating data to ensure its accuracy and consistency, which is crucial for making data-driven
decisions.

Steps involved in Data Ingestion in big data


1. Data Identification: The first step in data ingestion is to identify the data. This includes identifying
the sources of data, what type of data needs to be collected, and how often does it need to be collected.

2. Data Extraction: Once the data sources have been identified, the data needs to be extracted from
those sources. This may involve data extraction from a variety of sources such as databases, social
media platforms, websites, sensors, and devices.

3. Data Transformation: After data has been extracted, it often needs to be transformed into a
format suitable for processing. This may involve data cleaning, normalization, and transformation into
a standard format.

4. Data Validation: Before data can be processed, it needs to be validated to ensure that it is accurate
and complete. This may involve data validation against predefined rules or patterns.

5. Data Load: Once data has been validated, it can be loaded into a big data system. This may involve
loading data onto a Hadoop Distributed File System (HDFS) or other big data platforms.

6. Data Processing: Once the data has been loaded into the big data system, it can be processed. This
may involve data analysis, machine learning, or other advanced processing techniques.

7. Data Storage: Finally, processed data needs to be stored in a manner that is accessible and usable
by other systems. This may involve storing data in a NoSQL or SQL database, or making it available via
REST APIs.

Scalability Traditional DBMS


 Traditional DBMS (Database Management Systems) have long been used by organizations to
store, manage, and retrieve data efficiently. However, scalability is one of the key challenges
of these systems.
 Scalability refers to the ability of a system to handle an increase in workload or user requests
while maintaining its performance and availability. Traditional DBMS may face scalability issues
due to several factors, such as:

1. Hardware limitations: Traditional DBMS are strongly tied to the hardware they run on. As data
volume and user requests increase, the hardware may not be able to cope with the load, resulting in
slow performance or system crashes.

2. Architecture limitations: Traditional DBMS usually follow a centralized architecture, where all the
data is stored on a single server. As the number of users and data volume increase, this architecture
may become a bottleneck, affecting the system's scalability.
3. Cost limitations: Scaling traditional DBMS often requires significant investment in additional
hardware, licenses, and maintenance. This can make it difficult for small and mid-sized businesses to
scale their systems.

To overcome these limitations, some traditional DBMS vendors have introduced technologies such
as sharding, clustering, and replication. However, these techniques are often complex to implement
and require significant technical expertise.

Security Traditional DBMS

 Traditional DBMS (Database Management System) emphasizes data security as one of the
primary features. In a traditional DBMS, security is enforced through a combination of
authorization, authentication, and access control mechanisms.

Authorization: This mechanism ensures that users are only allowed to access the data that they are
authorized to access. It requires a username and password to gain access to the database, and each
user is granted specific privileges to access the database objects.

Authentication: Authentication is the process of verifying the identity of a user before granting
access to the database. In traditional DBMS, users are authenticated through a username and
password.

Access Control: Access control determines what users are allowed to do with the data. This mechanism
restricts access to database objects by specifying permission levels according to user roles or groups.
For example, an administrator may have full access to the database, while a user may only have access
to a limited set of data.

In addition to these security mechanisms, traditional DBMS also includes other security features such
as audit trails, encryption, and backups to prevent data theft and loss.
Overall, traditional DBMS ensures data security by implementing multiple security measures to protect
the database from unauthorized access, data theft, and loss.

Traditional DBMS advantages


1. Data Integrity: Traditional DBMS systems ensure the accuracy and consistency of data using various
techniques such as locks, transactions, and rollbacks.

2. Scalability: The traditional DBMS architecture supports scalability, enabling the system to handle
an increasing volume of data and users.

3. Security: Traditional DBMS systems offer robust security mechanisms, including authentication,
authorization, and encryption.

4. Query Optimization: Traditional DBMS systems provide sophisticated query optimization techniques
that enable users to retrieve data quickly and efficiently.
5. ACID Compliance: Traditional DBMS systems adhere to the ACID principle (Atomicity, Consistency,
Isolation, and Durability), ensuring that transactions are reliable and consistent.

10. Support: Traditional DBMS systems have been around for several decades and have a vast support
network of vendors, developers, and users, making them reliable and well-tested.

Traditional DBMS disadvantages


1. High cost: Traditional DBMS systems often require expensive hardware and software licenses,
leading to high costs for organizations.

2. Limited scalability: Traditional DBMS systems can be difficult to scale as organizations grow,
requiring additional investment to maintain performance.

3. Complex management: Traditional DBMS systems require skilled professionals to manage and
maintain them, adding to the cost and complexity of operations.

4. Limited flexibility: Traditional DBMS systems can be inflexible, making it difficult to adapt to
changing business needs and requirements.

5. Data redundancy: Traditional DBMS systems can lead to data redundancy, which can impact data
quality and efficiency.

6. Security risks: Traditional DBMS systems are vulnerable to security risks such as hacking, malware,
and data breaches.

Big Data Management Systems

Big data management systems are software systems that are designed to efficiently and effectively

process and manage very large amounts of data. These systems are typically used in businesses or large

organizations where there is a need to store, process, and analyze large amounts of data on a daily

basis.

Some examples of big data management systems include:

1. Hadoop: Hadoop is an open-source software framework that is designed to store and process very
large datasets across clusters of computers. It is highly scalable and can handle petabytes of data.

2. Apache Spark: Apache Spark is a powerful big data processing engine that is designed to perform
distributed processing of large datasets. It is highly efficient and can process data in real-time.

3. MongoDB: MongoDB is a NoSQL database that is designed for managing unstructured and semi-
structured data. It is highly scalable and can handle large volumes of data.

4. Cassandra: Cassandra is a distributed database that is highly scalable and fault-tolerant. It is


designed to handle large volumes of data with high throughput.
5. Amazon Web Services: Amazon Web Services (AWS) offers a range of big data management tools
and services, including AWS Glue, AWS Redshift, and Amazon EMR.

Big data management systems are essential for businesses and organizations that need to manage and
analyze large datasets to gain insights and make data-driven decisions. With the increasing volumes of
data being generated every day, these systems are becoming more important than ever before.

Big Data Management Systems advantages


1. Scalability: Big Data Management Systems are designed to handle vast amounts of data and are
highly scalable. The capacity to store and process data can be expanded easily as the volume of data
grows.

2. Speed: Big Data Management Systems are optimized for high-speed data processing, making it
possible to extract insights and analyze data in real-time.

3. Cost-Effective: Big Data Management Systems are cost-effective as they utilize commodity
hardware, which is less expensive than specialized hardware.

4. Improved Decision Making: Big Data Management Systems provide real-time analysis of large data
sets, enabling informed decision-making based on current and relevant data.

6. Data Security: Big Data Management Systems provide advanced security features, including robust
encryption, access controls, and auditing, ensuring the safety and privacy of sensitive data.

Big Data Management Systems disadvantages


1. Cost: Big data management systems can be very expensive to set up, maintain, and scale. The
hardware and software required to handle large volumes of data can be costly, and organizations may
need to hire specialized staff to manage and analyze the data.

2. Complexity: Big data management systems are typically more complex than traditional data
management systems. They require sophisticated software and hardware configurations, as well as
advanced analytical skills to effectively manage, query, and analyze the data.

3. Security: Big data management systems are often targeted by cybercriminals seeking to steal
sensitive data or disrupt operations. This makes security a critical concern, and organizations must
invest in robust security measures to prevent data breaches and other types of cyberattacks.

Difference between Traditional data and Big data


1. Traditional data: Traditional data is the structured data that is being majorly maintained by all

types of businesses starting from very small to big organizations. In a traditional database system, a

centralized database architecture used to store and maintain the data in a fixed format or fields in a

file. For managing and accessing the data Structured Query Language (SQL) is used.
2. Big data: We can consider big data an upper version of traditional data. Big data deal with too large

or complex data sets which is difficult to manage in traditional data-processing application software.

It deals with large volume of both structured, semi structured and unstructured data. Volume, Velocity

and Variety, Veracity and Value refer to the 5’V characteristics of big data. Big data not only refers

to large amount of data it refers to extracting meaningful data by analyzing the huge amount of

complex data sets. semi-structured

Traditional Data Big Data

Traditional data is generated in enterprise level. Big data is generated outside the enterprise

level.

Its volume ranges from Gigabytes to Terabytes. Its volume ranges from Petabytes to

Zettabytes or Exabytes.

Traditional database system deals with structured Big data system deals with structured, semi-

data. structured,database, and unstructured data.

Traditional data is generated per hour or per day or But big data is generated more frequently

more. mainly per seconds.

Traditional data source is centralized and it is Big data source is distributed and it is

managed in centralized form. managed in distributed form.

Data integration is very easy. Data integration is very difficult.

Normal system configuration is capable to process High system configuration is required to

traditional data. process big data.

The size of the data is very small. The size is more than the traditional data

size.

Traditional data base tools are required to perform Special kind of data base tools are required

any data base operation. to perform any databaseschema-based

operation.

Normal functions can manipulate data. Special kind of functions can manipulate data.

Its data model is strict schema based and it is static. Its data model is a flat schema based and it

is dynamic.

Traditional data is stable and inter relationship. Big data is not stable and unknown

relationship.

Traditional data is in manageable volume. Big data is in huge volume which becomes

unmanageable.
It is easy to manage and manipulate the data. It is difficult to manage and manipulate the

data.

Its data sources includes ERP transaction data, CRM Its data sources includes social media, device

transaction data, financial data, organizational data, data, sensor data, video, images, audio etc.

web transaction data etc.

Data Model
Def
A data model is an abstract model that organizes elements of data and standardizes how they relate
to one another and to the properties of real-world entities. For instance, a data model may specify that
the data element representing a car be composed of a number of other elements which, in turn,
represent the color and size of the car and define its owner.

Structure
A data model structure is a logical representation of the relationships between data elements in a
system or application. It defines how data is organized, stored, and retrieved in a database, and
provides a blueprint for database design. There are several types of data model structures, including:

1. Hierarchical model: This type of model represents data as a tree-like structure. Each record has a
parent record and zero or more child records.

2. Network model: In this model, data is organized as a collection of records or objects, and
relationships between them are represented as links or pointers.

3. Relational model: This is a widely used model where data is represented as tables with rows and
columns. The relationship between tables is defined by common fields.

4. Object-oriented model: This type of model represents data as a collection of objects, with each
object having a set of attributes and methods.

5. Document model: In this model, data is stored as nested documents or collections, with each
document having a unique identifier and key-value pairs.

6. Graph model: This model is used to represent complex networks, with data being stored as nodes
and edges.

The choice of data model structure depends on the specific needs of the application or system, as well
as the type of data being stored and the expected usage patterns.

Data Model operations

1. Create: This operation allows users to add new data to the model. It involves defining the structure
and properties of the new data and assigning it to the appropriate entity in the model.
2. Read: This operation allows users to retrieve data from the model. It involves specifying the exact
data required by the user, and the model returns the requested data from the relevant entity.

3. Update: This operation allows users to modify existing data in the model. It involves making changes
to the properties of the data currently stored in the entity.

4. Delete: This operation allows users to remove data from the model. It involves identifying the data
to be deleted, and the model removes the data from the relevant entity.

Data model constraints


Data model constraints are rules that are applied to the data in a database to ensure data integrity
and consistency.

1. Entity Integrity Constraint: Each entity in a database must have a unique identifier.

2. Referential Integrity Constraint: The relationship between entities must be consistent, and each
foreign key must match a primary key.

3. Domain Constraint: The values of an attribute of an entity must be from an allowed set of values.

4. Cardinality Constraint: The minimum and maximum number of relationships between entities must
be explicitly defined.

5. Structural Constraint: The structure of the data model must be consistent and well-defined.

6. Business Rule Constraint: The model must adhere to the business rules and requirements that govern
the organization.

7. Time Constraint: The data model must be designed to accommodate future changes and adjustments.

8. Execution Constraint: The model must be implementable given the technology and resources available
to the organization.

9. Security Constraint: The model must meet the security requirements of the organization, including
access control and privacy.

10. Usability Constraint: The model must be user-friendly, easy to navigate, and intuitive for users to
work with.

3 main types of big data


1. Structured data
Structured big data is data stored in a fixed schema. Most commonly, this means it is stored in
a relational database management system or RDBMS. This data is stored in tables where each record
has a fixed set of properties, and each property has a fixed data type.
One example is user records in a database:.
ID Email Name City State ZIP code
1 bob@example.com Bob Kansas City MO 64030
2 sara@example.com Sara Chicago IL 60007
3 sam@example.com Sam New York NY 10001
Every record in this table has the same structure, and each property has a specific type. For example,

the State column is limited to two uppercase letters, and the ID and ZIP code columns are limited to

integers.

If you attempt to insert a record in the database that does not fit this structure, it will not allow it,

and an error will be shown.

Structured big data is typically relational. This means that a record such as the user table above can

be linked to a record or records in another table. Let's say the user table is for a shopping cart, and

each user has orders.

ID User_ID Item Total


1 1 Cup 2.00
2 2 Bowl 4.00
3 2 Plate 3.00
4 4 Spoon 1.00
The User_ID property of the order table above links orders to the IDs in the user table. We can see

that Sara has two orders, and Sam hasn't ordered yet.

This type of static structure makes the data consistent and easy to enter, query, and organize. The

language used to query database tables like these is SQL (Structured Query Language). Using SQL,

developers can write queries that join the records in database tables in endless combinations based on

their relationships.

The disadvantage of structured data is that updating the structure of a table can be a complex process.

A lot of thought must be put into table structures before you even begin using the database. This type

of big data is not as flexible as semi-structured data.

2. Unstructured data
Everything that is stored digitally is data. Unstructured data includes text, email, video, audio, server

logs, webpages, and on and on. Unlike structured and semi-structured data that can be queried and

searched in a consistent manner, unstructured data doesn't follow a consistent data model.

This means that instead of simply using queries to turn this data into useful information, a more

complex process must be used, depending on the data source.

This is where machine learning, artificial intelligence, natural language processing, and optical

character recognition (OCR) can be useful.

One example of unstructured data is scanned receipts that are stored for expense reports. In their

native image format, the data is essentially useless. Here, OCR software can turn the images into

structured data that can then be inserted into a database.


The disadvantage of unstructured big data is that it is hard to process, and each data source needs a

custom processor. Advantages include the mere existence of many types of unstructured data, as the

insights gathered from it often can't be found in any other data source.

3. Semi-structured data
Semi-structured big data fits somewhere between structured and unstructured data. A common source

of semi-structured data is from NoSQL databases.

The data in a NoSQL database is organized, but it isn't relational and doesn't follow a consistent

schema.

For example, a user record in a NoSQL database may look like this:

{ _id: ObjectId("5effaa5662679b5af2c57829"), email: "sam@example.com", name: "Sam",

address: "101 Main Street" city: "Independence", state: "Iowa" }

Here, users access the data they need by the keys in the record. And while it looks similar to the

records in the structured data example above, it isn't in a consistent table format.

Instead, it's in JSON format, which is used to store and transmit data objects. While this one

record in the database may have this set of attributes, it doesn't mean the rest of the records will

have the same structure.

An advantage of semi-structured data stored in a NoSQL database is that it is very flexible. If you

need to add more data to a record, simply add it with a new key. This can also be a disadvantage if you

need data to be consistent.

But NoSQL data isn't the only type of semi-structured big data. XML and YAML are two other flexible

data formats that applications use to transfer and store data.

Email can also be considered semi-structured data since parts of it can be parsed consistently, such as

email addresses, time sent, and IP addresses, while the body is unstructured data.

Comparing structured, semi-structured, and unstructured data

Structured Semi-structured Unstructured


Format Most commonly data from Most commonly data from Unstructured data doesn't
relational databases where the NoSQL databases and follow any schema and can
data is arranged in structured transferred in a data take the form of log files,
tables and has specific types serialization language raw text, images, videos, and
such as integer, float, and text. such as JSON, XML, or more.
YAML.
Querying Can be queried quickly with This data can be queried, The raw data must be parsed
SQL in a structured and but due to its semi- and processed with custom
consistent way. structured nature, code in many cases.
records may not be
consistent.
Transactions Databases support Transactions are partially Transactions are not possible
transactions to ensure supported in NoSQL with unstructured data.
dependent data is updated. databases.
Flexibility Structured data sets have a NoSQL databases are Unstructured data is the
complex update process and flexible because data most flexible but also the
are not very flexible. schemas can be updated hardest to process.
dynamically.

Cluster , setting up Hadoop cluster (dec-2022)


Cluster Def:

In big data, a cluster refers to a group of interconnected computers or servers that work together
to process and store large amounts of data.

A cluster is typically made up of commodity hardware that is connected via a network, allowing the
resources of each node in the cluster to be pooled and used to handle the processing and storage
needs of large datasets.

The primary benefit of clustering in big data is that it allows for parallel processing of data across
multiple nodes, which can greatly reduce the time it takes to analyze and process large datasets.

Clusters can also provide high availability and fault tolerance, since the data is replicated across
multiple nodes in the cluster, and can be accessed even if one or more nodes fail.

Hadoop Cluster Architecture

Hadoop clusters are composed of a network of master and worker nodes that orchestrate and
execute the various jobs across the Hadoop distributed file system. The master nodes typically
utilize higher quality hardware and include a NameNode, Secondary NameNode, and JobTracker, with
each running on a separate machine.

The workers consist of virtual machines, running both DataNode and TaskTracker services on
commodity hardware, and do the actual work of storing and processing the jobs as directed by the
master nodes. The final part of the system are the Client Nodes, which are responsible for loading
the data and fetching the results.

 Master nodes are responsible for storing data in HDFS and overseeing key operations, such
as running parallel computations on the data using MapReduce.

 The worker nodes comprise most of the virtual machines in a Hadoop cluster, and perform the
job of storing the data and running computations. Each worker node runs the DataNode and
TaskTracker services, which are used to receive the instructions from the master nodes.

 Client nodes are in charge of loading the data into the cluster. Client nodes first submit
MapReduce jobs describing how data needs to be processed and then fetch the results once
the processing is finished.
What is cluster size in Hadoop?

A Hadoop cluster size is a set of metrics that defines storage and compute capabilities to run
Hadoop workloads, namely :

 Number of nodes : number of Master nodes, number of Edge Nodes, number of Worker
Nodes.

 Configuration of each type node: number of cores per node, RAM and Disk Volume.

What are the advantages of a Hadoop Cluster?

 Hadoop clusters can boost the processing speed of many big data analytics jobs, given their
ability to break down large computational tasks into smaller tasks that can be run in a parallel,
distributed fashion.

 Hadoop clusters are easily scalable and can quickly add nodes to increase throughput, and
maintain processing speed, when faced with increasing data blocks.

 The use of low cost, high availability commodity hardware makes Hadoop clusters relatively
easy and inexpensive to set up and maintain.

 Hadoop clusters replicate a data set across the distributed file system, making them resilient
to data loss and cluster failure.

 Hadoop clusters make it possible to integrate and leverage data from multiple different
source systems and data formats.

 It is possible to deploy Hadoop using a single-node installation, for evaluation purposes.

Setting up a Hadoop cluster involves several steps, which are as follows:

Choose the Hardware: First, you need to choose the hardware for your Hadoop cluster. Hadoop is
designed to run on commodity hardware, so you don't need to invest in expensive hardware. You
should choose servers with sufficient processing power, memory, and disk space to handle the
workload of your Hadoop applications.

Install Hadoop: Next, you need to install Hadoop on your cluster. Hadoop consists of several
components, including the Hadoop Distributed File System (HDFS) and MapReduce, which are
responsible for storing and processing data, respectively. You can download Hadoop from the Apache
website and follow the installation instructions.

Configure Hadoop: Once Hadoop is installed, you need to configure it for your specific cluster. This
involves setting various configuration parameters, such as the amount of memory and disk space
allocated to Hadoop, the number of nodes in the cluster, and the replication factor for data stored in
HDFS.

Set up Networking: You need to ensure that all nodes in the cluster are connected to each other and
can communicate over the network. This involves configuring IP addresses, hostnames, and network
settings for each node.
Test the Cluster: After configuring the Hadoop cluster, you should test it to ensure that it is
working correctly. You can run sample Hadoop applications or use tools such as Hadoop Streaming to
process data on the cluster.

Scale the Cluster: Finally, you can scale the cluster by adding more nodes to handle larger workloads.
You can add new nodes to the cluster and configure them to join the existing cluster.
Big Data Integration

Big Data Integration refers to the process of combining, cleaning, and transforming large and

complex data sets from disparate sources to make it usable and useful for analysis and decision-

making purposes. It involves extracting data from multiple sources, integrating and transforming it,

and loading it into a target system such as a data warehouse or a data lake.

Here is an example of Big Data Integration:

Suppose a company wants to analyze its sales data from multiple sources, including online sales, retail

stores, and call centers. These data sources store information in different formats, such as

structured data, semi-structured data, and unstructured data. The company needs to integrate these

different data sources into a single data repository for analysis.

1. The first step is to extract data from each of the sources, including data from the online

sales system, point-of-sale (POS) systems in retail stores, and call center logs. This data may

include customer information, sales transactions, product information, and other relevant

details.

2. Once the data has been extracted, it needs to be transformed and cleaned. This involves

converting the data into a common format, resolving data quality issues, and identifying any

duplicates or errors in the data. This is a crucial step as it ensures that the data is consistent

and accurate, which is important for accurate analysis and decision-making.

3. After the data has been transformed and cleaned, it is loaded into a data warehouse or a data

lake. The data warehouse is used for structured data, while the data lake is used for semi-
structured and unstructured data. The integrated data can now be used for analysis,

reporting, and decision-making.

In conclusion, Big Data Integration is a complex process that requires a combination of technical skills,

tools, and methodologies to bring together data from multiple sources and make it usable and useful for

analysis and decision-making purposes.

Big Data Integration has both advantages and disadvantages. Here are some of them:

Advantages:

1. Improved data quality: Integration enables data cleansing, validation, and standardization

across multiple data sources, improving data quality and accuracy.

2. Better decision-making: Integration enables businesses to make better decisions based on a

more complete and accurate view of their data.

3. Cost savings: Integration helps businesses save money by reducing duplication of data and

eliminating the need to maintain multiple data systems.

4. Enhanced productivity: Integration streamlines data processes, reducing the time and

resources needed to manage data.

Disadvantages:

1. Complexity: Big Data Integration can be a complex process, requiring technical skills and

specialized tools, which can be costly and time-consuming.

2. Security risks: Integration can create security risks, as it involves transferring data between

multiple systems, increasing the risk of data breaches and other security threats.

3. Data inconsistency: Integration may result in data inconsistencies if data is not transformed

and cleaned correctly, leading to incorrect conclusions and decisions.

4. Scalability: Integration can become more difficult and complex as data volumes increase,

making it difficult to scale up the process.


Big Data Processing

Big Data Processing is the process of collecting, storing, processing, and analyzing large and complex

data sets that cannot be processed by traditional data processing systems. Big Data Processing

enables organizations to gain insights into their data, make better decisions, and create new business

opportunities.

Here is an example of Big Data Processing:

Suppose a financial institution wants to analyze its customers' financial transactions to identify

potential fraudulent activities. The institution collects transaction data from multiple sources,

including ATM withdrawals, credit card transactions, and online transactions. The data includes

customer information, transaction dates, amounts, and locations.

1. The first step in Big Data Processing is to store the data in a distributed file system, such as

Hadoop Distributed File System (HDFS). The data is divided into smaller chunks and

distributed across multiple servers to enable parallel processing.

2. The next step is to process the data using a processing framework, such as Apache Spark or

Apache Flink. The processing framework enables the institution to analyze the data in real-

time, identifying patterns and anomalies that may indicate fraudulent activities.

For example, the institution may use machine learning algorithms to analyze the transaction

data and identify patterns that are indicative of fraudulent activities, such as unusually large

transactions, transactions made in different locations at the same time, or transactions made

outside of the customer's usual spending patterns.

Once the processing is complete, the institution can store the results in a data warehouse or data

lake for further analysis and reporting. The results may be used to generate alerts for potential

fraudulent activities, identify new trends and patterns, and improve the institution's overall fraud

prevention system.

In conclusion, Big Data Processing is a complex process that involves storing, processing, and

analyzing large and complex data sets. It enables organizations to gain insights into their data, make

better decisions, and create new business opportunities. With the right tools and expertise,

businesses can harness the power of Big Data Processing to transform their operations and gain a

competitive edge in the market.


Advantages:

1. Scalability: Big Data Processing enables businesses to process large volumes of data quickly

and efficiently, allowing them to scale up their operations as needed.

2. Real-time processing: Big Data Processing enables real-time processing of data, allowing

businesses to respond to events as they happen and make better decisions in real-time.

3. Improved insights: Big Data Processing enables businesses to gain deeper insights into their

data, identifying patterns and trends that were not previously visible.

4. Cost savings: Big Data Processing can be less expensive than traditional data processing, as it

uses open-source technologies and can run on commodity hardware.

Disadvantages:

1. Complexity: Big Data Processing can be complex, requiring specialized tools, technologies, and

expertise, which can be costly and time-consuming to acquire and implement.

2. Security risks: Big Data Processing involves collecting, storing, and processing large volumes of

data, increasing the risk of data breaches and other security threats.

3. Data quality issues: Big Data Processing can be prone to data quality issues if data is not

cleansed, validated, and standardized before processing.

4. Lack of standardization: Big Data Processing involves data from multiple sources, which may

not be standardized, leading to data inconsistency and incorrect conclusions.

Retrieving of Data query

Retrieving data query in Big Data involves querying and retrieving large volumes of data from

distributed file systems and NoSQL databases. Big Data processing frameworks like Apache Hadoop

and Apache Spark provide tools for querying Big Data using SQL-like languages or specialized query

languages designed for Big Data.

Here are some common methods for retrieving data query in Big Data:

1. Hive Query Language (HQL): Hive is a data warehousing tool that enables SQL-like querying

of Big Data stored in Hadoop Distributed File System (HDFS). HQL is a SQL-like language

used to retrieve data from structured data sources in Hadoop.


For example, the following HQL query retrieves all the records from a table called "sales"

where the product is "iPhone":

SELECT * FROM sales WHERE product = 'iPhone';

2. Pig Latin: Pig is a high-level data processing language used for large-scale data processing on

Hadoop. Pig Latin is a language used to write scripts to transform and query large datasets in

Hadoop.

For example, the following Pig Latin script retrieves all the records from a dataset called "users"

where the age is greater than 30:

users = LOAD 'hdfs://path/to/users' USING PigStorage(',') AS (name:chararray, age:int,

gender:chararray);

result = FILTER users BY age > 30; DUMP result;

3. Apache Spark SQL: Spark SQL is a data processing module in Apache Spark that enables

querying and processing of structured and semi-structured data using SQL-like queries.

For example, the following Spark SQL query retrieves all the records from a table called

"transactions" where the amount is greater than 1000:

val transactions = spark.read.format("csv").load("hdfs://path/to/transactions.csv")

transactions.createOrReplaceTempView("transactions")

val result = spark.sql("SELECT * FROM transactions WHERE amount > 1000") result.show()

In conclusion, retrieving data query in Big Data involves using specialized tools and languages to query

and retrieve large volumes of data from distributed file systems and NoSQL databases. The choice

of tool and language depends on the type of data being queried and the specific requirements of the

query.

data retrieval in big data

Data retrieval in Big Data refers to the process of querying and retrieving large volumes of

structured and unstructured data stored in distributed file systems and NoSQL databases.

Retrieving data from Big Data requires specialized tools and techniques to process and analyze large

volumes of data efficiently.

Here are some common methods used for data retrieval in Big Data:
1. Distributed File Systems: Distributed file systems like Apache Hadoop Distributed File

System (HDFS) store and manage large volumes of structured and unstructured data across

multiple nodes in a cluster. Data retrieval from HDFS involves using tools like Apache Hive or

Apache Pig, which enable SQL-like querying of data stored in HDFS.

2. NoSQL Databases: NoSQL databases like MongoDB, Cassandra, and HBase store

unstructured data and provide flexible data models that enable querying of data using

specialized query languages like MongoDB Query Language (MQL) or Cassandra Query

Language (CQL).

3. Apache Spark: Apache Spark is a data processing framework that enables fast and efficient

processing of large volumes of data using distributed computing techniques. Spark provides

modules like Spark SQL, Spark Streaming, and Spark Machine Learning (ML) that enable

querying and processing of structured and unstructured data.

4. Data Warehousing: Data warehousing tools like Amazon Redshift or Google BigQuery enable

querying and analysis of large volumes of structured data stored in cloud-based data

warehouses. These tools enable querying of data using SQL-like languages and provide scalable

and cost-effective solutions for data retrieval.

In conclusion, data retrieval in Big Data involves using specialized tools and techniques to query and

retrieve large volumes of structured and unstructured data stored in distributed file systems and

NoSQL databases. The choice of tool and technique depends on the type of data being retrieved and

the specific requirements of the query.

Information Integration in big data

Information integration in Big Data refers to the process of integrating data from multiple sources

and systems to provide a unified view of data. Big Data systems typically involve large volumes of

data from multiple sources, such as social media, IoT devices, and transactional systems, which may

have different data structures and formats.

Here are some common methods used for information integration in Big Data:

1. ETL (Extract, Transform, Load): ETL is a process that involves extracting data from various

sources, transforming it into a common format, and loading it into a target database or data

warehouse. ETL tools like Apache NiFi, Talend, and Pentaho enable integration of data from

multiple sources into a common format.


2. Data Virtualization: Data virtualization enables integration of data from multiple sources

without physically copying the data into a central repository. Data virtualization tools like

Denodo and Composite enable integration of data from multiple sources in real-time and

provide a unified view of data.

3. Master Data Management (MDM): MDM is a process that involves creating a single,

authoritative source of master data that can be used by various systems and applications.

MDM tools like Informatica and Talend enable integration of master data from multiple

sources into a common repository.

4. Schema-on-Read: Schema-on-Read is an approach that involves reading data in its native

format and applying a schema as it is read. This enables integration of data from multiple

sources without the need for a predefined schema. Tools like Apache Drill and Apache Impala

enable querying of data from multiple sources using Schema-on-Read approach.

In conclusion, information integration in Big Data involves using specialized tools and techniques to

integrate data from multiple sources and systems to provide a unified view of data. The choice of

tool and technique depends on the type of data being integrated, the data structures and formats,

and the specific requirements of the integration.

Examples of technologies available to integrate information include deduplication, and string metrics

which allow the detection of similar text in different data sources by fuzzy matching. A host of

methods for these research areas are available such as those presented in the International Society

of Information Fusion. Other methods rely on causal estimates of the outcomes based on a model of

the sources.
Big Data Processing pipelines

Big Data processing pipelines refer to a set of processes and technologies that are used to collect,

store, process, and analyze large volumes of data. These pipelines are designed to handle the

enormous volume, velocity, and variety of data generated by various sources such as social media, IoT

devices, and transactional systems.

Here are the common stages of Big Data processing pipelines:

1. Data ingestion: This is the first stage of the pipeline, where data is collected from various

sources such as sensors, applications, or web logs. The data may be ingested in real-time or

batch mode, and various technologies such as Apache Kafka, Apache NiFi, or Flume can be used

to ingest data into the system.

2. Data storage: Once data is ingested, it needs to be stored for further processing and

analysis. Various Big Data storage systems such as Hadoop Distributed File System (HDFS),

Amazon S3, or Azure Blob Storage can be used to store large volumes of data.
3. Data processing: In this stage, the data is processed to derive meaningful insights. The data

can be processed in real-time or batch mode using technologies such as Apache Spark, Apache

Flink, or Apache Storm.

4. Data analysis: In this stage, the processed data is analyzed to derive insights using

technologies such as Apache Hive, Apache Pig, or Apache Drill.

5. Data visualization: The final stage of the pipeline involves presenting the insights obtained

from data analysis in an understandable format using various data visualization tools such as

Tableau, Power BI, or D3.js.

In addition to the above stages, Big Data processing pipelines can include various other components

such as data validation, data cleansing, data transformation, and machine learning models for

predictive analysis.

In conclusion, Big Data processing pipelines are an essential part of Big Data systems that help

organizations collect, store, process, and analyze large volumes of data. These pipelines can be

customized based on the organization's specific requirements and the type of data being processed.

Analytical operations in big data(dec-2022)


Analytical operations in Big Data refer to the processes and techniques used to extract meaningful

insights from large and complex datasets. These operations involve the use of advanced analytical

methods, machine learning algorithms, and statistical models to derive insights that can be used to

support business decisions.

Let's take an example of how analytical operations can be used in Big Data to improve customer

segmentation for an e-commerce company.

1. Descriptive analytics: The e-commerce company can analyze their historical transaction data

to understand which products are popular among customers and what factors influence their

purchasing decisions. This analysis can help the company identify customer segments based on

purchase behavior, demographics, and product preferences.

2. Diagnostic analytics: The e-commerce company can use diagnostic analytics to identify the

factors that influence customer churn. By analyzing customer feedback and transaction data,

the company can identify patterns that lead to customer dissatisfaction and address these

issues.

3. Predictive analytics: The e-commerce company can use predictive analytics to forecast

customer demand for specific products, which can help them optimize their inventory and

pricing strategies. By analyzing customer behavior and transaction data, the company can

identify which products are likely to be popular in the future and adjust their supply

accordingly.

4. Prescriptive analytics: The e-commerce company can use prescriptive analytics to optimize

their marketing campaigns. By analyzing customer behavior and transaction data, the company

can identify which marketing channels are most effective for different customer segments

and adjust their marketing spend accordingly.

Overall, by applying these analytical operations to their Big Data, the e-commerce company can

improve their customer segmentation, reduce customer churn, optimize their inventory and pricing

strategies, and improve the effectiveness of their marketing campaigns.


explain the analytical operations used in big data processing pipelines

Analytical operations are an essential part of Big Data processing pipelines. These operations enable

organizations to analyze large volumes of data and derive insights that can inform business decisions.

Some of the most common analytical operations used in Big Data processing pipelines include:

1. Data Cleaning: Data cleaning involves removing any inconsistencies, errors, or missing values

from the data. This process ensures that the data is accurate and reliable, which is essential

for any data analysis.

2. Data Transformation: Data transformation involves converting data from one format to

another. This can include merging datasets, filtering data, and changing the structure of the

data to make it easier to analyze.

3. Data Aggregation: Data aggregation involves grouping data by specific attributes and

summarizing the data to identify patterns and trends. Common aggregation operations include

counting, summing, averaging, and finding the minimum or maximum value of a particular

attribute.

4. Data Visualization: Data visualization involves creating graphical representations of data to

make it easier to understand and interpret. This can include charts, graphs, and heat maps,

among others.

5. Machine Learning: Machine learning involves using algorithms to analyze data and identify

patterns and trends that are not immediately apparent. This can include classification,

regression, and clustering algorithms, among others.

6. Predictive Analytics: Predictive analytics involves using statistical models and machine learning

algorithms to analyze historical data and make predictions about future events. This can

include forecasting, trend analysis, and predictive modeling, among others.

7. Data Mining: Data mining involves analyzing large volumes of data to identify hidden patterns

and relationships. This can include association rules, sequential patterns, and anomaly

detection, among others.

These analytical operations are typically performed in a specific order as part of a Big Data

processing pipeline. For example, data cleaning and transformation are typically performed first to

ensure that the data is accurate and structured properly. Data aggregation and visualization are then
used to identify patterns and trends in the data, while machine learning, predictive analytics, and

data mining are used to generate insights that can inform business decisions.

Overall, analytical operations are a crucial part of Big Data processing pipelines, enabling

organizations to extract insights and value from large volumes of data.

Aggregation operation

Aggregation operations in Big Data refer to the process of grouping and summarizing data to derive

insights and make decisions. Aggregation operations are commonly used in data analytics and business

intelligence applications to summarize large volumes of data into manageable and meaningful insights.

Here is an example of how aggregation operations can be used in Big Data:

Suppose a retail company wants to analyze their sales data to identify trends and patterns in

customer behavior. They have a massive dataset that contains sales data for all their stores across

multiple regions and product categories.

To perform aggregation operations on this dataset, the retail company can use tools like Apache

Spark, Apache Flink, or Hadoop MapReduce. These tools enable the company to perform aggregation

operations, such as count, sum, average, minimum, maximum, and grouping by different dimensions

such as region, store, and product category.

For example, the company can use aggregation operations to answer questions such as:

 What is the total revenue for each region, store, and product category?

 What is the average sales price for each region, store, and product category?

 Which products have the highest and lowest sales revenue?

 Which regions or stores have the highest and lowest sales revenue?

 What is the total number of products sold for each region, store, and product category?

By answering these questions, the retail company can identify trends and patterns in customer

behavior, optimize their inventory and pricing strategies, and make data-driven decisions to improve

their sales and profitability.

In conclusion, aggregation operations are a crucial aspect of Big Data processing and can provide

valuable insights to organizations looking to make data-driven decisions.


High level Operation

The high-level operations in big data processing can be broadly categorized into three phases: data

ingestion, data processing, and data analysis. Here is an overview of each phase:

1. Data Ingestion: The first phase of big data processing is data ingestion. In this phase, data is

acquired from various sources such as sensors, logs, databases, social media, and other data

feeds. The data is then ingested into the big data platform, which can be a Hadoop cluster or

a cloud-based storage system. The data ingestion process involves data cleansing, data

normalization, and data validation to ensure that the data is accurate and reliable.

2. Data Processing: The second phase of big data processing is data processing. In this phase,

the data is processed using various techniques such as MapReduce, Spark, and Hive to extract

useful information from the data. This phase also involves data transformation, data

enrichment, and data aggregation to prepare the data for analysis. The data processing phase

requires significant computational resources, and organizations often use distributed

computing frameworks to process the data in parallel.

3. Data Analysis: The final phase of big data processing is data analysis. In this phase, the

processed data is analyzed to derive insights and make informed decisions. Data analysis

techniques include data visualization, statistical analysis, machine learning, and predictive

modeling. The results of the data analysis are used to identify trends, patterns, and anomalies

in the data, which can be used to optimize business operations, improve customer satisfaction,

and identify new business opportunities.

Overall, big data processing involves a complex series of operations that require significant

computational resources and expertise. By following these high-level operations, organizations can

leverage the power of big data to gain a competitive advantage and improve their business outcomes.

big data workflow

Big data workflow refers to the series of steps and processes involved in managing, processing, and

analyzing large volumes of data. Here is an example of a typical big data workflow:

1. Data Collection: The first step in the big data workflow is to collect data from various

sources. This can include structured data from databases, semi-structured data from log

files, and unstructured data from social media platforms. For example, a retail company may

collect data on customer purchases, social media activity, and website traffic.
2. Data Ingestion: Once data is collected, it needs to be ingested into a data storage system

such as Hadoop Distributed File System (HDFS) or Amazon S3. This involves converting the

data into a format that can be processed and analyzed by big data tools. For example, a retail

company may use Apache Kafka to ingest customer purchase data in real-time.

3. Data Processing: The next step in the big data workflow is to process the data using big data

tools such as Apache Spark or Hadoop MapReduce. This involves filtering, cleaning, and

transforming the data to prepare it for analysis. For example, a retail company may use

Apache Spark to filter out invalid purchase transactions and clean the data.

4. Data Analysis: Once data is processed, it can be analyzed using various analytical tools such as

Apache Hive or Apache Pig. This involves querying the data to extract insights and patterns

that can be used to make business decisions. For example, a retail company may use Apache

Hive to query customer purchase data and identify popular product categories.

5. Data Visualization: The final step in the big data workflow is to visualize the results of data

analysis using tools such as Tableau or QlikView. This involves creating charts, graphs, and

dashboards that can be used to communicate insights to stakeholders. For example, a retail

company may use Tableau to create a dashboard that displays customer purchase trends and

insights.

Overall, the big data workflow is a complex process that involves multiple steps and tools. By

following a structured approach to data management and analysis, businesses can leverage the power

of big data to gain insights and make data-driven decisions.

big data management

Big data management refers to the process of organizing, storing, and processing large volumes of

data efficiently and effectively. Here is an example of how big data management can be applied in a

real-world scenario:

Imagine a healthcare company that collects data from multiple sources, such as electronic medical

records, insurance claims, and medical devices. The company wants to use this data to improve patient

outcomes and reduce healthcare costs.

To manage this data, the company would first need to identify the different data sources and the

types of data they collect. This could include patient demographics, medical histories, lab results, and
medication records. The company would then need to determine how to store this data in a way that

allows for easy access and processing.

One approach to storing and managing this data could be to use a big data platform such as Hadoop.

Hadoop is a distributed file system that can store and process large volumes of data across multiple

servers. The company could use Hadoop to store the different types of data in separate data sets or

"tables", and use a tool like Apache Spark or Hive to query and analyze the data.

To ensure data quality, the company could implement data cleaning and validation processes to remove

or correct any errors or inconsistencies in the data. They could also use data governance policies and

procedures to ensure that the data is secure, compliant, and accessible to authorized users only.

Once the data is managed and organized, the company could use analytics tools and techniques to gain

insights into patient outcomes and healthcare costs. For example, they could use machine learning

algorithms to predict patient risk factors, identify patterns in disease prevalence, or optimize

treatment plans.

Overall, big data management is a critical aspect of any organization that wants to leverage the

power of big data. By implementing best practices in data storage, processing, and analysis,

businesses can gain valuable insights and make informed decisions that drive business success.

LAST YEAR QUES:

JULY:

1. What is Hadoop ? Explain its components


Ans: Hadoop is an open-source framework for distributed storage and processing of large data

sets. It was originally developed by Apache Software Foundation and is now maintained and

supported by the Apache community.

Hadoop is designed to handle big data, which refers to data that is too large or complex for

traditional data processing systems to handle. It achieves this by breaking down large data

sets into smaller chunks and distributing them across multiple servers or nodes in a cluster.

Components of Hadoop

Hadoop is a framework that uses distributed storage and parallel processing to store and

manage Big Data. It is the most commonly used software to handle Big Data. There are three

components of Hadoop.

1. Hadoop HDFS - Hadoop Distributed File System (HDFS) is the storage unit of Hadoop.

2. Hadoop MapReduce - Hadoop MapReduce is the processing unit of Hadoop.


3. Hadoop YARN - Hadoop YARN is a resource management unit of Hadoop.

Hadoop HDFS(separate ques in july 2022 in section-B)

Data is stored in a distributed manner in HDFS. There are two components of HDFS - name node and

data node. While there is only one name node, there can be multiple data nodes.

HDFS is specially designed for storing huge datasets in commodity hardware. An enterprise version

of a server costs roughly $10,000 per terabyte for the full processor. In case you need to buy 100 of

these enterprise version servers, it will go up to a million dollars.

Hadoop enables you to use commodity machines as your data nodes. This way, you don’t have to spend

millions of dollars just on your data nodes. However, the name node is always an enterprise server.

Where not to use HDFS

o Low Latency data access: Applications that require very less time to access the first data
should not use HDFS as it is giving importance to whole data rather than time to fetch the first
record.

o Lots Of Small Files:The name node contains the metadata of files in memory and if the files are
small in size it takes a lot of memory for name node's memory which is not feasible.

o Multiple Writes:It should not be used when we have to write multiple times.

HDFS Concepts

1. Blocks: A Block is the minimum amount of data that it can read or write.

HDFS blocks are 128 MB by default and this is configurable.Files & HDFS are broken into block-
sized chunks, which are stored as independent units. Unlike a file system, if the file is in HDFS is
smaller than block size, then it does not occupy full blocks size, i.e. 5 MB of file stored in HDFS of
block size 128 MB takes 5MB of space only.

The HDFS block size is large just to minimize the cost of seek.

2. Name Node: HDFS works in master-worker pattern where the name node acts as master.

Name Node is controller and manager of HDFS as it knows the status and the metadata of all the
files in HDFS; the metadata information being file permission, names and location of each block.The
metadata are small, so it is stored in the memory of name node, allowing faster access to data.
Moreover the HDFS cluster is accessed by multiple clients concurrently, so all this information is
handled by a single machine.

The file system operations like opening, closing, renaming etc. are executed by it.
3. Data Node: They store and retrieve blocks when they are told to; by client or name node.

They report back to name node periodically, with list of blocks that they are storing. The data

node being a commodity hardware also does the work of block creation, deletion and

replication as stated by the name node.

HDFS DataNode and NameNode Image:

HDFS Read Image:

HDFS Write Image:


Since all the metadata is stored in name node, it is very important.

If it fails the file system can not be used as there would be no way of knowing how to reconstruct

the files from blocks present in data node.

To overcome this, the concept of secondary name node arises.

Secondary Name Node: It is a separate physical machine which acts as a helper of name node. It

performs periodic check points. It communicates with the name node and take snapshot of meta data

which helps minimize downtime and loss of data.

HDFS Basic File Operations

HDFS (Hadoop Distributed File System) is designed to store large files efficiently and reliably

on a distributed system. It provides several basic file operations for managing files in HDFS.

1. Creating a file: HDFS allows creating a file using a command-line interface or through an API.

When a file is created, it is divided into smaller blocks, typically 64MB or 128MB, and each block is

replicated across multiple nodes in the cluster for fault tolerance. The file blocks are stored in the

DataNodes, while the metadata related to the file is stored in the NameNode.

To create a file in HDFS, use the following command:

hadoop fs -touchz /path/to/file

2. Copying a file:
To copy a file from a local file system to HDFS, use the following command:

hadoop fs -put /path/to/local/file /path/to/hdfs/

This command copies the file from the local file system to HDFS.

3. Reading a file: Once a file is created, it can be read using standard read operations in Java.

HDFS supports both sequential and random access to files. However, random access is slower due to

the need to perform several network hops to locate the required block.

To read a file from HDFS, use the following command:

hadoop fs -cat /path/to/hdfs/file

This command displays the contents of the specified file in the console.

4. Writing to a file: Once a file is created, it can be written using standard write operations in

Java. HDFS supports both sequential and random access to files. However, random access is slower

due to the need to perform several network hops to locate the required block.

To write data to a file in HDFS, use the following command:

echo "data" | hadoop fs -appendToFile - /path/to/hdfs/file

This command appends the specified data to the end of the file.

5. Deleting a file: HDFS allows deleting files and directories using a command-line interface or

through an API. Once a file is deleted, all its replicas across the cluster are also deleted.

To delete a file from HDFS, use the following command:

hadoop fs -rm /path/to/hdfs/file

This command deletes the specified file from HDFS.

6. Renaming a file: HDFS allows setting access permissions on files and directories, similar to Unix-

style file permissions. The permissions determine the level of access allowed for users and groups

accessing the file or directory.

To rename a file in HDFS, use the following command:

hadoop fs -mv /path/to/hdfs/file /path/to/hdfs/new_file_name

This command renames the specified file to the new file name.

HDFS Features and Goals


Define HDFS

Features of HDFS

o Highly Scalable - HDFS is highly scalable as it can scale hundreds of nodes in a single cluster.

o Replication - Due to some unfavorable conditions, the node containing the data may be loss.
So, to overcome such problems, HDFS always maintains the copy of data on a different
machine.

o Fault tolerance - In HDFS, the fault tolerance signifies the robustness of the system in the
event of failure. The HDFS is highly fault-tolerant that if any machine fails, the other
machine containing the copy of that data automatically become active.

o Distributed data storage - This is one of the most important features of HDFS that makes
Hadoop very powerful. Here, data is divided into multiple blocks and stored into nodes.

o Portable - HDFS is designed in such a way that it can easily portable from platform to
another.

Goals of HDFS

o Handling the hardware failure - The HDFS contains multiple server machines. Anyhow, if any
machine fails, the HDFS goal is to recover it quickly.

o Streaming data access - The HDFS applications usually run on the general-purpose file
system. This application requires streaming access to their data sets.

o Coherence Model - The application that runs on HDFS require to follow the write-once-
ready-many approach. So, a file once created need not to be changed. However, it can be
appended and truncate.

Hadoop MapReduce

Hadoop MapReduce is the processing unit of Hadoop. In the MapReduce approach, the processing is

done at the slave nodes, and the final result is sent to the master node. A data containing code is

used to process the entire data. This coded data is usually very small in comparison to the data itself.

You only need to send a few kilobytes worth of code to perform a heavy-duty process on computers.
The input dataset is first split into chunks of data. In this example, the input has three lines of text

with three separate entities - “bus car train,” “ship ship train,” “bus ship car.” The dataset is then

split into three chunks, based on these entities, and processed parallelly.

In the map phase, the data is assigned a key and a value of 1. In this case, we have one bus, one car,

one ship, and one train.

These key-value pairs are then shuffled and sorted together based on their keys. At the reduce

phase, the aggregation takes place, and the final output is obtained.

Its key features and goals include:

1. Data locality: Hadoop MapReduce is designed to process data where it is stored. This means that

the data is moved to the compute node, rather than the other way around. This can improve

processing times and reduce network traffic.

2. Simplified programming model: Hadoop MapReduce provides a simplified programming model that

allows developers to write code that is easy to read and maintain. It abstracts away many of the

complexities of distributed computing, allowing developers to focus on the business logic of their

application.

GOALS

1. Scalability: Hadoop MapReduce is designed to scale horizontally by adding more nodes to a

Hadoop cluster. This allows for the processing of large data sets in a distributed and parallel manner,

which can improve processing times.


2. Fault tolerance: Hadoop MapReduce is designed to be fault-tolerant, meaning that it can

recover from node failures or other system failures. If a node fails, Hadoop can automatically

reassign the work to another node in the cluster.

3. Flexibility: Hadoop MapReduce is flexible and can be used for a wide range of data processing

tasks, including batch processing, iterative processing, and real-time processing.

Hadoop YARN

Hadoop YARN stands for Yet Another Resource Negotiator. It is the resource management unit of
Hadoop and is available as a component of Hadoop version 2.

Hadoop YARN acts like an OS to Hadoop. It is a file system that is built on top of HDFS.

It is responsible for managing cluster resources to make sure you don't overload one machine.

It performs job scheduling to make sure that the jobs are scheduled in the right place

Suppose a client machine wants to do a query or fetch some code for data analysis. This job request

goes to the resource manager (Hadoop Yarn), which is responsible for resource allocation and

management.

In the node section, each of the nodes has its node managers. These node managers manage the

nodes and monitor the resource usage in the node. The containers contain a collection of physical

resources, which could be RAM, CPU, or hard drives. Whenever a job request comes in, the app

master requests the container from the node manager. Once the node manager gets the resource, it

goes back to the Resource Manager.


FEATURES/ GOALS OF HADOOP YARN

Apache Hadoop YARN (Yet Another Resource Negotiator) is a resource management platform that is

responsible for managing the resources (CPU, memory, and disk) of a Hadoop cluster. Its key

features include:

1. Scalability: Hadoop YARN is designed to scale to large clusters with thousands of nodes,

allowing it to handle the processing of massive amounts of data.

2. Flexibility: Hadoop YARN supports a variety of processing models, including batch processing,

interactive processing, and real-time processing, as well as a wide range of programming

languages and frameworks.

3. Fault tolerance: Hadoop YARN is designed to be fault-tolerant, meaning that it can recover

from node failures or other system failures.

4. Security: Hadoop YARN provides security features, such as authentication and authorization,

to ensure that applications running on the cluster are secure.

2. how do you analyze data in Hadoop?

Ans: Pipeline upar diya hua hai

3. what is real time analytics? discuss there technologies in detail

Real-time analytics refers to the process of analyzing data as it is generated or received, in order to

provide immediate insights and actionable intelligence. Real-time analytics is used in many industries,

including finance, healthcare, retail, and manufacturing, where real-time decisions can impact

business outcomes. Here are some examples of real-time analytics in action:

1. Fraud Detection: Real-time analytics can be used to detect fraud in financial transactions. For

example, credit card companies use real-time analytics to detect fraudulent transactions by

analyzing transaction data in real-time and identifying suspicious patterns or anomalies. If a

fraudulent transaction is detected, the credit card company can immediately flag the

transaction and notify the customer.

2. Customer Experience: Real-time analytics can be used to personalize customer experiences in

retail and e-commerce. For example, online retailers can use real-time analytics to analyze
customer browsing behavior and provide personalized product recommendations in real-time.

This can improve customer engagement and increase sales.

There are several technologies that are used for real-time analytics, including:

Stream Processing: Stream processing is a technology that allows real-time processing of large

volumes of data as it is generated. Stream processing systems typically use distributed computing to

analyze data streams in real-time, and can be used to detect anomalies, identify patterns, and

generate alerts. Some popular stream processing frameworks include Apache Flink, Apache Kafka,

and Apache Spark Streaming.

In-Memory Databases: In-memory databases are databases that store data in memory rather than

on disk. This enables faster access to data and allows for real-time analysis of large volumes of data.

In-memory databases are often used in conjunction with stream processing systems to provide real-

time analytics capabilities. Some popular in-memory databases include Apache Ignite, MemSQL, and

SAP HANA.

Machine Learning: Machine learning algorithms can be used for real-time analytics by analyzing data

streams in real-time and identifying patterns and anomalies. Machine learning algorithms can be used

to build predictive models, detect fraud, and optimize business processes in real-time. Some popular

machine learning frameworks include TensorFlow, scikit-learn, and PyTorch.

Complex Event Processing: Complex Event Processing (CEP) is a technology that enables the

detection of complex patterns and relationships in real-time data streams. CEP systems can be used

to analyze large volumes of data in real-time and identify meaningful events and patterns. CEP is

often used in industries such as finance and healthcare to detect fraud. Some popular CEP

frameworks include Apache Storm, Apache Apex, and Esper.

write in detail the concept of developing the MapReduce application

MapReduce is a programming model that allows developers to process large volumes of data in a

distributed computing environment. It is commonly used for processing Big Data and is a core

component of many Big Data processing frameworks, such as Apache Hadoop, Apache Spark, and

Amazon EMR.

Developing a MapReduce application involves several steps, including:


1. Defining the problem: The first step in developing a MapReduce application is to define the

problem that needs to be solved. This involves understanding the input data, the processing

requirements, and the output format.

2. Designing the MapReduce architecture: Once the problem has been defined, the next step is

to design the MapReduce architecture. This involves determining how the input data will be

divided into smaller chunks (known as "input splits"), how these input splits will be processed in

parallel by different nodes in the cluster, and how the results will be combined to generate

the final output.

3. Implementing the MapReduce program: The next step is to implement the MapReduce program

using a programming language such as Java, Python, or Scala. The MapReduce program typically

consists of two main functions: the map function and the reduce function.

The map function takes an input record and generates a set of key-value pairs as output. The key-

value pairs are then shuffled and sorted based on the key, and passed to the reduce function.

The reduce function takes a set of key-value pairs as input and generates a single output value for

each key. The output values from the reduce function are then combined to generate the final output.

4. Testing the MapReduce application: After the MapReduce application has been implemented, it

should be tested to ensure that it works correctly. This involves running the application on a

small dataset and verifying that the output is correct.

5. Deploying the MapReduce application: Once the MapReduce application has been tested, it can

be deployed to a distributed computing environment, such as a Hadoop cluster or an Amazon

EMR cluster. The application can then be run on a large dataset to process the data in parallel

and generate the final output.

In conclusion, developing a MapReduce application involves several steps, including defining the

problem, designing the MapReduce architecture, implementing the MapReduce program, testing the

application, and deploying the application to a distributed computing environment. By following these

steps, developers can build efficient and scalable applications for processing Big Data.

You might also like