You are on page 1of 73

UNIT 1:

1. WHAT IS BIG DATA


Data which are very large in size is called Big Data. Normally we work on data of size
MB(WordDoc ,Excel) or maximum GB(Movies, Codes) but data in Peta bytes i.e. 10^15 byte size is
called Big Data.

Sources of Big Data

These data come from many sources like

o Social networking sites: Facebook, Google, LinkedIn all these sites generate huge amount of
data on a day to day basis as they have billions of users worldwide.
o E-commerce site: Sites like Amazon, Flipkart, Alibaba generates huge amount of logs from
which users buying trends can be traced.
o Weather Station: All the weather station and satellite gives very huge data which are stored
and manipulated to forecast weather.
o Telecom company: Telecom giants like Airtel, Vodafone study the user trends and
accordingly publish their plans and for this they store the data of its million users.
o Share Market: Stock exchange across the world generates huge amount of data through its
daily transaction.

Advantages of Big Data:

 Able to handle and process large and complex data sets that cannot be easily managed with
traditional database systems
 Provides a platform for advanced analytics and machine learning applications

Disadvantages of Big Data:

 Requires specialized skills and expertise in data engineering, data management, and big data
tools and technologies
 Can be expensive to implement and maintain due to the need for specialized infrastructure
and software
2. WHAT IS THE PROBLEM OF 3v
The 3 V's (volume, velocity and variety) are three defining properties or dimensions of big data.
Volume refers to the amount of data, velocity refers to the speed of data processing, and variety refers
to the number of types of data.

In data science, the term "3V" is often used to refer to the three primary challenges
associated with big data. These are:

1. Volume: Volume refers to the sheer amount of data that is generated and collected
in today's digital world. With the proliferation of digital devices, sensors, social
media, and more, data is being generated at an unprecedented rate. Managing and
analyzing large volumes of data can be a significant challenge for data scientists.
2. Velocity: Velocity refers to the speed at which data is generated and how quickly it
must be processed and analyzed. Some data streams in real-time, such as social
media posts and sensor data. Data scientists need to develop systems that can
handle this high velocity of data and provide insights in a timely manner.
3. Variety: Variety relates to the diverse types of data that are generated. Data can
come in structured formats (like databases), semi-structured formats (like XML or
JSON), or unstructured formats (like text, images, and videos). Data scientists must be
able to work with and integrate data from various sources and in different formats.

3. WHAT IS EMERGING BIG DATA STACK


A number of emerging technologies are being used to collect, store, and analyse big data,
including Hadoop, NoSQL databases, and cloud computing. While each of these technologies has
its own unique benefits, they all share the ability to handle large amounts of data quickly and
efficiently.

4. WRITE A SHORT NOTE ON GARTNER HYPE


CYCLE
Gartner Hype Cycles provide a graphic representation of the maturity and adoption of technologies
and applications, and how they are potentially relevant to solving real business problems and
exploiting new opportunities. Gartner Hype Cycle methodology gives you a view of how a technology
or application will evolve over time, providing a sound source of insight to manage its deployment
within the context of your specific business goals.
The Hype Cycle

The Gartner Hype Cycle is a graphic representation of the maturity lifecycle of new technologies and
innovations divided into five phases: Innovation Trigger, Peak of Inflated Expectations, Trough of
Disillusionment, Slope of Enlightenment, and Plateau of Productivity.
1. Innovation Trigger. A breakthrough, public demonstration, product launch or other event sparks
media and industry interest in a technology or other type of innovation.
2. Peak of Inflated Expectations. The excitement and expectations for the innovation exceed the reality
of its current capabilities. In some cases, a financial bubble may form around the innovation.
3. Trough of Disillusionment. The original overexcitement about the innovation dissipates, and offset
disillusionment sets in due to performance issues, slower-than-expected adoption or a failure to
deliver timely financial returns.
4. Slope of Enlightenment. Some early adopters overcome the initial hurdles and begin to see the
benefits of the innovation. By learning from the experiences of early adopters, organizations gain a
better understanding of where and how the innovation will deliver significant value (and where it will
not).
5. Plateau of Productivity. The innovation has demonstrated real-world productivity and benefits, and
more organizations feel comfortable with the greatly reduced level of risk. A sharp uptick in adoption
begins until the innovation becomes mainstream.

Here are the typical stages in the Gartner Hype Cycle:

1. Innovation Trigger: This is the initial stage when a new technology or idea is
introduced. It may have the potential for significant impact, but it's often
experimental and unproven at this point.
2. Peak of Inflated Expectations: In this stage, there is a rapid increase in hype and
expectations surrounding the technology. There's often a lot of enthusiasm, media
attention, and sometimes unrealistic expectations about what the technology can
achieve.
3. Trough of Disillusionment: After reaching its peak, the technology often falls into a
"trough" of disappointment. This stage is characterized by disillusionment as early
adopters and others begin to realize the technology's limitations or challenges. Some
technologies may fail and fade away during this stage.
4. Slope of Enlightenment: As the initial hype wears off, a more realistic and practical
understanding of the technology emerges. This stage may involve ongoing research
and development to address the challenges identified during the Trough of
Disillusionment.
5. Plateau of Productivity: Finally, the technology reaches a point of maturity and
widespread adoption. It becomes a standard tool or practice in its respective domain,
and its benefits are widely recognized and realized.

5. EXPLAIN BIG DATA LIFE CYCLE IN DETAIL


The Big Data Analytics Life cycle is divided into nine phases, named as :
1. Business Case/Problem Definition
2. Data Identification
3. Data Acquisition and filtration
4. Data Extraction
5. Data Munging(Validation and Cleaning)
6. Data Aggregation & Representation(Storage)
7. Exploratory Data Analysis
8. Data Visualization(Preparation for Modeling and Assessment)
9. Utilization of analysis results.
Let us discuss each phase :
Phase I Business Problem Definition –
In this stage, the team learns about the business domain, which presents the motivation and goals for
carrying out the analysis. In this stage, the problem is identified, and assumptions are made that how
much potential gain a company will make after carrying out the analysis. Important activities in this
step include framing the business problem as an analytics challenge that can be addressed in
subsequent phases. It helps the decision-makers understand the business resources that will be
required to be utilized thereby determining the underlying budget required to carry out the project.
Moreover, it can be determined, whether the problem identified, is a Big Data problem or not, based
on the business requirements in the business case. To qualify as a big data problem, the business case
should be directly related to one(or more) of the characteristics of volume, velocity, or variety.

Phase II Data Definition –


Once the business case is identified, now it’s time to find the appropriate datasets to work with. In this
stage, analysis is done to see what other companies have done for a similar case.
Depending on the business case and the scope of analysis of the project being addressed, the sources
of datasets can be either external or internal to the company. In the case of internal datasets, the
datasets can include data collected from internal sources, such as feedback forms, from existing
software, On the other hand, for external datasets, the list includes datasets from third-party providers.

Phase III Data Acquisition and filtration –


Once the source of data is identified, now it is time to gather the data from such sources. This kind of
data is mostly unstructured. Then it is subjected to filtration, such as removal of the corrupt data or
irrelevant data, which is of no scope to the analysis objective. Here corrupt data means data that may
have missing records, or the ones, which include incompatible data types.
After filtration, a copy of the filtered data is stored and compressed, as it can be of use in the future,
for some other analysis.

Phase IV Data Extraction –


Now the data is filtered, but there might be a possibility that some of the entries of the data might be
incompatible, to rectify this issue, a separate phase is created, known as the data extraction phase. In
this phase, the data, which don’t match with the underlying scope of the analysis, are extracted and
transformed in such a form.

Phase V Data Munging –


As mentioned in phase III, the data is collected from various sources, which results in the data being
unstructured. There might be a possibility, that the data might have constraints, that are unsuitable,
which can lead to false results. Hence there is a need to clean and validate the data.
It includes removing any invalid data and establishing complex validation rules. There are many ways
to validate and clean the data. For example, a dataset might contain few rows, with null entries. If a
similar dataset is present, then those entries are copied from that dataset, else those rows are dropped.

Phase VI Data Aggregation & Representation –


The data is cleansed and validates, against certain rules set by the enterprise. But the data might be
spread across multiple datasets, and it is not advisable to work with multiple datasets. Hence, the
datasets are joined together. For example: If there are two datasets, namely that of a Student
Academic section and Student Personal Details section, then both can be joined together via common
fields, i.e. roll number.
This phase calls for intensive operation since the amount of data can be very large. Automation can be
brought into consideration, so that these things are executed, without any human intervention.

Phase VII Exploratory Data Analysis –


Here comes the actual step, the analysis task. Depending on the nature of the big data problem,
analysis is carried out. Data analysis can be classified as Confirmatory analysis and Exploratory
analysis. In confirmatory analysis, the cause of a phenomenon is analyzed before. The assumption is
called the hypothesis. The data is analyzed to approve or disapprove the hypothesis.
This kind of analysis provides definitive answers to some specific questions and confirms whether an
assumption was true or not. In an exploratory analysis, the data is explored to obtain information, why
a phenomenon occurred. This type of analysis answers “why” a phenomenon occurred. This kind of
analysis doesn’t provide definitive, meanwhile, it provides discovery of patterns.

Phase VIII Data Visualization –


Now we have the answer to some questions, using the information from the data in the datasets. But
these answers are still in a form that can’t be presented to business users. A sort of representation is
required to obtains value or some conclusion from the analysis. Hence, various tools are used to
visualize the data in graphic form, which can easily be interpreted by business users.
Visualization is said to influence the interpretation of the results. Moreover, it allows the users to
discover answers to questions that are yet to be formulated.

Phase IX Utilization of analysis results –


The analysis is done, the results are visualized, now it’s time for the business users to make decisions
to utilize the results. The results can be used for optimization, to refine the business process. It can
also be used as an input for the systems to enhance performance.

Link: https://www.geeksforgeeks.org/big-data-analytics-life-cycle/

6. WRITE A DETAIL NOTE ON DATA AND ITS


TYPES
There are three main types of data:

1. Structured data
2. Semi-structured data
3. Unstructured data

What is structured Data?

Structured data is generally stored in tables in the form of rows and columns. Structured data in these
tables can form relations with another tables. Humans and machines can easily retrieve information
from structured data. This data is meaningful and is used to develop data models.

Structured data refers to highly organized and formatted information that fits neatly into predefined
tables or relational databases. This type of data has a clear schema with fixed fields and data types.

Following are the advantages of maintaining structured data:

 It is easy to search for data


 Less storage space is required
 More data analytics tools can be used
 Data is highly secured

And, listed below are the disadvantages of keeping the data in a structured manner:

 Data is not flexible


 Its storage options are limited

Example

Customer ID Name Ag City


e
100 John 25 Chicago
101 Alexa 32 New York
102 Sam 40 Los Angeles
What is Unstructured Data?

Unprocessed and unorganized data is known as unstructured data. This type of data has no meaning
and is not used to develop data models. Unstructured data may be text, images, audio, videos,
reviews, satellite images, etc. Almost 80% of the data in this world is in the form of unstructured data.

Following are the advantages of using unstructured data:

 Data is flexible.
 It is very scalable
 This data can be used for a wide range of purposes as it is in its original form.

The disadvantages of using unstructured data are as follows:

 It requires more storage space.


 There is no security for data.
 Searching for data is a difficult process.
 There are limited tools available to analyse this data.

What is Semi-Structured Data?

Semi structured data is organized up to some extent only and the rest is unstructured. Hence, the level
of organizing is less than that of Structured Data and higher than that of Unstructured Data.

Link: https://www.tutorialspoint.com/difference-between-structured-semi-structured-and-
unstructured-data

7. DIFFERENTIATE STRUCTURE,
UNSTRUCTURE AND SEMI-STRUCTURE
(DATA)
Structured Data Unstructured Data Semi-structured Data

Well organised data Not organised at all Partially organised

It is less flexible and difficult to It is flexible and scalable. It is It is more flexible and simpler to scale
scale. It is schema dependent. schema independent. than structured data but lesser than
unstructured data.

It is based on relational database. It is based on character and It is based on XML/ RDF


binary data.GV

Versioning over tuples,row,tables Versioning is like as a whole Versioning over tuples is possible.
data.
Easy analysis Difficult analysis Difficult analysis compared to structured
data but easier when compared to
unstructured data.

Financial data, bar codes are Media logs, videos, audios are Tweets organised by hashtags, folder
some of the examples of some of the examples of organised by topics are some of the
structured data. unstructured data. examples of unstructured data.

8. NOTE ON NOSQL DATABASE

NoSQL is a type of database management system (DBMS) that is designed to handle and store large
volumes of unstructured and semi-structured data.(for example Google or Facebook which collects
terabits of data every day for their users).

NoSQL databases (aka "not only SQL") are non-tabular databases and store data differently than
relational tables. NoSQL databases come in a variety of types based on their data model. The main
types are document, key-value, wide-column, and graph. They provide flexible schemas and scale
easily with large amounts of data and high user loads.

NoSQL databases are generally classified into four main categories:

1. Document databases: These databases store data as semi-structured documents , such as JSON
or XML, and can be queried using document-oriented query languages.

• A collection of documents
• Data in this model is stored inside documents.
• A document is a key value collection where the key allows access to its value.
• Documents are not typically forced to have a schema and therefore are flexible and easy to
change.
• Documents are stored into collections in order to group different kinds of data.
• Documents can contain many different key-value pairs, or key-array pairs, or even nested
documents.

2. Key-value stores: These databases store data as key-value pairs , and are optimized for simple
and fast read/write operations.

• Key-value stores are most basic types of NoSQL databases.


• Designed to handle huge amounts of data.
• Key value stores allow developer to store schema-less data.
• In the key-value storage, database stores data as hash table where each key is unique and
the value can be string, JSON etc.
• A key may be strings, hashes, lists, sets, sorted sets and values are stored against these keys.
• Key-Value stores can be used as collections, dictionaries, associative arrays etc.
• Key-Value stores follow the ‘Availability’ and ‘Partition’ aspects of CAP theorem.
• Example of Key-value store Database : Redis etc.

3. Column-family stores: These databases store data as column families, which are sets of
columns that are treated as a single entity . They are optimized for fast and efficient querying of
large amounts of data.

• Column-oriented databases primarily work on columns and every column is treated


individually.
• Values of a single column are stored contiguously.
• Column stores data in column specific files.
• In Column stores, query processors work on columns too.
• All data within each column datafile have the same type which makes it ideal for
compression.
• Column stores can improve the performance of queries as it can access specific column
data.
• High performance on aggregation queries (e.g. COUNT, SUM, AVG, MIN, MAX).
• Works on data warehouses and business intelligence, customer relationship management
(CRM), Library card catalogs etc.
• Example of Column-oriented databases : BigTable, Cassandra, SimpleDB etc.

4. Graph databases: These databases store data as nodes and edges, and are designed to handle
complex relationships between data.

• A graph database stores data in a graph.


• It is capable of elegantly representing any kind of data in a highly accessible way.
• A graph database is a collection of nodes and edges
• Each node represents an entity (such as a student or business) and each edge represents a
connection or relationship between two nodes.
• Every node and edge are defined by a unique identifier.
• Each node knows its adjacent nodes.
• As the number of nodes increases, the cost of a local step (or hop) remains the same.
• Index for lookups.

NoSQL Advantages

High scalability

Distributed Computing

Lower cost

Schema flexibility, semi-structure data

No complicated Relationships

NoSQL Disadvantages

No standardization

Limited query capabilities (so far)


Link: https://redswitches.com/blog/non-relational-databases/#What-Are-Non-Relational-Databases

9. NOSQL VS SQL DATABSE

SQL NoSQL

RELATIONAL DATABASE MANAGEMENT Non-relational or distributed database


SYSTEM (RDBMS) system.

These databases have fixed or static or predefined


They have a dynamic schema
schema

These databases are not suited for hierarchical data These databases are best suited for
storage. hierarchical data storage.

These databases are not so good for


These databases are best suited for complex queries
complex queries

Vertically Scalable Horizontally scalable

Follows CAP(consistency, availability,


Follows ACID property
partition tolerance)
SQL NoSQL

Examples: MySQL, PostgreSQL, Oracle, MS-SQL Examples: MongoDB, HBase, Neo4j,


Server, etc Cassandra, etc

10. RDBMS VS NOSQL


Relational Database NoSQL

It is used to handle data coming in high


It is used to handle data coming in low velocity. velocity.

It gives only read scalability. It gives both read and write scalability.

It manages structured data. It manages all type of data.

Data arrives from one or few locations. Data arrives from many locations.

It supports complex transactions. It supports simple transactions.

It has single point of failure. No single point of failure.

It handles data in less volume. It handles data in high volume.

Transactions written in one location. Transactions written in many locations.

support ACID properties compliance doesn’t support ACID properties

Its difficult to make changes in database once it Enables easy and frequent changes to
is defined database

schema is mandatory to store the data schema design is not required

Deployed in vertical fashion. Deployed in Horizontal fashion.

Basis of
NoSQL RDBMS
Comparison

Non-relational databases, often known RDBMS, which stands for Relational Database
Definition as distributed databases, are another Management Systems, is the most common
name for NoSQL databases. name for SQL databases.

Query No declarative query language SQL stands for Structured Query Language.

NoSQL databases are horizontally


Scalability RDBMS databases are vertically scalable
scalable
Basis of
NoSQL RDBMS
Comparison

NoSQL combines multiple


Traditional RDBMS systems use SQL
database technologies. These
syntax and queries to get insights from
Design databases were created in
data. Different OLAP systems use
response to the application's
them.
requirements.
NoSQL databases use denormalization Relational database models contain data in
to optimise themselves. One record different tables; when running a query, you must
Speed stores all the query data. This simplifies integrate the information and set table-spanning
finding matched records, restrictions. Because of so many tables, the
which speeds up queries. database's query time is slow.

11. DOCUMENT ORIENTED VS GRAPH


ORIENTED
FROM NOTEBOOK

12. ADVANTAGES AND DISADVANTAGES


OF NOSQL
Advantages Disadvantages

Flexible Data Structures Limited Query Capability

Scalability Data Consistency

High Performance Lack of Standardization

Availability Limited Tooling

Cost-Effective Limited ACID Compliance

Advantages of NoSQL

1. Flexible Data Structures – NoSQL databases allow for more flexible data structures than
traditional relational databases. This means that data can be stored in a variety of formats,
which is particularly useful when dealing with unstructured or semi-structured data, such as
social media posts or log files.
2. Scalability – NoSQL databases are highly scalable, which means they can easily handle large
amounts of data and traffic. This is achieved through a distributed architecture that allows
data to be spread across multiple servers, making it easy to add more servers as needed.
3. High Performance – NoSQL databases are designed for high performance, meaning that
they can process large amounts of data quickly and efficiently. This is especially important
for applications that require real-time data processing, such as financial trading platforms or
social media analytics tools.
4. Availability – NoSQL databases are highly available, which means that they are designed to
be up and running at all times. This is achieved through a distributed architecture that allows
data to be replicated across multiple servers, ensuring that the system is always available,
even if one or more servers fail.
5. Cost-Effective – NoSQL databases can be cost-effective, especially for large-scale
applications. This is because they are designed to be run on commodity hardware, rather than
expensive proprietary hardware, which can save companies a significant amount of money.

Disadvantages of NoSQL

1. Limited Query Capability – NoSQL databases offer limited query capability when
compared to traditional relational databases. This is because NoSQL databases are designed
to handle unstructured data, which can make it difficult to perform complex queries or
retrieve data in a structured manner.
2. Data Consistency – NoSQL databases often sacrifice data consistency for scalability and
performance. This means that there may be some lag time between when data is written to the
database and when it is available for retrieval. Additionally, because NoSQL databases often
use distributed architectures, there may be instances where different nodes of the database
contain different versions of the same data.
3. Lack of Standardization – NoSQL databases lack standardization, meaning that different
NoSQL databases can have vastly different structures and query languages. This can make it
difficult for developers to learn and work with different NoSQL databases.
4. Limited Tooling – Because NoSQL databases are a relatively new technology, there is
limited tooling available for them when compared to traditional relational databases. This can
make it more difficult for developers to work with NoSQL databases and to debug issues
when they arise.
5. Limited ACID Compliance – NoSQL databases often sacrifice ACID compliance for
scalability and performance. ACID compliance refers to a set of properties that guarantee that
database transactions are processed reliably. Because NoSQL databases often use distributed
architectures and eventual consistency models, they may not always be fully ACID
compliant.

13. WRITE DETAIL NOTE ON CAP THEOREM


(IMP)

The CAP theorem applies a similar type of logic to distributed systems—namely, that a distributed
system can deliver only two of three desired characteristics: consistency, availability, and partition
tolerance (the ‘C,’ ‘A’ and ‘P’ in CAP).
Consistency

Consistency means that all clients see the same data at the same time, no matter which node they
connect to. For this to happen, whenever data is written to one node, it must be instantly forwarded or
replicated to all the other nodes in the system before the write is deemed ‘successful.’

Availability

Availability means that any client making a request for data gets a response, even if one or more
nodes are down. Another way to state this—all working nodes in the distributed system return a valid
response for any request, without exception.

Partition tolerance
A partition is a communication break within a distributed system—a lost or temporarily delayed
connection between two nodes. Partition tolerance means that the cluster must continue to work
despite any number of communication breakdowns between nodes in the system.

Link: https://towardsdatascience.com/cap-theorem-and-distributed-database-management-systems-
5c2be977950e#:~:text=CAP%20Theorem%20is%20a%20concept,on%20our%20unique%20use

%20case .
Consistency

Consistency means that all clients see the same data at the same time, no matter
which node they connect to. For this to happen, whenever data is written to one
node, it must be instantly forwarded or replicated to all the other nodes in the system
before the write is deemed ‘successful.’

Availability

Availability means that any client making a request for data gets a response, even if
one or more nodes are down. Another way to state this—all working nodes in the
distributed system return a valid response for any request, without exception.

Partition tolerance

A partition is a communications break within a distributed system—a lost or


temporarily delayed connection between two nodes. Partition tolerance means that
the cluster must continue to work despite any number of communication breakdowns
between nodes in the system.

UNIT 2 and 4:
1. WHAT IS HADOOP
Hadoop is an open source framework based on Java that manages the storage and processing of large
amounts of data for applications.
Hadoop is an open source framework from Apache and is used to store process and analyze data
which are very huge in volume. Hadoop is written in Java and is not OLAP (online analytical
processing). It is used for batch/offline processing. It is being used by Facebook, Yahoo, Google,
Twitter, LinkedIn and many more. Moreover it can be scaled up just by adding nodes in the cluster.

Modules of Hadoop

1. HDFS: Hadoop Distributed File System. Google published its paper GFS and on the basis of
that HDFS was developed. It states that the files will be broken into blocks and stored in
nodes over the distributed architecture.
2. Yarn: Yet another Resource Negotiator is used for job scheduling and manage the cluster
(Collection).
3. Map Reduce: This is a framework which helps Java programs to do the parallel computation
on data using key value pair. The Map task takes input data and converts it into a data set
which can be computed in Key value pair. The output of Map task is consumed by reduce
task and then the out of reducer gives the desired result.
4. Hadoop Common: These Java libraries are used to start Hadoop and are used by other
Hadoop modules.

Advantages of Hadoop

o Fast: In HDFS the data distributed over the cluster and are mapped which helps in faster
retrieval. Even the tools to process the data are often on the same servers, thus reducing the
processing time. It is able to process terabytes of data in minutes and Peta bytes in hours.
o Scalable: Hadoop cluster can be extended by just adding nodes in the cluster.
o Cost Effective: Hadoop is open source and uses commodity hardware to store data so it really
cost effective as compared to traditional relational database management system.
o Resilient to failure: HDFS has the property with which it can replicate data over the
network, so if one node is down or some other network failure happens, then Hadoop takes
the other copy of data and use it. Normally, data are replicated thrice but the replication factor
is configurable.
o Ability to store a large amount of data.
o High flexibility.
o Cost effective.
o High computational power.
o Tasks are independent.
o Linear scaling.

Disadvantages:

 Not very effective for small data.


 Hard cluster management.
 Has stability issues.
 Security concerns.
 Complexity: Hadoop can be complex to set up and maintain, especially for organizations
without a dedicated team of experts.
 Latency: Hadoop is not well-suited for low-latency workloads and may not be the best choice
for real-time data processing.
 Limited Support for Real-time Processing: Hadoop’s batch-oriented nature makes it less suited
for real-time streaming or interactive data processing use cases.
 Limited Support for Structured Data: Hadoop is designed to work with unstructured and semi-
structured data, it is not well-suited for structured data processing
 Data Security: Hadoop does not provide built-in security features such as data encryption or
user authentication, which can make it difficult to secure sensitive data.
 Limited Support for Ad-hoc Queries: Hadoop’s MapReduce programming model is not well-
suited for ad-hoc queries, making it difficult to perform exploratory data analysis.
 Limited Support for Graph and Machine Learning: Hadoop’s core component HDFS and
MapReduce are not well-suited for graph and machine learning workloads, specialized
components like Apache Graph and Mahout are available but have some limitations.
 Cost: Hadoop can be expensive to set up and maintain, especially for organizations with large
amounts of data.
 Data Loss: In the event of a hardware failure, the data stored in a single node may be lost
permanently.
 Data Governance: Data Governance is a critical aspect of data management, Hadoop does not
provide a built-in feature to manage data lineage, data quality, data cataloging, data lineage, and
data audit.

2. NOTE ON HADOOP ECHO SYSTEM


Hadoop Ecosystem is neither a programming language nor a service, it is a platform or framework
which solves big data problems.

Following are the components that collectively form a Hadoop ecosystem:

 HDFS: Hadoop Distributed File System


 YARN: Yet Another Resource Negotiator
 MapReduce: Programming based Data Processing
 Spark: In-Memory data processing
 PIG, HIVE: Query based processing of data services
 HBase: NoSQL Database
 Mahout, Spark MLLib: Machine Learning algorithm libraries
 Solar, Lucene: Searching and Indexing
 Zookeeper: Managing cluster
 Oozie: Job Scheduling

Link: https://geeksforgeeks.org/hadoop-ecosystem/

3. HDFS VS HDFS2(Hadoop Distributed File System)


Sr. No. Key Hadoop 1 Hadoop 2

On other hand Hadoop 2 introduced


As Hadoop 1 introduced prior after Hadoop 1 so has more
New Components and to Hadoop 2 so has some less components and APIs as compare to
1
API components and APIs as Hadoop 1 such as YARN API,YARN
compare to that of Hadoop 2. FRAMEWORK, and enhanced
Resource Manager.

On other hand Hadoop 2 allows to


Hadoop 1 only supports
work in MapReducer model as well as
MapReduce processing model
other distributed computing models
2 Support in its architecture and it does
like Spark, Hama, Giraph, Message
not support non MapReduce
Passing Interface) MPI & HBase
tools.
coprocessors.

3 Resource Map reducer in Hadoop 1 is On other hand in case of Hadoop 2 for


Management responsible for processing and cluster resource management YARN
Sr. No. Key Hadoop 1 Hadoop 2

is used while processing management


cluster-resource management. is done using different processing
models.

As Hadoop 1 is prior to
Hadoop 2 so comparatively On other hand Hadoop 2 has better
less scalable than Hadoop 2 scalability than Hadoop 1 and is
4 Scalability
and in context of scaling of scalable up to 10000 nodes per
nodes it is limited to 4000 cluster.
nodes per cluster

Hadoop 1 is implemented as it
follows the concepts of slots On other hand Hadoop 2 follows
5 Implementation which can be used to run a concepts of containers that can be
Map task or a Reduce task used to run generic tasks.
only.

On other hand with an advancement


Initially in Hadoop 1 there is
in version of Hadoop Apache
6 Windows Support no support for Microsoft
provided support for Microsoft
Windows provided by Apache.
windows in Hadoop 2.

4. EXPLAIN COMPONENT LEVEL


ARCHITECTURE

Component-level architecture, often referred to as component-based architecture, is an architectural


approach used in software development and system design. It involves breaking down a system into
discrete, reusable, and interchangeable components. These components encapsulate specific
functionalities or features, making the system more modular, maintainable, and scalable. Here's an
explanation of component-level architecture:

Key Concepts of Component-Level Architecture:

1. Components: Components are self-contained, independent units of software that perform specific
functions or provide certain features. Each component should have a well-defined interface, clearly
specifying how it can be used by other components or parts of the system.
2. Reusability: One of the primary goals of component-based architecture is reusability. Components
should be designed to be reusable in different parts of the system or in other projects. This reduces
redundancy and accelerates development.
3. Interchangeability: Components should be designed to be interchangeable. This means that you can
replace one component with another that provides the same interface and functionality without
affecting the overall system. Interchangeability promotes flexibility and scalability.
4. Encapsulation: Components encapsulate their internal details, making their inner workings hidden
from the rest of the system. This concept is based on the principle of information hiding, which
enhances security and simplifies maintenance.
5. Independence: Components should be as independent as possible, meaning that they shouldn't rely
heavily on other components. This reduces coupling between components, making the system more
robust and easier to maintain.
6. Communication: Components communicate with each other through well-defined interfaces.
Interactions between components are typically achieved using protocols, such as API calls, message
passing, or remote procedure calls.

Advantages of Component-Level Architecture:

1. Modularity: The system is divided into smaller, manageable pieces, making it easier to develop, test,
and maintain.
2. Reusability: Components can be reused in different projects, saving time and effort in software
development.
3. Scalability: Components can be added or replaced as the system evolves, allowing for easy expansion
and adaptation to changing requirements.
4. Parallel Development: Different teams or developers can work on individual components
simultaneously, speeding up the development process.
5. Easier Testing: Components can be tested in isolation, simplifying the testing process and ensuring
the quality of each component.
6. Flexibility: The interchangeability of components allows for system flexibility and the incorporation of
third-party components.
7. Improved Maintenance: With well-defined interfaces and encapsulation, it's easier to troubleshoot
and update individual components without affecting the entire system.

Examples of Component-Level Architectures:

 Object-Oriented Programming (OOP): In OOP, classes and objects are used as components,
encapsulating both data and behavior. Classes can be reused and extended in various contexts.
 Web Development: In web development, components are commonly used for user interface
elements. For example, React and Angular are JavaScript libraries/frameworks that encourage
component-based development for building web applications.
 Service-Oriented Architecture (SOA): SOA is an architectural style that promotes the creation of
services as components. These services can be distributed and used in different applications.

Component-level architecture is particularly useful in large, complex software systems where


modularity, reusability, and maintainability are essential. It enables efficient development and
management of software by breaking it down into manageable and interchangeable building blocks.

5. NOTE ON YARN
YARN stands for “Yet Another Resource Negotiator“.
Consider YARN as the brain of your Hadoop Ecosystem. It performs all your processing activities by
allocating resources and scheduling tasks.
It has two major components, i.e. ResourceManager and NodeManager.
YARN Features: YARN gained popularity because of the following features-

 Scalability: The scheduler in Resource manager of YARN architecture allows Hadoop to


extend and manage thousands of nodes and clusters.
 Compatibility: YARN supports the existing map-reduce applications without disruptions thus
making it compatible with Hadoop 1.0 as well.
 Cluster Utilization:Since YARN supports Dynamic utilization of cluster in Hadoop, which
enables optimized Cluster Utilization.
 Multi-tenancy: It allows multiple engine access thus giving organizations a benefit of multi-
tenancy.

Components Of YARN

o Client: For submitting MapReduce jobs.


o Resource Manager: To manage the use of resources across the cluster
o Node Manager:For launching and monitoring the computer containers on machines in the
cluster.
o Map Reduce Application Master: Checks tasks running the MapReduce job. The
application master and the MapReduce tasks run in containers that are scheduled by the
resource manager, and managed by the node managers.

6. NOTE ON DIFFERENT SERVICES OF


HADOOP

 HDFS: Hadoop Distributed File System


 YARN: Yet Another Resource Negotiator
 MapReduce: Programming based Data Processing
 Spark: In-Memory data processing
 PIG, HIVE: Query based processing of data services
 HBase: NoSQL Database
 Mahout, Spark MLLib: Machine Learning algorithm libraries
 Solar, Lucene: Searching and Indexing
 Zookeeper: Managing cluster
 Oozie: Job Scheduling

Hadoop is an open-source framework that provides a distributed storage and


processing environment for handling big data. It comprises several core services and
components that work together to enable the storage and processing of large
datasets. Here are the key services in Hadoop:

1. Hadoop Distributed File System (HDFS):


 HDFS is the primary storage system in Hadoop. It's designed to store and
manage large volumes of data across multiple machines.
 Data is distributed across a cluster of commodity hardware, with each file
divided into blocks and replicated for fault tolerance.
 HDFS is highly scalable and is the foundation for storing big data in Hadoop.
2. MapReduce:
 MapReduce is a programming model and processing engine for distributed
data processing.
 It consists of two main phases: the Map phase, which processes and filters
data, and the Reduce phase, which aggregates and summarizes results.
 MapReduce is suitable for batch processing and is the original processing
framework for Hadoop.
3. YARN (Yet Another Resource Negotiator):
 YARN is a cluster resource management and job scheduling component in
Hadoop.
 It separates the resource management and job scheduling aspects of Hadoop,
allowing for more flexible and efficient resource allocation.
 YARN enables different processing frameworks like MapReduce, Spark, and
Tez to run on the same Hadoop cluster.
4. Hadoop Common:
 Hadoop Common includes the libraries and utilities used by other Hadoop
modules.
 It provides a consistent set of tools and libraries that are used by different
Hadoop components.
5. Hadoop MapReduce:
 Hadoop MapReduce is the actual implementation of the MapReduce
programming model, and it's responsible for executing MapReduce jobs on a
Hadoop cluster.
6. Hive:
 Hive is a data warehousing and SQL-like query language system for Hadoop.
 It provides a high-level abstraction over Hadoop, allowing users to write SQL-
like queries to analyze and process data stored in HDFS.
7. Pig:
 Pig is a platform for analyzing large datasets using a high-level scripting
language called Pig Latin.
 It provides a way to express data transformations and analysis in a more
concise and readable manner compared to low-level MapReduce.
8. HBase:
 HBase is a NoSQL, distributed, and scalable database that runs on top of
HDFS.
 It is designed for fast and random read/write access to large datasets and is
suitable for real-time data processing and low-latency applications.
9. Spark:
 While not a part of the Hadoop ecosystem, Spark is often used alongside
Hadoop. It's a fast, in-memory data processing framework that supports a
variety of data processing workloads.
 Spark can be used with HDFS and YARN, and it provides APIs in multiple
programming languages for data analysis, machine learning, and more.
10. Zookeeper:
 Zookeeper is a distributed coordination service for managing and maintaining
configuration information, naming, providing distributed synchronization, and group
services.
11. Oozie:
 Oozie is a workflow scheduler system used to manage and schedule Hadoop jobs.
 It allows you to define and coordinate workflows with multiple Hadoop jobs, making
it easier to manage complex data processing pipelines.
12. Ambari:
 Ambari is a web-based tool for managing, monitoring, and provisioning Hadoop
clusters. It simplifies the administration and monitoring of Hadoop clusters.

7. EXPLAIN HIVE IN DETAIL


Hive is a data warehouse system which is used to analyze structured data. It is built on the top of
Hadoop. It was developed by Facebook.

Hive provides the functionality of reading, writing, and managing large datasets residing in
distributed storage. It runs SQL like queries called HQL (Hive query language) which gets internally
converted to MapReduce jobs.

Architecture of Hive

The following component diagram depicts the architecture of Hive:


8. EXPLAIN HBASE IN DETAIL
Hbase is an open source and sorted map data built on Hadoop. It is column oriented and horizontally
scalable.

It is based on Google's Big Table. It has set of tables which keep data in key value format. Hbase is
well suited for sparse data sets which are very common in big data use cases. Hbase provides APIs
enabling development in practically any programming language. It is a part of the Hadoop ecosystem
that provides random real-time read/write access to data in the Hadoop File System.

Why HBase

o RDBMS get exponentially slow as the data becomes large


o Expects data to be highly structured, i.e. ability to fit in a well-defined schema
o Any change in schema might require a downtime

Features of Hbase

o Horizontally scalable: You can add any number of columns anytime.


o Automatic Failover: Automatic failover is a resource that allows a system administrator to
automatically switch data handling to a standby system in the event of system compromise
o Integrations with Map/Reduce framework: Al the commands and java codes internally
implement Map/ Reduce to do the task and it is built over Hadoop Distributed File System.
o sparse, distributed, persistent, multidimensional sorted map, which is indexed by row key,
column key,and timestamp.
o Often referred as a key value store or column family-oriented database, or storing versioned
maps of maps.
o fundamentally, it's a platform for storing and retrieving data with random access.
o It doesn't care about datatypes(storing an integer in one row and a string in another for the
same column).
o It doesn't enforce relationships within your data.
o It is designed to run on a cluster of computers, built using commodity hardware.

Features of HBase

 HBase is linearly scalable.


 It has automatic failure support.
 It provides consistent read and writes.
 It integrates with Hadoop, both as a source and a destination.
 It has easy java API for client.
 It provides data replication across clusters.

Where to Use HBase

 Apache HBase is used to have random, real-time read/write access to Big Data.
 It hosts very large tables on top of clusters of commodity hardware.
 Apache HBase is a non-relational database modeled after Google's Bigtable. Bigtable acts up
on Google File System, likewise Apache HBase works on top of Hadoop and HDFS

Year Event

Nov 2006 Google released the paper on BigTable.

Feb 2007 Initial HBase prototype was created as a Hadoop contribution.

Oct 2007 The first usable HBase along with Hadoop 0.15.0 was released.

Jan 2008 HBase became the sub project of Hadoop.

Oct 2008 HBase 0.18.1 was released.

Jan 2009 HBase 0.19.0 was released.

Sept 2009 HBase 0.20.0 was released.

May 2010 HBase became Apache top-level project.

Link: https://www.tutorialspoint.com/hbase/hbase_overview.htm

9. NOTE ON KAFKA
What is Kafka?

Apache Kafka is a distributed publish-subscribe messaging system and a robust queue that can handle
a high volume of data and enables you to pass messages from one end-point to another. Kafka is
suitable for both offline and online message consumption. Kafka messages are persisted on the disk
and replicated within the cluster to prevent data loss. Kafka is built on top of the ZooKeeper
synchronization service. It integrates very well with Apache Storm and Spark for real-time streaming
data analysis.9

Apache Kafka is an open-source platform for real-time data handling

Kafka is often used to build real-time data streams and applications. Combining
communications, storage, and stream processing enables the collection and analysis of
real-time and historical data

Benefits

Following are a few benefits of Kafka −

 Reliability − Kafka is distributed, partitioned, replicated and fault tolerance.


 Scalability − Kafka messaging system scales easily without down time..
 Durability − Kafka uses Distributed commit log which means messages persists on disk as
fast as possible, hence it is durable..
 Performance − Kafka has high throughput for both publishing and subscribing messages. It
maintains stable performance even many TB of messages are stored.

Kafka is very fast and guarantees zero downtime and zero data loss.

Use Cases

Kafka can be used in many Use Cases. Some of them are listed below −

 Metrics − Kafka is often used for operational monitoring data. This involves aggregating
statistics from distributed applications to produce centralized feeds of operational data.
 Log Aggregation Solution − Kafka can be used across an organization to collect logs from
multiple services and make them available in a standard format to multiple con-sumers.
 Stream Processing − Popular frameworks such as Storm and Spark Streaming read data from
a topic, processes it, and write processed data to a new topic where it becomes available for
users and applications. Kafka’s strong durability is also very useful in the context of stream
processing.

Kafka Architecture:

 Producers: Producers are responsible for publishing data to Kafka topics. They can send data in the
form of records, and each record consists of a key, a value, and a topic. Producers distribute data to
Kafka brokers, typically in a round-robin fashion or based on custom partitioning logic.
 Brokers: Kafka brokers are the servers that store and manage data. They are responsible for receiving
data from producers, storing it, and serving it to consumers. Brokers are distributed and can
communicate with each other for data replication.
 Topics: Topics are logical channels for organizing and categorizing data in Kafka. Producers write data
to topics, and consumers subscribe to topics to read data. Each topic can have multiple partitions for
parallelism.
 Partitions: Topics can be split into multiple partitions, allowing data to be distributed and processed
in parallel. Partitions are the unit of parallelism in Kafka and provide fault tolerance.
 Consumers: Consumers subscribe to topics and read data from Kafka. They can process data in real-
time or batch mode. Kafka ensures that each message is consumed only once, and the offset is used
to keep track of the last consumed message.
 Zookeeper: Kafka uses Apache ZooKeeper for distributed coordination and management of Kafka
brokers. ZooKeeper helps in leader election, broker discovery, and synchronization.

Use Cases for Kafka:

1. Log Aggregation: Kafka is used to collect and aggregate logs from various sources, making it easier
to monitor and analyze application performance.
2. Real-Time Analytics: Kafka allows organizations to process and analyze data in real-time, making it
suitable for applications like fraud detection, recommendation engines, and monitoring systems.
3. Data Integration: Kafka Connectors enable data integration between various systems, such as
databases, messaging systems, and data lakes.
4. IoT and Sensor Data: Kafka can handle the high volume and velocity of data generated by IoT
devices and sensors, making it an ideal choice for IoT applications.
5. Event Sourcing: Kafka is used in event-driven architectures and event sourcing patterns, where events
are stored and processed for maintaining system state.
6. Clickstream Analysis: Kafka is employed for real-time processing of user activity data, making it
valuable for applications like e-commerce analytics.
7. Replication and Data Backup: Kafka's durability and fault tolerance features make it suitable for
replicating data across data centers for backup and disaster recovery .

10. NOTE ON APACHE BASE

11. EXPLAIN MESSAGING SYSTEM IN


DETAIL

Message passing in distributed systems refers to the communication medium used by nodes
(computers or processes) to commute information and coordinate their actions. It involves
transferring and entering messages between nodes to achieve various goals such as coordination,
synchronization, and data sharing.

Message passing is a flexible and scalable method for inter-node communication in distributed
systems. It enables nodes to exchange information, coordinate activities, and share data without
relying on shared memory or direct method invocation

s. Models like synchronous and asynchronous message passing offer different synchronization and
communication semantics to suit system requirements. Synchronous message passing ensures
sender and receiver synchronization, while asynchronous message passing allows concurrent
execution and non-blocking communication.
Types of Message Passing

1. Synchronous message passing


2. Asynchronous message passing
3. Hybrids

1. Synchronous Message Passing

Current programming uses synchronous message passing, which allows processes or threads to
change messages in real time. To ensure coordination and predictable execution, the sender waits until
the receiver has received and processed the message before continuing. The most common way to
implement this strategy is through blocking method calls or procedure invocations, where a process or
thread blocks until the called system returns a result or completes its investigation. The caller will be
forced to wait until the message has been processed thanks to this blocking behaviour. However, there
are drawbacks to synchronous message passing, such as system halts or delays if the receiver takes
too long to process the message or gets stuck. It's critical to precisely implement synchronous
message passing in concurrent systems to guarantee its proper operation.

2. Asynchronous Message Passing

Asynchronous message passing is a type of communication used in concurrent and distributed


systems that lets processes or other elements change messages without requiring strict time
synchronisation. It entails sending a message to a receiving process or component and carrying on
with execution right away. Asynchronous message passing's key features include the ability for the
sender and receiver to work independently without waiting for a response and its asynchronous
nature. Through the exchange of messages, which may be one-way or include a reply address for the
recipient to respond, communication takes place. The sender and receiver of an asynchronous
message can run on different processes, threads, or even different machines, allowing for loose
coupling between them.

3. Hybrids
Hybrid message passing combines elements of both synchronous and asynchronous message ends. It
provides flexibility to the sender to choose whether to block and hold on for a response or continue
execution asynchronously. The choice between synchronous or asynchronous actions can be made
based on the specific requirements of the system or the nature of the communication. Hybrid message
passing allows for optimization and customization based on different scenarios, enabling a balance
between synchronous and asynchronous paradigms.

12. NOTE ON ZOOKIPER


Zookeeper is a distributed co-ordination service to manage large set of hosts. Co-ordinating and
managing a service in a distributed environment is a complicated process. Zookeeper solves this issue
with its simple architecture and API. Zookeeper allows developers to focus on core application logic 0
The Zookeeper framework was originally built at “Yahoo!” for accessing their applications in an easy
and robust manner.
Zookeeper helps you to maintain configuration information, naming, group services for distributed
applications. It implem1ents different protocols on the cluster so that the application should not
implement on their own. It provides a single coherent view of multiple machines.

Apache Zookeeper is a distributed, open-source coordination service for


distributed systems. It provides a central place for distributed applications to
store data, communicate with one another, and coordinate activities.
Zookeeper is used in distributed systems to coordinate distributed processes
and services

Benefits of Zookeeper

Here are the benefits of using Zookeeper −

 Simple distributed coordination process


 Synchronization − Mutual exclusion and co-operation between server processes. This
process helps in Apache HBase for configuration management.
 Ordered Messages
 Serialization − Encode the data according to specific rules. Ensure your application runs
consistently. This approach can be used in MapReduce to coordinate queue to execute
running threads.
 Reliability
 Atomicity − Data transfer either succeed or fail completely, but no transaction is partial.

Architecture of Zookeeper
Server: The server sends an acknowledge when any client connects. In the case when there is no
response from the connected server, the client automatically redirects the message to another server.

Client: Client is one of the nodes in the distributed application cluster. It helps you to accesses
information from the server. Every client sends a message to the server at regular intervals that helps
the server to know that the client is alive.

Leader: One of the servers is designated a Leader. It gives all the information to the clients as well as
an acknowledgment that the server is alive. It would performs automatic recovery if any of the
connected nodes failed.

13. NOTE ON ARCHTECHTURE OF APACHE


KAFKA
As different applications design the architecture of Kafka accordingly, there are the following
essential parts required to design Apache Kafka architecture.

o Data Ecosystem: Several applications that use Apache Kafka forms an ecosystem. This
ecosystem is built for data processing. It takes inputs in the form of applications that create
data, and outputs are defined in the form of metrics, reports, etc. The below diagram
represents a circulatory data ecosystem for Kafka.
o Kafka Cluster: A Kafka cluster is a system that comprises of different brokers, topics, and
their respective partitions. Data is written to the topic within the cluster and read by the
cluster itself.
o Producers: A producer sends or writes data/messages to the topic within the cluster. In order
to store a huge amount of data, different producers within an application send data to the
Kafka cluster.
o Consumers: A consumer is the one that reads or consumes messages from the Kafka cluster.
There can be several consumers consuming different types of data form the cluster. The
beauty of Kafka is that each consumer knows from where it needs to consume the data.
o Brokers: A Kafka server is known as a broker. A broker is a bridge between producers and
consumers. If a producer wishes to write data to the cluster, it is sent to the Kafka server. All
brokers lie within a Kafka cluster itself. Also, there can be multiple brokers.
o Topics: It is a common name or a heading given to represent a similar type of data. In Apache
Kafka, there can be multiple topics in a cluster. Each topic specifies different types of
messages.
o Partitions: The data or message is divided into small subparts, known as partitions. Each
partition carries data within it having an offset value. The data is always written in a
sequential manner. We can have an infinite number of partitions with infinite offset values.
However, it is not guaranteed that to which partition the message will be written.
o ZooKeeper: A ZooKeeper is used to store information about the Kafka cluster and details of
the consumer clients. It manages brokers by maintaining a list of them. Also, a ZooKeeper is
responsible for choosing a leader for the partitions. If any changes like a broker die, new
topics, etc., occurs, the ZooKeeper sends notifications to Apache Kafka. A ZooKeeper is
designed to operate with an odd number of Kafka servers. Zookeeper has a leader server that
handles all the writes, and rest of the servers are the followers who handle all the reads.
However, a user does not directly interact with the Zookeeper, but via brokers. No Kafka
server can run without a zookeeper server. It is mandatory to run the zookeeper server.
o

14. ADVANTAGES AND DISADVANTAGES


OF
-HIVE,

Advantages of Hive

 Keeps queries running fast


 Takes very little time to write Hive query in comparison to MapReduce code
 HiveQL is a declarative language like SQL
 Provides the structure on an array of data formats
 Multiple users can query the data with the help of HiveQL
 Very easy to write query including joins in Hive
 Simple to learn and use

Disadvantages of Hive

 Useful when the data is structured


 You can do any analytical operation using MR programming
 Debugging code is very difficult
 You can’t do complicated operations

-KAFKA,
Advantages of Apache Kafka

Following advantages of Apache Kafka makes it worthy:

1. User Friendly

There are more than one customer waiting to handle messages. When there is a need to integrate with multiple
customers, creating one integration is sufficient. The integration is made simple even for customers with variety
of languages and behaviors.

2. Reliability

As compared to other messaging services, Kafka is considered to be more reliable. In the event of a machine
failure, Kafka provides resistance through means of replicating data. Thus, the consumers are automatically
balanced.

3. Durability

Kafka ensures durable messaging service by storing data quickly as possible. The messages are persisted on the
disk which is one of the reasons for data not being lost.

4. Latency

The latency value offered by Kafka is very low ; not more than 10 milliseconds. The messages received by the
consumer is consumed instantly. Apache Kafka cannot handle most messages sine the messages are
automatically decoupled.

5. Scalability

Apache Kafka is a scalable solution. It allows you to add additional nodes without facing any downtimes. And
also, Kafka posses transparent message handling capabilities. They are able to process even terabytes of data
seamless.

6. Real-time Data Control


Handline real-time data pipeline is crucial for every applications. Kafka makes it easy to build real time data
pipelines such as storage, processor and analytics.

7. Buffering Action

Apache Kafka comes with its own set of servers known as Clusters. These clusters make sure that system does
not crash when there is a data transfer happening real time. Kafka acts as a buffer by relieving data from source
systems and redirecting it to the target systems .

Disadvantages Of Apache Kafka

With the above advantages, there are following limitations/disadvantages of Apache Kafka:

1. Do not have complete set of monitoring tools: Apache Kafka does not contain a complete
set of monitoring as well as managing tools. Thus, new startups or enterprises fear to work
with Kafka.
2. Message tweaking issues: The Kafka broker uses system calls to deliver messages to the
consumer. In case, the message needs some tweaking, the performance of Kafka gets
significantly reduced. So, it works well if the message does not need to change.
3. Do not support wildcard topic selection: Apache Kafka does not support wildcard topic
selection. Instead, it matches only the exact topic name. It is because selecting wildcard topics
make it incapable to address certain use cases.
4. Reduces Performance: Brokers and consumers reduce the performance of Kafka by
compressing and decompressing the data flow. This not only affects its performance but also
affects its throughput.
5. Clumsy Behaviour: Apache Kafka most often behaves a bit clumsy when the number of
queues increases in the Kafka Cluster.
6. Lack some message paradigms: Certain message paradigms such as point-to-point queues,
request/reply, etc. are missing in Kafka for some use cases.
-HBASE,
Advantages of HBase –

1. Can store large data sets

2. Database can be shared

3. Cost-effective from gigabytes to petabytes

4. High availability through failover and replication

Disadvantages of HBase –

1. No support SQL structure

2. No transaction support

3. Sorted only on key

4. Memory issues on the cluster

-APACHE HADOOP,
Pros

1. Cost
Hadoop is open-source and uses cost-effective commodity hardware which provides a cost-efficient
model, unlike traditional Relational databases that require expensive hardware and high-end
processors to deal with Big Data. The problem with traditional Relational databases is that storing
the Massive volume of data is not cost-effective, so the company’s started to remove the Raw data.
which may not result in the correct scenario of their business. Means Hadoop provides us 2 main
benefits with the cost one is it’s open-source means free to use and the other is that it uses
commodity hardware which is also inexpensive.
2. Scalability
Hadoop is a highly scalable model. A large amount of data is divided into multiple inexpensive
machines in a cluster which is processed parallelly. the number of these machines or nodes can be
increased or decreased as per the enterprise’s requirements. In traditional RDBMS(Relational
DataBase Management System) the systems can not be scaled to approach large amounts of data.
3. Flexibility
Hadoop is designed in such a way that it can deal with any kind of dataset like structured(MySql
Data), Semi-Structured(XML, JSON), Un-structured (Images and Videos) very efficiently. This
means it can easily process any kind of data independent of its structure which makes it highly
flexible. which is very much useful for enterprises as they can process large datasets easily, so the
businesses can use Hadoop to analyze valuable insights of data from sources like social media,
email, etc. with this flexibility Hadoop can be used with log processing, Data Warehousing, Fraud
detection, etc.
4. Speed
Hadoop uses a distributed file system to manage its storage i.e. HDFS(Hadoop Distributed File
System). In DFS(Distributed File System) a large size file is broken into small size file blocks then
distributed among the Nodes available in a Hadoop cluster, as this massive number of file blocks
are processed parallelly which makes Hadoop faster, because of which it provides a High-level
performance as compared to the traditional DataBase Management Systems. When you are dealing
with a large amount of unstructured data speed is an important factor, with Hadoop you can easily
access TB’s of data in just a few minutes.
5. Fault Tolerance
Hadoop uses commodity hardware(inexpensive systems) which can be crashed at any moment. In
Hadoop data is replicated on various DataNodes in a Hadoop cluster which ensures the availability
of data if somehow any of your systems got crashed. You can read all of the data from a single
machine if this machine faces a technical issue data can also be read from other nodes in a Hadoop
cluster because the data is copied or replicated by default. Hadoop makes 3 copies of each file block
and stored it into different nodes.
6. High Throughput
Hadoop works on Distributed file System where various jobs are assigned to various Data node in a
cluster, the bar of this data is processed parallelly in the Hadoop cluster which produces high
throughput. Throughput is nothing but the task or job done per unit time.
7. Minimum Network Traffic
In Hadoop, each task is divided into various small sub-task which is then assigned to each data node
available in the Hadoop cluster. Each data node process a small amount of data which leads to low
traffic in a Hadoop cluster.
Cons

1. Problem with Small files


Hadoop can efficiently perform over a small number of files of large size. Hadoop stores the file in
the form of file blocks which are from 128MB in size(by default) to 256MB. Hadoop fails when it
needs to access the small size file in a large amount. This so many small files surcharge the
Namenode and make it difficult to work.
2. Vulnerability
Hadoop is a framework that is written in java, and java is one of the most commonly used
programming languages which makes it more insecure as it can be easily exploited by any of the
cyber-criminal.
3. Low Performance In Small Data Surrounding
Hadoop is mainly designed for dealing with large datasets, so it can be efficiently utilized for the
organizations that are generating a massive volume of data. It’s efficiency decreases while
performing in small data surroundings.
4. Lack of Security
Data is everything for an organization, by default the security feature in Hadoop is made un-
available. So the Data driver needs to be careful with this security face and should take appropriate
action on it. Hadoop uses Kerberos for security feature which is not easy to manage. Storage and
network encryption are missing in Kerberos which makes us more concerned about it.
5. High Up Processing
Read/Write operation in Hadoop is immoderate since we are dealing with large size data that is in
TB or PB. In Hadoop, the data read or write done from the disk which makes it difficult to perform
in-memory calculation and lead to processing overhead or High up processing.
6. Supports Only Batch Processing
The batch process is nothing but the processes that are running in the background and does not have
any kind of interaction with the user. The engines used for these processes inside the Hadoop core
is not that much efficient. Producing the output with low latency is not possible with it.

-TEZ,
Apache Tez is a data processing framework that is often used as an alternative to the
classic MapReduce processing model in the Hadoop ecosystem. Tez is designed to
improve the performance and efficiency of data processing, making it a valuable tool
in data science and big data analytics. Here are some of the advantages of Apache
Tez in the context of data science:

1. Performance Improvement: Tez is optimized for performance and is designed to


execute data processing jobs more efficiently than the traditional MapReduce model.
It achieves this by reducing the overhead associated with launching separate Map
and Reduce tasks for each job and by optimizing task scheduling.
2. Complex Data Flows: Tez supports complex, directed acyclic graphs (DAGs) of data
processing tasks. This flexibility allows data scientists to create more intricate data
processing pipelines, enabling the efficient execution of complex workflows.
3. Reduced Latency: Tez minimizes job startup and intermediate data transfer
overhead, leading to reduced processing latency. This makes it suitable for real-time
or near-real-time data processing scenarios, which are crucial in data science
applications.
4. Resource Management: Tez is integrated with YARN (Yet Another Resource
Negotiator), which helps with dynamic resource allocation and management. It
optimizes resource utilization, ensuring that data science tasks are executed
efficiently without wasting resources.
5. Custom Data Processing: Tez supports custom data processing logic and allows
users to define their data processing tasks, which is valuable for data science tasks
that require specialized computations.
6. Data Locality Optimization: Tez takes data locality into account when scheduling
tasks, which means it tries to execute tasks on nodes where the data they need is
already located. This reduces data transfer overhead and improves performance.
7. Reusability: Tez provides a framework for building custom data processing
applications, making it a reusable solution for various data science tasks. This
reusability can lead to faster development and more efficient processing.
8. Integration with Ecosystem Tools: Tez is well integrated with other tools and
frameworks in the Hadoop ecosystem, such as Hive and Pig. Data scientists can
leverage Tez for enhanced performance in these data processing applications.
9. Support for In-Memory Processing: Tez supports in-memory data processing,
which can significantly accelerate certain data science operations, particularly when
dealing with iterative algorithms commonly used in machine learning and graph
processing.
10. Resource Isolation: Tez provides resource isolation mechanisms that help prevent
one job from consuming all available cluster resources, ensuring fair sharing of
resources among multiple jobs.
11. Distributed Caching: Tez supports distributed caching, allowing data to be cached
across nodes for efficient data processing. This is useful for data science tasks that
require reference data or models
-PIG
Pig is a scripting platform that runs on Hadoop clusters, designed to process and analyze large
datasets. Pig uses a language called Pig Latin, which is similar to SQL. This language does not require
as much code in order to analyze data.

Advantages of Pig

 Creates a sequence of MapReduce Jobs that run by Hadoop cluster


 Decrease in deployment time
 Use own language called pig Latin
 Perfect for programmers and software developers
 Easy to write and read
 Provides data operations such as ordering, filters, and joins

Disadvantages of Pig

 The errors that Pig produces are not helpful


 Not mature
 The data schema is not enforced explicitly but implicitly
 Commands are not executed until you dump in an intermediate result
 No IDE for Vim rendering more functionality than syntax completion to write the pig scripts

15. PIG VS HIVE


Hive Pig

Hive is commonly used by Data Analysts. Pig is commonly used by programmers.

It follows SQL-like queries. It follows the data-flow language.

It can handle structured data. It can handle semi-structured data.

It works on server-side of HDFS cluster. It works on client-side of HDFS cluster.

Hive is slower than Pig. Pig is comparatively faster than Hive.


Pig Hive

Operates on the client-side of a cluster. Operates on the server-side of a cluster.

Procedural Data Flow Language. Declarative SQLish Language.

Pig is used for programming. Hive is used for creating reports.

Majorly used by Researchers and Programmers. Used by Data Analysts.

Used for handling structured and semi-structured data. It is used in handling structured data.

Scripts end with .pig extension. Hive supports all extensions.

Supports Avro file format. Does not support Avro file format.

Uses an exact variation of dedicated SQL-DDL language b


Does not have a dedicated metadata database.
defining tables beforehand.

16. PIG VS HBASE


Pig Hbase
It is used for semi structured data. HBase is a data storage particularly for
unstructured data.
Pig Hadoop Component is generally HBase is extensively used for transactional
used by Researchers and processing where in the response time of the query
Programmers. is not highly interactive i.e. OLTP
Pig Hadoop Component operates on Operations in HBase are run in real-time on the
the client side of any cluster. database
Avro supported for Pig. The client which is reading/writing the data has to
deal with the avro schemas, after HBase delivered
the raw data to it.
Pig Hadoop is a great ETL tool Hbase Component is helpful for ETL.
for big data because of its powerful
transformation and processing
capabilities.
Pig are high-level languages that HBase allows Hadoop to support the transactions
compile to MapReduce. on key value pairs.
Pig is also SQL-like but varies to HBase allows you to do quick random versus scan
a great extent and thus it will all of data sequentially, do insert/update/delete
take some time efforts to master Pig. from middle, and not just add/append.

17. HIVE VS HBASE


Hive HBase

Apache Hive is a query engine HBase is a data storage which is particular for unstructured data

Apache Hive is not ideally a database but it is a MapReduce HBase is a NoSQL database that is commonly used for real-time
based SQL engine that runs atop Hadoop data streaming

HBase is extremely used for transactional processing, and in the


Apache Hive is used for batch processing (that means, OLAP
process, the query response time is not highly interactive (that
based)
means OLTP)

Operations in HBase are said to run in real-time on the database


Operations in Hive don’t run in real-time
instead of transforming into MapReduce jobs

Apache Hive is to be used for analytical queries HBase is to be used for real-time queries

Apache Hive has limitations of higher latency HBase doesn’t have any analytical capabilities

Difference between Hive and HBase:

S.
No. Parameters Hive HBase

Hive is a query engine that uses


It is Data storage, particularly for
queries that are mostly similar to
unstructured data.
1. Basics SQL queries.
S.
No. Parameters Hive HBase

It is mainly used for batch


It is extensively used for transactional
processing (that means OLAP-
processing (that means OLTP).
2. Used for based).

It cannot be used for real-time


processing since immediate It can be used to process data in real-
analysis results are unable to time. Transactional operations are
obtain. In other words, the faster than non-transactional
operations in Hive require batch operations ( since HBase stores data
processing, they normally take a in the form of key-value pairs).
3. Processing long time to complete.

It is used only for analytical


It is used for real-time querying. It is
queries. It is mostly used to
mostly used to query Big Data.
4. Queries analyze Big Data.

HBase runs on the top of HDFS


Hive runs on the top of Hadoop.
5. Runs on (Hadoop Distributed File System).

6. Database Apache Hive is not a database. It supports the NoSQL database.

7. Schema It has a schema model. It is free from the schema model.

Made for high latency operations Made for low-level latency


8. Latency as batch processing takes time. operations.

It is expensive as compared to It is cost-effective as compared to


9. Cost HBase. Hive.

10. Query Hive uses HQL (Hive Query To conduct CRUD (Create, Read,
Language Language). Update, and Delete) activities, HBase
does not have a specialized query
language. HBase includes a Ruby-
based shell where you can use Get,
Put, and Scan functions to edit your
S.
No. Parameters Hive HBase

data.

Level of
Eventual consistency Immediate consistency
11. Consistency

Secondary It does not support Secondary


It supports Secondary Indexes.
12. Indexes Indexes.

13. Example Hubspot Facebook

18. HIVE VS MONGO DB

S.NO. HIVE MONGODB

It was developed by Apache Software


1. It was developed by MongoDB Inc. in 2009.
Foundation in 2012.

2. It is an open-source software. It is also an open-source software.

Server operating systems for Hive is all OS Server operating systems for MongoDB are
3.
with a Java VM . Solaris, Linux, OS X, Windows.

The replication method that Hive supports is The replication method that MongoDB
4.
Selectable Replication Factor. supports is Master-Slave Replication.

5. It support C+ It supports many programming languages


+, Java, PHP, Python programming
like C, C#, Java, JavaScript, PHP, Lau, Python,
languages. R, Ruby, etc.

It also supports Sharding partitioning


6. It supports Sharding partitioning method.
method.

The primary database model is Relational The primary database model is Document
7.
DBMS. Store.

JDBC, ODBC, Thrift are used as APIs and Proprietary protocol using JSON is used as
8.
other access methods. APIs and other access methods.

9. It do not support in-memory capabilities. It supports in-memory capabilities.

10. No transaction concepts. ACID properties of transaction is used.

19. MONGO DB VS HBASE


S.
No. Parameters HBase MongoDB

Developed by Apache
Developed by MongoDB Inc.
1. Developed by Software Foundation.

2. Website hbase.apache.org www.mongodb.com

Technical
hbase.apache.org docs.mongodb.com/manual
3. Documentation

Primary Database It is based on column-


It is based on a document store.
4. Model oriented.

Implementation
It is written in JAVA. It is written in C++.
5. Language
S.
No. Parameters HBase MongoDB

6. Server OS Linux, Unix, Windows Linux, OS X, Solaris, Windows

Supported C, C#, C++, Erlang, Haskell, Java,


C, C#, C++, Groovy,
Programming JavaScript, Perl, PHP, Python,
Java, PHP, Python, Scala
7. Languages Ruby, Scala

8. Edition Community Community (Free) and Enterprise

It has no secondary
It has secondary indexes.
9. Secondary Index indexes.

Data are stored in form Data are not stored in form of


10. Storing data of key/value pairs. key/value pair.

HBase is used to store MongoDB is used to store any kind


11. Data Type structured data. of data.

20. HBASE VS HDFS

HDFS HBase

HDFS is a distributed file system


HBase is a database built on top of the HDFS.
suitable for storing large files.

HDFS does not support fast individual


HBase provides fast lookups for larger tables.
record lookups.

It provides high latency batch


It provides low latency access to single rows from billions of
processing; no concept of batch
records (Random access).
processing.

It provides only sequential access of HBase internally uses Hash tables and provides random access,
data. and it stores the data in indexed HDFS files for faster lookups.

HDFS Hadoop Distributed File System HBase

HDFS is a java based file distribution


Hbase is hadoop database that runs on top of HDFS
system

HDFS is highly fault-tolerant and cost-


HBase is partially tolerant and highly consistent
effective
HDFS Provides only sequential
Random access is possible due to hash table
read/write operation

HDFS is based on write once read many HBase supports random read and write operation into
times filesystem

HDFS has a rigid architecture HBase support dynamic changes

HDFS is preferable
HBase is preferable for real time processing
for offline batch processing

HDFS provides high latency for access


HBase provides low latency access to small amount of data
operations.

21. MAPREDUCE VS HDFS

22. HADOOP , HIVE EXPLAIN (COMMAND)

HIVE:
Hive supports Data definition Language(DDL), Data Manipulation Language(DML) and User defined
functions.
Hive DDL Commands

create database

drop database

create table

drop table

alter table

create index

create views

Hive DML Commands

Select

Where

Group By

Order By

Load Data

Join:

o Inner Join
o Left Outer Join
o Right Outer Join
o Full Outer Join

23. NOTE ON ARCHITECTURE OF TEZ


(MIMP)
Apache Tez is a data processing framework designed for optimizing and accelerating
data processing in the Hadoop ecosystem. Its architecture is designed to provide
flexibility, performance, and scalability. Here's a note on the architecture of Tez:
Tez Architecture Overview:

Tez's architecture is centered around Directed Acyclic Graphs (DAGs) and provides a
flexible and efficient execution framework for complex data processing workflows.
The key components and concepts in Tez's architecture include:

1. DAG (Directed Acyclic Graph): DAGs are used to represent complex data
processing workflows. They consist of vertices and edges, where vertices represent
tasks and edges define the data flow between tasks. DAGs allow users to model and
execute multi-stage data processing jobs.
2. Vertices: In Tez, vertices are the units of computation within a DAG. They can be of
various types, including map vertices, reduce vertices, and custom vertices. Custom
vertices can encapsulate user-specific processing logic, allowing for more specialized
data processing.
3. Edges: Edges in a DAG connect vertices and define the data flow between them.
Edges specify the type of data movement, such as one-to-one, one-to-many, or
many-to-one, and also define data ordering and routing.
4. DAG ApplicationMaster (AM): The DAG ApplicationMaster is responsible for
managing and coordinating the execution of a DAG. It interacts with the
ResourceManager in the Hadoop YARN cluster to allocate resources, such as
containers for tasks. The AM monitors task progress, handles task failures, and
ensures the successful execution of the DAG.
5. Tez Sessions: A Tez session represents a runtime execution context for a DAG. It
includes a set of containers where tasks run and where data is managed. These
sessions facilitate efficient resource allocation and data movement.
6. Task Containers: Tez tasks run within containers allocated by YARN (Yet Another
Resource Negotiator). Containers provide a controlled execution environment with
access to CPU, memory, and data locality considerations. Containers manage task
execution, which includes reading input, processing data, and writing output.
7. Data Flow: Tez uses a data flow model to manage the movement of data between
tasks. The data flow is directed along edges, allowing for efficient data transfer and
data locality optimization.
8. Input/Output (I/O) Handlers: Tez provides Input and Output Handlers, which
enable reading data from and writing data to external storage systems. These
handlers support various data formats and data sources, making it easier to integrate
Tez with different data systems.

Advantages of Tez's Architecture:

 Performance Optimization: Tez's architecture is designed for performance. It


reduces overhead by efficiently managing task execution, data flow, and resource
allocation, resulting in faster data processing.
 Complex Workflows: The support for custom vertices and complex DAGs allows for
the modeling of intricate data processing workflows, which is valuable for data
science and big data applications.
 Efficient Data Movement: The data flow model and data locality optimization in Tez
minimize data transfer overhead, leading to reduced processing latency and
improved performance.
 Resource Management: Tez integrates with YARN for dynamic resource allocation
and efficient resource management. It optimizes resource utilization to ensure that
data processing jobs are executed efficiently.
 Scalability: Tez is designed to handle large-scale data processing, making it suitable
for big data applications where the volume of data can be substantial.

24. NOTE ON ARCHITECTURE OF HIVE


(MIMP)

Architecture of Hive

The following component diagram depicts the architecture of Hive:

This component diagram contains different units. The following table describes each unit:

Unit Name Operation

Hive is a data warehouse infrastructure software that can create interaction


User Interface between user and HDFS. The user interfaces that Hive supports are Hive Web
UI, Hive command line, and Hive HD Insight (In Windows server).

Hive chooses respective database servers to store the schema or Metadata of


Meta Store
tables, databases, columns in a table, their data types, and HDFS mapping.
HiveQL is similar to SQL for querying on schema info on the Metastore. It is
one of the replacements of traditional approach for MapReduce program.
HiveQL Process Engine
Instead of writing MapReduce program in Java, we can write a query for
MapReduce job and process it.

The conjunction part of HiveQL process Engine and MapReduce is Hive


Execution Engine Execution Engine. Execution engine processes the query and generates results
as same as MapReduce results. It uses the flavor of MapReduce.

Hadoop distributed file system or HBASE are the data storage techniques to
HDFS or HBASE
store data into file system.

25. NOTE ON ARCHITECTURE OF HBASE


(MIMP)

In HBase, tables are split into regions and are served by the region servers. Regions are vertically
divided by column families into “Stores”. Stores are saved as files in HDFS. Shown below is the
architecture of HBase.

Note: The term ‘store’ is used for regions to explain the storage structure.

HBase has three major components: the client library, a master server, and region servers. Region
servers can be added or removed as per requirement.

MasterServer

The master server -

 Assigns regions to the region servers and takes the help of Apache ZooKeeper for this task.
 Handles load balancing of the regions across region servers. It unloads the busy servers and
shifts the regions to less occupied servers.
 Maintains the state of the cluster by negotiating the load balancing.
 Is responsible for schema changes and other metadata operations such as creation of tables
and column families.

Regions

Regions are nothing but tables that are split up and spread across the region servers.

Region server

The region servers have regions that -

 Communicate with the client and handle data-related operations.


 Handle read and write requests for all the regions under it.
 Decide the size of the region by following the region size thresholds.

When we take a deeper look into the region server, it contain regions and stores as shown below:

The store contains memory store and HFiles. Memstore is just like a cache memory. Anything that is
entered into the HBase is stored here initially. Later, the data is transferred and saved in Hfiles as
blocks and the memstore is flushed.

Zookeeper

 Zookeeper is an open-source project that provides services like maintaining configuration


information, naming, providing distributed synchronization, etc.
 Zookeeper has ephemeral nodes representing different region servers. Master servers use
these nodes to discover available servers.
 In addition to availability, the nodes are also used to track server failures or network
partitions.
 Clients communicate with region servers via zookeeper.
 In pseudo and standalone modes, HBase itself will take care of zookeeper.
26. NOTE ON ARCHITECTURE OF SPARK
(MIMP)

Spark Architecture

The Spark follows the master-slave architecture. Its cluster consists of a single master and multiple
slaves.

The Spark architecture depends upon two abstractions:

o Resilient Distributed Dataset (RDD)


o Directed Acyclic Graph (DAG)

Resilient Distributed Datasets (RDD)

The Resilient Distributed Datasets are the group of data items that can be stored in-memory on worker
nodes. Here,

o Resilient: Restore the data on failure.


o Distributed: Data is distributed among different nodes.
o Dataset: Group of data.

Directed Acyclic Graph (DAG)

Directed Acyclic Graph is a finite direct graph that performs a sequence of computations on data.
Each node is an RDD partition, and the edge is a transformation on top of data. Here, the graph refers
the navigation whereas directed and acyclic refers to how it is done.

Let's understand the Spark architecture.


Driver Program

The Driver Program is a process that runs the main() function of the application and creates
the SparkContext object. The purpose of SparkContext is to coordinate the spark applications,
running as independent sets of processes on a cluster.

To run on a cluster, the SparkContext connects to a different type of cluster managers and then
perform the following tasks: -

o It acquires executors on nodes in the cluster.


o Then, it sends your application code to the executors. Here, the application code can be
defined by JAR or Python files passed to the SparkContext.
o At last, the SparkContext sends tasks to the executors to run.

AD

Cluster Manager

o The role of the cluster manager is to allocate resources across applications. The Spark is
capable enough of running on a large number of clusters.
o It consists of various types of cluster managers such as Hadoop YARN, Apache Mesos and
Standalone Scheduler.
o Here, the Standalone Scheduler is a standalone spark cluster manager that facilitates to install
Spark on an empty set of machines.

Worker Node
o The worker node is a slave node
o Its role is to run the application code in the cluster.

Executor
o An executor is a process launched for an application on a worker node.
o It runs tasks and keeps data in memory or disk storage across them.
o It read and write data to the external sources.
o Every application contains its executor.

Task
o A unit of work that will be sent to one executor.

Link: https://www.javatpoint.com/apache-spark-architecture
27. WRITE DOWN STEPS TO PERFORM
SINGLE NODE INSTALLATION OF HADOOP
Link: https://www.javatpoint.com/hadoop-installation\
Link: https://www.tutorialspoint.com/how-to-install-and-configure-apache-hadoop-on-a-single-node-
in-centos-8

Installation Steps:

1. Download and Extract Hadoop:


 Place the downloaded Hadoop tarball in a directory of your choice (e.g., /opt or ~/).
 Open a terminal and navigate to the directory where you placed the tarball.
 Extract the tarball using the following command:

CODE: tar -xzvf hadoop-x.y.z.tar.gz

2. Configuration:
 Navigate to the Hadoop configuration directory:

CODE: cd hadoop-x.y.z/etc/hadoop

 Edit the hadoop-env.sh file to set the Java home. Set the value of JAVA_HOME to
the path of your Java installation:

CODE: export JAVA_HOME=/path/to/your/java

 Configure the Hadoop core-site.xml and hdfs-site.xml files. You can use the provided
template files ( core-site.xml and hdfs-site.xml) and customize them to specify
the Hadoop configuration settings. For example:
core-site.xml:

CODE: <configuration>

<property>

<name>fs.defaultFS</name>

<value>hdfs://localhost:9000</value>
</property>
</configuration>

hdfs-site.xml:

CODE: <configuration>

<property>

<name>dfs.replication</name>

<value>1</value>

</property>
</configuration>

3. Format the HDFS Filesystem:


Before you can start HDFS, you need to format the filesystem. Run the following
command:

CODE: hdfs namenode -format

4. Start Hadoop Services:


Start the Hadoop services, including the HDFS and YARN resource manager:

CODE: start-dfs.sh

start-yarn.sh

You can also start Hadoop services individually using the start-dfs.sh and start-
yarn.sh scripts.

5. Verify Installation:
Open a web browser and access the Hadoop NameNode web interface at
http://localhost:50070. You should see the HDFS web interface, which indicates that
Hadoop is up and running.

6. Run a Sample Job:


To test your Hadoop installation, you can run a sample job using the hadoop jar
command. For example, you can run a simple word count job on sample text data:

CODE :hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-x.y.z.jar


wordcount input output

This command will count the words in the input file and store the result in the
output directory.

28. HADOOP V1 VS HADOOP V2

Sr. No. Key Hadoop 1 Hadoop 2

On other hand Hadoop 2 introduced


As Hadoop 1 introduced prior after Hadoop 1 so has more
New Components and to Hadoop 2 so has some less components and APIs as compare to
1
API components and APIs as Hadoop 1 such as YARN API,YARN
compare to that of Hadoop 2. FRAMEWORK, and enhanced
Resource Manager.

On other hand Hadoop 2 allows to


Hadoop 1 only supports
work in MapReducer model as well as
MapReduce processing model
other distributed computing models
2 Support in its architecture and it does
like Spark, Hama, Giraph, Message
not support non MapReduce
Passing Interface) MPI & HBase
tools.
coprocessors.

On other hand in case of Hadoop 2 for


Map reducer in Hadoop 1 is cluster resource management YARN
Resource
3 responsible for processing and is used while processing management
Management
cluster-resource management. is done using different processing
models.

As Hadoop 1 is prior to
Hadoop 2 so comparatively On other hand Hadoop 2 has better
less scalable than Hadoop 2 scalability than Hadoop 1 and is
4 Scalability
and in context of scaling of scalable up to 10000 nodes per
nodes it is limited to 4000 cluster.
nodes per cluster

Hadoop 1 is implemented as it
follows the concepts of slots On other hand Hadoop 2 follows
5 Implementation which can be used to run a concepts of containers that can be
Map task or a Reduce task used to run generic tasks.
only.
Sr. No. Key Hadoop 1 Hadoop 2

On other hand with an advancement


Initially in Hadoop 1 there is
in version of Hadoop Apache
6 Windows Support no support for Microsoft
provided support for Microsoft
Windows provided by Apache.
windows in Hadoop 2.

UNIT 4:

1. Introduction to Map-Reduce
MapReduce is a programming model for writing applications that can process Big Data in parallel on
multiple nodes. MapReduce provides analytical capabilities for analysing huge volumes of complex
data.

What is MapReduce?
A MapReduce is a data processing tool which is used to process the data parallelly in a distributed
form. It was developed in 2004, on the basis of paper titled as "MapReduce: Simplified Data
Processing on Large Clusters," published by Google.

The MapReduce is a paradigm which has two phases, the mapper phase, and the reducer phase. In the
Mapper, the input is given in the form of a key-value pair. The output of the Mapper is fed to the
reducer as input. The reducer runs only after the Mapper is over. The reducer too takes input in key-
value format, and the output of reducer is the final output.

Why MapReduce?
Traditional Enterprise Systems normally have a centralized server to store and process data. The
following illustration depicts a schematic view of a traditional enterprise system. Traditional model is
certainly not suitable to process huge volumes of scalable data and cannot be accommodated by
standard database servers. Moreover, the centralized system creates too much of a bottleneck while
processing multiple files simultaneously.
Google solved this bottleneck issue using an algorithm called MapReduce. MapReduce divides a task
into small parts and assigns them to many computers. Later, the results are collected at one place and
integrated to form the result dataset.

Usage of MapReduce

o It can be used in various application like document clustering, distributed sorting, and web
link-graph reversal.
o It can be used for distributed pattern-based searching.
o We can also use MapReduce in machine learning.
o It was used by Google to regenerate Google's index of the World Wide Web.
o It can be used in multiple computing environments such as multi-cluster, multi-core, and
mobile environment.

MapReduce Architecture:
Components of MapReduce Architecture:

1. Client: The MapReduce client is the one who brings the Job to the MapReduce for processing.
There can be multiple clients available that continuously send jobs for processing to the Hadoop
MapReduce Manager.
2. Job: The MapReduce Job is the actual work that the client wanted to do which is comprised of
so many smaller tasks that the client wants to process or execute.
3. Hadoop MapReduce Master: It divides the particular job into subsequent job-parts.
4. Job-Parts: The task or sub-jobs that are obtained after dividing the main job. The result of all
the job-parts combined to produce the final output.
5. Input Data: The data set that is fed to the MapReduce for processing.
6. Output Data: The final result is obtained after the processing.

2. Map-Reduce Hands-on with Hadoop streaming.

UNIT 3:

1. HDFS Architecture
Hadoop File System was developed using distributed file system design. It is run on commodity
hardware. Unlike other distributed systems, HDFS is highly fault tolerant and designed using low-cost
hardware.
HDFS holds very large amount of data and provides easier access. To store such huge data, the files
are stored across multiple machines. These files are stored in redundant fashion to rescue the system
from possible data losses in case of failure. HDFS also makes applications available to parallel
processing.

Features of HDFS
 It is suitable for the distributed storage and processing.
 Hadoop provides a command interface to interact with HDFS.
 The built-in servers of name node and data node help users to easily check the status of
cluster.
 Streaming access to file system data.
 HDFS provides file permissions and authentication.

HDFS Architecture
Given below is the architecture of a Hadoop File System.

HDFS follows the master-slave architecture and it has the following elements.

Namenode

The namenode is the commodity hardware that contains the GNU/Linux operating system and the
namenode software. It is a software that can be run on commodity hardware. The system having the
namenode acts as the master server and it does the following tasks −

 Manages the file system namespace.


 Regulates client’s access to files.
 It also executes file system operations such as renaming, closing, and opening files and
directories.
Datanode

The datanode is a commodity hardware having the GNU/Linux operating system and datanode
software. For every node (Commodity hardware/System) in a cluster, there will be a datanode. These
nodes manage the data storage of their system.

 Datanodes perform read-write operations on the file systems, as per client request.
 They also perform operations such as block creation, deletion, and replication according to
the instructions of the namenode.
Block

Generally, the user data is stored in the files of HDFS. The file in a file system will be divided into
one or more segments and/or stored in individual data nodes. These file segments are called as blocks.
In other words, the minimum amount of data that HDFS can read or write is called a Block. The
default block size is 64MB, but it can be increased as per the need to change in HDFS configuration.

Goals of HDFS
Fault detection and recovery − Since HDFS includes a large number of commodity hardware,
failure of components is frequent. Therefore HDFS should have mechanisms for quick and automatic
fault detection and recovery.
Huge datasets − HDFS should have hundreds of nodes per cluster to manage the applications having
huge datasets.
Hardware at data − A requested task can be done efficiently, when the computation takes place near
the data. Especially where huge datasets are involved, it reduces the network traffic and increases the
throughput.

2. HDFS Read / Write processes

3. HDFS Commands
Sr.N
Command & Description
o

-ls <path>
1 Lists the contents of the directory specified by path, showing the names, permissions, owner, size
and modification date for each entry.

-lsr <path>
2
Behaves like -ls, but recursively displays entries in all subdirectories of path.
-du <path>
3 Shows disk usage, in bytes, for all the files which match path; filenames are reported with the full
HDFS protocol prefix.

-dus <path>
4
Like -du, but prints a summary of disk usage of all files/directories in the path.

-mv <src><dest>
5
Moves the file or directory indicated by src to dest, within HDFS.

-cp <src> <dest>


6
Copies the file or directory identified by src to dest, within HDFS.

-rm <path>
7
Removes the file or empty directory identified by path.

-rmr <path>
8 Removes the file or directory identified by path. Recursively deletes any child entries (i.e., files or
subdirectories of path).

-put <localSrc> <dest>


9 Copies the file or directory from the local file system identified by localSrc to dest within the
DFS.

-copyFromLocal <localSrc> <dest>


10
Identical to -put

-moveFromLocal <localSrc> <dest>


11 Copies the file or directory from the local file system identified by localSrc to dest within HDFS,
and then deletes the local copy on success.

-get [-crc] <src> <localDest>


12 Copies the file or directory in HDFS identified by src to the local file system path identified by
localDest.

-getmerge <src> <localDest>


13 Retrieves all files that match the path src in HDFS, and copies them to a single, merged file in the
local file system identified by localDest.

-cat <filen-ame>
14
Displays the contents of filename on stdout.

-copyToLocal <src> <localDest>


15
Identical to -get

-moveToLocal <src> <localDest>


16
Works like -get, but deletes the HDFS copy on success.

-mkdir <path>
17 Creates a directory named path in HDFS.
Creates any parent directories in path that are missing (e.g., mkdir -p in Linux).

18 -setrep [-R] [-w] rep <path>


Sets the target replication factor for files identified by path to rep. (The actual replication factor
will move toward the target over time)

-touchz <path>
19 Creates a file at path containing the current time as a timestamp. Fails if a file already exists at
path, unless the file is already size 0.

-test -[ezd] <path>


20
Returns 1 if path exists; has zero length; or is a directory or 0 otherwise.

-stat [format] <path>


21 Prints information about path. Format is a string which accepts file size in blocks (%b), filename
(%n), block size (%o), replication (%r), and modification date (%y, %Y).

-tail [-f] <file2name>


22
Shows the last 1KB of file on stdout.

-chmod [-R] mode,mode,... <path>...


Changes the file permissions associated with one or more objects identified by path.... Performs
23
changes recursively with R. mode is a 3-digit octal mode, or {augo}+/-{rwxX}. Assumes if no
scope is specified and does not apply an umask.

-chown [-R] [owner][:[group]] <path>...


24 Sets the owning user and/or group for files or directories identified by path.... Sets owner
recursively if -R is specified.

-chgrp [-R] group <path>...


25 Sets the owning group for files or directories identified by path.... Sets group recursively if -R is
specified.

-help <cmd-name>
26 Returns usage information for one of the commands listed above. You must omit the leading '-'
character in cmd.

4. Native Java APIs, Rest APIs

UNIT 5:

1. Resilient Distributed Datasets (RDDs)


What Is a Resilient Distributed Dataset?
A Resilient Distributed Dataset (RDD) is a low-level API and Spark's underlying data abstraction. An
RDD is a static set of items distributed across clusters to allow parallel processing. The data structure
stores any Python, Java, Scala, or user-created object.

Resilient Distributed Datasets (RDD) is a fundamental data structure of Spark. It is an immutable


distributed collection of objects. Each dataset in RDD is divided into logical partitions, which may be
computed on different nodes of the cluster. RDDs can contain any type of Python, Java, or Scala
objects, including user-defined classes.

Formally, an RDD is a read-only, partitioned collection of records. There are two ways to create
RDDs − parallelizing an existing collection in your driver program, or referencing a dataset in an
external storage system, such as a shared file system, HDFS, HBase, or any data source offering a
Hadoop Input Format.

Why Do We Need RDDs in Spark?

RDDs address MapReduce's shortcomings in data sharing. When reusing data for computations,
MapReduce requires writing to external storage (HDFS, Cassandra, HBase, etc.). The read and write
processes between jobs consume a significant amount of memory.

Furthermore, data sharing between tasks is slow due to replication, serialization, and increased disk
usage.

RDDs aim to reduce the usage of external storage systems by leveraging in-memory compute
operation storage. This approach improves data exchange speeds between tasks by 10 to 100 times.

How Does RDD Store Data?

An RDD stores data in read-only mode, making it immutable. Performing operations on existing
RDDs creates new objects without manipulating existing data.

RDDs reside in RAM through a caching process. Data that does not fit is either recalculated to reduce
the size or stored on a permanent storage. Caching allows retrieving data without reading from disk,
reducing disk overhead.

RDDs further distribute the data storage across multiple partitions. Partitioning allows data recovery
in case a node fails and ensures the data is available at all times.
Spark's RDD uses a persistence optimization technique to save computation results. Two methods
help achieve RDD persistence:

 cache()
 persist()

These methods provide an interactive storage mechanism by choosing different storage levels. The
cached memory is fault-tolerant, allowing the recreation of lost RDD partitions through the initial
creation operations.

Spark RDD Features

The main features of a Spark RDD are:

 In-memory computation. Data calculation resides in memory for faster access and fewer I/O
operations.
 Fault tolerance. The tracking of data creation helps recover or recreate lost data after a node
failure.
 Immutability. RDDs are read-only. The existing data cannot change, and transformations on
existing data generate new RDDs.
 Lazy evaluation. Data does not load immediately after definition - the data loads when
applying an action to the data.

Advantages and Disadvantages of RDDs

The advantages of using RDDs are:

 Data resilience. The self-recovery mechanism ensures data is never lost, regardless of whether
a machine fails.
 Data consistency. Since RDDs do not change over time and are only available for reading,
data consistency maintains throughout various operations.
 Performance speeds. Storing data in RAM whenever possible instead of on disk. However,
RDDs maintain the possibility of on-disk storage to provide a massive performance and
flexibility boost.
The disadvantages when working with Resilient Distributed
Datasets include:
 No schematic view of data. RDDs have a hard time dealing with structured data. A better
option for handling structured data is through the DataFrames and Datasets APIs, which fully
integrate with RDDs in Spark.
 Garbage collection. Since RDDs are in-memory objects, they rely heavily on Java's memory
management and serialization. This causes performance limitations as data grows.
 Overflow issues. When RDDs run out of RAM, the information resides on a disk, requiring
additional RAM and disk space to overcome overflow issues.

 No automated optimization. An RDD does not have functions for automatic input
optimization. While other Spark objects, such as DataFrames and Datasets, use the Catalyst
optimizer, for RDDs, optimization happens manually.

2. RDD Operations
Spark RDD Operations
RDDs offer two operation types:

1. Transformations are operations on RDDs that result in RDD creation.

2. Actions are operations that do not result in RDD creation and provide some other value.

Transformations perform various operations and create new RDDs as a result.

Actions come as a final step after completed modifications and return a non-RDD result (such
as the total count) from the data stored in the Spark Driver.
3. Printing elements of an RDD
4. Introduction to Graphx

What is Spark GraphX?

GraphX is the newest component in Spark. It’s a directed multigraph, which means it
contains both edges and vertices and can be used to represent a wide range of data structures.
It also has associated properties attached to each vertex and edge.

GraphX supports several fundamental operators and an optimized variant of the Pregel API.
In addition to these tools, it includes a growing collection of algorithms that help you analyze
your data.

Spark GraphX Features

Spark GraphX is the most powerful and flexible graph processing system available today. It
has a growing library of algorithms that can be applied to your data, including PageRank,
connected components, SVD++, and triangle count.

In addition, Spark GraphX can also view and manipulate graphs and computations. You can
use RDDs to transform and join graphs. A custom iterative graph algorithm can also be
written using the Pregel API.

While Spark GraphX retains its flexibility, fault tolerance, and ease-of-use, it delivers
comparable performance to the fastest specialized graph processors.

Spark GraphX was a component of Apache Spark, a powerful open-source data


processing engine for big data processing and analytics. GraphX was designed to
handle graph-based computations efficiently within the Spark framework. It allowed
users to create and process large-scale graphs and perform various graph-based
algorithms.

Key features and capabilities of Spark GraphX included:


1. Graph Creation: Users could create directed and undirected graphs by defining
vertices and edges using RDDs (Resilient Distributed Datasets), the fundamental data
structure in Spark.
2. Graph Transformation: GraphX provided a range of graph transformations and
operations for manipulating and analyzing graphs. These operations included
subgraph extraction, filtering, mapping, and more.
3. Graph Algorithms: The library offered a collection of built-in graph algorithms, such
as PageRank, connected components, and graph coloring, making it easier to
perform common graph computations.
4. Graph Operators: Users could apply mathematical operations to the vertices and
edges of the graph, allowing for custom computations and analysis.
5. Integration with Spark Ecosystem: Spark GraphX was designed to work seamlessly
with other components of the Apache Spark ecosystem, such as Spark SQL, MLlib,
and Spark Streaming.

5. Features of Graphx
 Display up to 32 graphics images simultaneously.

 Save graphics images to disk.

 Load graphics images from disk.

 Load a picture configuration file for display or printing.

 Save a configuration file to use for display or printing.

 Print or plot a graphics image.

 Copy an image to the clipboard in bitmap format or the ASCII command sequence.

 Specify a different font for text within a picture.

 Validate the syntax of a graphics command for troubleshooting.

 Tile, cascade, resize, or reduce graphics windows to icons.

Apache Spark GraphX Features


Apache Spark GraphX provides the following features.

1. Flexibility
Apache Spark GraphX is capable of working with graphs and perform the
computations on them. Spark GraphX is be used for ETL processing,
iterative graph computation, exploratory analysis, and so on. The data can
be views as a collection as well as a graph and the transformation and
joining of that data can be efficiently performed with Spark RDD.

2. Speed
Apache Spark GraphX provides better performance compared to the fastest
graph systems and since it works with Spark so it by default adopts the
features of Apache Spark such as fault tolerance, flexibility, ease to use, and
so on.

3. Algorithms
Apache Spark GraphX provides the following graph algorithms.

o PageRank
o Connected Components
o Label Propagation
o SVD++
o Strongly Connected Components
o Triangle Count

6. Basic path analytics algorithm with Graphx


7. Implement Dijkstra Algorithm with GraphX
8. An Introduction Data Visulization
Data visualization is a graphical representation of quantitative information and data
by using visual elements like graphs, charts, and maps.

Data visualization convert large and small data sets into visuals, which is easy to
understand and process for humans.

Data visualization tools provide accessible ways to understand outliers, patterns, and
trends in the data.

In the world of Big Data, the data visualization tools and technologies are required to
analyze vast amounts of information.
Data visualizations are common in your everyday life, but they always appear in the
form of graphs and charts. The combination of multiple visualizations and bits of
information are still referred to as Infographics.

Data visualizations are used to discover unknown facts and trends. You can see
visualizations in the form of line charts to display change over time. Bar and column
charts are useful for observing relationships and making comparisons. A pie chart is a
great way to show parts-of-a-whole. And maps are the best way to share
geographical data visually.

What makes Data Visualization Effective?


Effective data visualization is created by communication, data science, and design
collide. Data visualizations did right key insights into complicated data sets into
meaningful and natural.

American statistician and Yale professor Edward Tufte believe useful data
visualizations consist of ?complex ideas communicated with clarity, precision, and
efficiency.

To craft an effective data visualization, you need to start with clean data that is well-
sourced and complete. After the data is ready to visualize, you need to pick the right
chart.

Why Use Data Visualization?


1. To make easier in understand and remember.
2. To discover unknown facts, outliers, and trends.
3. To visualize relationships and patterns quickly.
4. To ask a better question and make better decisions.
5. To competitive analyze.
6. To improve insights.

9. Various BI tools

The 8 best business intelligence tools

 datapine for all-in-one functionality


 Zoho Analytics for solopreneurs
 Domo for easy and flexible data management
 Microsoft Power BI for Microsoft users
 SAS Viya for automated data visualization
 IBM Cognos Analytics for AI functionality
 Tableau for team collaboration
 Metabase for ease of use

10. Data Visualization with Tableau


https://www.geeksforgeeks.org/what-is-tableau-and-its-
importance-in-data-visualization/

11. Passing functions to Spark

You might also like