You are on page 1of 12

Q.1]EXPLAIN CAP theorem ? Q.2]What is YARN and explain it ?

Ans:- The CAP theorem, originally introduced Ans:- Certainly! YARN (Yet Another Resource
as the CAP principle, can be used to explain Negotiator) is a core component of Apache
some of the competing requirements in a Hadoop, designed for resource management
distributed system with replication. It is a and job scheduling in a distributed computing
tool used to make system designers aware of environment. For a big data analyst,
the trade-offs while designing networked understanding YARN's architecture is vital as it
shared-data systems. orchestrates resource allocation and
The three letters in CAP refer to three facilitates efficient data processing.
desirable properties of distributed systems Components of YARN:
with replicated data: consistency (among 1.)ResourceManager (RM):i.)The central
replicated copies), availability (of the system authority overseeing resource allocation in
for read and write operations) and partition the cluster.
tolerance (in the face of the nodes in the ii.)Consists of two main components:
system being partitioned by a network a.)Scheduler: Allocates resources to various
fault).
applications based on predefined policies
Consistency –
Consistency means that the nodes will have b.)ApplicationManager: Manages the
the same copies of a replicated data item lifecycle of application masters.
visible for various transactions. A guarantee 2.)NodeManager (NM):- i.)Runs on each
that every node in a distributed cluster node in the cluster and manages resources
returns the same, most recent and a
successful write. Consistency refers to every available on that node. ii.)Monitors resource
client having the same view of the data. utilization (CPU, memory, etc.) and reports it
There are various types of consistency to the Resource Manager.
models. 3.)ApplicationMaster (AM):-i.)Manages the
Availability – execution of a specific application within the cluster.
Availability means that each read or write ii.)Negotiates resources with the ResourceManager
request for a data item will either be via the ApplicationManager.
processed successfully or will receive a
Workflow:-
message that the operation cannot be
completed. Every non-failing node returns a a.)Submit Application:-A user submits a
response for all the read and write requests job/application to the ResourceManager.
in a reasonable amount of time. The key
word here is “every”. b.)Application Execution:-The Application
Partition Tolerance – Master coordinates with the Resource
Partition tolerance means that the system Manager to acquire necessary resources for
can continue operating even if the network execution.
connecting the nodes has a fault that results c.)Task Execution:-NodeManagers create
in two or more partitions, where the nodes containers and execute application tasks
in each partition can only communicate
within these containers.
among each other. That means, the system
continues to function and upholds its d.)Resource Deallocation:-Once tasks are
consistency guarantees in spite of network completed, resources are released and made
partitions. Network partitions are a fact of available for other applications.
life.
Q.1]EXPLAIN CAP theorem ? Q.2]What is YARN and explain it ?
Ans:- The CAP theorem, originally introduced Ans:- Certainly! YARN (Yet Another Resource
as the CAP principle, can be used to explain Negotiator) is a core component of Apache
some of the competing requirements in a Hadoop, designed for resource management
distributed system with replication. It is a and job scheduling in a distributed computing
tool used to make system designers aware of environment. For a big data analyst,
the trade-offs while designing networked understanding YARN's architecture is vital as it
shared-data systems. orchestrates resource allocation and
The three letters in CAP refer to three facilitates efficient data processing.
desirable properties of distributed systems Components of YARN:
with replicated data: consistency (among 1.)ResourceManager (RM):i.)The central
replicated copies), availability (of the system authority overseeing resource allocation in
for read and write operations) and partition the cluster.
tolerance (in the face of the nodes in the ii.)Consists of two main components:
system being partitioned by a network a.)Scheduler: Allocates resources to various
fault).
applications based on predefined policies
Consistency –
Consistency means that the nodes will have b.)ApplicationManager: Manages the
the same copies of a replicated data item lifecycle of application masters.
visible for various transactions. A guarantee 2.)NodeManager (NM):- i.)Runs on each
that every node in a distributed cluster node in the cluster and manages resources
returns the same, most recent and a
successful write. Consistency refers to every available on that node. ii.)Monitors resource
client having the same view of the data. utilization (CPU, memory, etc.) and reports it
There are various types of consistency to the Resource Manager.
models. 3.)ApplicationMaster (AM):-i.)Manages the
Availability – execution of a specific application within the cluster.
Availability means that each read or write ii.)Negotiates resources with the ResourceManager
request for a data item will either be via the ApplicationManager.
processed successfully or will receive a
Workflow:-
message that the operation cannot be
completed. Every non-failing node returns a a.)Submit Application:-A user submits a
response for all the read and write requests job/application to the ResourceManager.
in a reasonable amount of time. The key
word here is “every”. b.)Application Execution:-The Application
Partition Tolerance – Master coordinates with the Resource
Partition tolerance means that the system Manager to acquire necessary resources for
can continue operating even if the network execution.
connecting the nodes has a fault that results c.)Task Execution:-NodeManagers create
in two or more partitions, where the nodes containers and execute application tasks
in each partition can only communicate
within these containers.
among each other. That means, the system
continues to function and upholds its d.)Resource Deallocation:-Once tasks are
consistency guarantees in spite of network completed, resources are released and made
partitions. Network partitions are a fact of available for other applications.
life.
Q.1]EXPLAIN CAP theorem ? Q.2]What is YARN and explain it ?
Ans:- The CAP theorem, originally introduced Ans:- Certainly! YARN (Yet Another Resource
as the CAP principle, can be used to explain Negotiator) is a core component of Apache
some of the competing requirements in a Hadoop, designed for resource management
distributed system with replication. It is a and job scheduling in a distributed computing
tool used to make system designers aware of environment. For a big data analyst,
the trade-offs while designing networked understanding YARN's architecture is vital as it
shared-data systems. orchestrates resource allocation and
The three letters in CAP refer to three facilitates efficient data processing.
desirable properties of distributed systems Components of YARN:
with replicated data: consistency (among 1.)ResourceManager (RM):i.)The central
replicated copies), availability (of the system authority overseeing resource allocation in
for read and write operations) and partition the cluster.
tolerance (in the face of the nodes in the ii.)Consists of two main components:
system being partitioned by a network a.)Scheduler: Allocates resources to various
fault).
applications based on predefined policies
Consistency –
Consistency means that the nodes will have b.)ApplicationManager: Manages the
the same copies of a replicated data item lifecycle of application masters.
visible for various transactions. A guarantee 2.)NodeManager (NM):- i.)Runs on each
that every node in a distributed cluster node in the cluster and manages resources
returns the same, most recent and a
successful write. Consistency refers to every available on that node. ii.)Monitors resource
client having the same view of the data. utilization (CPU, memory, etc.) and reports it
There are various types of consistency to the Resource Manager.
models. 3.)ApplicationMaster (AM):-i.)Manages the
Availability – execution of a specific application within the cluster.
Availability means that each read or write ii.)Negotiates resources with the ResourceManager
request for a data item will either be via the ApplicationManager.
processed successfully or will receive a
Workflow:-
message that the operation cannot be
completed. Every non-failing node returns a a.)Submit Application:-A user submits a
response for all the read and write requests job/application to the ResourceManager.
in a reasonable amount of time. The key
word here is “every”. b.)Application Execution:-The Application
Partition Tolerance – Master coordinates with the Resource
Partition tolerance means that the system Manager to acquire necessary resources for
can continue operating even if the network execution.
connecting the nodes has a fault that results c.)Task Execution:-NodeManagers create
in two or more partitions, where the nodes containers and execute application tasks
in each partition can only communicate
within these containers.
among each other. That means, the system
continues to function and upholds its d.)Resource Deallocation:-Once tasks are
consistency guarantees in spite of network completed, resources are released and made
partitions. Network partitions are a fact of available for other applications.
life.
Q.1]EXPLAIN CAP theorem ? Q.2]What is YARN and explain it ?
Ans:- The CAP theorem, originally introduced Ans:- Certainly! YARN (Yet Another Resource
as the CAP principle, can be used to explain Negotiator) is a core component of Apache
some of the competing requirements in a Hadoop, designed for resource management
distributed system with replication. It is a and job scheduling in a distributed computing
tool used to make system designers aware of environment. For a big data analyst,
the trade-offs while designing networked understanding YARN's architecture is vital as it
shared-data systems. orchestrates resource allocation and
The three letters in CAP refer to three facilitates efficient data processing.
desirable properties of distributed systems Components of YARN:
with replicated data: consistency (among 1.)ResourceManager (RM):i.)The central
replicated copies), availability (of the system authority overseeing resource allocation in
for read and write operations) and partition the cluster.
tolerance (in the face of the nodes in the ii.)Consists of two main components:
system being partitioned by a network a.)Scheduler: Allocates resources to various
fault).
applications based on predefined policies
Consistency –
Consistency means that the nodes will have b.)ApplicationManager: Manages the
the same copies of a replicated data item lifecycle of application masters.
visible for various transactions. A guarantee 2.)NodeManager (NM):- i.)Runs on each
that every node in a distributed cluster node in the cluster and manages resources
returns the same, most recent and a
successful write. Consistency refers to every available on that node. ii.)Monitors resource
client having the same view of the data. utilization (CPU, memory, etc.) and reports it
There are various types of consistency to the Resource Manager.
models. 3.)ApplicationMaster (AM):-i.)Manages the
Availability – execution of a specific application within the cluster.
Availability means that each read or write ii.)Negotiates resources with the ResourceManager
request for a data item will either be via the ApplicationManager.
processed successfully or will receive a
Workflow:-
message that the operation cannot be
completed. Every non-failing node returns a a.)Submit Application:-A user submits a
response for all the read and write requests job/application to the ResourceManager.
in a reasonable amount of time. The key
word here is “every”. b.)Application Execution:-The Application
Partition Tolerance – Master coordinates with the Resource
Partition tolerance means that the system Manager to acquire necessary resources for
can continue operating even if the network execution.
connecting the nodes has a fault that results c.)Task Execution:-NodeManagers create
in two or more partitions, where the nodes containers and execute application tasks
in each partition can only communicate
within these containers.
among each other. That means, the system
continues to function and upholds its d.)Resource Deallocation:-Once tasks are
consistency guarantees in spite of network completed, resources are released and made
partitions. Network partitions are a fact of available for other applications.
life.
Q.1]EXPLAIN CAP theorem ? Q.2]What is YARN and explain it ?
Ans:- The CAP theorem, originally introduced Ans:- Certainly! YARN (Yet Another Resource
as the CAP principle, can be used to explain Negotiator) is a core component of Apache
some of the competing requirements in a Hadoop, designed for resource management
distributed system with replication. It is a and job scheduling in a distributed computing
tool used to make system designers aware of environment. For a big data analyst,
the trade-offs while designing networked understanding YARN's architecture is vital as it
shared-data systems. orchestrates resource allocation and
The three letters in CAP refer to three facilitates efficient data processing.
desirable properties of distributed systems Components of YARN:
with replicated data: consistency (among 1.)ResourceManager (RM):i.)The central
replicated copies), availability (of the system authority overseeing resource allocation in
for read and write operations) and partition the cluster.
tolerance (in the face of the nodes in the ii.)Consists of two main components:
system being partitioned by a network a.)Scheduler: Allocates resources to various
fault).
applications based on predefined policies
Consistency –
Consistency means that the nodes will have b.)ApplicationManager: Manages the
the same copies of a replicated data item lifecycle of application masters.
visible for various transactions. A guarantee 2.)NodeManager (NM):- i.)Runs on each
that every node in a distributed cluster node in the cluster and manages resources
returns the same, most recent and a
successful write. Consistency refers to every available on that node. ii.)Monitors resource
client having the same view of the data. utilization (CPU, memory, etc.) and reports it
There are various types of consistency to the Resource Manager.
models. 3.)ApplicationMaster (AM):-i.)Manages the
Availability – execution of a specific application within the cluster.
Availability means that each read or write ii.)Negotiates resources with the ResourceManager
request for a data item will either be via the ApplicationManager.
processed successfully or will receive a
Workflow:-
message that the operation cannot be
completed. Every non-failing node returns a a.)Submit Application:-A user submits a
response for all the read and write requests job/application to the ResourceManager.
in a reasonable amount of time. The key
word here is “every”. b.)Application Execution:-The Application
Partition Tolerance – Master coordinates with the Resource
Partition tolerance means that the system Manager to acquire necessary resources for
can continue operating even if the network execution.
connecting the nodes has a fault that results c.)Task Execution:-NodeManagers create
in two or more partitions, where the nodes containers and execute application tasks
in each partition can only communicate
within these containers.
among each other. That means, the system
continues to function and upholds its d.)Resource Deallocation:-Once tasks are
consistency guarantees in spite of network completed, resources are released and made
partitions. Network partitions are a fact of available for other applications.
life.
Q.1]EXPLAIN CAP theorem ? Q.2]What is YARN and explain it ?
Ans:- The CAP theorem, originally introduced Ans:- Certainly! YARN (Yet Another Resource
as the CAP principle, can be used to explain Negotiator) is a core component of Apache
some of the competing requirements in a Hadoop, designed for resource management
distributed system with replication. It is a and job scheduling in a distributed computing
tool used to make system designers aware of environment. For a big data analyst,
the trade-offs while designing networked understanding YARN's architecture is vital as it
shared-data systems. orchestrates resource allocation and
The three letters in CAP refer to three facilitates efficient data processing.
desirable properties of distributed systems Components of YARN:
with replicated data: consistency (among 1.)ResourceManager (RM):i.)The central
replicated copies), availability (of the system authority overseeing resource allocation in
for read and write operations) and partition the cluster.
tolerance (in the face of the nodes in the ii.)Consists of two main components:
system being partitioned by a network a.)Scheduler: Allocates resources to various
fault).
applications based on predefined policies
Consistency –
Consistency means that the nodes will have b.)ApplicationManager: Manages the
the same copies of a replicated data item lifecycle of application masters.
visible for various transactions. A guarantee 2.)NodeManager (NM):- i.)Runs on each
that every node in a distributed cluster node in the cluster and manages resources
returns the same, most recent and a
successful write. Consistency refers to every available on that node. ii.)Monitors resource
client having the same view of the data. utilization (CPU, memory, etc.) and reports it
There are various types of consistency to the Resource Manager.
models. 3.)ApplicationMaster (AM):-i.)Manages the
Availability – execution of a specific application within the cluster.
Availability means that each read or write ii.)Negotiates resources with the ResourceManager
request for a data item will either be via the ApplicationManager.
processed successfully or will receive a
Workflow:-
message that the operation cannot be
completed. Every non-failing node returns a a.)Submit Application:-A user submits a
response for all the read and write requests job/application to the ResourceManager.
in a reasonable amount of time. The key
word here is “every”. b.)Application Execution:-The Application
Partition Tolerance – Master coordinates with the Resource
Partition tolerance means that the system Manager to acquire necessary resources for
can continue operating even if the network execution.
connecting the nodes has a fault that results c.)Task Execution:-NodeManagers create
in two or more partitions, where the nodes containers and execute application tasks
in each partition can only communicate
within these containers.
among each other. That means, the system
continues to function and upholds its d.)Resource Deallocation:-Once tasks are
consistency guarantees in spite of network completed, resources are released and made
partitions. Network partitions are a fact of available for other applications.
life.
Q.3] Real-time Streaming Platforms for Big Q.4] What is Streaming Data Architecture?
Data ? Ans:-a.)A streaming data architecture is a
Ans:- Real-time analytics can keep you up-to- framework of software components built to
date on what’s happening right now, such as ingest, process, and analyze data streams –
how many people are currently reading your typically in real time or near-real time. Rather
new blog post and whether someone just than writing and reading data in batches, a
liked your latest Facebook status. For most streaming data architecture consumes data
use cases, real-time is a nice-to-have feature immediately as it is generated, persists it to
that won’t provide any crucial insights. storage, and may include various additional
However, sometimes real-time is a must. components per use case – such as tools for
a.)Apache Flink:-Apache Flink is an open- real-time processing, data manipulation, and
source streaming platform that’s extremely analytics.
fast at complex stream processing. In fact, it’s b.)Streaming architectures must account for
able to process live streams within the unique characteristics of data streams,
milliseconds because it can be programmed which tend to generate massive amounts of
to only process new, changed data as it goes data (terabytes to petabytes) that it is at best
through rows of big data in real-time. In this semi-structured and requires significant pre-
way, Flink easily enables the execution of processing and ETL to become useful.
batch and stream processing at a large scale c.)Stream processing is a complex challenge
to offer real-time insights, so it’s no wonder rarely solved with a single database or ETL
this platform is known for offering low latency tool – hence the need to “architect” a
and high performance. solution consisting of multiple building blocks.
b.)Apache Spark:-Another open-source data Part of the thinking behind Upsolver SQLake
processing framework that’s known for its is to replace these point products with an
speed and ease of use is Spark. This platform integrated platform that delivers with self-
runs in-memory on RAM on clusters and isn’t orchestrating declarative data pipelines. We’ll
tied to Hadoop’s MapReduce two-stage demonstrate how this approach manifests
paradigm, which adds to its lightning-fast within each part of the streaming data supply
performance when it comes to big data chain.
processing. c.)Apache Samza:-Samza is an
open-source distributed stream-processing
framework that lets users build applications
that can process big data in real-time from
several sources. It’s based on Apache Kafka
and YARN, but it can also run as a standalone
library. LinkedIn originally developed Samza,
but since then, other big brands have started
using it—such as eBay, Slack, Redfin,
Optimizely, and TripAdvisor.
Q.3] Real-time Streaming Platforms for Big Q.4] What is Streaming Data Architecture?
Data ? Ans:-a.)A streaming data architecture is a
Ans:- Real-time analytics can keep you up-to- framework of software components built to
date on what’s happening right now, such as ingest, process, and analyze data streams –
how many people are currently reading your typically in real time or near-real time. Rather
new blog post and whether someone just than writing and reading data in batches, a
liked your latest Facebook status. For most streaming data architecture consumes data
use cases, real-time is a nice-to-have feature immediately as it is generated, persists it to
that won’t provide any crucial insights. storage, and may include various additional
However, sometimes real-time is a must. components per use case – such as tools for
a.)Apache Flink:-Apache Flink is an open- real-time processing, data manipulation, and
source streaming platform that’s extremely analytics.
fast at complex stream processing. In fact, it’s b.)Streaming architectures must account for
able to process live streams within the unique characteristics of data streams,
milliseconds because it can be programmed which tend to generate massive amounts of
to only process new, changed data as it goes data (terabytes to petabytes) that it is at best
through rows of big data in real-time. In this semi-structured and requires significant pre-
way, Flink easily enables the execution of processing and ETL to become useful.
batch and stream processing at a large scale c.)Stream processing is a complex challenge
to offer real-time insights, so it’s no wonder rarely solved with a single database or ETL
this platform is known for offering low latency tool – hence the need to “architect” a
and high performance. solution consisting of multiple building blocks.
b.)Apache Spark:-Another open-source data Part of the thinking behind Upsolver SQLake
processing framework that’s known for its is to replace these point products with an
speed and ease of use is Spark. This platform integrated platform that delivers with self-
runs in-memory on RAM on clusters and isn’t orchestrating declarative data pipelines. We’ll
tied to Hadoop’s MapReduce two-stage demonstrate how this approach manifests
paradigm, which adds to its lightning-fast within each part of the streaming data supply
performance when it comes to big data chain.
processing. c.)Apache Samza:-Samza is an
open-source distributed stream-processing
framework that lets users build applications
that can process big data in real-time from
several sources. It’s based on Apache Kafka
and YARN, but it can also run as a standalone
library. LinkedIn originally developed Samza,
but since then, other big brands have started
using it—such as eBay, Slack, Redfin,
Optimizely, and TripAdvisor.
Q.3] Real-time Streaming Platforms for Big Q.4] What is Streaming Data Architecture?
Data ? Ans:-a.)A streaming data architecture is a
Ans:- Real-time analytics can keep you up-to- framework of software components built to
date on what’s happening right now, such as ingest, process, and analyze data streams –
how many people are currently reading your typically in real time or near-real time. Rather
new blog post and whether someone just than writing and reading data in batches, a
liked your latest Facebook status. For most streaming data architecture consumes data
use cases, real-time is a nice-to-have feature immediately as it is generated, persists it to
that won’t provide any crucial insights. storage, and may include various additional
However, sometimes real-time is a must. components per use case – such as tools for
a.)Apache Flink:-Apache Flink is an open- real-time processing, data manipulation, and
source streaming platform that’s extremely analytics.
fast at complex stream processing. In fact, it’s b.)Streaming architectures must account for
able to process live streams within the unique characteristics of data streams,
milliseconds because it can be programmed which tend to generate massive amounts of
to only process new, changed data as it goes data (terabytes to petabytes) that it is at best
through rows of big data in real-time. In this semi-structured and requires significant pre-
way, Flink easily enables the execution of processing and ETL to become useful.
batch and stream processing at a large scale c.)Stream processing is a complex challenge
to offer real-time insights, so it’s no wonder rarely solved with a single database or ETL
this platform is known for offering low latency tool – hence the need to “architect” a
and high performance. solution consisting of multiple building blocks.
b.)Apache Spark:-Another open-source data Part of the thinking behind Upsolver SQLake
processing framework that’s known for its is to replace these point products with an
speed and ease of use is Spark. This platform integrated platform that delivers with self-
runs in-memory on RAM on clusters and isn’t orchestrating declarative data pipelines. We’ll
tied to Hadoop’s MapReduce two-stage demonstrate how this approach manifests
paradigm, which adds to its lightning-fast within each part of the streaming data supply
performance when it comes to big data chain.
processing. c.)Apache Samza:-Samza is an
open-source distributed stream-processing
framework that lets users build applications
that can process big data in real-time from
several sources. It’s based on Apache Kafka
and YARN, but it can also run as a standalone
library. LinkedIn originally developed Samza,
but since then, other big brands have started
using it—such as eBay, Slack, Redfin,
Optimizely, and TripAdvisor.
Q.3] Real-time Streaming Platforms for Big Q.4] What is Streaming Data Architecture?
Data ? Ans:-a.)A streaming data architecture is a
Ans:- Real-time analytics can keep you up-to- framework of software components built to
date on what’s happening right now, such as ingest, process, and analyze data streams –
how many people are currently reading your typically in real time or near-real time. Rather
new blog post and whether someone just than writing and reading data in batches, a
liked your latest Facebook status. For most streaming data architecture consumes data
use cases, real-time is a nice-to-have feature immediately as it is generated, persists it to
that won’t provide any crucial insights. storage, and may include various additional
However, sometimes real-time is a must. components per use case – such as tools for
a.)Apache Flink:-Apache Flink is an open- real-time processing, data manipulation, and
source streaming platform that’s extremely analytics.
fast at complex stream processing. In fact, it’s b.)Streaming architectures must account for
able to process live streams within the unique characteristics of data streams,
milliseconds because it can be programmed which tend to generate massive amounts of
to only process new, changed data as it goes data (terabytes to petabytes) that it is at best
through rows of big data in real-time. In this semi-structured and requires significant pre-
way, Flink easily enables the execution of processing and ETL to become useful.
batch and stream processing at a large scale c.)Stream processing is a complex challenge
to offer real-time insights, so it’s no wonder rarely solved with a single database or ETL
this platform is known for offering low latency tool – hence the need to “architect” a
and high performance. solution consisting of multiple building blocks.
b.)Apache Spark:-Another open-source data Part of the thinking behind Upsolver SQLake
processing framework that’s known for its is to replace these point products with an
speed and ease of use is Spark. This platform integrated platform that delivers with self-
runs in-memory on RAM on clusters and isn’t orchestrating declarative data pipelines. We’ll
tied to Hadoop’s MapReduce two-stage demonstrate how this approach manifests
paradigm, which adds to its lightning-fast within each part of the streaming data supply
performance when it comes to big data chain.
processing. c.)Apache Samza:-Samza is an
open-source distributed stream-processing
framework that lets users build applications
that can process big data in real-time from
several sources. It’s based on Apache Kafka
and YARN, but it can also run as a standalone
library. LinkedIn originally developed Samza,
but since then, other big brands have started
using it—such as eBay, Slack, Redfin,
Optimizely, and TripAdvisor.
Q.3] Real-time Streaming Platforms for Big Q.4] What is Streaming Data Architecture?
Data ? Ans:-a.)A streaming data architecture is a
Ans:- Real-time analytics can keep you up-to- framework of software components built to
date on what’s happening right now, such as ingest, process, and analyze data streams –
how many people are currently reading your typically in real time or near-real time. Rather
new blog post and whether someone just than writing and reading data in batches, a
liked your latest Facebook status. For most streaming data architecture consumes data
use cases, real-time is a nice-to-have feature immediately as it is generated, persists it to
that won’t provide any crucial insights. storage, and may include various additional
However, sometimes real-time is a must. components per use case – such as tools for
a.)Apache Flink:-Apache Flink is an open- real-time processing, data manipulation, and
source streaming platform that’s extremely analytics.
fast at complex stream processing. In fact, it’s b.)Streaming architectures must account for
able to process live streams within the unique characteristics of data streams,
milliseconds because it can be programmed which tend to generate massive amounts of
to only process new, changed data as it goes data (terabytes to petabytes) that it is at best
through rows of big data in real-time. In this semi-structured and requires significant pre-
way, Flink easily enables the execution of processing and ETL to become useful.
batch and stream processing at a large scale c.)Stream processing is a complex challenge
to offer real-time insights, so it’s no wonder rarely solved with a single database or ETL
this platform is known for offering low latency tool – hence the need to “architect” a
and high performance. solution consisting of multiple building blocks.
b.)Apache Spark:-Another open-source data Part of the thinking behind Upsolver SQLake
processing framework that’s known for its is to replace these point products with an
speed and ease of use is Spark. This platform integrated platform that delivers with self-
runs in-memory on RAM on clusters and isn’t orchestrating declarative data pipelines. We’ll
tied to Hadoop’s MapReduce two-stage demonstrate how this approach manifests
paradigm, which adds to its lightning-fast within each part of the streaming data supply
performance when it comes to big data chain.
processing. c.)Apache Samza:-Samza is an
open-source distributed stream-processing
framework that lets users build applications
that can process big data in real-time from
several sources. It’s based on Apache Kafka
and YARN, but it can also run as a standalone
library. LinkedIn originally developed Samza,
but since then, other big brands have started
using it—such as eBay, Slack, Redfin,
Optimizely, and TripAdvisor.
Q.3] Real-time Streaming Platforms for Big Q.4] What is Streaming Data Architecture?
Data ? Ans:-a.)A streaming data architecture is a
Ans:- Real-time analytics can keep you up-to- framework of software components built to
date on what’s happening right now, such as ingest, process, and analyze data streams –
how many people are currently reading your typically in real time or near-real time. Rather
new blog post and whether someone just than writing and reading data in batches, a
liked your latest Facebook status. For most streaming data architecture consumes data
use cases, real-time is a nice-to-have feature immediately as it is generated, persists it to
that won’t provide any crucial insights. storage, and may include various additional
However, sometimes real-time is a must. components per use case – such as tools for
a.)Apache Flink:-Apache Flink is an open- real-time processing, data manipulation, and
source streaming platform that’s extremely analytics.
fast at complex stream processing. In fact, it’s b.)Streaming architectures must account for
able to process live streams within the unique characteristics of data streams,
milliseconds because it can be programmed which tend to generate massive amounts of
to only process new, changed data as it goes data (terabytes to petabytes) that it is at best
through rows of big data in real-time. In this semi-structured and requires significant pre-
way, Flink easily enables the execution of processing and ETL to become useful.
batch and stream processing at a large scale c.)Stream processing is a complex challenge
to offer real-time insights, so it’s no wonder rarely solved with a single database or ETL
this platform is known for offering low latency tool – hence the need to “architect” a
and high performance. solution consisting of multiple building blocks.
b.)Apache Spark:-Another open-source data Part of the thinking behind Upsolver SQLake
processing framework that’s known for its is to replace these point products with an
speed and ease of use is Spark. This platform integrated platform that delivers with self-
runs in-memory on RAM on clusters and isn’t orchestrating declarative data pipelines. We’ll
tied to Hadoop’s MapReduce two-stage demonstrate how this approach manifests
paradigm, which adds to its lightning-fast within each part of the streaming data supply
performance when it comes to big data chain.
processing. c.)Apache Samza:-Samza is an
open-source distributed stream-processing
framework that lets users build applications
that can process big data in real-time from
several sources. It’s based on Apache Kafka
and YARN, but it can also run as a standalone
library. LinkedIn originally developed Samza,
but since then, other big brands have started
using it—such as eBay, Slack, Redfin,
Optimizely, and TripAdvisor.

You might also like