Big Data Analytics

239. What do you mean by Big data analytics?
Big data analytics is the use of advanced analytic techniques against very large, diverse data sets that
include structured, semi-structured and unstructured data, from different sources, and in different sizes
from terabytes to zettabytes.
240. List various applications of big data. *
Applications of Big Data

 Difficulty Level : Medium
 Last Updated : 26 Aug, 2019
In today’s world, there are a lot of data. Big companies utilize those data for their business
growth. By analyzing this data, the useful decision can be made in various cases as
discussed below:
1. Tracking Customer Spending Habit, Shopping Behavior: In big retails store (like
Amazon, Walmart, Big Bazar etc.) management team has to keep data of customer’s
spending habit (in which product customer spent, in which band they wish to spent, how
frequently they spent), shopping behavior, customer’s most liked product (so that they can
keep those products in the store). Which product is being searched/sold most, based on
that data, production/collection rate of that product get fixed.
Banking sector uses their customer’s spending behavior-related data so that they can
provide the offer to a particular customer to buy his particular liked product by using
bank’s credit or debit card with discount or cashback. By this way, they can send the right
offer to the right person at the right time.
2. Recommendation: By tracking customer spending habit, shopping behavior, Big
retails store provide a recommendation to the customer. E-commerce site like Amazon,
Walmart, Flipkart does product recommendation. They track what product a customer is
searching, based on that data they recommend that type of product to that customer.
As an example, suppose any customer searched bed cover on Amazon. So, Amazon got
data that customer may be interested to buy bed cover. Next time when that customer will
go to any google page, advertisement of various bed covers will be seen. Thus,
advertisement of the right product to the right customer can be sent.
YouTube also shows recommend video based on user’s previous liked, watched video
type. Based on the content of a video, the user is watching, relevant advertisement is
shown during video running. As an example suppose someone watching a tutorial video of
Big data, then advertisement of some other big data course will be shown during that
video.
3. Smart Traffic System: Data about the condition of the traffic of different road,
collected through camera kept beside the road, at entry and exit point of the city, GPS
device placed in the vehicle (Ola, Uber cab, etc.). All such data are analyzed and jam-free
or less jam way, less time taking ways are recommended. Such a way smart traffic system
can be built in the city by Big data analysis. One more profit is fuel consumption can be
reduced.
4. Secure Air Traffic System: At various places of flight (like propeller etc) sensors
present. These sensors capture data like the speed of flight, moisture, temperature, other
environmental condition. Based on such data analysis, an environmental parameter within
flight are set up and varied.
By analyzing flight’s machine-generated data, it can be estimated how long the machine
can operate flawlessly when it to be replaced/repaired.
5. Auto Driving Car: Big data analysis helps drive a car without human interpretation. In
the various spot of car camera, a sensor placed, that gather data like the size of the
surrounding car, obstacle, distance from those, etc. These data are being analyzed, then
various calculation like how many angles to rotate, what should be speed, when to stop,
etc carried out. These calculations help to take action automatically.
6. Virtual Personal Assistant Tool: Big data analysis helps virtual personal assistant tool
(like Siri in Apple Device, Cortana in Windows, Google Assistant in Android) to provide
the answer of the various question asked by users. This tool tracks the location of the user,
their local time, season, other data related to question asked, etc. Analyzing all such data,
it provides an answer.
As an example, suppose one user asks “Do I need to take Umbrella?”, the tool collects
data like location of the user, season and weather condition at that location, then analyze
these data to conclude if there is a chance of raining, then provide the answer.
7. IoT:
 Manufacturing company install IOT sensor into machines to collect operational
data. Analyzing such data, it can be predicted how long machine will work
without any problem when it requires repairing so that company can take action
before the situation when machine facing a lot of issues or gets totally down.
Thus, the cost to replace the whole machine can be saved.
 In the Healthcare field, Big data is providing a significant contribution. Using
big data tool, data regarding patient experience is collected and is used by
doctors to give better treatment. IoT device can sense a symptom of probable
coming disease in the human body and prevent it from giving advance treatment.
IoT Sensor placed near-patient, new-born baby constantly keeps track of various
health condition like heart bit rate, blood presser, etc. Whenever any parameter
crosses the safe limit, an alarm sent to a doctor, so that they can take step
remotely very soon.
8. Education Sector: Online educational course conducting organization utilize big data
to search candidate, interested in that course. If someone searches for YouTube tutorial
video on a subject, then online or offline course provider organization on that subject send
ad online to that person about their course.
9. Energy Sector: Smart electric meter read consumed power every 15 minutes and sends
this read data to the server, where data analyzed and it can be estimated what is the time in
a day when the power load is less throughout the city. By this system manufacturing unit
or housekeeper are suggested the time when they should drive their heavy machine in the
night time when power load less to enjoy less electricity bill.
10. Media and Entertainment Sector: Media and entertainment service providing
company like Netflix, Amazon Prime, Spotify do analysis on data collected from their
users. Data like what type of video, music users are watching, listening most, how long
users are spending on site, etc are collected and analyzed to set the next business strategy.
241. What is Hadoop Ecosystem? *
Hadoop Ecosystem
 Difficulty Level : Easy
Overview: Apache Hadoop is an open source framework intended to make interaction

with big data easier, However, for those who are not acquainted with this technology, one
question arises that what is big data ? Big data is a term given to the data sets which can’t
be processed in an efficient manner with the help of traditional methodology such as
RDBMS. Hadoop has made its place in the industries and companies that need to work on
large data sets which are sensitive and needs efficient handling. Hadoop is a framework
that enables processing of large data sets which reside in the form of clusters. Being a
framework, Hadoop is made up of several modules that are supported by a large
ecosystem of technologies.
Introduction: Hadoop Ecosystem is a platform or a suite which provides various services
to solve the big data problems. It includes Apache projects and various commercial tools
and solutions. There are four major elements of Hadoop i.e. HDFS, MapReduce, YARN,
and Hadoop Common. Most of the tools or solutions are used to supplement or support
these major elements. All these tools work collectively to provide services such as
absorption, analysis, storage and maintenance of data etc.
Following are the components that collectively form a Hadoop ecosystem:
 HDFS: Hadoop Distributed File System

 YARN: Yet Another Resource Negotiator
 MapReduce: Programming based Data Processing
 Spark: In-Memory data processing
 PIG, HIVE: Query based processing of data services
 HBase: NoSQL Database
 Mahout, Spark MLLib: Machine Learning algorithm libraries
 Solar, Lucene: Searching and Indexing
 Zookeeper: Managing cluster
 Oozie: Job Scheduling
Note: Apart from the above-mentioned components, there are many other components too
that are part of the Hadoop ecosystem.
All these toolkits or components revolve around one term i.e. Data. That’s the beauty of
Hadoop that it revolves around data and hence making its synthesis easier.
HDFS:
 HDFS is the primary or major component of Hadoop ecosystem and is

responsible for storing large data sets of structured or unstructured data across
various nodes and thereby maintaining the metadata in the form of log files.
 HDFS consists of two core components i.e.
0. Name node
1. Data Node
 Name Node is the prime node which contains metadata (data about data)
requiring comparatively fewer resources than the data nodes that stores the
actual data. These data nodes are commodity hardware in the distributed
environment. Undoubtedly, making Hadoop cost effective.
 HDFS maintains all the coordination between the clusters and hardware, thus
working at the heart of the system.
YARN:
 Yet Another Resource Negotiator, as the name implies, YARN is the one who
helps to manage the resources across the clusters. In short, it performs
scheduling and resource allocation for the Hadoop System.
 Consists of three major components i.e.
0. Resource Manager
1. Nodes Manager
2. Application Manager
 Resource manager has the privilege of allocating resources for the applications
in a system whereas Node managers work on the allocation of resources such as
CPU, memory, bandwidth per machine and later on acknowledges the resource
manager. Application manager works as an interface between the resource
manager and node manager and performs negotiations as per the requirement of
the two.
MapReduce:
 By making the use of distributed and parallel algorithms, MapReduce makes it

possible to carry over the processing’s logic and helps to write applications
which transform big data sets into a manageable one.
 MapReduce makes the use of two functions i.e. Map() and Reduce() whose task
is:
0. Map() performs sorting and filtering of data and thereby organizing
them in the form of group. Map generates a key-value pair based result
which is later on processed by the Reduce() method.
1. Reduce(), as the name suggests does the summarization by aggregating
the mapped data. In simple, Reduce() takes the output generated by
Map() as input and combines those tuples into smaller set of tuples.
PIG:
Pig was basically developed by Yahoo which works on a pig Latin language, which is
Query based language similar to SQL.
 It is a platform for structuring the data flow, processing and analyzing huge data
sets.
 Pig does the work of executing commands and in the background, all the
activities of MapReduce are taken care of. After the processing, pig stores the
result in HDFS.
 Pig Latin language is specially designed for this framework which runs on Pig
Runtime. Just the way Java runs on the JVM.
 Pig helps to achieve ease of programming and optimization and hence is a major
segment of the Hadoop Ecosystem.
HIVE:
 With the help of SQL methodology and interface, HIVE performs reading and
writing of large data sets. However, its query language is called as HQL (Hive
Query Language).
 It is highly scalable as it allows real-time processing and batch processing both.
Also, all the SQL datatypes are supported by Hive thus, making the query
processing easier.
 Similar to the Query Processing frameworks, HIVE too comes with two
components: JDBC Drivers and HIVE Command Line.
 JDBC, along with ODBC drivers work on establishing the data storage
permissions and connection whereas HIVE Command line helps in the
processing of queries.
Mahout:
 Mahout, allows Machine Learnability to a system or application. Machine

Learning, as the name suggests helps the system to develop itself based on some
patterns, user/environmental interaction or on the basis of algorithms.
 It provides various libraries or functionalities such as collaborative filtering,
clustering, and classification which are nothing but concepts of Machine
learning. It allows invoking algorithms as per our need with the help of its own
libraries.
Apache Spark:
It’s a platform that handles all the process consumptive tasks like batch
processing, interactive or iterative real-time processing, graph conversions, and
visualization, etc.
 It consumes in memory resources hence, thus being faster than the prior in terms
of optimization.
 Spark is best suited for real-time data whereas Hadoop is best suited for
structured data or batch processing, hence both are used in most of the
companies interchangeably.
Apache HBase:
 It’s a NoSQL database which supports all kinds of data and thus capable of
handling anything of Hadoop Database. It provides capabilities of Google’s
BigTable, thus able to work on Big Data sets effectively.
 At times where we need to search or retrieve the occurrences of something small
in a huge database, the request must be processed within a short quick span of
time. At such times, HBase comes handy as it gives us a tolerant way of storing
limited data
Other Components: Apart from all of these, there are some other components too that
carry out a huge task in order to make Hadoop capable of processing large datasets. They
are as follows:
 Solr, Lucene: These are the two services that perform the task of searching and
indexing with the help of some java libraries, especially Lucene is based on Java
which allows spell check mechanism, as well. However, Lucene is driven by
Solr.
 Zookeeper: There was a huge issue of management of coordination and
synchronization among the resources or the components of Hadoop which
resulted in inconsistency, often. Zookeeper overcame all the problems by
performing synchronization, inter-component based communication, grouping,
and maintenance.
 Oozie: Oozie simply performs the task of a scheduler, thus scheduling jobs and
binding them together as a single unit. There is two kinds of jobs .i.e Oozie
workflow and Oozie coordinator jobs. Oozie workflow is the jobs that need to be
executed in a sequentially ordered manner whereas Oozie Coordinator jobs are
those that are triggered when some data or external stimulus is given to it.
242. What do you mean by MapReduce? *
MapReduce is a programming paradigm that enables massive scalability across hundreds or
thousands of servers in a Hadoop cluster. As the processing component, MapReduce is the heart of
Apache Hadoop. The term "MapReduce" refers to two separate and distinct tasks that Hadoop programs
perform.
243. What do you mean by Hadoop? *
Hadoop is an open-source software framework for storing data and running applications on
clusters of commodity hardware. It provides massive storage for any kind of data, enormous
processing power and the ability to handle virtually limitless concurrent tasks or jobs.
244. Write down the names of some components of Hadoop. *
1. Components of the Hadoop Ecosystem

0. HDFS (Hadoop Distributed File System)
1. MapReduce
2. YARN
3. HBase
4. Pig
5. Hive
6. Sqoop
7. Flume
8. Kafka
9. Zookeeper
10.Spark
245. What is Hadoop ecosystem? *
Hadoop Ecosystem
Overview: Apache Hadoop is an open source framework intended to make interaction

with big data easier, However, for those who are not acquainted with this technology, one
question arises that what is big data ? Big data is a term given to the data sets which can’t
be processed in an efficient manner with the help of traditional methodology such as
RDBMS. Hadoop has made its place in the industries and companies that need to work on
large data sets which are sensitive and needs efficient handling. Hadoop is a framework
that enables processing of large data sets which reside in the form of clusters. Being a
framework, Hadoop is made up of several modules that are supported by a large
ecosystem of technologies.
Introduction: Hadoop Ecosystem is a platform or a suite which provides various services
to solve the big data problems. It includes Apache projects and various commercial tools
and solutions. There are four major elements of Hadoop i.e. HDFS, MapReduce, YARN,
and Hadoop Common. Most of the tools or solutions are used to supplement or support
these major elements. All these tools work collectively to provide services such as
absorption, analysis, storage and maintenance of data etc.
Following are the components that collectively form a Hadoop ecosystem:
 HDFS: Hadoop Distributed File System

 YARN: Yet Another Resource Negotiator
 MapReduce: Programming based Data Processing
 Spark: In-Memory data processing
 PIG, HIVE: Query based processing of data services
 HBase: NoSQL Database
 Mahout, Spark MLLib: Machine Learning algorithm libraries
 Solar, Lucene: Searching and Indexing
 Zookeeper: Managing cluster
 Oozie: Job Scheduling
Note: Apart from the above-mentioned components, there are many other components too
that are part of the Hadoop ecosystem.
All these toolkits or components revolve around one term i.e. Data. That’s the beauty of
Hadoop that it revolves around data and hence making its synthesis easier.
HDFS:
 HDFS is the primary or major component of Hadoop ecosystem and is

responsible for storing large data sets of structured or unstructured data across
various nodes and thereby maintaining the metadata in the form of log files.
 HDFS consists of two core components i.e.
0. Name node
1. Data Node
 Name Node is the prime node which contains metadata (data about data)
requiring comparatively fewer resources than the data nodes that stores the
actual data. These data nodes are commodity hardware in the distributed
environment. Undoubtedly, making Hadoop cost effective.
 HDFS maintains all the coordination between the clusters and hardware, thus
working at the heart of the system.
YARN:
Yet Another Resource Negotiator, as the name implies, YARN is the one who

helps to manage the resources across the clusters. In short, it performs
scheduling and resource allocation for the Hadoop System.
 Consists of three major components i.e.
0. Resource Manager
1. Nodes Manager
2. Application Manager
 Resource manager has the privilege of allocating resources for the applications
in a system whereas Node managers work on the allocation of resources such as
CPU, memory, bandwidth per machine and later on acknowledges the resource
manager. Application manager works as an interface between the resource
manager and node manager and performs negotiations as per the requirement of
the two.
MapReduce:
 By making the use of distributed and parallel algorithms, MapReduce makes it

possible to carry over the processing’s logic and helps to write applications
which transform big data sets into a manageable one.
 MapReduce makes the use of two functions i.e. Map() and Reduce() whose task
is:
0. Map() performs sorting and filtering of data and thereby organizing
them in the form of group. Map generates a key-value pair based result
which is later on processed by the Reduce() method.
1. Reduce(), as the name suggests does the summarization by aggregating
the mapped data. In simple, Reduce() takes the output generated by
Map() as input and combines those tuples into smaller set of tuples.
PIG:
Pig was basically developed by Yahoo which works on a pig Latin language, which is
Query based language similar to SQL.
 It is a platform for structuring the data flow, processing and analyzing huge data
sets.
 Pig does the work of executing commands and in the background, all the
activities of MapReduce are taken care of. After the processing, pig stores the
result in HDFS.
 Pig Latin language is specially designed for this framework which runs on Pig
Runtime. Just the way Java runs on the JVM.
 Pig helps to achieve ease of programming and optimization and hence is a major
segment of the Hadoop Ecosystem.
HIVE:
 With the help of SQL methodology and interface, HIVE performs reading and
writing of large data sets. However, its query language is called as HQL (Hive
Query Language).
 It is highly scalable as it allows real-time processing and batch processing both.
Also, all the SQL datatypes are supported by Hive thus, making the query
processing easier.
 Similar to the Query Processing frameworks, HIVE too comes with two
components: JDBC Drivers and HIVE Command Line.
 JDBC, along with ODBC drivers work on establishing the data storage
permissions and connection whereas HIVE Command line helps in the
processing of queries.
Mahout:
 Mahout, allows Machine Learnability to a system or application. Machine

Learning, as the name suggests helps the system to develop itself based on some
patterns, user/environmental interaction or on the basis of algorithms.
 It provides various libraries or functionalities such as collaborative filtering,
clustering, and classification which are nothing but concepts of Machine
learning. It allows invoking algorithms as per our need with the help of its own
libraries.
Apache Spark:
It’s a platform that handles all the process consumptive tasks like batch
processing, interactive or iterative real-time processing, graph conversions, and
visualization, etc.
 It consumes in memory resources hence, thus being faster than the prior in terms
of optimization.
 Spark is best suited for real-time data whereas Hadoop is best suited for
structured data or batch processing, hence both are used in most of the
companies interchangeably.
Apache HBase:
 It’s a NoSQL database which supports all kinds of data and thus capable of
handling anything of Hadoop Database. It provides capabilities of Google’s
BigTable, thus able to work on Big Data sets effectively.
 At times where we need to search or retrieve the occurrences of something small
in a huge database, the request must be processed within a short quick span of
time. At such times, HBase comes handy as it gives us a tolerant way of storing
limited data
Other Components: Apart from all of these, there are some other components too that
carry out a huge task in order to make Hadoop capable of processing large datasets. They
are as follows:
 Solr, Lucene: These are the two services that perform the task of searching and
indexing with the help of some java libraries, especially Lucene is based on Java
which allows spell check mechanism, as well. However, Lucene is driven by
Solr.
 Zookeeper: There was a huge issue of management of coordination and
synchronization among the resources or the components of Hadoop which
resulted in inconsistency, often. Zookeeper overcame all the problems by
performing synchronization, inter-component based communication, grouping,
and maintenance.
 Oozie: Oozie simply performs the task of a scheduler, thus scheduling jobs and
binding them together as a single unit. There is two kinds of jobs .i.e Oozie
workflow and Oozie coordinator jobs. Oozie workflow is the jobs that need to be
executed in a sequentially ordered manner whereas Oozie Coordinator jobs are
those that are triggered when some data or external stimulus is given to it.
246. Write down the features of Hadoop. *
Hadoop – Features of Hadoop Which Makes

It Popular
Today tons of Companies are adopting Hadoop Big Data tools to solve their Big Data
queries and their customer market segments. There are lots of other tools also available in
the Market like HPCC developed by LexisNexis Risk Solution, Storm, Qubole,
Cassandra, Statwing, CouchDB, Pentaho, Openrefine, Flink, etc. Then why Hadoop is so
popular among all of them. Here we will discuss some top essential industrial ready
features that make Hadoop so popular and the Industry favorite.
Hadoop is a framework written in java with some code in C and Shell Script that works
over the collection of various simple commodity hardware to deal with the large dataset
using a very basic level programming model. It is developed by Doug Cutting and Mike
Cafarella and now it comes under Apache License 2.0. Now, Hadoop will be considered
as the must-learn skill for the data-scientist and Big Data Technology. Companies are
investing big in it and it will become an in-demand skill in the future. Hadoop 3.x is the
latest version of Hadoop. Hadoop consist of Mainly 3 components.
1. HDFS(Hadoop Distributed File System): HDFS is working as a storage layer
on Hadoop. The data is always stored in the form of data-blocks on HDFS where
the default size of each data-block is 128 MB in size which is configurable.
Hadoop works on the MapReduce algorithm which is a master-slave
architecture. HDFS has NameNode and DataNode that works in a similar
pattern.
2. MapReduce: MapReduce works as a processing layer on Hadoop. Map-Reduce
is a programming model that is mainly divided into two phases Map Phase
and Reduce Phase. It is designed for processing the data in parallel which is
divided on various machines(nodes).
3. YARN(yet another Resources Negotiator): YARN is the job scheduling and
resource management layer in Hadoop. The data stored on HDFS is processed
and run with the help of data processing engines like graph processing,
interactive processing, batch processing, etc. The overall performance of Hadoop
is improved up with the Help of this YARN framework.
Features of Hadoop Which Makes It Popular
Let’s discuss the key features which make Hadoop more reliable to use, an industry
favorite, and the most powerful Big Data tool.
1. Open Source:
Hadoop is open-source, which means it is free to use. Since it is an open-source project
the source-code is available online for anyone to understand it or make some
modifications as per their industry requirement.
2. Highly Scalable Cluster:
Hadoop is a highly scalable model. A large amount of data is divided into multiple
inexpensive machines in a cluster which is processed parallelly. the number of these
machines or nodes can be increased or decreased as per the enterprise’s requirements. In
traditional RDBMS(Relational DataBase Management System) the systems can not be
scaled to approach large amounts of data.
3. Fault Tolerance is Available:
Hadoop uses commodity hardware(inexpensive systems) which can be crashed at any
moment. In Hadoop data is replicated on various DataNodes in a Hadoop cluster which
ensures the availability of data if somehow any of your systems got crashed. You can read
all of the data from a single machine if this machine faces a technical issue data can also
be read from other nodes in a Hadoop cluster because the data is copied or replicated by
default. By default, Hadoop makes 3 copies of each file block and stored it into different
nodes. This replication factor is configurable and can be changed by changing the
replication property in the hdfs-site.xml file.
4. High Availability is Provided:
Fault tolerance provides High Availability in the Hadoop cluster. High Availability means
the availability of data on the Hadoop cluster. Due to fault tolerance in case if any of the
DataNode goes down the same data can be retrieved from any other node where the data is
replicated. The High available Hadoop cluster also has 2 or more than two Name Node i.e.
Active NameNode and Passive NameNode also known as stand by NameNode. In case if
Active NameNode fails then the Passive node will take the responsibility of Active Node
and provide the same data as that of Active NameNode which can easily be utilized by the
user.
5. Cost-Effective:
Hadoop is open-source and uses cost-effective commodity hardware which provides a
cost-efficient model, unlike traditional Relational databases that require expensive
hardware and high-end processors to deal with Big Data. The problem with traditional
Relational databases is that storing the Massive volume of data is not cost-effective, so the
company’s started to remove the Raw data. which may not result in the correct scenario of
their business. Means Hadoop provides us 2 main benefits with the cost one is it’s open-
source means free to use and the other is that it uses commodity hardware which is also
inexpensive.
6. Hadoop Provide Flexibility:
Hadoop is designed in such a way that it can deal with any kind of dataset like
structured(MySql Data), Semi-Structured(XML, JSON), Un-structured (Images and
Videos) very efficiently. This means it can easily process any kind of data independent of
its structure which makes it highly flexible. It is very much useful for enterprises as they
can process large datasets easily, so the businesses can use Hadoop to analyze valuable
insights of data from sources like social media, email, etc. With this flexibility, Hadoop
can be used with log processing, Data Warehousing, Fraud detection, etc.
7. Easy to Use:
Hadoop is easy to use since the developers need not worry about any of the processing
work since it is managed by the Hadoop itself. Hadoop ecosystem is also very large comes
up with lots of tools like Hive, Pig, Spark, HBase, Mahout, etc.
8. Hadoop uses Data Locality:
The concept of Data Locality is used to make Hadoop processing fast. In the data locality
concept, the computation logic is moved near data rather than moving the data to the
computation logic. The cost of Moving data on HDFS is costliest and with the help of the
data locality concept, the bandwidth utilization in the system is minimized.
9. Provides Faster Data Processing:
Hadoop uses a distributed file system to manage its storage i.e. HDFS(Hadoop Distributed
File System). In DFS(Distributed File System) a large size file is broken into small size
file blocks then distributed among the Nodes available in a Hadoop cluster, as this
massive number of file blocks are processed parallelly which makes Hadoop faster,
because of which it provides a High-level performance as compared to the traditional
DataBase Management Systems.
247. What is Sqoop? *
Apache Sqoop is a big data tool for transferring data between Hadoop and relational database
servers. Sqoop is used to transfer data from RDBMS (relational database management system) like
MySQL and Oracle to HDFS (Hadoop Distributed File System).
248. What is Flume? *
Apache Flume is an open-source, powerful, reliable and flexible system used to collect, aggregate and
move large amounts of unstructured data from multiple data sources into HDFS/Hbase (for
example) in a distributed fashion via it's strong coupling with the Hadoop cluster.
249. State the difference between Sqoop and Flume. *
Both Flume and Sqoop are meant for data movement.
Sqoop and Flume both are meant to fulfill data ingestion needs but they serve different purposes. Apache
Flume works well for streaming data sources that are generated continuously in Hadoop environment
such as log files from multiple servers whereas whereas Apache Sqoop works well with any RDBMS has
JDBC connectivity.
Sqoop is actually meant for bulk data transfers between Hadoop and any other structured data stores.
Flume collects log data from many sources, aggregating it, and writing it to HDFS.
Flume:
Flume is a framework for populating Hadoop with data. Agents are populated throughout ones IT
infrastructure – inside web servers, application servers and mobile devices, for example – to collect data
and integrate it into Hadoop.
Flume helps to collect data from a variety of sources, like logs, jms, Directory etc. Multiple flume agents
can be configured to collect high volume of data. It scales horizontally.
Flume is a better choice when moving bulk streaming data from various sources like JMS or Spooling
directory whereas Sqoop is an ideal fit if the data is sitting in databases like Teradata, Oracle, MySQL
Server, Postgres or any other JDBC compatible database then it is best to use Apache Sqoop.
Sqoop:
Sqoop is a connectivity tool for moving data from non-Hadoop data stores – such as relational databases
and data warehouses – into Hadoop. It allows users to specify the target location inside of Hadoop and
instruct Sqoop to move data from Oracle,Teradata or other relational databases to the target.
Sqoop helps to move data between Hadoop and other databases and it can transfer data in parallel for
performance.
Apache Sqoop provides direct input i.e. it can map relational databases and import directly into HBase
and Hive.
Sqoop helps in mitigating the excessive loads to external systems.
250. What is Oozie? *

Apache Oozie is used by Hadoop system administrators to run complex log analysis on HDFS.
Hadoop Developers use Oozie for performing ETL operations on data in a sequential order and saving
the output in a specified format (Avro, ORC, etc.) in HDFS. In an enterprise, Oozie jobs are scheduled
as coordinators or bundles.
251. What is HBase? *
HBase is a column-oriented non-relational database management system that runs on top of
Hadoop Distributed File System (HDFS). HBase provides a fault-tolerant way of storing sparse data
sets, which are common in many big data use cases.
252. State the difference between Oozie and HBase. *
253. What do you mean by Big Table? *
Cloud Bigtable is a sparsely populated table that can scale to billions of rows and thousands of
columns, enabling you to store terabytes or even petabytes of data. A single value in each row is
indexed; this value is known as the row key.
254. What do you mean by pattern recognition? *
Pattern Recognition | Introduction

 Difficulty Level : Medium
 Last Updated : 01 Oct, 2021
Pattern is everything around in this digital world. A pattern can either be seen physically
or it can be observed mathematically by applying algorithms.
Example: The colors on the clothes, speech pattern, etc. In computer science, a pattern is
represented using vector feature values.
What is Pattern Recognition?
Pattern recognition is the process of recognizing patterns by using a machine learning
algorithm. Pattern recognition can be defined as the classification of data based on
knowledge already gained or on statistical information extracted from patterns and/or their
representation. One of the important aspects of pattern recognition is its application
potential.
Examples: Speech recognition, speaker identification, multimedia document recognition
(MDR), automatic medical diagnosis.
In a typical pattern recognition application, the raw data is processed and converted into a
form that is amenable for a machine to use. Pattern recognition involves the classification
and cluster of patterns.
 In classification, an appropriate class label is assigned to a pattern based on an
abstraction that is generated using a set of training patterns or domain
knowledge. Classification is used in supervised learning.
 Clustering generated a partition of the data which helps decision making, the
specific decision-making activity of interest to us. Clustering is used in
unsupervised learning.
Features may be represented as continuous, discrete, or discrete binary variables. A
feature is a function of one or more measurements, computed so that it quantifies some
significant characteristics of the object.
Example: consider our face then eyes, ears, nose, etc are features of the face.
A set of features that are taken together, forms the features vector.
Example: In the above example of a face, if all the features (eyes, ears, nose, etc) are
taken together then the sequence is a feature vector([eyes, ears, nose]). The feature vector
is the sequence of a feature represented as a d-dimensional column vector. In the case of
speech, MFCC (Mel-frequency Cepstral Coefficient) is the spectral feature of the speech.
The sequence of the first 13 features forms a feature vector.
Pattern recognition possesses the following features:
 Pattern recognition system should recognize familiar patterns quickly and
accurate
 Recognize and classify unfamiliar objects
 Accurately recognize shapes and objects from different angles
 Identify patterns and objects even when partly hidden
 Recognize patterns quickly with ease, and with automaticity.
Training and Learning in Pattern Recognition
Learning is a phenomenon through which a system gets trained and becomes adaptable to
give results in an accurate manner. Learning is the most important phase as to how well
the system performs on the data provided to the system depends on which algorithms are
used on the data. The entire dataset is divided into two categories, one which is used in
training the model i.e. Training set, and the other that is used in testing the model after
training, i.e. Testing set.
 Training set:
The training set is used to build a model. It consists of the set of images that are
used to train the system. Training rules and algorithms are used to give relevant
information on how to associate input data with output decisions. The system is
trained by applying these algorithms to the dataset, all the relevant information is
extracted from the data, and results are obtained. Generally, 80% of the data of
the dataset is taken for training data.
 Testing set:
Testing data is used to test the system. It is the set of data that is used to verify
whether the system is producing the correct output after being trained or not.
Generally, 20% of the data of the dataset is used for testing. Testing data is used
to measure the accuracy of the system. For example, a system that identifies
which category a particular flower belongs to is able to identify seven categories
of flowers correctly out of ten and the rest of others wrong, then the accuracy is
70 %
Real-time Examples and Explanations:
A pattern is a physical object or an abstract notion. While talking about the classes of
animals, a description of an animal would be a pattern. While talking about various types
of balls, then a description of a ball is a pattern. In the case balls considered as pattern, the
classes could be football, cricket ball, table tennis ball, etc. Given a new pattern, the class
of the pattern is to be determined. The choice of attributes and representation of patterns is
a very important step in pattern classification. A good representation is one that makes use
of discriminating attributes and also reduces the computational burden in pattern
classification.
An obvious representation of a pattern will be a vector. Each element of the vector can
represent one attribute of the pattern. The first element of the vector will contain the value
of the first attribute for the pattern being considered.
Example: While representing spherical objects, (25, 1) may be represented as a spherical
object with 25 units of weight and 1 unit diameter. The class label can form a part of the
vector. If spherical objects belong to class 1, the vector would be (25, 1, 1), where the first
element represents the weight of the object, the second element, the diameter of the object
and the third element represents the class of the object.
Advantages:
 Pattern recognition solves classification problems
 Pattern recognition solves the problem of fake biometric detection.
 It is useful for cloth pattern recognition for visually impaired blind people.
 It helps in speaker diarization.
 We can recognize particular objects from different angles.
Disadvantages:
 The syntactic pattern recognition approach is complex to implement and it is a
very slow process.
 Sometimes to get better accuracy, a larger dataset is required.
 It cannot explain why a particular object is recognized.
Example: my face vs my friend’s face.
Applications:
 Image processing, segmentation, and analysis
Pattern recognition is used to give human recognition intelligence to machines
that are required in image processing.
 Computer vision
Pattern recognition is used to extract meaningful features from given
image/video samples and is used in computer vision for various applications like
biological and biomedical imaging.
 Seismic analysis
The pattern recognition approach is used for the discovery, imaging, and
interpretation of temporal patterns in seismic array recordings. Statistical pattern
recognition is implemented and used in different types of seismic analysis
models.
 Radar signal classification/analysis
Pattern recognition and signal processing methods are used in various
applications of radar signal classifications like AP mine detection and
identification.
 Speech recognition
The greatest success in speech recognition has been obtained using pattern
recognition paradigms. It is used in various algorithms of speech recognition
which tries to avoid the problems of using a phoneme level of description and
treats larger units such as words as pattern
 Fingerprint identification
Fingerprint recognition technology is a dominant technology in the biometric
market. A number of recognition methods have been used to perform fingerprint
matching out of which pattern recognition approaches are widely used.
255. What is classification? *

Classification is one of the data mining technique that classifies unstructured data into the
structured class and groups and it helps to user for knowledge discovery and future plan [3].
Classification provides intelligent decision making.
256. What do you mean by regression? *
Regression is a statistical method used in finance, investing, and other disciplines that attempts to
determine the strength and character of the relationship between one dependent variable (usually
denoted by Y) and a series of other variables (known as independent variables).
257. What is the drawback of using Bayesian classifier? *
Disadvantages of Naive Bayes

 If your test data set has a categorical variable of a category that wasn’t present
in the training data set, the Naive Bayes model will assign it zero probability
and won’t be able to make any predictions in this regard. This phenomenon is
called ‘Zero Frequency,’ and you’ll have to use a smoothing technique to
solve this problem.
 This algorithm is also notorious as a lousy estimator. So, you shouldn’t take
the probability outputs of ‘predict_proba’ too seriously.
 It assumes that all the features are independent. While it might sound great in
theory, in real life, you’ll hardly find a set of independent features.
258. What do you mean by pixel classification? *

Semantic segmentation, also known as pixel-based classification, is an important task where
classification of each pixel belongs to a particular class. In GIS, you can use segmentation for land
cover classification or for extracting roads or buildings from satellite imagery.
259. What is false positive? *
A false positive is an outcome where the model incorrectly predicts the positive class. And a false
negative is an outcome where the model incorrectly predicts the negative class.
260. What is false negative? *
A false positive is an outcome where the model incorrectly predicts the positive class. And a false
negative is an outcome where the model incorrectly predicts the negative class.
261. What is true positive? *
A true positive is an outcome where the model correctly predicts the positive class. Similarly, a true
negative is an outcome where the model correctly predicts the negative class. A false positive is an
outcome where the model incorrectly predicts the positive class.
262. What is true negative? *
a true negative is an outcome where the model correctly predicts the negative class. A false
positive is an outcome where the model incorrectly predicts the positive class. And a false negative is an
outcome where the model incorrectly predicts the negative class.
263. State the difference between true positive and false positive. *
True Positive (TP): False Positive (FP):

 Reality: A wolf threatened.  Reality: No wolf threatened.
 Shepherd said: "Wolf."  Shepherd said: "Wolf."
 Outcome: Shepherd is a hero. Outcome: Villagers are angry at
shepherd for waking them up.
False Negative (FN): True Negative (TN):
 Shepherd said: "No wolf."  Shepherd said: "No wolf."
 Outcome: The wolf ate all the sheep.  Outcome: Everyone is fine.
264. State the difference between true positive and true negative. *

 Outcome: Shepherd is a hero. Outcome: Villagers are angry at
265. State the difference between false negative and false positive. *

 Outcome: Shepherd is a hero.  Outcome: Villagers are angry at
266. State the difference between true negative and false positive. *

 Outcome: Shepherd is a hero.  Outcome: Villagers are angry at
267. What is the difference between linear regression and logistic regression? *
Difference between Linear Regression and Logistic Regression:
Linear Regression Logistic Regression
Linear regression is used Logistic Regression is used

to predict the continuous to predict the categorical
dependent variable using dependent variable using
a given set of independent a given set of independent
variables. variables.
Linear Regression is used Logistic regression is used

for solving Regression for solving Classification
problem. problems.
In Linear regression, we In logistic Regression, we

predict the value of predict the values of
continuous variables. categorical variables.
In linear regression, we In Logistic Regression, we

find the best fit line, by find the S-curve by which
which we can easily we can classify the
predict the output. samples.
Least square estimation Maximum likelihood

method is used for estimation method is used
estimation of accuracy. for estimation of accuracy.
The output for Linear The output of Logistic

Regression must be a Regression must be a
continuous value, such as Categorical value such as 0
price, age, etc. or 1, Yes or No, etc.
In Linear regression, it is In Logistic regression, it is

required that relationship not required to have the
between dependent linear relationship
variable and independent between the dependent
variable must be linear. and independent variable.
In linear regression, there In logistic regression, there

may be collinearity should not be collinearity
between the independent between the independent
variables. variable.
268. State the difference between logistic regression and SVM. *
Differentiate between Support Vector

Machine and Logistic Regression
 Last Updated : 17 Jul, 2020
Logistic Regression:
It is a classification model which is used to predict the odds in favour of a particular event.
The odds ratio represents the positive event which we want to predict, for example, how
likely a sample has breast cancer/ how likely is it for an individual to become diabetic in
future. It used the sigmoid function to convert an input value between 0 and 1.
The basic idea of logistic regression is to adapt linear regression so that it estimates the
probability a new entry falls in a class. The linear decision boundary is simply a
consequence of the structure of the regression function and the use of a threshold in the
function to classify. Logistic Regression tries to maximize the conditional likelihood of
the training data, it is highly prone to outliers. Standardization (as co-linearity checks) is
also fundamental to make sure a features’ weights do not dominate over the others.
Support Vector Machine (SVM):
It is a very powerful classification algorithm to maximize the margin among class
variables. This margin (support vector) represents the distance between the separating
hyperplanes (decision boundary). The reason to have decision boundaries with large
margin is to separate positive and negative hyperplanes with adjustable bias-variance
proportion. The goal is to separate so that negative samples would fall under negative
hyperplane and positive samples would fall under positive hyperplane. SVM is not as
prone to outliers as it only cares about the points closest to the decision boundary. It
changes its decision boundary depending on the placement of the new positive or negative
events.
The decision boundary is much more important for Linear SVM’s – the whole goal is to
place a linear boundary in a smart way. There isn’t a probabilistic interpretation of
individual classifications, at least not in the original formulation.
Hence, key points are:
 SVM try to maximize the margin between the closest support vectors whereas
logistic regression maximize the posterior class probability
 SVM is deterministic (but we can use Platts model for probability score) while
LR is probabilistic.
 For the kernel space, SVM is faster
S.No
. Logistic Regression Support Vector Machine
It is an algorithm used for solving It is a model used for both classification and
1. classification problems. regression.
It is not used to find the best margin, it tries to find the “best” margin (distance
instead, it can have different decision between the line and the support vectors)
boundaries with different weights that that separates the classes and thus reduces
2. are near the optimal point. the risk of error on the data.
It works with already identified It works well with unstructured and semi-
3. identified independent variable. structured data like text and images.
It is based on geometrical properties of the

4. It is based on statistical approach. data.
5. It is vulnerable to overfitting. The risk of overfitting is less in SVM.
6. Problems to apply logistic regression Problems that can be solved using SVM
algorithm.
1. Image Classification
1. Cancer Detection: It can be used to 2. Recognizing handwriting
detect if a patient has cancer(1) or not(0)
3. Cancer Detection
2. Test Score: Predict if the student is
passed(1) or not(0).
3. Marketing: Predict if a customer will
S.No
. Logistic Regression Support Vector Machine
purchase a product(1) or not(0).
269. State the difference between Linear regression and SVM. *

270. What is discriminant function? *
a function of a set of variables that is evaluated for samples of events or objects and used as an
aid in discriminating between or classifying them.
271. What do you mean by IaaS? *
Infrastructure as a service (IaaS) is a type of cloud computing service that offers essential compute,
storage and networking resources on demand, on a pay-as-you-go basis. IaaS is one of the four types of
cloud services, along with software as a service (SaaS), platform as a service (PaaS) and serverless.
272. What do you mean by PaaS? *

Platform as a service. Platform as a service (PaaS) is a complete development and deployment
environment in the cloud, with resources that enable you to deliver everything from simple cloud-based
apps to sophisticated, cloud-enabled enterprise applications.
273. What do you mean by SaaS? *

Software as a service (or SaaS) is a way of delivering applications over the Internet—as a service.
Instead of installing and maintaining software, you simply access it via the Internet, freeing yourself from
complex software and hardware management.
274. What is Graph database? Explain *
A graph database is defined as a specialized, single-purpose platform for creating and manipulating
graphs. Graphs contain nodes, edges, and properties, all of which are used to represent and store data in
a way that relational databases are not equipped to do.
Graph analytics is another commonly used term, and it refers specifically to the process of analyzing data
in a graph format using data points as nodes and relationships as edges. Graph analytics requires a
database that can support graph formats; this could be a dedicated graph database, or a converged
database that supports multiple data models, including graph.
Graph database types

There are two popular models of graph databases: property graphs and RDF graphs. The property graph
focuses on analytics and querying, while the RDF graph emphasizes data integration. Both types of
graphs consist of a collection of points (vertices) and the connections between those points (edges). But
there are differences as well.
There are two popular models of graph databases: property graphs and RDF graphs. The property graph
focuses on analytics and querying, while the RDF graph emphasizes data integration. Both types of
graphs consist of a collection of points (vertices) and the connections between those points (edges). But
there are differences as well.
275. What do you mean by Spatial databases? *
A spatial database is a database that is enhanced to store and access spatial data or data
that defines a geometric space. These data are often associated with geographic locations
and features, or constructed features like cities. Data on spatial databases are stored as
coordinates, points, lines, polygons and topology. Some spatial databases handle more
complex data like three-dimensional objects, topological coverage and linear networks.
276. What is the difference between IaaS and PaaS? *
Difference between IAAS, PAAS and SAAS

 Last Updated : 16 Jun, 2020
1. IAAS :
Infrastructure As A Service (IAAS) is means of delivering computing infrastructure as on-
demand services. It is one of three fundamental cloud service model servers storage
network operating system. In the user purchasing servers software data center space or
network equipment and rent those resources as a fully outsourced service can demand
model. It allows dynamic scaling and resources are distributed as a service. generally
includes multiple-user on a single piece of hardware.
2. PAAS :
Platform As A Service (PAAS) is a cloud delivery model for application composed of
services managed by the third party. It provides elastic scaling of your application in
which it allows developers to build application and services over the internet and
deployment include public, private and hybrid.
3. SAAS :
Software As A Service (SAAS) allows user to run existing online application and it is a
model software that deployed as a hosting service and accessed over Output
Rephrased/Re-written Text the internet or software delivery model during which software
and its associated data are hosted centrally and accessed using their client, usually an
online browser over the web. SAAS services are used for the development and
deployment of modern application.
Difference between IAAS, PAAS and SAAS :

Basis Of IAAS PAAS SAAS
Infrastructure Platform as a Software as a

Stands for as a services. services. services.
IAAS is used
by network PAAS is used SAAS is used by
Uses architects. by developer. end user.
Access IAAS give PAAS give SAAS give

access to the access to run access to the
resources like time end user.
virtual environment
machines and to
virtual deployment
storage. and
development
tools for
application.
It is a cloud
It is service computing
model that model that It is a service
provide delivers tools model in cloud
visualized that is used computing that
computing for host software
resources over development make available
Model internet. of application. for client.
There is no
In which you requirement
required about
knowledge of technicalities
It required subject to company
Technical technical understand handle
understanding. knowledge. basic setup. everything.
It popular
between It is popular
developer between
who focus on consumer and
It is popular the company.such
between development as file sharing,
developer and of apps and email and
Popularity. researchers. scripts. networking.
Amazon web
services, sun, Facebook, and M.S office web,
vcloud google search Facebook and
Cloud services. express. engine. google apps.
Enterprise AWS virtual Microsoft IBM cloud

services. private cloud. azure. analysis.
Outsourced Force.com, AWS,

cloud services. Salesforced. Gigaspaces. terremark
277. What is the difference between IaaS and SaaS? *

IAAS is used
PAAS give
access to run
time
IAAS give environment
access to the to
resources like deployment
virtual and
machines and development SAAS give
virtual tools for access to the
Access storage. application. end user.
It is a cloud
There is no
required about
Popularity. It is popular It popular It is popular

between between between
developer
the company.such
development as file sharing,
researchers. scripts. networking.
Amazon web


278. What is the difference between SaaS and PaaS? *

IAAS is used
PAAS give
access to run
time
IAAS give environment
access to the to
resources like deployment
virtual and
machines and development SAAS give
virtual tools for access to the
Access storage. application. end user.
It is a cloud
There is no
required about
It popular
between It is popular
developer between
It is popular the company.such
between development as file sharing,
Popularity. researchers. scripts. networking.
Amazon web


279. When we need to use SaaS? *
The following are five of the top advantages of using SaaS:

1. Reduced time to benefit
Software as a service (SaaS) differs from the traditional model because the software (application) is
already installed and configured. You can simply provision the server for an instance in cloud, and in a
couple hours, you'll have the application ready for use. This reduces the time spent on installation and
configuration and can reduce the issues that get in the way of the software deployment.
2. Lower costs
SaaS can provide beneficial cost savings since it usually resides in a shared or multi-tenant environment,
where the hardware and software license costs are low compared with the traditional model.
Another advantage is that you can rapidly scale your customer base since SaaS allows small and
medium businesses to use a software that otherwise they would not use due to the high cost of licensing.
Maintenance costs are reduced as well, since the SaaS provider owns the environment and it is split
among all customers that use that solution.
3. Scalability and integration
Usually, SaaS solutions reside in cloud environments that are scalable and have integrations with other
SaaS offerings. Compared with the traditional model, you don't have to buy another server or software.
You only need to enable a new SaaS offering and, in terms of server capacity planning, the SaaS provider
will own that. Additionally, you'll have the flexibility to be able to scale your SaaS use up and down
based on specific needs.
4. New releases (upgrades)
With SaaS, the provider upgrades the solution and it becomes available for their customers. The costs and
effort associated with upgrades and new releases are lower than the traditional model that usually forces
you to buy an upgrade package and install it (or pay for specialized services to get the environment
upgraded).
5. Easy to use and perform proof-of-concepts
SaaS offerings are easy to use since they already come with baked-in best practices and samples. Users
can do proof-of-concepts and test the software functionality or a new release feature in advance. Also,
you can have more than one instance with different versions and do a smooth migration. Even for large
environments, you can use SaaS offerings to test the software before buying.
280. What do you mean by SaaS delivery? *

SaaS works through the cloud delivery model. A software provider will either host the application and
related data using its own servers, databases, networking and computing resources, or it may be an ISV
that contracts a cloud provider to host the application in the provider's data center.
281. Write down the advantages of SaaS. *
1. Reduced time to benefit
Software as a service (SaaS) differs from the traditional model because the software (application) is
already installed and configured. You can simply provision the server for an instance in cloud, and in a
couple hours, you'll have the application ready for use. This reduces the time spent on installation and
configuration and can reduce the issues that get in the way of the software deployment.
2. Lower costs
SaaS can provide beneficial cost savings since it usually resides in a shared or multi-tenant environment,
where the hardware and software license costs are low compared with the traditional model.
Another advantage is that you can rapidly scale your customer base since SaaS allows small and
medium businesses to use a software that otherwise they would not use due to the high cost of licensing.
Maintenance costs are reduced as well, since the SaaS provider owns the environment and it is split
among all customers that use that solution.
3. Scalability and integration
Usually, SaaS solutions reside in cloud environments that are scalable and have integrations with other
SaaS offerings. Compared with the traditional model, you don't have to buy another server or software.
You only need to enable a new SaaS offering and, in terms of server capacity planning, the SaaS provider
will own that. Additionally, you'll have the flexibility to be able to scale your SaaS use up and down
based on specific needs.
4. New releases (upgrades)
With SaaS, the provider upgrades the solution and it becomes available for their customers. The costs and
effort associated with upgrades and new releases are lower than the traditional model that usually forces
you to buy an upgrade package and install it (or pay for specialized services to get the environment
upgraded).
5. Easy to use and perform proof-of-concepts
SaaS offerings are easy to use since they already come with baked-in best practices and samples. Users
can do proof-of-concepts and test the software functionality or a new release feature in advance. Also,
you can have more than one instance with different versions and do a smooth migration. Even for large
environments, you can use SaaS offerings to test the software before buying.
282. Write down the characteristics of SaaS. *
Some of the must have or nice to have features and key

characteristics of SaaS applications are the following:
 - Multi-tenancy model
 - Automated provisioning
 - Single Sign On
 - Subscription based billing
 - High availability
 - Elastic Infrastructure
 - Data Security
 - Application Security
 - Rate limiting/QoS
 - Audit
Multi-tenancy Model
Multi-tenancy is a kind of software architecture in which a single deployment of a software application serves
multiple customers. Each customer is called a tenant. Tenants may be given the ability to customize some
parts of the application, now a days applications are designed in a such a way that per tenant, the storage area
is segregated by having different database altogether or having a different sachems inside a single database or
same database with discriminators.
Automated Provisioning
The users should be able to access the SaaS applications on the fly, which means the process of provisioning
the users with the services needs to be automated. SaaS applications are typically used by B2B/B2C
customers and this requirement demands creating companies/users just by invoking web services and provide
the access credentials. Most of the SaaS applications provide this critical feature and a great example would
be CREST API from Microsoft. Cloud Services Broker (CSB) platforms can automate this procedure to
provide access to SaaS applications on demand basis. Another important characteristic is the de-provisioning
ability - remove the access from the user/organizations whenever the customer decides not to use the Software
as a Service applications. A good example for this is Salesforce, used by sales folks to manage the sales
related operations. Typically, Salesforce tenant gets created for an organization with unique identification by
invoking APIs of Saleforce. Another set of APIs are called to create users under the tenant and the access
credentials are shared to user. Also delete API is called for when an organization decides to discontinue the
application.
Single Sign On
An enterprise organization would want to have a single identity system in place in order to authenticate the
various systems which are going to be consumed by users. Also, it is important for enterprises to have a single
page to provide login credentials and access all Software as a Service applications provisioned to the
respective users. So, Software as a Service applications should be easily integrated with various identity
management systems without much change. It is also a big maintenance overhead for enterprises to store &
maintain multiple credentials per system which are used by enterprise users. So it becomes important to
enable Single Sign On for SaaS applications to authenticate against existing identity system and provide an
experience of logging in once and use the various systems. Typically, Software as a Service applications use
SAML or OpenID kind of impersonations to enable this critical piece. Also, another important factor is that
the SaaS applications are multi-tenant, each tenant would want to authenticate against their own identity &
access management system.
Subscription-based Billing
SaaS applications pricing do not involve the complexity of license cost & upgrade cost etc. Generally, the
Software as a Service applications are subscription based, and this enables customers to buy the SaaS
applications whenever they require them and discontinue whenever the enterprise decides that they are not
needed any more. SaaS applications generally follow seat based charging type- the number of quantity
purchased will decide the amount to be paid. It can have various pricing models and billing cycles such as
monthly/quarterly/half yearly/annually fixed etc. Few modern SaaS applications also provide the ability to
charge based on usage based billing. Another important characteristic is that the SaaS applications should be
able to be invoiced. Typically CSB platforms will look for this critical feature so that they can dispatch a
single invoice to their customers.
High Availability
SaaS applications are shared by multiple tenants and the availability of kind of applications are expected to be
really high throughout. So the Software as a Service applications should provide a high degree of SLA to their
customers. Applications should be accessible 24x7 across globe. Also SaaS applications should expose
management & monitoring API to continuously check the health/availability factor.
Elastic Infrastructure
SaaS applications usage is generally not predictable, consumption can dramatically vary in some months. The
infrastructure on the applications deployed should really have an ability to expand/shrink the resources used
behind the show. These days, SaaS applications are designed in such a way that it identifies the behavior of
the infrastructure. Monitoring agents reside within the deployment resources intimate the respective
management servers about the accessibility of the resources. Typicality, policies and procedures are built as
part of the core architecture to expand/shrink the infrastructure resources. Micro architecture based SaaS
applications are the classic examples. Tools like Docker and Kubernetes are using to manage the elasticity of
the SaaS applications. Another way is to build a policy engine to receive and react for an event; an event
could be expand/shrink the infrastructure resources.
Data Security
Ensuring that the data/business information is protected from corruption and unauthorized access is very
important in today’s world. Since the Software as a Service applications are designed to be shared by different
tenants, it becomes extremely important to know how well the data is secured. Certain types of data must be
enabled with encrypted storage for a particular tenant and the same should not be accessible to another tenant.
So, having a good Key Management Framework or ability to integrate/interface with external Key
Management Frameworks becomes essential part of SaaS applications. Also integration with CASB (Cloud
Access Security Brokers) system will increase the confidence with respect to data security. A very strong Role
Based Access Controls need to be ensured in order to protect the data.
Application Security
SaaS applications should be equipped with protection against vulnerabilities. Typically, they should be
protected against OWASP/SAN identified vulnerabilities. Also, strong identity and access management
controls should be enabled for SaaS applications. The other aspects that make the Software as a Service
application secure are the following:
 Strong session management, protection against hijack the session
 Identifying unauthorized session, protection against multi-session etc.
 Usage of cookies not storing sensitive data, follow Cookie etc.
 Step-Up authentication like password lock out etc.
 Multi factor authentication
 Strong implementation on separation of duties
 Protection against DoS/DDoS
 Protection against buffer overflow attacks
 Also integration points open with CASB will help in gaining confidence of the customers.
Rate Limiting/QoS
Every business has preferred/important users apart from the regular list of users using the applications. These
days, in order to provide better service to all class of customers, rate limiting is a good feature to have. The
number of hits/ number of transaction can be technically limited to ensure the smooth business transactions.
Also, SaaS applications can be enabled with Rate limiting/QoS configure-ability which helps organizations to
manage their user base.
Audit
Generally SaaS applications are equipped with providing audit logs of business transactions and this enables
customers to work out a business strategy by applying business intelligence plans. These services also should
be able to comply with government regulations and internal policies.
283. Write down the limitations of SaaS. *
1. Access management
Access management is critical for every SaaS application due to the presence of sensitive data. SaaS
customers need to know whether the single point of access into the public cloud can expose confidential
information. It is also worthwhile to ask questions about the design of access control systems and identify
whether there are any chances for network security issues, like deficient patching and lack of monitoring.
2. Misconfigurations
Most SaaS products add more layers of complexity into their system, thus increasing the chances for
misconfigurations to arise. Even small configuration mistakes can affect the availability of the cloud
infrastructure.
One of the most well-known misconfiguration mistakes occurred in February 2008 when Pakistan
Telecom tried to block YouTube within Pakistan due to some supposedly blasphemous videos. Their
attempt to create a dummy route for YouTube made the platform globally unavailable for two hours.
3. Regulatory compliance
When you are ensuring that your suppliers have strong endpoint security measures in place, ask these
questions:
 What is the relevant jurisdiction that governs customer data, and how is it determined?
 Do your cloud applications comply with regulatory, privacy, and data protection requirements like GDPR,
HIPAA, SOX, and more?
 Are your cloud providers ready to undergo external security audits?
 Does your cloud service provider hold any security certifications like ISO, ITIL, and more?
4. Storage
Before you purchase new software, it is important to check where all the data is stored. SaaS users can
ask the following questions to cross-check data storage policies:
 Does your SaaS provider allow you to have any control over the location of data stored?
 Is data stored with the help of a secure cloud service provider like AWS or Microsoft, or is it stored in a
private data center?
 Are security solutions like data encryption available in all stages of data storage?
 Can end users share files and objects with other users within and outside their domain?
5. Retention
You need to check how long the SaaS environment retains the sensitive information you enter into the
system. It is recommended to check who owns the data available in the cloud: the SaaS provider or the
user? What is the cloud data retention policy, who enforces it, and are there any exceptions to this?
6. Disaster recovery
Disasters can happen out of the blue and have the capacity to shake the foundations of your business. You
need to ask these questions to get yourself ready to face any impending disasters.
What happens to the cloud application and all your data stored in it during a natural disaster? Does the
force majeure clause in your master service agreement come into play? Does your service provider
promise a complete restoration? If yes, check how long that will take and its procedures.
7. Privacy and data breaches
Security and data breaches are a common security threat that organizations face every day. Ask these
questions to know how well your supplier can mitigate and overcome privacy and data breaches.
284. Write down the name of some SaaS oriented products. *
Types of SaaS
SaaS model is singled out as a separate market direction for a reason. The diversity of all the
possibilities of the market for cloud solutions is astonishing.
Let’s consider some of the available options. Here are the main types of SaaS.
Type Description Examples
The CRM or In 2019, 91% of small business players

customer used CRM. The goal of such systems is Salesforce and HubSpot are
relationship to automate the sales and marketing one of the most significant
management processes. The core features include trendsetters in this area.
software lead’s pipeline and analytic dashboard.
It’s a management system to

manage every business process in real-
time. The data shows that customer
The ERP or Oracle and Acumatica are
satisfaction with ERP software increased
enterprise resource famous players in this
to 68% in 2019. On the background of
planning software market.
the market growth to $79 billion by
2026, your SaaS would be beloved from
day 1.
This type of SaaS product helps PMs to

The Project collaborate with their teams. This niche The strongest competitors
management is expected to grow to $4.33 billion by in this field
software 2023, so there is an excellent spot to are Jira and ProWorkflow.
start your own project.
The billing software In simple words, this niche contains SaaS It’s great news for such
products to cover all payment SaaS providers
procedures. It makes a payment and as Xero, Tipalti or Refrens
after-payment reports the single-click
processes. This market is predicted to
reach $20 billion by 2026.
Usually, this type of cloud soft includes

features for communication and sharing
The good examples of
The collaboration the information. The size of this market
collaboration software
software is expected to be worth $16 billion by
are Miro and i done this.
2025. It is large enough to carve your
own piece of the pie.
This kind of software includes

everything you need to do business on
the Internet. The core features include
To understand the
goods management and payment
eCommerce trend better,
The eCommerce integrations. The e-commerce market is
you can analyze such trend-
software forecasted to reach $21.4 billion by
setters
2020. It creates synergy with a Internet-
as Shopify or BigCommerce.
based shopping, since by 2040 95% of
customers will make purchases via
Internet.
It’s a specific SaaS product for a particular niche. It can be anything – a

comic store, martial arts dojo, grooming salon, or any other business.
The vertical SaaS
Vertical SaaS is a generic name for the niche products made, especially to
optimize specific business processes.
285. When we need to use PaaS? *
PaaS can streamline workflows when multiple developers are working on the
same development project. If other vendors must be included, PaaS can provide great
speed and flexibility to the entire process. PaaS is particularly beneficial if you need to
create customized applications.
286. What do you mean by PaaS delivery? *
Platform-as-a-service (PaaS) is a type of cloud computing model in which a service provider delivers a
platform to customers. The platform enables the organization to develop, run, and manage business
applications without the need to build and maintain the infrastructure such software development
processes require.
287. Write down the advantages of PaaS. *
PaaS works well for small businesses and startup companies for two very basic reasons. First, it’s cost
effective, allowing smaller organizations access to state-of-the-art resources without the big price tag.
Most small firms have never been able to build robust development environments on premises, so PaaS
provides a path for accelerating software development. Second, it allows companies to focus on what
they specialize in without worrying about maintaining basic infrastructure.
Other advantages include the following:
 Cost Effective: No need to purchase hardware or pay expenses during downtime

 Time Savings: No need to spend time setting up/maintaining the core stack
 Speed to Market: Speed up the creation of apps
 Future-Proof: Access to state-of-the-art data center, hardware and operating systems
 Increase Security: PaaS providers invest heavily in security technology and expertise
 Dynamically Scale: Rapidly add capacity in peak times and scale down as needed
 Custom Solutions: Operational tools in place so developers can create custom software
 Flexibility: Allows employees to log in and work on applications from anywhere
288. Write down the characteristics of PaaS. *
Here are the characteristics of PaaS service model:

 PaaS offers browser based development environment. It allows the developer to
create database and edit the application code either via Application Programming
Interface or point-and-click tools.
 PaaS provides built-in security, scalability, and web service interfaces.
 PaaS provides built-in tools for defining workflow, approval processes, and business
rules.
 It is easy to integrate PaaS with other applications on the same platform.
 PaaS also provides web services interfaces that allow us to connect the applications
outside the platform.
289. Write down the limitations of PaaS. *
There are always two sides to every story. While it’s easy to make the case for PaaS, there’s bound to be
some challenges as well. Some of these hurdles are simply the flip side of the positives and the nature of
the beast. Others can be overcome with advanced planning and preparation.
Challenges may include the following:
 Vendor Dependency: Very dependent upon the vendor’s capabilities

 Risk of Lock-In: Customers may get locked into a language, interface or program they no
longer need
 Compatibility: Difficulties may arise if PaaS is used in conjunction with existing
development platforms
 Security Risks: While PaaS providers secure the infrastructure and platform, businesses
are responsible for security of the applications they build
290. Write down the name of some PaaS oriented products. *
Examples of PaaS
 AWS Elastic Beanstalk.
 Windows Azure.
 Heroku.
 Force.com.
 Google App Engine.
 OpenShift.
291. When we need to use IaaS? *

IaaS provides all the infrastructure to support web apps, including storage, web and application
servers and networking resources. Your organisation can quickly deploy web apps on IaaS and easily
scale infrastructure up and down when demand for the apps is unpredictable.
292. What do you mean by IaaS delivery? *

IAAS stands for Infrastructure-as-a-Service. It refers to a cloud based infrastructure for your
business. Cloud service providers offer virtual computing resources over the Internet. The powerful
cloud servers offered by leading IAAS providers tend to keep your web applications active all year
round.
293. Write down the advantages of IaaS. *
IaaS is advantageous to companies in scenarios where scalability and quick provisioning are key. In
other words, organizations experiencing rapid growth but lacking the capital to invest in hardware are
great candidates for IaaS models. IaaS can also be beneficial to companies with steady application
workloads that simply want to offload some of the routine operations and maintenance involved in
managing infrastructure.
Other advantages may include the following:
 Pay for What You Use: Fees are computed via usage-based metrics
 Reduce Capital Expenditures: IaaS is typically a monthly operational expense
 Dynamically Scale: Rapidly add capacity in peak times and scale down as needed
 Increase Security: IaaS providers invest heavily in security technology and expertise
 Future-Proof: Access to state-of-the-art data center, hardware and operating systems
 Self-Service Provisioning: Access via simple internet connection
 Reallocate IT Resources: Free up IT staff for higher value projects
 Reduce Downtime: IaaS enables instant recovery from outages
 Boost Speed: Developers can begin projects once IaaS machines are provisioned
 Enable Innovation: Add new capabilities and leverage APIs
 Level the Playing Field: SMBs can compete with much larger firms
294. Write down the characteristics of IaaS. *
Characteristics of IaaS systems include:

 Automated administrative tasks.
 Dynamic scaling.
 Platform virtualization technology.
 GUI and API-based access.
 Internet connectivity.
295. Write down the limitations of IaaS. *
There are many benefits to using IaaS in an organization, but there are also challenges. Some of these
hurdles can be overcome with advanced preparation, but others present risks that a customer should
weigh in on before deployment.
Challenges may include the following:
 Unexpected Costs: Monthly fees can add up, or peak usage may be more than expected
 Process Changes: IaaS may require changes to processes and workflows
 Runaway Inventory: Instances may be deployed, but not taken down
 Security Risks: While IaaS providers secure the infrastructure, businesses are responsible
for anything they host
 Lack of Support: Live help is sometimes hard to come by
 Complex Integration: Challenges with interaction with existing systems
 Security Risks: New vulnerabilities may emerge around the loss of direct control
 Limited Customization: Public cloud users may have limited control and ability to
customize
 Vendor Lock-In: Moving from one IaaS provider to another can be challenging
 Broadband Dependency: Only as good as the reliability of the internet connection
 Providers Not Created Equally: Vendor vetting and selection can be challenging
 Managing Availability: Even the largest service providers experience downtime
 Confusing SLAs: Service level agreements (SLAs) can be difficult to understand
 Regulatory Uncertainty: Evolving federal and state laws can impact some industries’ use
of IaaS, especially across country borders
 Vendor Consolidation: Providers may be acquired or go out of business
 Third-Party Expertise: Lack of mature service providers, guidance or ecosystem support
296. Write down the name of some IaaS oriented products. *
Popular examples of IaaS include:

 DigitalOcean.
 Linode.
 Rackspace.
 Amazon Web Services (AWS)
 Cisco Metacloud.
 Microsoft Azure.
 Google Compute Engine (GCE)

Big Data Analytics

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Big Data Analytics

Uploaded by

Copyright:

Available Formats

239. What do you mean by Big data analytics?

240. List various applications of big data. *

Applications of Big Data

241. What is Hadoop Ecosystem? *

Overview: Apache Hadoop is an open source framework intended to make interaction

 HDFS: Hadoop Distributed File System

 HDFS is the primary or major component of Hadoop ecosystem and is

 By making the use of distributed and parallel algorithms, MapReduce makes it

 Mahout, allows Machine Learnability to a system or application. Machine

1. Components of the Hadoop Ecosystem

245. What is Hadoop ecosystem? *

Overview: Apache Hadoop is an open source framework intended to make interaction

 HDFS: Hadoop Distributed File System

 HDFS is the primary or major component of Hadoop ecosystem and is

 By making the use of distributed and parallel algorithms, MapReduce makes it

 Mahout, allows Machine Learnability to a system or application. Machine

246. Write down the features of Hadoop. *

Hadoop – Features of Hadoop Which Makes

Both Flume and Sqoop are meant for data movement.

Sqoop helps in mitigating the excessive loads to external systems.

250. What is Oozie? *

Pattern Recognition | Introduction

255. What is classification? *

Disadvantages of Naive Bayes

258. What do you mean by pixel classification? *

True Positive (TP): False Positive (FP):

True Positive (TP): False Positive (FP):

True Positive (TP): False Positive (FP):

True Positive (TP): False Positive (FP):

Difference between Linear Regression and Logistic Regression:

Linear Regression Logistic Regression

Linear regression is used Logistic Regression is used

Linear Regression is used Logistic regression is used

In Linear regression, we In logistic Regression, we

In linear regression, we In Logistic Regression, we

Least square estimation Maximum likelihood

The output for Linear The output of Logistic

In Linear regression, it is In Logistic regression, it is

In linear regression, there In logistic regression, there

268. State the difference between logistic regression and SVM. *

Differentiate between Support Vector

It is based on geometrical properties of the

5. It is vulnerable to overfitting. The risk of overfitting is less in SVM.

purchase a product(1) or not(0).

269. State the difference between Linear regression and SVM. *

272. What do you mean by PaaS? *

273. What do you mean by SaaS? *

274. What is Graph database? Explain *

Graph database types

275. What do you mean by Spatial databases? *

Difference between IAAS, PAAS and SAAS

Difference between IAAS, PAAS and SAAS :

Infrastructure Platform as a Software as a

Access IAAS give PAAS give SAAS give

Enterprise AWS virtual Microsoft IBM cloud

Outsourced Force.com, AWS,

Basis Of IAAS PAAS SAAS

Infrastructure Platform as a Software as a

Popularity. It is popular It popular It is popular

Enterprise AWS virtual Microsoft IBM cloud

Outsourced Force.com, AWS,

278. What is the difference between SaaS and PaaS? *

Basis Of IAAS PAAS SAAS