You are on page 1of 15

UNIT-1

1.What are Characteristics of Bid Data?


Ans:-
1. Volume:
• The name ‘Big Data’ itself is related to a size which is enormous.
• Volume is a huge amount of data.
• To determine the value of data, size of data plays a very crucial role. If the
volume of data is very large then it is actually considered as a ‘Big Data’
Hence while dealing with Big Data it is necessary to consider a
characteristic
‘Volume’.

2. Velocity:
• Velocity refers to the high speed of accumulation of data.
• In Big Data velocity data flows in from sources like machines, networks,
social media, mobile phones etc.
• There is a massive and continuous flow of data. This determines the
potential of data that how fast the data is generated and processed to meet the
demands. Sampling data can help in dealing with the issue like
‘velocity’.

3. Variety:
• It refers to nature of data that is structured, semi-structured and unstructured
data.
• It also refers to heterogeneous sources.
• Structured data: This data is basically an organized data. It generally refers
to data that has defined the length and format of data.
• Semi- Structured data: This data is basically a semi-organised data. It is
generally a form of data that do not conform to the formal structure of data.
Unstructured data: This data basically refers to unorganized data. It
generally refers to data that doesn’t fit neatly into the traditional row and
column structure of the relational database. Texts, pictures, videos etc. are
the examples of unstructured data

4. Veracity:
• It refers to inconsistencies and uncertainty in data 5. Value:
• After having the 4 V’s into account there comes one more V
which stands for Value!. The bulk of Data having no Value is
of no good to the company, unless you turn it into something
useful.

2.Describe the Google File system with neat diagram?


Ans:-

3.Differentiate Google File system with HDFS?


Ans:-
4.Discuss the Hadooop Cluster Architecture in detail?
Ans:-
Basically, for the purpose of storing as well as analyzing huge amounts of
unstructured data in a distributed computing environment, a special type of
computational cluster is designed that what we call as Hadoop Clusters.
Though, whenever we talk about Hadoop Clusters, two main terms come up,
they are cluster and node, so on defining them:
• A collection of nodes is what we call the cluster.
• A node is a point of intersection/connection within a network, ie a
server
There is nothing shared between the nodes in a Hadoop cluster except for the
network which connects them (Hadoop follows shared-nothing architecture).
This feature decreases the processing latency so the cluster-wide latency is
minimized when there is a need to process queries on huge amounts of data.
In addition, Hadoop clusters have two types of machines, such as Master and
Slave, where:
• Master: HDFS NameNode, YARN ResourceManager.
• Slaves: HDFS DataNodes, YARN NodeManagers.
However, it is recommended to separate the master and slave node, because:
• Task/application workloads on the slave nodes should be isolated
from the masters.
• Slaves nodes are frequently decommissioned for maintenance.
Moreover, it is possible to scale out a Hadoop cluster. Here, Scaling means to
add more nodes. That’s why we also call it linearly scalable. Hence, we get a
corresponding boost in throughput, for every node we add.

5.List any 10 Hadoop File System Commands with syntax and Options if
any?
Ans:-
6.Discuss the challenges and applications of Bid data?
Ans:-
Big data is a combination of structured, semistructured and unstructured data
collected by organizations that can be mined for information and used in
machine learning projects, predictive modeling and other advanced analytics
applications.
Big Data Challenges
1. Lack of knowledge Professionals
To run these modern technologies and large Data tools, companies need skilled
data professionals. These professionals will include data scientists, data
analysts, and data engineers to work with the tools and make sense of giant data
sets. One of the Big Data Challenges that any Company face is a lack of
massive Data professionals
2. Lack of proper understanding of Massive Data
Employees might not know what data is, its storage, processing, importance,
and sources. Data professionals may know what's happening, but others might
not have a transparent picture.
As a result, when this important data is required, it can't be retrieved easily
3. Data Growth Issues
One of the foremost pressing challenges of massive Data is storing these huge
sets of knowledge properly. The quantity of knowledge being stored in data
centers and databases of companies is increasing rapidly. As these data sets
grow exponentially with time, it gets challenging to handle. Most of the info is
unstructured and comes from documents, videos, audio, text files, and other
sources.
4. Confusion while Big Data Tool selection
Companies often get confused while selecting the simplest tool for giant Data
analysis and storage. Is HBase or Cassandra the simplest technology for data
storage? Is Hadoop MapReduce ok,
These questions bother companies, and sometimes they're unable to seek out the
answers
5. Securing Data
Securing these huge sets of knowledge is one of the daunting challenges of
massive Data. Often companies are so busy in understanding, storing, and
analyzing their data sets that they push data security for later stages

Applications

Travel and Tourism

Travel and tourism are the users of Big Data. It enables us to forecast travel
facilities requirements at multiple locations, improve business through dynamic
pricing, and many more.

Financial and banking sector

The financial and banking sectors use big data technology extensively. Big
data analytics help banks and customer behaviour on the basis of investment
patterns, shopping trends, motivation to invest, and inputs that are obtained
from personal or financial backgrounds.
Healthcare

Big data has started making a massive difference in the healthcare sector, with
the help of predictive analytics, medical professionals, and health care
personnel. It can produce personalized healthcare and solo patients also.

Telecommunication and media

Telecommunications and the multimedia sector are the main users of Big
Data. There are zettabytes to be generated every day and handling large-scale
data that require big data technologies.

Government and Military

The government and military also used technology at high rates. In the
military, a fighter plane requires to process petabytes of data. Government
agencies use Big Data and run many agencies, managing utilities, dealing with
traffic jams, and the effect of crime like hacking and online fraud.

Social Media

Social Media is the largest data generator. The statistics have shown that around
500+ terabytes of fresh data generated from social media daily, particularly on
Facebook. The data mainly contains videos, photos, message exchanges

7.Draw and explain the components or building blocks of Hadoop


Cluster?

Ans:- pdf name:- Unit-II Working with BigData

Page no:-14

About(Namenode ,Datanode ,Secondary Name node ,JobTracker. TaskTracker)

8.Discuss different modes in which Hadoop run?

Ans:-
9.Hadoop Ecosystem in detail?

Ans:-

Hadoop ecosystem is a platform or framework which helps in solving the big data
problems. It comprises of different components and services ( ingesting, storing,
analyzing, and maintaining) inside of it. Most of the services available in the
Hadoop ecosystem are to supplement the main four core components of Hadoop
which include HDFS, YARN, MapReduce and Common.

Hadoop ecosystem includes both Apache Open Source projects and other wide
variety of commercial tools and solutions. Some of the well known open source
examples include Spark, Hive, Pig, Sqoop and Oozie.

As we have got some idea about what is Hadoop ecosystem, what it does, and
what are its components, let’s discuss each concept in detail.

10.Hadoop Configuration files?


Ans:-
11.Differences between DW and Bid Data?
Ans:-

UNIT-2
1.MR Architecture in detail?
Ans:-
MapReduce is a programming model used for efficient processing in parallel
over large data-sets in a distributed manner. The data is first split and then
combined to produce the final result. The libraries for MapReduce is written in
so many programming languages with various different-different
optimizations. The purpose of MapReduce in Hadoop is to Map each of the jobs
and then it will reduce it to equivalent tasks for providing less overhead over
the cluster network and to reduce the processing power. The MapReduce task
is mainly divided into two phases Map Phase and Reduce Phase.
MapReduce Architecture:

The MapReduce task is mainly divided into 2 phases i.e. Map phase and
Reduce phase.
1. Map: As the name suggests its main use is to map the input data in
key-value pairs. The input to the map may be a key-value pair where
the key can be the id of some kind of address and value is the actual
value that it keeps. The Map() function will be executed in its memory
repository on each of these input key-value pairs and generates the
intermediate key-value pair which works as input for the Reducer
or Reduce() function.

2. Reduce: The intermediate key-value pairs that work as input for


Reducer are shuffled and sort and send to the Reduce() function.
Reducer aggregate or group the data based on its key-value pair as per
the reducer algorithm written by the developer.
2.Differences between Old and New API of Hadoop?
Ans:-

You might also like