Professional Documents
Culture Documents
2. Velocity:
• Velocity refers to the high speed of accumulation of data.
• In Big Data velocity data flows in from sources like machines, networks,
social media, mobile phones etc.
• There is a massive and continuous flow of data. This determines the
potential of data that how fast the data is generated and processed to meet the
demands. Sampling data can help in dealing with the issue like
‘velocity’.
3. Variety:
• It refers to nature of data that is structured, semi-structured and unstructured
data.
• It also refers to heterogeneous sources.
• Structured data: This data is basically an organized data. It generally refers
to data that has defined the length and format of data.
• Semi- Structured data: This data is basically a semi-organised data. It is
generally a form of data that do not conform to the formal structure of data.
Unstructured data: This data basically refers to unorganized data. It
generally refers to data that doesn’t fit neatly into the traditional row and
column structure of the relational database. Texts, pictures, videos etc. are
the examples of unstructured data
4. Veracity:
• It refers to inconsistencies and uncertainty in data 5. Value:
• After having the 4 V’s into account there comes one more V
which stands for Value!. The bulk of Data having no Value is
of no good to the company, unless you turn it into something
useful.
5.List any 10 Hadoop File System Commands with syntax and Options if
any?
Ans:-
6.Discuss the challenges and applications of Bid data?
Ans:-
Big data is a combination of structured, semistructured and unstructured data
collected by organizations that can be mined for information and used in
machine learning projects, predictive modeling and other advanced analytics
applications.
Big Data Challenges
1. Lack of knowledge Professionals
To run these modern technologies and large Data tools, companies need skilled
data professionals. These professionals will include data scientists, data
analysts, and data engineers to work with the tools and make sense of giant data
sets. One of the Big Data Challenges that any Company face is a lack of
massive Data professionals
2. Lack of proper understanding of Massive Data
Employees might not know what data is, its storage, processing, importance,
and sources. Data professionals may know what's happening, but others might
not have a transparent picture.
As a result, when this important data is required, it can't be retrieved easily
3. Data Growth Issues
One of the foremost pressing challenges of massive Data is storing these huge
sets of knowledge properly. The quantity of knowledge being stored in data
centers and databases of companies is increasing rapidly. As these data sets
grow exponentially with time, it gets challenging to handle. Most of the info is
unstructured and comes from documents, videos, audio, text files, and other
sources.
4. Confusion while Big Data Tool selection
Companies often get confused while selecting the simplest tool for giant Data
analysis and storage. Is HBase or Cassandra the simplest technology for data
storage? Is Hadoop MapReduce ok,
These questions bother companies, and sometimes they're unable to seek out the
answers
5. Securing Data
Securing these huge sets of knowledge is one of the daunting challenges of
massive Data. Often companies are so busy in understanding, storing, and
analyzing their data sets that they push data security for later stages
Applications
Travel and tourism are the users of Big Data. It enables us to forecast travel
facilities requirements at multiple locations, improve business through dynamic
pricing, and many more.
The financial and banking sectors use big data technology extensively. Big
data analytics help banks and customer behaviour on the basis of investment
patterns, shopping trends, motivation to invest, and inputs that are obtained
from personal or financial backgrounds.
Healthcare
Big data has started making a massive difference in the healthcare sector, with
the help of predictive analytics, medical professionals, and health care
personnel. It can produce personalized healthcare and solo patients also.
Telecommunications and the multimedia sector are the main users of Big
Data. There are zettabytes to be generated every day and handling large-scale
data that require big data technologies.
The government and military also used technology at high rates. In the
military, a fighter plane requires to process petabytes of data. Government
agencies use Big Data and run many agencies, managing utilities, dealing with
traffic jams, and the effect of crime like hacking and online fraud.
Social Media
Social Media is the largest data generator. The statistics have shown that around
500+ terabytes of fresh data generated from social media daily, particularly on
Facebook. The data mainly contains videos, photos, message exchanges
Page no:-14
Ans:-
9.Hadoop Ecosystem in detail?
Ans:-
Hadoop ecosystem is a platform or framework which helps in solving the big data
problems. It comprises of different components and services ( ingesting, storing,
analyzing, and maintaining) inside of it. Most of the services available in the
Hadoop ecosystem are to supplement the main four core components of Hadoop
which include HDFS, YARN, MapReduce and Common.
Hadoop ecosystem includes both Apache Open Source projects and other wide
variety of commercial tools and solutions. Some of the well known open source
examples include Spark, Hive, Pig, Sqoop and Oozie.
As we have got some idea about what is Hadoop ecosystem, what it does, and
what are its components, let’s discuss each concept in detail.
UNIT-2
1.MR Architecture in detail?
Ans:-
MapReduce is a programming model used for efficient processing in parallel
over large data-sets in a distributed manner. The data is first split and then
combined to produce the final result. The libraries for MapReduce is written in
so many programming languages with various different-different
optimizations. The purpose of MapReduce in Hadoop is to Map each of the jobs
and then it will reduce it to equivalent tasks for providing less overhead over
the cluster network and to reduce the processing power. The MapReduce task
is mainly divided into two phases Map Phase and Reduce Phase.
MapReduce Architecture:
The MapReduce task is mainly divided into 2 phases i.e. Map phase and
Reduce phase.
1. Map: As the name suggests its main use is to map the input data in
key-value pairs. The input to the map may be a key-value pair where
the key can be the id of some kind of address and value is the actual
value that it keeps. The Map() function will be executed in its memory
repository on each of these input key-value pairs and generates the
intermediate key-value pair which works as input for the Reducer
or Reduce() function.