You are on page 1of 96

CH-1

Q. 1) What is Big data Analytics? List applications of it. 03 (3)


Ans.
The process of analysis of large volumes of diverse data sets, using advanced
analytic techniques is referred to as Big Data Analytics.
→These diverse data sets include structured, semi-structured, and unstructured
data, from different sources, and in different sizes from terabytes to zettabytes.
We also reckon them as big data.
→Big Data is a term that is used for data sets whose size or type is beyond the
capturing, managing, and processing ability of traditional rotational databases.
The database required to process big data should have low latency that
traditional databases don't have.
→ List applications of it.
Healthcare
• Big Data has already started to create a huge difference in the healthcare
sector.
• With the help of predictive analytics, medical professionals and HCPs are now
able to provide personalized healthcare services to individual patients.
• Apart from that, fitness wearable's, telemedicine, remote monitoring - all
powered by Big Data and Al- are helping change lives for the better.
Academia
• Big Data is also helping enhance education today.
• Education is no more limited to the physical bounds of the classroom - there are
numerous online educational courses to learn from.
• Academic institutions are investing in digital courses powered by Big Data
technologies to aid the all-round development of budding learners.
Banking
• The banking sector relies on Big Data for fraud detection.
• Big Data tools can efficiently detect fraudulent acts in real-time such as misuse
of credit/debit cards, archival of inspection tracks, faulty alteration in customer
stats, etc.
Manufacturing
• According to TCS Global Trend Study, the most significant benefit of Big Data in
manufacturing is improving the supply strategies and product quality.
• In the manufacturing sector, Big data helps create a transparent infrastructure,
thereby, predicting uncertainties and incompetencies that can affect the business
adversely.

IT
• One of the largest users of Big Data, IT companies around the world are using
Big Data to optimize their functioning, enhance employee productivity, and
minimize risks in business operations.
• By combining Big Data technologies with ML and Al, the IT sector is continually
powering innovation to find solutions even for the most complex of problems.
Retail
• Big Data has changed the way of working in traditional brick and mortar retail
stores.
• Over the years, retailers have collected vast amounts of data from local
demographic surveys, POS scanners, RFID, customer loyalty cards, store
inventory, and so on.
• Now, they've started to leverage this data to create personalized customer
experiences, boost sales, increase revenue, and deliver outstanding customer
service.
• Retailers are even using smart sensors and Wi-Fi to track the movement of
customers, the most frequented aisles, for how long customers linger in the
aisles, among other things.
• They also gather social media data to understand what customers are saying
about their brand, their services, and tweak their product design and marketing
strategies accordingly.
Transportation
Big Data Analytics holds immense value for the transportation industry.
• In countries across the world, both private and government-run transportation
companies use Big Data technologies to optimize route planning, control traffic,
manage road congestion, and improve services.
• Additionally, transportation services even use Big Data to revenue management,
drive technological innovation, enhance logistics, and of course, to gain the upper
hand in the
market.

Q. 2) What is Big Data? Explain Challenges of Conventional System. 07(2)


Ans.
-> Big Data is a collection of data that is huge in volume, yet growing
exponentially with time.

->1. Lack of proper understanding of big data.


•Companies fail in their Big Data initiatives due to insufficient understanding.
•Employees may not know what data is, its Storage, Processing, importance and
senses.
•Data Professionals may know what is going on, but others may not have a clean
picture.

->2. Data growth issues.


•one of the most pressing challenges of Big Data is storing all these huge sets of
data Properly.
•The amount of data being Stored in data centers and databases of companies is
increasing rapidly.
•As these data sets grow exponentially with time, it gets extremely difficult to
handle.

->3. Confusion while Big Data tools selection.


• Companies often get confused while selecting the best tool for Big Data analysis
and storage.
•IS HBase or cassandra the best technology for data storage?
•Is Hadoop MapReduce good enough or will Spark be a better option for data
analytics and Storage?

->4. Lack of data professionals


•to run these modern technologies and Big Data tools, companies need skilled
data professionals.
•These Professionals will include data Scientists, data analysts and data engineer
who are experienced in working with the tools and making sense out of huge data
sets.

->5. Securing data.


•Seeming these huge sets of data is one of the daunting challenges of Big data.
•Often companies are so busy in understanding Storing and analyzing their data
sets that they push data security for later Stages.

->6. Integrating data from a variety of Sources


•Data in an organization comes from a variety of Sources, such as social media
pages, ERP applications, customer logs, financial reports, emails,Presentations
and reports created by employees.
•combining all this data to prepare reports is a challenging task.

Q. 3) Explain characteristics (4 V’s) of Big data. 04 (2)


Ans.
Volume:
• The name ‘Big Data’ itself is related to a size which is enormous.
• Volume is a huge amount of data.
• To determine the value of data, size of data plays a very crucial role. If
the volume of data is very large then it is actually considered as a ‘Big
Data’. This means whether a particular data can actually be considered
as a Big Data or not, is dependent upon the volume of data.
• Hence while dealing with Big Data it is necessary to consider a
characteristic ‘Volume’.
• Example: In the year 2016, the estimated global mobile traffic was 6.2
Exabytes(6.2 billion GB) per month. Also, by the year 2020 we will have
almost 40000 ExaBytes of data.
2. Velocity:
• Velocity refers to the high speed of accumulation of data.
• In Big Data velocity data flows in from sources like machines, networks,
social media, mobile phones etc.
• There is a massive and continuous flow of data. This determines the
potential of data that how fast the data is generated and processed to
meet the demands.
• Sampling data can help in dealing with the issue like ‘velocity’.
• Example: There are more than 3.5 billion searches per day are made on
Google. Also, FaceBook users are increasing by 22%(Approx.) year by
year.
3. Variety:
• It refers to nature of data that is structured, semi-structured and
unstructured data.
• It also refers to heterogeneous sources.
• Variety is basically the arrival of data from new sources that are both
inside and outside of an enterprise. It can be structured, semi-
structured and unstructured.
• Structured data: This data is basically an organized data. It
generally refers to data that has defined the length and
format of data.
• Semi- Structured data: This data is basically a semi-organised
data. It is generally a form of data that do not conform to the
formal structure of data. Log files are the examples of this
type of data.
• Unstructured data: This data basically refers to unorganized
data. It generally refers to data that doesn’t fit neatly into the
traditional row and column structure of the relational
database. Texts, pictures, videos etc. are the examples of
unstructured data which can’t be stored in the form of rows
and columns.
4. Veracity:
• It refers to inconsistencies and uncertainty in data, that is data which is
available can sometimes get messy and quality and accuracy are
difficult to control.
• Big Data is also variable because of the multitude of data dimensions
resulting from multiple disparate data types and sources.
• Example: Data in bulk could create confusion whereas less amount of
data could convey half or Incomplete Information.
5. Value:
• After having the 4 V’s into account there comes one more V which
stands for Value!. The bulk of Data having no Value is of no good to the
company, unless you turn it into something useful.
• Data in itself is of no use or importance but it needs to be converted
into something valuable to extract Information. Hence, you can state
that Value! is the most important V of all the 6V’s.
6. Variability:
• How fast or ,available data that extent is the structure of your data is
changing ?.
• How often does the meaning or shape of your data!to?change?.
• example : if you are eating same ice-cream daily and the taste just keep
changing.

Q. 4) Explain advantages and disadvantages of big data analytics 03


Ans.Advantages of Big Data
1. Making wiser decisions

Businesses use big data to enhance B2B operations, advertising, and


communication. Big data is primarily being used by many industries, such as travel,
real estate, finance, and insurance, to enhance decision-making. Businesses can use
big data to accurately predict what customers want and don't want, as well as their
behavioral tendencies because it reveals more information in a usable format.

Big data provides business intelligence and cutting-edge analytical insights that
help with decision-making. A company can get a more in-depth picture of its target
market by collecting more customer data.

2. Cut back on the expense of business operations

According to surveys done by New Vantage and Syncsort (now Precisely), big data
analytics has helped businesses significantly cut their costs. Big data is being used
to cut costs, according to 66.7% of survey participants from New Vantage.
Moreover, 59.4% of Syncsort survey participants stated that using big data tools
improved operational efficiency and reduced costs. Do you know that Hadoop and
Cloud-Based Analytics, two popular big data analytics tools, can help lower the cost
of storing big data

3. Detection of Fraud
Financial companies especially use big data to identify fraud. To find anomalies and
transaction patterns, data analysts use artificial intelligence and machine learning
algorithms. These irregularities in transaction patterns show that something is out
of place or that there is a mismatch, providing us with hints about potential fraud.

For credit unions, banks, and credit card companies, fraud detection is crucial for
identifying account information, materials, or product access. By spotting frauds
before they cause problems, any industry, including finance, can provide better
customer service.

4. A rise in productivity

A survey by Syncsort found that 59.9% of respondents said they were using big
data analytics tools like Spark and Hadoop to boost productivity. They have been
able to increase sales and improve customer retention as a result of this rise in
productivity. Modern big data tools make it possible for data scientists and analysts
to analyse a lot of data quickly and effectively, giving them an overview of more
data.

They become more productive as a result of this. Additionally, big data analytics
aids data scientists and analysts in learning more about themselves to figure out
how to be more effective in their tasks and job responsibilities. As a result, investing
in big data analytics gives businesses across all sectors a chance to stand out
through improved productivity.

5. Enhanced customer support

As part of their marketing strategies, businesses must improve customer


interactions. Since big data analytics give businesses access to more information,
they can use that information to make more specialised, highly personalised offers
to each individual customer as well as more targeted marketing campaigns.

Social media, email exchanges, customer CRM (customer relationship


management) systems, and other major data sources are the main sources of big
data. As a result, it provides businesses with access to a wealth of data about the
needs, interests, and trends of their target market.

Big data also enables businesses better to comprehend the thoughts and feelings
of their clients to provide them with more individualised goods and services.
Providing a personalised experience can increase client satisfaction, strengthen
bonds with clients, and, most importantly, foster loyalty.

6. Enhanced speed and agility

Increasing business agility is a big data benefit for competition. Big data analytics
can assist businesses in becoming more innovative and adaptable in the
marketplace. Large customer data sets can be analysed to help businesses gain
insights ahead of the competition and more effectively address customer pain
points.

Additionally, having a wealth of data at their disposal enables businesses to assess


risks, enhance products and services, and improve communications. Additionally,
big data assists businesses in strengthening their business tactics and strategies,
which are crucial in coordinating their operations to support frequent and quick
changes in the industry.

7. Greater innovation

Innovation is another common benefit of big data, and the NewVantage survey
found that 11.6 per cent of executives are investing in analytics primarily as a means
to innovate and disrupt their markets. They reason that if they can glean insights
that their competitors don't have, they may be able to get out ahead of the rest of
the market with new products and services.

Disadvantages of Big Data

1. A talent gap

A study by AtScale found that for the past three years, the biggest challenge in this
industry has been a lack of big data specialists and data scientists. Given that it
requires a different skill set, big data analytics is currently beyond the scope of
many IT professionals. Finding data scientists who are also knowledgeable about
big data can be difficult.

Data scientists and big data specialists are two well-paid professions in the data
science industry. As a result, hiring big data analysts can be very costly for
businesses, particularly for start-ups. Some businesses must wait a long time to hire
the necessary personnel to carry out their big data analytics tasks.
2. Security hazard

For big data analytics, businesses frequently collect sensitive data. These data need
to be protected, and security risks can be detrimental if they are not properly
maintained.

Additionally, having access to enormous data sets can attract the unwanted
attention of hackers, and your company could become the target of a potential
cyber-attack. You are aware that for many businesses today, data breaches are the
biggest threat. Unless you take all necessary precautions, important information
could be leaked to rivals, which is another risk associated with big data.

3. Adherence

Another disadvantage of big data is the requirement for legal compliance with
governmental regulations. To store, handle, maintain, and process big data that
contains sensitive or private information, a company must make sure that they
adhere to all applicable laws and industry standards. As a result, managing data
governance tasks, transmission, and storage will become more challenging as big
data volumes grow.

4. High Cost

Given that it is a science that is constantly evolving and has as its goal the processing
of ever-increasing amounts of data, only large companies can sustain the
investment in the development of their Big Data techniques.

5. Data quality

Dealing with data quality issues was the main drawback of working with big data.
Data scientists and analysts must ensure the data they are using is accurate,
pertinent, and in the right format for analysis before they can use big data for
analytics efforts.

This significantly slows down the reporting process, but if businesses don't address
data quality problems, they may discover that the insights their analytics produce
are useless or even harmful if used.

6. Rapid Change
The fact that technology is evolving quickly is another potential disadvantage of big
data analytics. Businesses must deal with the possibility of spending money on one
technology only to see something better emerge a few months later. This big data
drawback was ranked fourth among all the potential difficulties by Syncsort
respondents.

Q. 5) Explain types of Big Data 03


Ans.

Structured Data

• Structured data can be crudely defined as the data that resides in


a fixed field within a record.
• It is type of data most familiar to our everyday lives. for ex:
birthday,address
• A certain schema binds it, so all the data has the same set of
properties. Structured data is also called relational data. It is split
into multiple tables to enhance the integrity of the data by creating
a single record to depict an entity. Relationships are enforced by
the application of table constraints.
• The business value of structured data lies within how well an
organization can utilize its existing systems and processes for
analysis purposes.
Sources of structured data

A Structured Query Language (SQL) is needed to bring the data together.


Structured data is easy to enter, query, and analyze. All of the data follows
the same format. However, forcing a consistent structure also means that
any alteration of data is too tough as each record has to be updated to
adhere to the new structure. Examples of structured data include numbers,
dates, strings, etc. The business data of an e-commerce website can be
considered to be structured data.
Name Class Section Roll No Grade

Geek1 11 A 1 A

Geek2 11 A 2 B
Name Class Section Roll No Grade

Geek3 11 A 3 A

Cons of Structured Data


1. Structured data can only be leveraged in cases of predefined
functionalities. This means that structured data has limited
flexibility and is suitable for certain specific use cases only.
2. Structured data is stored in a data warehouse with rigid constraints
and a definite schema. Any change in requirements would mean
updating all of that structured data to meet the new needs. This is
a massive drawback in terms of resource and time management.

Semi-Structured Data

• Semi-structured data is not bound by any rigid schema for data


storage and handling. The data is not in the relational format and
is not neatly organized into rows and columns like that in a
spreadsheet. However, there are some features like key-value
pairs that help in discerning the different entities from each other.
• Since semi-structured data doesn’t need a structured query
language, it is commonly called NoSQL data.
• A data serialization language is used to exchange semi-structured
data across systems that may even have varied underlying
infrastructure.
• Semi-structured content is often used to store metadata about a
business process but it can also include files containing machine
instructions for computer programs.
• This type of information typically comes from external sources
such as social media platforms or other web-based data feeds.

Unstructured Data

• Unstructured data is the kind of data that doesn’t adhere to any


definite schema or set of rules. Its arrangement is unplanned and
haphazard.
• Photos, videos, text documents, and log files can be generally
considered unstructured data. Even though the metadata
accompanying an image or a video may be semi-structured, the
actual data being dealt with is unstructured.
• Additionally, Unstructured data is also known as “dark data”
because it cannot be analyzed without the proper software tools.

Un-structured Data

Q. 6) Describe Traditional vs. Big Data business approach. 04


Ans.
Traditional Data Big Data
Traditional data is generated in Big data is generated outside the
enterprise level. enterprise level.
Its volume ranges from Gigabytes Its volume ranges from
to Terabytes. Petabytes to Zettabytes or
Exabytes.
Traditional database system deals Big data system deals with
with structured data. structured, semi-
structured,database, and
unstructured data.
Traditional data is generated per But big data is generated more
hour or per day or more. frequently mainly per seconds.
Traditional data source is Big data source is distributed and
centralized and it is managed in it is managed in distributed form.
centralized form.
Data integration is very easy. Data integration is very difficult.
Normal system configuration is High system configuration is
capable to process traditional required to process big data.
data.
The size of the data is very small. The size is more than the
traditional data size.
Traditional data base tools are Special kind of data base tools
required to perform any data base are required to perform any
operation. databaseschema-based
operation.
Normal functions can manipulate Special kind of functions can
data. manipulate data.
Its data model is strict schema Its data model is a flat schema
based and it is static. based and it is dynamic.
Traditional data is stable and inter Big data is not stable and
relationship. unknown relationship.
Traditional data is in manageable Big data is in huge volume which
volume. becomes unmanageable.
It is easy to manage and It is difficult to manage and
manipulate the data. manipulate the data.
Its data sources includes ERP Its data sources includes social
transaction data, CRM transaction media, device data, sensor data,
data, financial data, organizational video, images, audio etc.
data, web transaction data etc.

Q. 7) Mention few applications where big data analytics are useful. 04


Ans.
1. Tracking Customer Spending Habit, Shopping Behavior
2. Recommendation:
3. Smart Traffic System
4. Secure Air Traffic System
5. Auto Driving Car:
6. Virtual Personal Assistant Tool
7. IoT
8. Education Sector:
9. Energy Sector:
10.Media and Entertainment Sector

Q. 8) Discuss big data in healthcare, transportation. 03


Ans. Healthcare
• Big Data has already started to create a huge difference in the healthcare
sector.
• With the help of predictive analytics, medical professionals and HCPs are now
able to provide personalized healthcare services to individual patients.
• Apart from that, fitness wearable's, telemedicine, remote monitoring - all
powered by Big Data and Al- are helping change lives for the better.
Transportation
Big Data Analytics holds immense value for the transportation industry.
• In countries across the world, both private and government-run transportation
companies use Big Data technologies to optimize route planning, control traffic,
manage road congestion, and improve services.
• Additionally, transportation services even use Big Data to revenue management,
drive technological innovation, enhance logistics, and of course, to gain the upper
hand in the
market.
Q. 9)Describe any application that you know related to enhance particular
business using big data and explain how it is important as a business prospective.
07
Ans.
Retail, shoppingmalls, market basket analysis

Q. 10) What are the benefits of Big Data? Discuss challenges under Big Data. How
Big Data Analytics can be useful in the development of smart cities. 07
Ans.
• Better Decision Making. ...
• Reduce costs of business processes. ...
• Fraud Detection. ...
• Increased productivity. ...
• Improved customer service. ...
• Increased agility
• Sharing and Accessing Data:
o Perhaps the most frequent challenge in big data efforts is the
inaccessibility of data sets from external sources.
o Sharing data can cause substantial challenges.
o It include the need for inter and intra- institutional legal
documents.
o Accessing data from public repositories leads to multiple
difficulties.
o It is necessary for the data to be available in an accurate,
complete and timely manner because if data in the companies
information system is to be used to make accurate decisions in
time then it becomes necessary for data to be available in this
manner.

• Privacy and Security:


o It is another most important challenge with Big Data. This
challenge includes sensitive, conceptual, technical as well as
legal significance.
o Most of the organizations are unable to maintain regular
checks due to large amounts of data generation. However, it
should be necessary to perform security checks and
observation in real time because it is most beneficial.
o There is some information of a person which when combined
with external large data may lead to some facts of a person
which may be secretive and he might not want the owner to
know this information about that person.
o Some of the organization collects information of the people in
order to add value to their business. This is done by making
insights into their lives that they’re unaware of.

1. Analytical Challenges:
• There are some huge analytical challenges in big data
which arise some main challenges questions like how to
deal with a problem if data volume gets too large?
• Or how to find out the important data points?
• Or how to use data to the best advantage?
• These large amount of data on which these type of
analysis is to be done can be structured (organized
data), semi-structured (Semi-organized data) or
unstructured (unorganized data). There are two
techniques through which decision making can be done:
• Either incorporate massive data volumes in the
analysis.
• Or determine upfront which Big data is relevant.
2. Technical challenges:
• Quality of data:
• When there is a collection of a large amount of
data and storage of this data, it comes at a cost.
Big companies, business leaders and IT leaders
always want large data storage.
• For better results and conclusions, Big data
rather than having irrelevant data, focuses on
quality data storage.
• This further arise a question that how it can be
ensured that data is relevant, how much data
would be enough for decision making and
whether the stored data is accurate or not.
• Fault tolerance:
• Fault tolerance is another technical challenge
and fault tolerance computing is extremely hard,
involving intricate algorithms.
• Nowadays some of the new technologies like
cloud computing and big data always intended
that whenever the failure occurs the damage
done should be within the acceptable threshold
that is the whole task should not begin from the
scratch.
• Scalability:
• Big data projects can grow and evolve rapidly.
The scalability issue of Big Data has lead
towards cloud computing.
• It leads to various challenges like how to run
and execute various jobs so that goal of each
workload can be achieved cost-effectively.
• It also requires dealing with the system failures
in an efficient manner. This leads to a big
question again that what kinds of storage
devices are to be used.

Big data in smart cities:


Traffic Management

ncreasing population and number of vehicles will give rise to traffic issues. The
smart city solutions can harness the real-time data for delivering improved
mobility and resolving traffic-related problems. In the coming years, traffic
management will become more significant. As per an estimate, the revenue of
traffic-focused technology for the smart city will be valued at $4.4 billion by
2023.

The smarter traffic system is based on big data analytics solutions. These
solutions assist the administration to fetch and share the information across
various departments to take prompt action. Big data solutions are designed to
gather all sorts of information using sensors for ensuring real-time traffic
control. Also, these solutions can predict traffic trends based on mathematical
models and current scenario.

Public Safety

In any big or small city, public safety is highly important. Data insights can be
used to identify the most possible reasons for crime and find vulnerable areas
in the city. Data-driven smart cities can readily unfold any criminal offenses and
the administration can implement safety-related measures more effectively.
Once the high-crime prone zones are defined, it is easy for the low enforcement
agencies to act in a ‘smarter’ way to prevent crimes before they occur.

Emergency services will also get a boost from real-time data. For example, it is
possible for police, fire, and ambulance to rush at the spot immediately with the
help of real-time data about any unwanted incident. Different data sets can
handle the crisis situations during the natural or manmade calamities.

Public Spending
A lot of investment from the government is necessary for transforming any city.
In the smart city, the city planner can spend money either for renovation or
redesigning the city. Before spending money and allocating the budget for a
smart city, it is better to go through the collected data. The data can suggest the
major areas that need upgrade or revamping. Also, big data analytics solutions
can predict the situation of particular areas in the future. It can give the city
planners a clear idea of how they can invest in such areas to get the maximum
value for money.

City Planning

Smart city solutions can easily provide infrastructure requirements and


necessary measures when accurate data is available. Big data can highlight
developing and developed areas with ease and contribute to city planning. A
robust data strategy holds a key to a smart city. Rich data collected from the
cities that are on the way to become smart cities can be used to streamline the
operations and simplify the complexities.

Growth and Development

Big data solutions aim at providing continuous updates about the necessary
change. In a way, these solutions ascertain the growth of a smart city. It is
necessary for the smart city to implement desired features and infrastructure-
based facilities to attract and retain more people. In other words, sustainability
can be achieved by proper monitoring and control over the city’s infrastructure.
Big data analytics can play a big role in achieving this objective by collecting
real-time information.

Ch – 2
Q. 1)Draw and Explain HDFS architecture. How can you restart Name Node & all
the daemons in Hadoop? 07(4)
Ans.
Q. 2)What is MapReduce? Explain working of various phases of MapReduce with
appropriate example and diagram.07(4)
Ans.
Q. 3)What is Hadoop? Briefly explain the core components and advantages of it.
07 (3)
Ans.
Advantages of Hadoop
• Varied Data Sources. Hadoop accepts a variety of data. ...
• Cost-effective. Hadoop is an economical solution as it uses a cluster
of commodity hardware to store data. ...
• Performance. ...
• Fault-Tolerant. ...
• Highly Available. ...
• Low Network Traffic. ...
• High Throughput. ...
• Open Source.

Q. 4)What is Data Serialization? Explain advantages and drawbacks of Hadoop


serialization 03 (2)
Ans. ->Data serialization is the process of converting structured data into a format
that can be stored or transmitted across a network. This format is often a binary
representation of the data, which can be easily read and written by computer
programs.

->In Hadoop, data serialization is used to convert data objects into a format that
can be stored in Hadoop Distributed File System (HDFS) and processed by
MapReduce.

->The Hadoop serialization framework allows developers to choose from a


number of different serialization formats, such as Avro, SequenceFile, and
Writable.

**Advantages of Hadoop serialization include:

->Improved performance: Serialized data is smaller and can be read and written
faster than non-serialized data.

->Language independence: Serialized data can be read and written by programs


written in different programming languages.

->Interoperability: Data can be easily exchanged between different systems and


platforms.

**Drawbacks of Hadoop serialization include:

->Complexity: The process of serializing and deserializing data can be complex and
time-consuming.
->Limited flexibility: The chosen serialization format may not be well-suited for all
types of data or use cases.

->Increased memory usage: Serialized data takes up more memory than non-
serialized data.

Hadoop serialization is also known for its efficient data compression and data
transfer, which reduces the network I/O and storage costs. However, it can also
increase the CPU load when deserializing and serializing large data sets, which
can be a drawback.
Q. 5)How is Big data and Hadoop related? 03
Ans.

Big data and Hadoop are interrelated. In simple words, big data is massive
amount of data that cannot be stored, processed, or analyzed using the
traditional methods. Big data consists of vast volumes of various types of
data which is generated at a high speed. To overcome the issue of storing,
processing, and analyzing big data, Hadoop is used.

Hadoop is a framework that is used to store and process big data in a


distributed and parallel way. In Hadoop, storing vast volumes of data
becomes easy as the data is distributed across various machines, and data
is also processed parallelly and this saves time.

Q. 6)Brief Anatomy of a Map Reduce Job run and Failures. 07


Ans.
Q. 7)How HDFS is different from traditional NFS? 04
Ans.
Q. 8)What do you mean by job scheduling in Hadoop? List different schedulers in
Hadoop. 03
Ans.
Q. 9)Describe Map Reduce Types and Formats 07
Ans.
Q. 10)What are the basic challenges with Big Data? How does Hadoop help to
overcome these challenges? 03
Ans.4 vs and haddop helps by storing ,processing and analysis.

Q. 11)What is distributed file system? Explain key features of Hadoop Distributed


File System. 03
Ans.
The MapReduce algorithm contains two important tasks, namely Map and Reduce.
• The Map task takes a set of data and converts it into another set of data,
where individual elements are broken down into tuples (key-value pairs).
• The Reduce task takes the output from the Map as an input and combines
those data tuples (key-value pairs) into a smaller set of tuples.
The reduce task is always performed after the map job.
Let us now take a close look at each of the phases and try to understand their
significance.
Q. 12)Explain working of MapReduce with reference to ‘Wordcount’ program for
a file having input as: Red apple Red wine Green apple Green peas Pink rose Blue
whale Blue sky Green city Clean city 07
Ans.
Step 1 :
Map creates a key value for each word on the line containing the word
as the key, and 1 as the value
So key value pairs are ;
(Red,1)( Red,1)
(apple,1) (apple,1)
(Wine,1)
(Green,1) (Green,1) (Green,1)
(peas,1)
(Pink,1)
(rose,1)
(Blue, 1)( Blue, 1)
(whale,1)
(sky,1)
(city,1)(city,1)
(clean,1)
Step 2:
Next, all the key-values that were output from map are sorted based on
their key. And the key values, with the same word, are moved, or
shuffled, to the same node.
Step 3:
Reduce function
(Red,1)( Red,1)->(Red,2)
(apple,1) (apple,1)->(apple,2)
(Wine,1)
(Green,1) (Green,1) (Green,1)->(Green,3)
(peas,1)
(Pink,1)
(rose,1)
(Blue, 1)( Blue, 1)->(Blue,2)
(whale,1)
(sky,1)
(city,1)(city,1)->(city,2)
(clean,1)

Q. 13)Explain Hadoop commands to move the data in and out of HDFS. 03


Ans.

moveFromLocal: This command will move file from local to hdfs.


Syntax:
bin/hdfs dfs -moveFromLocal <local src> <dest(on hdfs)>
Example:
bin/hdfs dfs -moveFromLocal ../Desktop/cutAndPaste.txt /geeks
copyToLocal (or) get: To copy files/folders from hdfs store to local file system.
Syntax:
bin/hdfs dfs -copyToLocal <<srcfile(on hdfs)> <local file dest>
Example:
bin/hdfs dfs -copyToLocal /geeks ../Desktop/hero

Q. 14)What is the role of a “combiner” in the Map Reduce ? Explain with the help of an example 07
Ans.

Combiner always works in between Mapper and Reducer. The output


produced by the Mapper is the intermediate output in terms of key-value pairs
which is massive in size. If we directly feed this huge output to the Reducer,
then that will result in increasing the Network Congestion. So to minimize this
Network congestion we have to put combiner in between Mapper and Reducer.
These combiners are also known as semi-reducer. It is not necessary to add a
combiner to your Map-Reduce program, it is optional. Combiner is also a class
in our java program like Map and Reduce class that is used in between
this Map and Reduce classes. Combiner helps us to produce abstract details or
a summary of very large datasets. When we process or deal with very large
datasets using Hadoop Combiner is very much necessary, resulting in the
enhancement of overall performance.

How does combiner work?

In the above example, we can see that two Mappers are containing different data.
the main text file is divided into two different Mappers. Each mapper is assigned
to process a different line of our data. in our above example, we have two lines
of data so we have two Mappers to handle each line. Mappers are producing the
intermediate key-value pairs, where the name of the particular word is key and
its count is its value. For example for the data Geeks For Geeks For the key-
value pairs are shown below.
// Key Value pairs generated for data Geeks For Geeks For

(Geeks,1)
(For,1)
(Geeks,1)
(For,1)
The key-value pairs generated by the Mapper are known as the intermediate key-
value pairs or intermediate output of the Mapper. Now we can minimize the
number of these key-value pairs by introducing a combiner for each Mapper in
our program. In our case, we have 4 key-value pairs generated by each of the
Mapper. since these intermediate key-value pairs are not ready to directly feed
to Reducer because that can increase Network congestion so Combiner will
combine these intermediate key-value pairs before sending them to Reducer.
The combiner combines these intermediate key-value pairs as per their key. For
the above example for data Geeks For Geeks For the combiner will partially
reduce them by merging the same pairs according to their key value and
generate new key-value pairs as shown below.
// Partially reduced key-value pairs with combiner
(Geeks,2)
(For,2)
With the help of Combiner, the Mapper output got partially reduced in terms of
size(key-value pairs) which now can be made available to the Reducer for better
performance. Now the Reducer will again Reduce the output obtained from
combiners and produces the final output that is stored on HDFS(Hadoop
Distributed File System).

Q. 15)Explain basic Components of Analyzing the Data with Hadoop. 03


Ans.
Q. 16)Explain any two commands of HDFS with example: (i)copyFromLocal (ii) mv (iii) cat 03

Ans. copyToLocal (or) get: To copy files/folders from hdfs store to local file
system.
Syntax:
bin/hdfs dfs -copyToLocal <<srcfile(on hdfs)> <local file dest>
Example:
bin/hdfs dfs -copyToLocal /geeks ../Desktop/hero

mv: This command is used to move files within hdfs. Lets cut-paste a
file myfile.txt from geeks folder to geeks_copied.
Syntax:
bin/hdfs dfs -mv <src(on hdfs)> <src(on hdfs)>
Example:
bin/hdfs -mv /geeks/myfile.txt /geeks_copied

cat: To print file contents.


Syntax:
bin/hdfs dfs -cat <path>
Example:
// print the content of AI.txt present
// inside geeks folder.
bin/hdfs dfs -cat /geeks/AI.txt ->

Q. 17)How does HDFS ensure data Integrity in a Hadoop Cluster? 07

Ans.

Data Integrity ensures the correctness of the data. However, it is possible that the
data will get corrupted during I/O operation on the disk. Corruption can occur due
to various reasons like faults in a storage device, network faults, or buggy
software.Hadoop HDFS framework implements checksum checking on the
contents of HDFS files. In Hadoop, when a client creates an HDFS file, it computes a
checksum of each block of file and stores these checksums in a separate hidden file
in the same HDFS namespace.

HDFS client, when retrieves file contents, it first verifies that the data it received
from each Datanode matches the checksum stored in the associated checksum file.
And if not, then the client can opt to retrieve that block from another DataNode
that has a replica of that block.

1) Data Integrity means to make sure that no data is lost or corrupted during
storage or processing of the Data.

2) Since in Hadoop, amount of data being written or read is large in Volume,


chances of data corruption is more.
3) So in Hadoop checksum is computed when data written to the disk for the first
time and again checked while reading data from the disk. If checksum matches the
original checksum then it is said that data is not corrupted otherwise it is said to be
corrupted.

4) Its just data detection error.

5) It is possible that it’s the checksum that is corrupt, not the data, but this is very
unlikely, because the checksum is much smaller than the data

6) HDFS uses a more efficient variant called CRC-32C to calculate checksum.

7) DataNodes are responsible for verifying the data they receive before storing the
data and its checksum. Checksum is computed for the data that they receive from
clients and from other DataNodes during replication

8) Hadoop can heal the corrupted data by copying one of the good replica to
produce the new replica which is uncorrupt replica.

9) if a client detects an error when reading a block, it reports the bad block and the
DataNodes it was trying to read from to the NameNode before throwing a
Checksum Exception.

10) The NameNode marks the block replica as corrupt so it doesn’t direct any more
clients to it or try to copy this replica to another DataNodes.

11) It provides copy of the block another DataNodes which is to be replicated, so its
replication factor is back at the expected level.

12) Once this has happened, the corrupt replica is deleted


CH-3

Q. 1)What are the features of MongoDB and How MongoDB is better than SQL database? 04 (4)

Ans.
MongoDB MySQL

MySQL concept does not allow


MongoDB was designed with high efficient replication and sharding
availability and scalability in mind, and but in MySQL one can access
includes out-of-the- associated data using joins which
box replication and sharding. minimizes duplication.

Differences in Terminology
There are differences based on terminology between MongoDB and MySQL.

Terminology Differences gfg

Data Representation
The difference between the way data is represented and stored in both the
databases is quite different.
MongoDB stores data in form of JSON-like documents and MySQL stores data
in form of rows of table as mentioned earlier.
Example: To show how data is stored and represented in MongoDB and
MySQL.

Q. 2)Explain CRUD operations in MongoDB 04 (3)

Ans.
Q. 3)Explain Replication and scaling feature of MongoDB. 04 (2)

Ans.

Q. 4)Differentiate SQL Vs NoSQL.07 (2)

Ans.

SQL NoSQL
RELATIONAL DATABASE Non-relational or distributed database system.
MANAGEMENT SYSTEM
(RDBMS)
These databases have They have dynamic schema
fixed or static or
predefined schema
These databases are not These databases are best suited for hierarchical
suited for hierarchical data storage.
data storage.
These databases are best These databases are not so good for complex
suited for complex queries
queries
Vertically Scalable Horizontally scalable
Follows ACID property Follows CAP(consistency, availability, partition
tolerance)
Examples: MySQL, Postg Examples: MongoDB, GraphQL, HBase, Neo4j, C
reSQL, Oracle, MS-SQL assandra, etc
Server, etc
1. Structure –
SQL databases are table-based on the other hand NoSQL databases
are either key-value pairs, document-based, graph databases or wide-
column stores. This makes relational SQL databases a better option
for applications that require multi-row transactions such as an
accounting system or for legacy systems that were built for a relational
structure.
2. Property followed –
SQL databases follow ACID properties (Atomicity, Consistency,
Isolation and Durability) whereas the NoSQL database follows the
Brewers CAP theorem (Consistency, Availability and Partition
tolerance).
3. Support –
Great support is available for all SQL database from their vendors.
Also a lot of independent consultations are there who can help you
with SQL database for a very large scale deployments but for some
NoSQL database you still have to rely on community support and only
limited outside experts are available for setting up and deploying your
large scale NoSQL deployments.

Q. 5)What is NoSQL database? List the differences between NoSQL and relational databases. Explain
various types of NoSQL databases . 07(2)

Ans.

NoSQL originally referring to non SQL or non relational is a database that


provides a mechanism for storage and retrieval of data. This data is modeled in
means other than the tabular relations used in relational databases.

Types of NoSQL Database:

• Document-based databases
• Key-value stores
• Column-oriented databases
• Graph-based databases
Document-Based Database:

The document-based database is a nonrelational database. Instead of storing the


data in rows and columns (tables), it uses the documents to store the data in the
database. A document database stores data in JSON, BSON, or XML
documents.
Documents can be stored and retrieved in a form that is much closer to the data
objects used in applications which means less translation is required to use these
data in the applications. In the Document database, the particular elements can
be accessed by using the index value that is assigned for faster querying.
Collections are the group of documents that store documents that have similar
contents. Not all the documents are in any collection as they require a similar
schema because document databases have a flexible schema.
Key features of documents database:
• Flexible schema: Documents in the database has a flexible schema. It
means the documents in the database need not be the same schema.
• Faster creation and maintenance: the creation of documents is easy
and minimal maintenance is required once we create the document.
• No foreign keys: There is no dynamic relationship between two
documents so documents can be independent of one another. So, there
is no requirement for a foreign key in a document database.
• Open formats: To build a document we use XML, JSON, and others.
Key-Value Stores:

A key-value store is a nonrelational database. The simplest form of a NoSQL


database is a key-value store. Every data element in the database is stored in
key-value pairs. The data can be retrieved by using a unique key allotted to each
element in the database. The values can be simple data types like strings and
numbers or complex objects.

A key-value store is like a relational database with only two columns which is the
key and the value.
Key features of the key-value store:
• Simplicity.
• Scalability.
• Speed.

Column Oriented Databases:

A column-oriented database is a non-relational database that stores the data in


columns instead of rows. That means when we want to run analytics on a small
number of columns, you can read those columns directly without consuming
memory with the unwanted data.
Columnar databases are designed to read data more efficiently and retrieve the
data with greater speed. A columnar database is used to store a large amount of
data. Key features of columnar oriented database:
• Scalability.
• Compression.
• Very responsive.

Graph-Based databases:

Graph-based databases focus on the relationship between the elements. It stores


the data in the form of nodes in the database. The connections between the
nodes are called links or relationships.
Key features of graph database:
• In a graph-based database, it is easy to identify the relationship
between the data by using the links.
• The Query’s output is real-time results.
• The speed depends upon the number of relationships among the
database elements.

Q. 6)Describe analyzing big data with a shared-nothing architecture. 07

Ans.
Q. 7)Difference between master-slave versus peer-to-peer distribution models. 03

Ans. **In a master-slave distribution model,

->there is a central authority, or "master," that controls and manages all the other nodes, or "slaves," in
the network.

-> The master node is responsible for distributing tasks and coordinating the work of the slave nodes.

->This model is often used in distributed computing systems, where a central server is responsible for
managing a group of worker nodes.

-> It can not have any issues about decision making because it is handled by master node only

->it can have a single point of failure and a bottleneck in the master node.

->This architecture is not good for fault tolerance

->This model is not suitable for high avialibility

**In Peer-to-Peer distribution Model,

->a peer-to-peer (P2P) distribution model does not have a central authority.

->Instead, all nodes in the network are equal and can communicate with each other directly. Each node
acts as both a client and a server, making requests and providing resources to other nodes.

-> This model is often used in decentralized systems, such as file sharing networks.

->It can have issues when it comes to decision making, coordination and consistency.

-> there is no central point of control or failure


->This architecture is good for fault tolerance and high relaibility

->This model is best suitable for high avialibility

Q. )Which Four ways that NoSQL systems handle big data problems. 07

Ans.
]

Q. 9)Explain NoSQL data architecture. 03

Ans.

Architecture Pattern is a logical way of categorizing data that will be stored on


the Database. NoSQL is a type of database which helps to perform operations
on big data and store it in a valid format. It is widely used because of its
flexibility and a wide variety of services.
Architecture Patterns of NoSQL:
The data is stored in NoSQL in any of the following four data architecture
patterns.
1. Key-Value Store Database
2. Column Store Database
3. Document Database
4. Graph Database
Q. 10)Write difference between MongoDB and Hadoop. 03

Ans.

Based on Hadoop MongoDB


Format of It can be used with both Uses only CSV or JSON format
Data structured or unstructured
data
Design It is primarily designed as a It is designed to analyze and
purpose database. process large volume of data.
Built It is a Java based application It is a C++ based application
Strength Handling of batch processes It is more robust and flexible as
and lengthy-running ETL jobs compared to Hadoop
is excellently done in Hadoop.
It comes very handy while for
managing Big Data
Cost of As it is a group of various Being a single product makes it
Hardware software, it can cost more cost effective
Framework It comprises of various It can be used to query,
software that is responsible aggregate, index or replicate data
for creating a data processing stored. The stored data is in the
framework. form of Binary JSON(BJSON)
and storing of data is done in
collections.
RDBMS It is not designed to replace a It is designed for the purpose of
RDBMS system but provides replacing or enhancing the
addition support to RDBMS to RDBMS and giving it a wide
archive data and also gives it variety of use cases.
a wide variety of use cases.
Drawbacks Highly depends upon Low fault-tolerance that leads to
‘NameNode’, that can be a loss of data occasionally
point of failure

Q. 11)What is NewSQL? Mention its basic characteristics.? Differentiate NoSQL vs NewSQL 04

Ans.

The term NewSQL categorizes databases that are the combination of relational
model with the advancement in scalability, flexibility with types of data. These
databases focus on the features which are not present in NoSQL, which offers
a strong consistency guarantee. This covers two layers of data one relational
one and a key-value store.
S.No NoSQL NewSQL
1. NoSQL is a schema-free database. NewSQL is schema-fixed as well as
a schema-free database.
2. It is horizontally scalable. It is horizontally scalable.
3. It possesses automatically high- It possesses built-in high
availability. availability.
4. It supports cloud, on-disk, and It fully supports cloud, on-disk, and
cache storage. cache storage.
5. It promotes CAP properties. It promotes ACID properties.
6. Online Transactional Processing is Online Transactional Processing is
not supported. fully supported.
7. There are low-security concerns. There are moderate security
concerns.
8. Use Cases: Big Data, Social Use Cases: E-Commerce, Telecom
Network Applications, and IOT. industry, and Gaming.
9. Examples : DynamoDB, MongoDB, Examples : VoltDB, CockroachDB,
RaveenDB etc. NuoDB etc.

Q. 12)What are the data types of MongoDB? Explain in brief.03

Ans.

1. String: This is the most commonly used data type in


MongoDB to store data, BSON strings are of UTF-8. So, the
drivers for each programming language convert from the
string format of the language to UTF-8 while serializing and
de-serializing BSON. The string must be a valid UTF-8.
2. 2. Integer: In MongoDB, the integer data type is used to
store an integer value. We can store integer data type in two
forms 32 -bit signed integer and 64 – bit signed integer.
3. 3. Double: The double data type is used to store the floating-
point values.
4. 4. Boolean: The boolean data type is used to store either
true or false.
5. 5. Null: The null data type is used to store the null value.
6. 6. Array: The Array is the set of values. It can store the same
or different data types values in it. In MongoDB, the array is
created using square brackets([]).
7. 7. Object: Object data type stores embedded documents.
Embedded documents are also known as nested documents.
Embedded document or nested documents are those types
of documents which contain a document inside another
document.
8. 9. Undefined: This data type stores the undefined values.
9. 10. Binary Data: This datatype is used to store binary data.
10. 11. Date: Date data type stores date.
11. 18. Decimal: This MongoDB data type store 128-bit decimal-
based floating-point value. This data type was introduced in
MongoDB version 3.4
Q. 13)Explain MongoDB sharding process. 03
Ans.

Sharding means to distribute data on multiple servers, here a large amount of data is partitioned into
data chunks using the shard key, and these data chunks are evenly distributed across shards that reside
across many physical servers.

Sharding which is also known as data partitioning works on the same concept
of sharing the Pizza slices. It is basically a database architecture pattern in
which we split a large dataset into smaller chunks (logical shards) and we
store/distribute these chunks in different machines/database nodes (physical
shards). Each chunk/partition is known as a “shard” and each shard has the
same database schema as the original database. We distribute the data in such
a way that each row appears in exactly one shard. It’s a good mechanism to
improve the scalability of an application.

Q. 14)Explain following for MongoDB (i)Indexing (ii) Aggregation 07


Ans.

In MongoDB, aggregation operations process the data records/documents and


return computed results. It collects values from various documents and groups
them together and then performs different types of operations on that grouped
data like sum, average, minimum, maximum, etc to return a computed result. It
is similar to the aggregate function of SQL.
MongoDB provides three ways to perform aggregation
•Aggregation pipeline
•Map-reduce function
•Single-purpose aggregation
In MongoDB, the aggregation pipeline consists of stages and each stage
transforms the document. Or in other words, the aggregation pipeline is a multi-
stage pipeline, so in each state, the documents taken as input and produce the
resultant set of documents now in the next stage(id available) the resultant
documents taken as input and produce output, this process is going on till the
last stage. The basic pipeline stages provide filters that will perform like queries
and the document transformation modifies the resultant document and the other
pipeline provides tools for grouping and sorting documents. You can also use
the aggregation pipeline in sharded collection.

Map Reduce
Map reduce is used for aggregating results for the large volume of data. Map
reduce has two main functions one is a map that groups all the documents and
the second one is the reduce which performs operation on the grouped data.
Syntax:
db.collectionName.mapReduce(mappingFunction, reduceFunction,
{out:'Result'});

Single Purpose Aggregation


It is used when we need simple access to document like counting the number of
documents or for finding all distinct values in a document. It simply provides the
access to the common aggregation process using the count(), distinct(), and
estimatedDocumentCount() methods, so due to which it lacks the flexibility and
capabilities of the pipeline.
Example:
In the following example, we are working with:
Database: GeeksForGeeks
Collection: studentsMark
Documents: Seven documents that contain the details of the students in the
form of field-value pairs.

Indexing in MongoDB :
MongoDB uses indexing in order to make the query processing more efficient. If
there is no indexing, then the MongoDB must scan every document in the
collection and retrieve only those documents that match the query. Indexes are
special data structures that stores some information related to the documents
such that it becomes easy for MongoDB to find the right data file. The indexes
are order by the value of the field specified in the index.
Creating an Index :
MongoDB provides a method called createIndex() that allows user to create an
index.
Syntax –

db.COLLECTION_NAME.createIndex({KEY:1})
The key determines the field on the basis of which you want to create an index
and 1 (or -1) determines the order in which these indexes will be
arranged(ascending or descending).

Q. 15)Which terms are used for table, row, column, and table-join in MongoDB 03
Ans.

Ch-4

Q. 1)What is Stream Computing and Sampling Data in a Stream. 03


Ans.
Q. 2)Write the application of RTAP. 04

Ans.

1. Apache Samza

• Apache Samza is an open source, near-real time, asynchronous computational framework for stream
processing developed by the Apache Software Foundation in Scala and Java.

• Samza allows users to build stateful applications that process data in real-time from multiple sources
including Apache Kafka.

• Samza provides fault tolerance, isolation and stateful processing. Samza is used by multiple
companies. The biggest installation is in LinkedIn.
2. Apache Flink

• Apache Flink is an open-source, unified stream-processing and batch-processing framework developed


by the Apache Software Foundation.

• The core of Apache Flink is a distributed streaming data-flow engine written in Java and Scala.

• Fink provides a high-throughput, low-latency streaming engine as well as support for eventtime
processing and state management.

• Flink does not provide its own data-storage system, but provides data-source and sink connectors to
systems such as Amazon Kinesis, Apache Kafka, Alluxio, HDFS, Apache Cassandra, and Elastic Search.

Q. 3)Explain Stream Data Model and Architecture. 07

Ans.
Q. 4)How Graph Analytics used in Big Data. 03

Ans.
Q. 5)Write a short note on Decaying Window. 04

Ans.
Q. 6)Explain with example: How to perform Real Time Sentiment Analysis of any product. 07

Ans.

CH-5

Q. 1)What is Zookeeper? List the benefits of it. 03 (4)

Ans.

Here are the benefits of using ZooKeeper −


• Simple distributed coordination process
• Synchronization − Mutual exclusion and co-operation between server
processes. This process helps in Apache HBase for configuration
management.
• Ordered Messages
• Serialization − Encode the data according to specific rules. Ensure your
application runs consistently. This approach can be used in MapReduce to
coordinate queue to execute running threads.
• Reliability
• Atomicity − Data transfer either succeed or fail completely, but no
transaction is partial.

Q. 2)Explain working of Hive with proper steps and diagram. 07 (4)

Ans.

The major components of Hive and its interaction with the Hadoop is
demonstrated in the figure below and all the components are described further:
• User Interface (UI) –
As the name describes User interface provide an interface between
user and hive. It enables user to submit queries and other operations
to the system. Hive web UI, Hive command line, and Hive HD Insight
(In windows server) are supported by the user interface.

• Hive Server – It is referred to as Apache Thrift Server. It accepts the


request from different clients and provides it to Hive Driver.

• Driver –
Queries of the user after the interface are received by the driver within
the Hive. Concept of session handles is implemented by driver.
Execution and Fetching of APIs modelled on JDBC/ODBC interfaces
is provided by the user.

• Compiler –
Queries are parses, semantic analysis on the different query blocks
and query expression is done by the compiler. Execution plan with the
help of the table in the database and partition metadata observed from
the metastore are generated by the compiler eventually.

• Metastore –
All the structured data or information of the different tables and
partition in the warehouse containing attributes and attributes level
information are stored in the metastore. Sequences or de-sequences
necessary to read and write data and the corresponding HDFS files
where the data is stored. Hive selects corresponding database servers
to stock the schema or Metadata of databases, tables, attributes in a
table, data types of databases, and HDFS mapping.
• Execution Engine –
Execution of the execution plan made by the compiler is performed in
the execution engine. The plan is a DAG of stages. The dependencies
within the various stages of the plan is managed by execution engine
as well as it executes these stages on the suitable system
components.

Diagram – Architecture of Hive that is built on the top of Hadoop


In the above diagram along with architecture, job execution flow in Hive with
Hadoop is demonstrated step by step.
• Step-1: Execute Query –
Interface of the Hive such as Command Line or Web user interface
delivers query to the driver to execute. In this, UI calls the execute
interface to the driver such as ODBC or JDBC.

• Step-2: Get Plan –


Driver designs a session handle for the query and transfer the query to
the compiler to make execution plan. In other words, driver interacts
with the compiler.

• Step-3: Get Metadata –


In this, the compiler transfers the metadata request to any database
and the compiler gets the necessary metadata from the metastore.
• Step-4: Send Metadata –
Metastore transfers metadata as an acknowledgment to the compiler.

• Step-5: Send Plan –


Compiler communicating with driver with the execution plan made by
the compiler to execute the query.

• Step-6: Execute Plan –


Execute plan is sent to the execution engine by the driver.
• Execute Job
• Job Done
• Dfs operation (Metadata Operation)
• Step-7: Fetch Results –
Fetching results from the driver to the user interface (UI).

• Step-8: Send Results –


Result is transferred to the execution engine from the driver. Sending
results to Execution engine. When the result is retrieved from data
nodes to the execution engine, it returns the result to the driver and to
user interface (UI).

Q. 3)What is Pig? 03 (3)

Ans.

Q. 4)Explain the architecture of HBase. 07 (3)

Ans.
Q. 5)What do you mean by HiveQL DDL? Explain any 3 HiveQL DDL command with its Syntax & example.
07 (3)

Ans. Hive Data Definition Language (DDL) is a subset of Hive SQL statements that describe
the data structure in Hive by creating, deleting, or altering schema objects such as
databases, tables, views, partitions, and buckets. Most Hive DDL statements start with the
keywords CREATE , DROP , or ALTER .

1. CREATE DATABASE in Hive


The CREATE DATABASE statement is used to create a database in the Hive.
The DATABASE and SCHEMA are interchangeable. We can use either
DATABASE or SCHEMA.
Syntax:
CREATE (DATABASE|SCHEMA) [IF NOT EXISTS] database_name

[COMMENT database_comment]

[LOCATION hdfs_path]

[WITH DBPROPERTIES (property_name=property_value, ...)];

2. SHOW DATABASE in Hive


The SHOW DATABASES statement lists all the databases present in the Hive.
Syntax:
SHOW (DATABASES|SCHEMAS);

3. DESCRIBE DATABASE in Hive


The DESCRIBE DATABASE statement in Hive shows the name of Database in
Hive, its comment (if set), and its location on the file system.
The EXTENDED can be used to get the database properties.
Syntax:
DESCRIBE DATABASE/SCHEMA [EXTENDED] db_name;

Q. 6)Explain Storage mechanism in HBase. 04 (2)

Ans.

HBase is a column-oriented database and the tables in it are sorted by row. The table
schema defines only column families, which are the key value pairs. A table have
multiple column families and each column family can have any number of columns.
Subsequent column values are stored contiguously on the disk. Each cell value of the
table has a timestamp. In short, in an HBase:

• Table is a collection of rows.


• Row is a collection of column families.
• Column family is a collection of columns.
• Column is a collection of key value pairs.

Q. 7)Differentiate between HDFS Vs. HBase. 04 (2)

Ans.
HDFS HBase

HDFS is a java based file Hbase is hadoop database that runs on


distribution system top of HDFS

HDFS is highly fault- HBase is partially tolerant and highly


tolerant and cost-effective consistent

HDFS Provides only sequential Random access is possible due to


read/write operation hash table

HDFS is based on write once HBase supports random read and


read many times writeoperation into filesystem

HDFS has a rigid architecture HBase support dynamic changes

HDFS is prefereable for offline HBase is preferable for real time


batch processing processing

HDFS provides high latency for HBase provides low latency access to
access operations. small amount of data

Q. 8)Differentiate between HBase Vs. RDBMS. 04 (2)

Ans.

S. Parameters RDBMS HBase


No.
1. SQL It requires SQL (Structured SQL is not required in
Query Language). HBase.
2. Schema It has a fixed schema. It does not have a fixed
schema and allows for the
addition of columns on
the fly.
3. Database It is a row-oriented database It is a column-oriented
Type database.
4. Scalability RDBMS allows for scaling up. Scale-out is possible
That implies, that rather than using HBase. It means
adding new servers, we that, while we require
should upgrade the current extra memory and disc
server to a more capable space, we must add new
server whenever there is a servers to the cluster
requirement for more rather than upgrade the
memory, processing power, existing ones.
and disc space.
5. Nature It is static in nature Dynamic in nature
6. Data retrieval In RDBMS, slower retrieval of In HBase, faster retrieval
data. of data.
7. Rule It follows the ACID (Atomicity, It follows CAP
Consistency, Isolation, and (Consistency, Availability,
Durability) property. Partition-tolerance)
theorem.
8. Type of data It can handle structured data. It can handle structured,
unstructured as well as
semi-structured data.
9. Sparse data It cannot handle sparse data. It can handle sparse data.
10. Volume of The amount of data in In HBase, the .amount of
data RDBMS is determined by the data depends on the
server’s configuration. number of machines
deployed rather than on a
single machine.
11. Transaction In RDBMS, mostly there is a In HBase, there is no
Integrity guarantee associated with such guarantee
transaction integrity. associated with the
transaction integrity.
12. Referential Referential integrity is When it comes to
Integrity supported by RDBMS. referential integrity, no
built-in support is
available.
13. Normalize In RDBMS, you can normalize The data in HBase is not
the data. normalized, which means
there is no logical
relationship or connection
between distinct tables of
data.
14. Table size It is designed to It is designed to
accommodate small tables.. accommodate large
Scaling is difficult. tables. HBase may scale
horizontally.
Q. 9)Differentiate between HIVE and HBASE. 04

Ans.

Q. 10)What is the use of Pig and Hive in Big Data? 03

Ans.
Q. 11)What are WAL, MemStore, Hfile and Hlog in HBase? 04

Ans.

Write Ahead Log (WAL) is a file used to store new data that is yet to be put on permanent
storage. It is used for recovery in the case of failure. When a client issues a put request, it will
write the data to the write-ahead log (WAL).
Memstore is just like a cache memory. Anything that is entered into the HBase is stored here
initially. Later, the data is transferred and saved in Hfiles as blocks and the memstore is flushed.
When the MemStore accumulates enough data, the entire sorted KeyValue set is written to a
new HFile in HDFS

HFile is the internal file format for HBase to store its data. These are the first two
lines of the description of HFile from its source code: File format for hbase. A file of
sorted key/value pairs. Both keys and values are byte arrays.

HLog contains entries for edits of all regions performed by a particular Region
Server. WAL abbreviates to Write Ahead Log (WAL) in which all the HLog edits are
written immediately. WAL edits remain in the memory till the flush period in case of
deferred log flush
Q. 12)Explain the concept of regions in HBase and storing Big data with HBase. 07

Ans.

Region Server –
HBase Tables are divided horizontally by row key range into
Regions. Regions are the basic building elements of HBase cluster that
consists of the distribution of tables and are comprised of Column families.
Region Server runs on HDFS DataNode which is present in Hadoop cluster.
Regions of Region Server are responsible for several things, like handling,
managing, executing as well as reads and writes HBase operations on that set
of regions. The default size of a region is 256 MB.

2types of data storage mediums.


1. Row-Oriented
2. Column-Oriented
Row-Oriented :
In the Row-Oriented data storage approach, the data is stored and retrieved
one row at a time. This could lead to several problems, suppose we want only
some part of the data from the row but according to this approach you have to
retrieve the complete row even if you don’t need it. Apart from that this
approach also serves to get help in the case of the OLTP systems operation
and it helps in easy to read and write the records. But it is less efficient in the
case when we perform operations on a complete database.
Column-Oriented :

In the column-Oriented data storage approach, the data is stored and retrieved
based on the columns. Thus the problem which we were facing in the case of
the row-oriented approach has been solved because in the column-oriented
approach we can filter out the data which are required to us from the whole set
of data with the help of corresponding columns. In the column-oriented
approach, the read and write operations are slower than others but it can be
efficient while performing operations on the entire database and hence it
permits very high compression rates.
Q. 13)Explain Pig data Model in detail and Discuss how it will help for effective data flow. 07
Ans.
Q. 14)Describe data processing operators in Pig. 04

Ans.

Data Processing Operators :


The Apache Pig Operators is a high-level procedural language for querying large
data sets using Hadoop and the Map-Reduce Platform.
A Pig Latin statement is an operator that takes a relation as input and produces
another relation as output.
These operators are the main tools for Pig Latin provides to operate on the data.
They allow you to transform it by sorting, grouping, joining, projecting, and
filtering.
The Apache Pig operators can be classified as :
Relational Operators :
Relational operators are the main tools Pig Latin provides to operate on the data.
Some of the Relational Operators are :
LOAD: The LOAD operator is used to loading data from the file system or HDFS
storage into a Pig relation.
FOREACH: This operator generates data transformations based on columns of
data. It is used to add or remove fields from a relation.
FILTER: This operator selects tuples from a relation based on a condition.
JOIN: JOIN operator is used to performing an inner, equijoin join of two or more
relations based on common field values
ORDER BY: Order By is used to sort a relation based on one or more fields in
either ascending or descending order using ASC and DESC keywords.
GROUP: The GROUP operator groups together the tuples with the same group
key (key field).
COGROUP: COGROUP is the same as the GROUP operator. For readability,
programmers usually use GROUP when only one relation is involved and
COGROUP when multiple relations are reinvolved.
Diagnostic Operator :
The load statement will simply load the data into the specified relation in Apache
Pig.
To verify the execution of the Load statement, you have to use the Diagnostic
Operators.
Some Diagnostic Operators are :
DUMP: The DUMP operator is used to run Pig Latin statements and display the
results on the screen.
DESCRIBE: Use the DESCRIBE operator to review the schema of a particular
relation. The DESCRIBE operator is best used for debugging a script.
ILLUSTRATE: ILLUSTRATE operator is used to review how data is transformed
through a sequence of Pig Latin statements. ILLUSTRATE command is your best
friend when it comes to debugging a script.
EXPLAIN: The EXPLAIN operator is used to display the logical, physical, and
MapReduce execution plans of a relation.

Q. 15)Compare: Map Reduce v/s Apache PIG

S.No MapReduce Pig


1. It is a Data Processing Language. It is a Data Flow Language.
2. It converts the job into map-reduce It converts the query into map-
functions. reduce functions.
3. It is a Low-level Language. It is a High-level Language
4. It is difficult for the user to perform Makes it easy for the user to
join operations. perform Join operations.
5. The user has to write 10 times The user has to write fewer lines of
more lines of code to perform a code because it supports the multi-
similar task than Pig. query approach.
6. It has several jobs therefore It is less compilation time as the
execution time is more. Pig operator converts it into
MapReduce jobs.
7. It is supported by recent versions of It is supported with all versions of
the Hadoop. Hadoop.
.

Q. Explain in brief four major libraries of Apache Spark 07

Ans.

Spark Core
o The Spark Core is the heart of Spark and performs the core functionality.
o It holds the components for task scheduling, fault recovery, interacting with
storage systems and memory management.

Spark SQL
o The Spark SQL is built on the top of Spark Core. It provides support for structured
data.
o It allows to query the data via SQL (Structured Query Language) as well as the
Apache Hive variant of SQL?called the HQL (Hive Query Language).
o It supports JDBC and ODBC connections that establish a relation between Java
objects and existing databases, data warehouses and business intelligence tools.
o It also supports various sources of data like Hive tables, Parquet, and JSON.

Spark Streaming
o Spark Streaming is a Spark component that supports scalable and fault-tolerant
processing of streaming data.
o It uses Spark Core's fast scheduling capability to perform streaming analytics.
o It accepts data in mini-batches and performs RDD transformations on that data.
o Its design ensures that the applications written for streaming data can be reused
to analyze batches of historical data with little modification.
o The log files generated by web servers can be considered as a real-time example
of a data stream.

MLlib
o The MLlib is a Machine Learning library that contains various machine learning
algorithms.
o These include correlations and hypothesis testing, classification and regression,
clustering, and principal component analysis.
o It is nine times faster than the disk-based implementation used by Apache Mahout.

GraphX
o The GraphX is a library that is used to manipulate graphs and perform graph-
parallel computations.
o It facilitates to create a directed graph with arbitrary properties attached to each
vertex and edge.
o To manipulate graph, it supports various fundamental operators like subgraph, join
Vertices, and aggregate Messages.

Q. 16)Explain HIVE data types and Services. 04

Ans.

Column type are used as column data types of Hive. They are as follows:
Integral Types
Integer type data can be specified using integral data types, INT. When the data range
exceeds the range of INT, you need to use BIGINT and if the data range is smaller than
the INT, you use SMALLINT. TINYINT is smaller than SMALLINT.

String Types
String type data types can be specified using single quotes (' ') or double quotes (" "). It
contains two data types: VARCHAR and CHAR. Hive follows C-types escape characters.

Timestamp
It supports traditional UNIX timestamp with optional nanosecond precision. It supports
java.sql.Timestamp format “YYYY-MM-DD HH:MM:SS.fffffffff” and format “yyyy-mm-dd
hh:mm:ss.ffffffffff”.

Dates
DATE values are described in year/month/day format in the form {{YYYY-MM-DD}}.

Decimals
The DECIMAL type in Hive is as same as Big Decimal format of Java. It is used for
representing immutable arbitrary precision. The syntax and example is as follows:

DECIMAL(precision, scale)

Null Value
Missing values are represented by the special value NULL.
CH-6
Q. 1)Write a short note on Spark stack and Explain Components of Spark. 07 (3)
Ans.

Done above library vala question

Q. 2)Write a short note on Spark. What are the advantages of using Apache Spark over Hadoop? 04 (3)

Ans.done

Basis Hadoop Spark


Processing Hadoop’s MapReduce Spark reduces the number of
Speed & model reads and writes read/write cycles to disk and
Performance from a disk, thus slowing stores intermediate data in
down the processing memory, hence faster-
speed. processing speed.
Usage Hadoop is designed to Spark is designed to handle
handle batch processing real-time data efficiently.
efficiently.
Latency Hadoop is a high latency Spark is a low latency
computing framework, computing and can process
which does not have an data interactively.
interactive mode.
Data With Hadoop MapReduce, Spark can process real-time
a developer can only data, from real-time events like
process data in batch Twitter, and Facebook.
mode only.
Cost Hadoop is a cheaper Spark requires a lot of RAM to
option available while run in-memory, thus increasing
comparing it in terms of the cluster and hence cost.
cost
Algorithm Used The PageRank algorithm Graph computation library
is used in Hadoop. called GraphX is used by
Spark.
Fault Tolerance Hadoop is a highly fault- Fault-tolerance achieved by
tolerant system where storing chain of transformations
Fault-tolerance achieved If data is lost, the chain of
by replicating blocks of transformations can be
data. recomputed on the original data
If a node goes down, the
data can be found on
another node
Security Hadoop supports LDAP, Spark is not secure, it relies on
ACLs, SLAs, etc and the integration with Hadoop to
hence it is extremely achieve the necessary security
secure. level.
Machine Data fragments in Hadoop Spark is much faster as it uses
Learning can be too large and can MLib for computations and has
create bottlenecks. Thus, it in-memory processing.
is slower than Spark.
Scalability Hadoop is easily scalable It is quite difficult to scale as it
by adding nodes and disk relies on RAM for computations.
for storage. It supports It supports thousands of nodes
tens of thousands of in a cluster.
nodes.
Language It uses Java or Python for It uses Java, R, Scala, Python,
support MapReduce apps. or Spark SQL for the APIs.
User- It is more difficult to use. It is more user-friendly.
friendliness
Resource YARN is the most common It has built-in tools for resource
Management option for resource management.
management.
Q 3)What is transformation and actions in Spark? Explain with example. 04 (2)

Ans.

Transformation: Transformation refers to the operation applied on a RDD to


create new RDD. Filter, groupBy and map are the examples of transformations.

Actions: Actions refer to an operation which also applies on RDD, that instructs
Spark to perform computation and send the result back to driver. This is an
example of action.

The Transformations and Actions in Apache Spark are divided into 4 major
categories:

• General
• Mathematical and Statistical
• Set Theory and Relational
• Data-structure and IO

Q. 4)List the features of spark. 03 (2)

Ans.

Apache Spark has following features.


• Speed − Spark helps to run an application in Hadoop cluster, up to 100
times faster in memory, and 10 times faster when running on disk. This is
possible by reducing number of read/write operations to disk. It stores the
intermediate processing data in memory.
• Supports multiple languages − Spark provides built-in APIs in Java, Scala,
or Python. Therefore, you can write applications in different languages.
Spark comes up with 80 high-level operators for interactive querying.
• Advanced Analytics − Spark not only supports ‘Map’ and ‘reduce’. It also
supports SQL queries, Streaming data, Machine learning (ML), and Graph
algorithms.
• Fault tolerance.
• Dynamic In Nature.
• Lazy Evaluation.
• Real-Time Stream Processing.
• Speed.
• Reusability.
• Advanced Analytics.
• In Memory Computing.

Q. 5)What is RDD in Apache Spark? Why RDD is better than Map Reduce data storage?(2)

Ans.

Resilient Distributed Datasets (RDD) is a fundamental data structure of Spark. It is an


immutable distributed collection of objects. Each dataset in RDD is divided into logical
partitions, which may be computed on different nodes of the cluster. RDDs can contain
any type of Python, Java, or Scala objects, including user-defined classes.
Formally, an RDD is a read-only, partitioned collection of records. RDDs can be created
through deterministic operations on either data on stable storage or other RDDs. RDD
is a fault-tolerant collection of elements that can be operated on in parallel.
->Why RDD is better than Map Reduce data storage?(2)

Apache Spark is very much popular for its speed. It runs 100 times faster in
memory and ten times faster on disk than Hadoop MapReduce since it
processes data in memory (RAM). At the same time, Hadoop MapReduce has
to persist data back to the disk after every Map or Reduce action.
Apache Spark can schedule all its tasks by itself, while Hadoop MapReduce
requires external schedulers like Oozie.

Q.s 8)Write application of writing Spark. 04(Scala, python, java)


Ans.

• Streaming Data:
• Machine Learning:
• Fog Computing:
• Apache Spark at Alibaba:
• Apache Spark at MyFitnessPal:
• Apache Spark at TripAdvisor:
• Apache Spark at Yahoo:
• Finance:
Q. 10)Discuss Spark Streaming with suitable example such as analyzing tweets from Twitter. 07

Ans.

Q. 11)What are the problems related to Map Reduce data storage? How Apache Spark solves it using
RDD? 07

Ans.

Apache MapReduce is a distributed data processing framework which uses the MapReduce
algorithm for processing large volumes of data distributed across multiple nodes in a
Hadoop cluster.

It is a data processing technology and not a data storage technology. HDFS (Hadoop
Distributed File System) is a data storage technology which stores large volumes of data in a
distributed manner across multiple nodes in a Hadoop cluster.

Apache MapReduce reads large volumes of data from the disk, executes the mapper tasks
and then the reducer tasks and stores the output back on the disk.

The intermediate output of the mapper task is also stored on the local disk. This results in
latency as writing the data to the disk (as part of the map task) and reading the data from
the disk (for it to be used in the reduce task) is a time consuming operation as disk reads
are I/O expensive.

MapReduce provides high throughput along with high latency. Hence it is not suitable for

Iterative Tasks

Interactive Queries

RDD avoids all of the reading/writing to HDFS. By significantly reducing I/O operations,
RDD offers a much faster way to retrieve and process data in a Hadoop cluster. In fact, it's
estimated that Hadoop MapReduce apps spend more than 90% of their time performing
reads/writes to HDFS.

You might also like