BDA Assignment No-1 B-2 47

Name- Harshita Mandloi
BE IT-1
BATCH- B-2
Roll No- 47
Big Data Analytics (BDA) Assignment No.1
Q.1) List and Explain Big Data- i) Characteristics; ii)

Types; iii) Challenges.
Ans- Big Data is a collection of data that is huge in volume, yet
growing exponentially with time. It is a data with so large size
and complexity that none of traditional data management tools
can store it or process it efficiently. Big data is also a data but
with huge size. For example, the “New York Stock Exchange” is
an example of Big Data that generates about one terabyte of new
trade data per day.
Characteristics-
Big data can be described by the following characteristics:
 Volume
 Variety
 Velocity
 Variability
(i) Volume- The name Big Data itself is related to a size which
is enormous. Size of data plays a very crucial role in
determining value out of data. Also, whether a particular data
can actually be considered as a Big Data or not, is dependent
upon the volume of data. Hence, ‘Volume’ is one characteristic
which needs to be considered while dealing with Big Data
solutions.
BE IT-1
BATCH- B-2
Roll No- 47
(ii) Variety- The next aspect of Big Data is its variety. Variety
refers to heterogeneous sources and the nature of data, both
structured and unstructured. During earlier days, spreadsheets
and databases were the only sources of data considered by most
of the applications. Nowadays, data in the form of emails,
photos, videos, monitoring devices, PDFs, audio, etc. are also
being considered in the analysis applications. This variety of
unstructured data poses certain issues for storage, mining and
analysing data.
(iii) Velocity- The term ‘velocity’ refers to the speed of
generation of data. How fast the data is generated and processed
to meet the demands, determines real potential in the data. Big
Data Velocity deals with the speed at which data flows in from
sources like business processes, application logs, networks, and
social media sites, sensors, Mobile devices, etc. The flow of data
is massive and continuous
(iv) Variability- This refers to the inconsistency which can be
shown by the data at times, thus hampering the process of being
able to handle and manage the data effectively.
Types-
Following are the types of Big Data:
 Structured
 Unstructured
 Semi-structured
BE IT-1
BATCH- B-2
Roll No- 47
(i) Structured- Any data that can be stored, accessed and

processed in the form of fixed format is termed as a ‘structured’
data. Over the period of time, talent in computer science has
achieved greater success in developing techniques for working
with such kind of data (where the format is well known in
advance) and also deriving value out of it. However, nowadays,
we are foreseeing issues when a size of such data grows to a
huge extent, typical sizes are being in the rage of multiple
zettabytes.
Looking at these figures one can easily understand why the
name Big Data is given and imagine the challenges involved in
its storage and processing.
Examples Of Structured Data- An ‘Employee’ table in a
database is an example of Structured Data.
Employee_ID Employee_Name Gender Department Salary_In_lacs
2365 Rajesh Kulkarni Male Finance 650000
3398 Pratibha Joshi Female Admin 650000
7465 Shushil Roy Male Admin 500000
7500 Shubhojit Das Male Finance 500000
7699 Priya Sane Female Finance 550000
BE IT-1
BATCH- B-2
Roll No- 47
(ii) Unstructured-Any data with unknown form or the structure

is classified as unstructured data. In addition to the size being
huge, un-structured data poses multiple challenges in terms of its
processing for deriving value out of it. A typical example of
unstructured data is a heterogeneous data source containing a
combination of simple text files, images, videos etc.
Examples Of Unstructured Data
The output returned by ‘Google Search’
(iii) Semi-structured- Semi-structured data can contain both the

forms of data. We can see semi-structured data as a structured in
form but it is actually not defined with example a table
definition in relational DBMS. Example of semi-structured data
is a data represented in an XML file.
BE IT-1
BATCH- B-2
Roll No- 47
Examples Of Semi Structured Data- Personal data stored in an
XML file-
<rec><name>Prashant Rao</name><sex>Male</sex><age>35</age></rec>
<rec><name>Seema R.</name><sex>Female</sex><age>41</age></rec>
<rec><name>Satish Mane</name><sex>Male</sex><age>29</age></rec>
<rec><name>Subrato Roy</name><sex>Male</sex><age>26</age></rec>
<rec><name>Jeremiah J.</name><sex>Male</sex><age>35</age></rec>
Challenges- The Top 6 Big Data Challenges are as follows-

 Lack of knowledge Professionals
 Lack of proper understanding of Massive Data
 Data Growth Issues
 Confusion while Big Data Tool selection
 Integrating Data from a Spread of Sources
 Securing Data
1) Lack of knowledge Professionals- To run these modern
technologies and large Data tools, companies need skilled data
professionals. These professionals will include data scientists,
data analysts, and data engineers to work with the tools and
make sense of giant data sets. One of the Big Data Challenges
that any Company face is a drag of lack of massive Data
professionals. This is often because data handling tools have
evolved rapidly, but in most cases, the professionals haven't.
Actionable steps got to be taken to bridge this gap.
2) Lack of proper understanding of Massive Data- Companies
fail in their Big Data initiatives, all thanks to insufficient
BE IT-1
BATCH- B-2
Roll No- 47
understanding. Employees might not know what data is, its
storage, processing, importance, and sources. Data professionals
may know what's happening, but others might not have a
transparent picture. For example, if employees don't understand
the importance of knowledge storage, they could not keep the
backup of sensitive data. They could not use databases properly
for storage. As a result, when this important data is required, it
can't be retrieved easily.
3) Data Growth Issues- One of the foremost pressing challenges
of massive Data is storing these huge sets of knowledge
properly. the quantity of knowledge being stored in data centers
and databases of companies is increasing rapidly. As these data
sets grow exponentially with time, it gets challenging to handle.
Most of the info is unstructured and comes from documents,
videos, audio, text files, and other sources. This suggests that
you cannot find them in the database.
4) Confusion while Big Data Tool selection- Companies often
get confused while selecting the simplest tool for giant Data
analysis and storage. Is HBase or Cassandra the simplest
technology for data storage? Is Hadoop MapReduce ok, or will
Spark be a far better option for data analytics and storage? These
questions bother companies, and sometimes they're unable to
seek out the answers. They find themselves making poor
decisions and selecting inappropriate technology. As a result,
money, time, efforts, and work hours are wasted.
BE IT-1
BATCH- B-2
Roll No- 47
5) Integrating Data from a Spread of Sources- Data in a
corporation comes from various sources, like social media
pages, ERP applications, customer logs, financial reports, e-
mails, presentations, and reports created by employees.
Combining all this data to organize reports may be a challenging
task. This is a neighborhood often neglected by firms. Data
integration is crucial for analysis, reporting, and business
intelligence, so it's perfect.
6) Securing Data- Securing these huge sets of knowledge is one
of the daunting challenges of massive Data. Often companies are
so busy in understanding, storing, and analyzing their data sets
that they push data security for later stages. This is often not a
sensible move as unprotected data repositories can become
breeding grounds for malicious hackers. Companies can lose up
to $3.7 million for a stolen record or a knowledge breach.
Q.2) i) Distinguish between DBMS and DSMS.

ii) Explain Core Hadoop Components.
Ans- i) Distinguish between DBMS and DSMS.
Sr DBMS DSMS
no
.
1 DBMS refers to Data Base DSMS refers to Data
Management System. Stream Management
System.
2 Data Base Management Data Stream Management
System deals with persistent System deals with stream
BE IT-1
BATCH- B-2
Roll No- 47
data. data.
3 In DBMS random data access In DSMS sequential data
takes place. access takes place.
4 It is based on Query Driven It is based on Data Driven
processing model i.e called processing model i.e
pull based model. called pushbased model.
5 In DBMS query plan is DSMS is based on
optimized at beginning/fixed. adaptive query plans.
ii) Explain Core Hadoop Components- Hadoop is a framework
that uses distributed storage and parallel processing to store and
manage Big Data. It is the most commonly used software to
handle Big Data. There are three components of Hadoop.
1. Hadoop HDFS - Hadoop Distributed File System (HDFS)
is the storage unit of Hadoop.
2. Hadoop MapReduce - Hadoop MapReduce is the
processing unit of Hadoop.
3. Hadoop YARN - Hadoop YARN is a resource management
unit of Hadoop.
a) Hadoop HDFS- Data is stored in a distributed manner in
HDFS. There are two components of HDFS - name node and
data node. While there is only one name node, there can be
multiple data nodes. HDFS is specially designed for storing
huge datasets in commodity hardware. An enterprise version of
a server costs roughly $10,000 per terabyte for the full
processor. In case you need to buy 100 of these enterprise
version servers, it will go up to a million dollars. Hadoop
enables you to use commodity machines as your data nodes.
BE IT-1
BATCH- B-2
Roll No- 47
This way, you don’t have to spend millions of dollars just on
your data nodes. However, the name node is always an
enterprise server. Features of HDFS-
 Provides distributed storage
 Can be implemented on commodity hardware
 Provides data security
 Highly fault-tolerant - If one machine goes down, the data
from that machine goes to the next machine
Master and Slave Nodes- Master and slave nodes form the
HDFS cluster. The name node is called the master, and the data
nodes are called the slaves.
The name node is responsible for the workings of the data

nodes. It also stores the metadata. The data nodes read, write,
process, and replicate the data. They also send signals, known as
heartbeats, to the name node. These heartbeats show the status
of the data node.
BE IT-1
BATCH- B-2
Roll No- 47
Consider that 30TB of data is loaded into the name node. The
name node distributes it across the data nodes, and this data is
replicated among the data notes. You can see in the image above
that the blue, grey, and red data are replicated among the three
data nodes. Replication of the data is performed three times by
default. It is done this way, so if a commodity machine fails, you
can replace it with a new machine that has the same data.
b) Hadoop MapReduce- Hadoop MapReduce is the processing

unit of Hadoop. In the MapReduce approach, the processing is
done at the slave nodes, and the final result is sent to the master
node. A data containing code is used to process the entire data.
This coded data is usually very small in comparison to the data
itself. You only need to send a few kilobytes worth of code to
perform a heavy-duty process on computers.
BE IT-1
BATCH- B-2
Roll No- 47
The input dataset is first split into chunks of data. In this

example, the input has three lines of text with three separate
entities - “bus car train,” “ship ship train,” “bus ship car.” The
dataset is then split into three chunks, based on these entities,
and processed parallelly. In the map phase, the data is assigned a
key and a value of 1. In this case, we have one bus, one car, one
ship, and one train. These key-value pairs are then shuffled and
sorted together based on their keys. At the reduce phase, the
aggregation takes place, and the final output is obtained.
c) Hadoop YARN- Hadoop YARN stands for Yet Another
Resource Negotiator. It is the resource management unit of
Hadoop and is available as a component of Hadoop version 2.
 Hadoop YARN acts like an OS to Hadoop. It is a file
system that is built on top of HDFS.
 It is responsible for managing cluster resources to make
sure you don't overload one machine.
 It performs job scheduling to make sure that the jobs are
scheduled in the right place
BE IT-1
BATCH- B-2
Roll No- 47
Suppose a client machine wants to do a query or fetch some

code for data analysis. This job request goes to the resource
manager (Hadoop Yarn), which is responsible for resource
allocation and management. In the node section, each of the
nodes has its node managers. These node managers manage the
nodes and monitor the resource usage in the node. The
containers contain a collection of physical resources, which
could be RAM, CPU, or hard drives.
Q.3) List the different NoSQL data stores. Explain any

two with diagram.
Ans- NoSQL is an umbrella term to describe any alternative system to
traditional SQL databases. NoSQL databases are all quite different from
SQL databases. They all use a data model that has a different structure
than the traditional row-and-column table model used with relational
database management systems (RDBMSs). But NoSQL databases are all
quite different from each other as well. Here are the four main types
of NoSQL databases:
BE IT-1
BATCH- B-2
Roll No- 47
 Document databases
 Key-value stores
 Column-oriented databases
 Graph databases
1) Document Databases- A document database stores data in
JSON, BSON, or XML documents (not Word documents or
Google docs, of course). In a document database, documents can
be nested. Particular elements can be indexed for faster
querying. Documents can be stored and retrieved in a form that
is much closer to the data objects used in applications, which
means less translation is required to use the data in an
application. SQL data must often be assembled and
disassembled when moving back and forth between applications
and storage. Document databases are popular with developers
because they have the flexibility to rework their document
structures as needed to suit their application, shaping their data
structures as their application requirements change over time.
This flexibility speeds development because in effect data
becomes like code and is under the control of developers. In
SQL databases, intervention by database administrators may be
required to change the structure of a database. The most widely
adopted document databases are usually implemented with a
scale-out architecture, providing a clear path to scalability of
both data volumes and traffic.
BE IT-1
BATCH- B-2
Roll No- 47
2) Key-Value Stores- The simplest type of NoSQL database is a

key-value store . Every data element in the database is stored as
a key value pair consisting of an attribute name (or "key") and a
value. In a sense, a key-value store is like a relational database
with only two columns: the key or attribute name (such as state)
and the value (such as Alaska). Use cases include shopping
carts, user preferences, and user profiles.
BE IT-1
BATCH- B-2
Roll No- 47
3) Column-Oriented Databases- While a relational database

stores data in rows and reads data row by row, a column store is
organized as a set of columns. This means that when you want
to run analytics on a small number of columns, you can read
those columns directly without consuming memory with the
unwanted data. Columns are often of the same type and benefit
from more efficient compression, making reads even faster.
Columnar databases can quickly aggregate the value of a given
column (adding up the total sales for the year, for example). Use
cases include analytics. Unfortunately, there is no free lunch,
which means that while columnar databases are great for
analytics, the way in which they write data makes it very
BE IT-1
BATCH- B-2
Roll No- 47
difficult for them to be strongly consistent as writes of all the
columns require multiple write events on disk. Relational
databases don't suffer from this problem as row data is written
contiguously to disk.
4) Graph Databases- A graph database focuses on the
relationship between data elements. Each element is stored as a
node (such as a person in a social media graph). The connections
between elements are called links or relationships. In a graph
database, connections are first-class elements of the database,
stored directly. In relational databases, links are implied, using
data to express the relationships. A graph database is optimized
to capture and search the connections between data elements,
overcoming the overhead associated with JOINing multiple
tables in SQL. Very few real-world business systems can
survive solely on graph queries. As a result, graph databases are
usually run alongside other more traditional databases.
Q.4) Illustrate the use of MapReduce with use of real-

life databases and applications.
Ans- MapReduce is a programming model used to perform
distributed processing in parallel in a Hadoop cluster, which
Makes Hadoop working so fast. When you are dealing with Big
Data, serial processing is no more of any use. MapReduce has
mainly two tasks which are divided phase-wise:
 Map Task
 Reduce Task
BE IT-1
BATCH- B-2
Roll No- 47
Let us understand it with a real-time example, and the example
helps you understand MapReduce Programming Model in a
story manner:
 Suppose the Indian government has assigned you the task
to count the population of India. You can demand all the
resources you want, but you have to do this task in 4
months. Calculating the population of such a large country
is not an easy task for a single person(you). So what will be
your approach?.
 One of the ways to solve this problem is to divide the
country by states and assign individual in-charge to each
state to count the population of that state.
 Task Of Each Individual: Each Individual has to visit every
home present in the state and need to keep a record of each
house members as:
State_Name Member_House1
.
.
State_Name Member_House n
.
.
For Simplicity, we have taken only three states.
BE IT-1
BATCH- B-2
Roll No- 47
This is a simple Divide and Conquer approach and will be

followed by each individual to count people in his/her state.
 Once they have counted each house member in their
respective state. Now they need to sum up their results
and need to send it to the Head-quarter at New Delhi.
 We have a trained officer at the Head-quarter to receive
all the results from each state and aggregate them by
each state to get the population of that entire state. and
Now, with this approach, you are easily able to count the
population of India by summing up the results obtained
at Head-quarter.
 The Indian Govt. is happy with your work and the next
year they asked you to do the same job in 2 months
instead of 4 months. Again you will be provided with all
the resources you want.
BE IT-1
BATCH- B-2
Roll No- 47
 Since the Govt. has provided you with all the resources,
you will simply double the number of assigned
individual in-charge for each state from one to two. For
that divide each state in 2 division and assigned different
in-charge for these two divisions as:
State_Name_Incharge_division1
State_Name_Incharge_division2
 Similarly, each individual in charge of its division will
gather the information about members from each house
and keep its record.
 We can also do the same thing at the Head-quarters, so
let’s also divide the Head-quarter in two division as:
Head-qurter_Division1
Head-qurter_Division2
 Now with this approach, you can find the population of
India in two months. But there is a small problem with this,
we never want the divisions of the same state to send their
result at different Head-quarters then, in that case, we have
the partial population of that state in Head-
quarter_Division1 and Head-quarter_Division2 which is
inconsistent because we want consolidated population by
the state, not the partial counting.
 One easy way to solve is that we can instruct all individuals
of a state to either send there result to Head-
quarter_Division1 or Head-quarter_Division2. Similarly,
for all the states.
BE IT-1
BATCH- B-2
Roll No- 47
 Our problem has been solved, and you successfully did it in
two months.
 Now, if they ask you to do this process in a month, you
know how to approach the solution.
 Great, now we have a good scalable model that works so
well. The model we have seen in this example is like the
MapReduce Programming model. so now you must be
aware that MapReduce is a programming model, not a
programming language.
Now let’s discuss the phases and important things involved in

our model.
1. Map Phase: The Phase where the individual in-charges are
collecting the population of each house in their division is Map
Phase.
 Mapper: Involved individual in-charge for calculating
population
 Input Splits: The state or the division of the state
BE IT-1
BATCH- B-2
Roll No- 47
 Key-Value Pair: Output from each individual Mapper like
the key is Rajasthan and value is 2
2. Reduce Phase: The Phase where you are aggregating your
result.
Reducers: Individuals who are aggregating the actual result.
Here in our example, the trained-officers. Each Reducer produce
the output as a key-value pair
3. Shuffle Phase: The Phase where the data is copied from
Mappers to Reducers is Shuffler’s Phase. It comes in between
Map and Reduces phase. Now the Map Phase, Reduce Phase,
and Shuffler Phase out the three main Phases of our
MapReduce.
Q.5) i) MapReduce Programming Model

ii) Matrix Vector Multiplication by MapReduce
Ans- i) MapReduce Programming Model- Big Data is a
collection of large datasets that cannot be processed using
traditional computing techniques. Traditional Enterprise
Systems normally have a centralized server to store and process
data. The following illustration depicts a schematic view of a
traditional enterprise system. Traditional model is certainly not
suitable to process huge volumes of scalable data and cannot be
accommodated by standard database servers. Moreover, the
centralized system creates too much of a bottleneck while
processing multiple files simultaneously.
BE IT-1
BATCH- B-2
Roll No- 47
Google solved this bottleneck issue using an algorithm called

MapReduce. MapReduce divides a task into small parts and
assigns them to many computers. Later, the results are collected
at one place and integrated to form the result dataset.
The MapReduce algorithm contains two important tasks, namely

Map and Reduce.
 The Map task takes a set of data and converts it into
another set of data, where individual elements are broken
down into tuples (key-value pairs).
 The Reduce task takes the output from the Map as an input
and combines those data tuples (key-value pairs) into a
smaller set of tuples.
BE IT-1
BATCH- B-2
Roll No- 47
The reduce task is always performed after the map job.
a) Input Phase − Here we have a Record Reader that translates

each record in an input file and sends the parsed data to the
mapper in the form of key-value pairs.
b) Map − Map is a user-defined function, which takes a series
of key-value pairs and processes each one of them to generate
zero or more key-value pairs.
c) Intermediate Keys − They key-value pairs generated by the
mapper are known as intermediate keys.
d) Combiner − A combiner is a type of local Reducer that
groups similar data from the map phase into identifiable sets. It
takes the intermediate keys from the mapper as input and applies
a user-defined code to aggregate the values in a small scope of
one mapper. It is not a part of the main MapReduce algorithm; it
is optional.
BE IT-1
BATCH- B-2
Roll No- 47
e) Shuffle and Sort − The Reducer task starts with the Shuffle
and Sort step. It downloads the grouped key-value pairs onto the
local machine, where the Reducer is running. The individual
key-value pairs are sorted by key into a larger data list. The data
list groups the equivalent keys together so that their values can
be iterated easily in the Reducer task.
f) Reducer − The Reducer takes the grouped key-value paired
data as input and runs a Reducer function on each one of them.
Here, the data can be aggregated, filtered, and combined in a
number of ways, and it requires a wide range of processing.
Once the execution is over, it gives zero or more key-value pairs
to the final step.
g) Output Phase − In the output phase, we have an output
formatter that translates the final key-value pairs from the
Reducer function and writes them onto a file using a record
writer.
ii) Matrix Vector Multiplication by MapReduce-
MapReduce is a technique in which a huge program is
subdivided into small tasks and run parallelly to make
computation faster, save time, and mostly used in
distributed systems. It has 2 important parts:
Mapper: It takes raw data input and organizes into key,
value pairs. For example, In a dictionary, you search for the
word “Data” and its associated meaning is “facts and
statistics collected together for reference or analysis”. Here
BE IT-1
BATCH- B-2
Roll No- 47
the Key is Data and the Value associated with is facts and
statistics collected together for reference or analysis.
Reducer: It is responsible for processing data in parallel
and produce final output.
Let us consider the matrix multiplication example to
visualize MapReduce. Consider the following matrix:
Here matrix A is a 2×2 matrix which means the number of

rows(i)=2 and the number of columns(j)=2. Matrix B is
also a 2×2 matrix where number of rows(j)=2 and number
of columns(k)=2. Each cell of the matrix is labelled as Aij
and Bij. Ex. element 3 in matrix A is called A21 i.e. 2nd-
row 1st column. Now One step matrix multiplication has 1
mapper and 1 reducer. The Formula is:
Mapper for Matrix A (k, v)=((i, k), (A, j, Aij)) for all k
Mapper for Matrix B (k, v)=((i, k), (B, j, Bjk)) for all i
Therefore computing the mapper for Matrix A:
# k, i, j computes the number of times it occurs.
# Here all are 2, therefore when k=1, i can have
# 2 values 1 & 2, each case can have 2 further
# values of j=1 and j=2. Substituting all values
BE IT-1
BATCH- B-2
Roll No- 47
# in formula
k=1 i=1 j=1 ((1, 1), (A, 1, 1))

j=2 ((1, 1), (A, 2, 2))
i=2 j=1 ((2, 1), (A, 1, 3))
j=2 ((2, 1), (A, 2, 4))
k=2 i=1 j=1 ((1, 2), (A, 1, 1))

j=2 ((1, 2), (A, 2, 2))
i=2 j=1 ((2, 2), (A, 1, 3))
j=2 ((2, 2), (A, 2, 4))
Computing the mapper for Matrix B
i=1 j=1 k=1 ((1, 1), (B, 1, 5))

k=2 ((1, 2), (B, 1, 6))
j=2 k=1 ((1, 1), (B, 2, 7))
j=2 ((1, 2), (B, 2, 8))
i=2 j=1 k=1 ((2, 1), (B, 1, 5))

k=2 ((2, 2), (B, 1, 6))
j=2 k=1 ((2, 1), (B, 2, 7))
k=2 ((2, 2), (B, 2, 8))
The formula for Reducer is:
Reducer(k, v)=(i, k)=>Make sorted Alist and Blist

(i, k) => Summation (Aij * Bjk)) for j
BE IT-1
BATCH- B-2
Roll No- 47
Output =>((i, k), sum)
Therefore computing the reducer:
# We can observe from Mapper computation

# that 4 pairs are common (1, 1), (1, 2),
# (2, 1) and (2, 2)
# Make a list separate for Matrix A &
# B with adjoining values taken from
# Mapper step above:
(1, 1) =>Alist ={(A, 1, 1), (A, 2, 2)}

Blist ={(B, 1, 5), (B, 2, 7)}
Now Aij x Bjk: [(1*5) + (2*7)] =19 -------(i)
(1, 2) =>Alist ={(A, 1, 1), (A, 2, 2)}

Blist ={(B, 1, 6), (B, 2, 8)}
Now Aij x Bjk: [(1*6) + (2*8)] =22 -------(ii)
(2, 1) =>Alist ={(A, 1, 3), (A, 2, 4)}

Blist ={(B, 1, 5), (B, 2, 7)}
Now Aij x Bjk: [(3*5) + (4*7)] =43 -------(iii)
(2, 2) =>Alist ={(A, 1, 3), (A, 2, 4)}

Blist ={(B, 1, 6), (B, 2, 8)}
Now Aij x Bjk: [(3*6) + (4*8)] =50 -------(iv)
BE IT-1
BATCH- B-2
Roll No- 47
From (i), (ii), (iii) and (iv) we conclude that

((1, 1), 19)
((1, 2), 22)
((2, 1), 43)
((2, 2), 50)
Therefore, the Final Matrix is:

BDA Assignment No-1 B-2 47

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

BDA Assignment No-1 B-2 47

Uploaded by

Copyright:

Available Formats

Name- Harshita Mandloi

Big Data Analytics (BDA) Assignment No.1

Q.1) List and Explain Big Data- i) Characteristics; ii)

(i) Structured- Any data that can be stored, accessed and

(ii) Unstructured-Any data with unknown form or the structure

(iii) Semi-structured- Semi-structured data can contain both the

Challenges- The Top 6 Big Data Challenges are as follows-

Q.2) i) Distinguish between DBMS and DSMS.

The name node is responsible for the workings of the data

b) Hadoop MapReduce- Hadoop MapReduce is the processing

The input dataset is first split into chunks of data. In this

Suppose a client machine wants to do a query or fetch some

Q.3) List the different NoSQL data stores. Explain any

2) Key-Value Stores- The simplest type of NoSQL database is a

3) Column-Oriented Databases- While a relational database

Q.4) Illustrate the use of MapReduce with use of real-

This is a simple Divide and Conquer approach and will be

Now let’s discuss the phases and important things involved in

Q.5) i) MapReduce Programming Model

Google solved this bottleneck issue using an algorithm called

The MapReduce algorithm contains two important tasks, namely

The reduce task is always performed after the map job.

a) Input Phase − Here we have a Record Reader that translates

Here matrix A is a 2×2 matrix which means the number of

k=1 i=1 j=1 ((1, 1), (A, 1, 1))

k=2 i=1 j=1 ((1, 2), (A, 1, 1))

i=1 j=1 k=1 ((1, 1), (B, 1, 5))

i=2 j=1 k=1 ((2, 1), (B, 1, 5))

Reducer(k, v)=(i, k)=>Make sorted Alist and Blist

Therefore computing the reducer:

# We can observe from Mapper computation

(1, 1) =>Alist ={(A, 1, 1), (A, 2, 2)}

(1, 2) =>Alist ={(A, 1, 1), (A, 2, 2)}

(2, 1) =>Alist ={(A, 1, 3), (A, 2, 4)}

(2, 2) =>Alist ={(A, 1, 3), (A, 2, 4)}

From (i), (ii), (iii) and (iv) we conclude that

You might also like