Bda Assignment - 1

1.
Explain the life cycle of Data Analytics with an example
Data Analytics Lifecycle

The lifecycle has six phases, and project work can occur in
Data Analytics Lifecycle several phases at once.
Phase 1—Discovery: In Phase 1, The team assesses the
resources available to support the project in terms of
people, technology, time, and data. Important activities in
this phase include framing the business problem as an
analytics challenge that can be addressed in subsequent
phases and formulating initial hypotheses (IHs) to test and
begin learning the data.
Phase 2—Data preparation: Phase 2 requires the presence
of an analytic sandbox, in which the team can work with
data and perform analytics for the duration of the project.
The team needs to execute extract, load, and transform (ELT)
or extract, transform and load (ETL) to get data into the
sandbox. The ELT and ETL are sometimes abbreviated as
ETLT. Data should be transformed in the ETLT process so the
team can work with it and analyze it. In this phase, the team
also needs to familiarize itself with the data thoroughly and
take steps to condition the data
• Phase 3—Model planning: Phase 3 is model planning, where the team determines the methods,
techniques, and workflow it intends to follow for the subsequent model building phase. The team
explores the data to learn about the relationships between variables and subsequently selects key
variables and the most suitable models.
• Phase 4—Model building: In Phase 4, the team develops datasets for testing, training, and production
purposes. In addition, in this phase the team builds and executes models based on the work done in the
model planning phase. The team also considers whether its existing tools will suffice for running the
models, or if it will need a more robust environment for executing models and workflows (for example,
fast hardware and parallel processing, if applicable).
• Phase 5—Communicate results: In Phase 5, the team, in collaboration with major stakeholders,
determines if the results of the project are a success or a failure based on the criteria developed in
Phase 1. The team should identify key findings, quantify the business value, and develop a narrative to
summarize and convey findings to stakeholders.
• Phase 6—Operationalize: In Phase 6, the team delivers final reports, briefings, code, and technical
documents. In addition, the team may run a pilot project to implement the models in a production
environment.
2. Explain Generic classes and Concept of Serialization with example(programs)
Generic Classes in R:
• In R, you can create generic functions and methods to work with different types of data. While R doesn't have traditional generic
classes like some other languages, it has a flexible data structure system that allows you to work with different types of objects.
• Generic classes are classes that can be used with different data types. They provide flexibility and reusability. For example, let's say
we have a generic class called "List" that can store any type of data. We can create an instance of the "List" class and add integers,
strings, or any other data type to it.
Here's a simple example of a generic function in R:
# Define a generic class called List

setClass("List", representation(data = "list"))
# Create an instance of the List class
myList <- new("List", data = list())
# Add different types of data to the list
myList$data <- c(myList$data, 10, "Hello", TRUE)
# Print the list
print(myList$data)
Serialization in R:
• Serialization in R is typically achieved using the saveRDS and readRDS functions. These functions allow you to save R
objects to a file in a serialized format and then read them back. RDS stands for "R Data Serialization."
It the process of converting an object into a format that can be stored or transmitted and then reconstructing it back
into an object later. It allows us to save the state of an object and restore it when needed .
• Here's an example of serialization in R:
# Serialize an object
serializedObject <- serialize(myList, NULL)
# Save the serialized object to a file
saveRDS(serializedObject, "serialized_object.rds")
# Load the serialized object from the file
loadedObject <- readRDS("serialized_object.rds")
# Unserialize the object
unserializedList <- unserialize(loadedObject)
# Print the unserialized list
print(unserializedList$data)
3. What is HDFS? List all the components(building blocks) of HDFS and explain with neat diagram
• Hadoop is a framework that uses distributed storage and parallel processing to store and manage big data. It is the software
most used by data analysts to handle big data, and its market size continues to grow. There are three components of
Hadoop
Hadoop HDFS - Hadoop Distributed File System (HDFS) is the storage unit.
Hadoop MapReduce - Hadoop MapReduce is the processing unit.
Hadoop YARN - Yet Another Resource Negotiator (YARN) is a resource management unit.
Building blocks of Hadoop (Namenode, Datanode, Secondary Namenode, JobTracker, TaskTracker)
1.A fully configured cluster, “running Hadoop” means running a set of daemons, or resident programs, on the different
servers in your network.
2.These daemo ns have specific roles; some exist only on one server, some exist across multiple servers.
3.The daemons include
■ Name Node ■ Data Node ■ Secondary Name Node ■ Job Tracker ■ Task Tracker
• The components of HDFS are:
1.NameNode:
It is the master node that manages the file system namespace and keeps track of the metadata of all the files and directories in
HDFS.
2. DataNodes:
These are the worker nodes that store the actual data blocks of files. They receive instructions from the NameNode and perform
read and write operations on the data.
3. Secondary NameNode:
It assists the NameNode by periodically checkpointing the file system metadata and creating a new image of the file system
namespace.
4.Job Tracker
Job tracker is a liaison between the client application and Hadoop
Once the client submits the code to the cluster the Job Tracker determines the execution plan by determining which files to process
5. Task Tracker
Job tracker is the master overseeing the overall execution of MapReduce job
Task trackers manages the execution of individual tasks
Task tracker can spawn multiple JVMs to handle many map/reduce tasks in parallel
4.Explain in detail about configuring/installing hadoop in pseudo distributed
mode
To configure Hadoop in pseudo-distributed mode, you can follow these steps:
1. Download Hadoop: Go to the Apache Hadoop website and download the latest stable release of Hadoop.
2. Extract the files: Once the download is complete, extract the Hadoop files to a directory of your choice.
3. Configure Hadoop: Open the hadoop-env.sh file in the ètc/hadoop directory and set the JAVA_HOME variable to the path of your
Java installation.
4. Configure core-site.xml: Open the `core-site.xml file in the ètc/hadoop' directory and set the following properties:
- Set the fs.defaultFS property to 'hdfs://localhost:9000.
- Set the 'hadoop.tmp.dir property to a directory where Hadoop can store temporary files.
5. Configure hdfs-site.xml: Open the 'hdfs-site.xml file in the ètc/hadoop directory and set the following properties:
- Set the `dfs.replication property to 1 to have a single replica of each data block.
- Set the `dfs.namenode.name.dir property to a directory where the NameNode can store its metadata.
6. Format the NameNode: Open the terminal and navigate to the Hadoop directory. Run the command `bin/hdfs namenode -format to
format the NameNode.
7. Start Hadoop: Run the command `sbin/ start-dfs.sh to start the Hadoop daemons.
8. Verify the installation: Open your web browser and go to http://localhost:9870 to access the Hadoop NameNode web interface. You should see the
Hadoop cluster information.
Now you have Hadoop configured in pseudo-distributed mode! You can run MapReduce jobs and store data in HDFS. Let me know if you need any
further assistance!
5.Explain Hadoop API for MapReduce Framework and develop the java code for MAP REDUCE of
word count problem
Hadoop MapReduce API:
The Hadoop MapReduce API consists of classes and interfaces that you can use to develop MapReduce applications. Here are some key
components of the API:

Bda Assignment - 1

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Bda Assignment - 1

Uploaded by

Copyright:

Available Formats

1.

Explain the life cycle of Data Analytics with an example

Data Analytics Lifecycle

# Define a generic class called List

• Here's an example of serialization in R:

3.The daemons include

You might also like