0% found this document useful (0 votes)

8 views48 pages

24afi08 Big Data File

The document outlines a series of experiments related to Big Data Analytics using Apache Hadoop and MapReduce programming. It includes objectives such as installing Hadoop, developing MapReduce programs for various tasks like word frequency calculation, maximum temperature finding, and student grade processing. Each experiment provides a theoretical background, code examples, and expected outputs for practical implementation.

Uploaded by

20DCS066 ARYA PATEL

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

8 views48 pages

24afi08 Big Data File

Uploaded by

20DCS066 ARYA PATEL

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 48

DELHI TECHNOLOGICAL UNIVERSITY

(Formerly Delhi College of Engineering) Shahbad Daulatpur, Bawana

Road, Delhi-110042

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING

BIG DATA ANALYTICS

Subject code: AI5342

Submitted To: Submitted By:

Prof. Rahul Katarya Arya Patel
Department of Computer [Link] AI IST Year (Sem-II)
Science and Engineering Roll No. 24/AFI/08
INDEX

S. No. Objective Date Signature

1 Install Apache Hadoop

2 MapReduce program to calculate the frequency of a given

word in a given file.

3 MapReduce program to find the maximum temperature in

each year.

4 MapReduce program to find the grades of student’s.

5 MapReduce program to implement Matrix Multiplication.

6 MapReduce to find the maximum electrical consumption in

each year

7 MapReduce program to print whether the day is shinny or cool

day.

8 MapReduce program to find the number of products sold in

each country

9 MapReduce program to find the tags associated with each

movie

10 MapReduce program to get the following from [Link]

online music website.

11 MapReduce program to find the frequency of books published

each year

12 MapReduce program to find the average age of the people

(both male and female) who died in the tragedy

13 MapReduce program to find the days on which each basement

has more trips using the following dataset.

14 Program to calculate the maximum recorded temperature by

year wise for the weather dataset in Pig Latin

15 Write queries to sort and aggregate the data in a table using

HiveQL.

16 Develop a Java application to find the maximum temperature

using Spark.
EXPERIMENT 1
AIM: Install Apache Hadoop
THEORY :

Prerequisites:

• Java (JDK 8 or later)

• SSH (Secure Shell)
• Linux-based OS (Ubuntu recommended) or Windows with WSL
• At least 4GB of RAM (Recommended 8GB)

Apache Hadoop is an open-source framework designed for distributed storage and processing
of large datasets using the MapReduce programming model. It operates on a cluster of
computers and provides scalability, fault tolerance, and high availability. Hadoop consists of
four main modules:

 Hadoop Common: Provides utilities and libraries required by other modules.

 Hadoop Distributed File System (HDFS): A distributed storage system.
 Hadoop YARN: Manages cluster resources and job scheduling.
 Hadoop MapReduce: A programming model for data processing.

Procedure :

1. Download Hadoop:

 Visit the official Apache Hadoop website and download the desired version of
Hadoop.

2. Install Java:

 Ensure Java is installed and set up correctly.

 Configure the JAVA_HOME environment variable to point to the Java
installation directory.

Verify the installation:

3. Extract Hadoop Files:

 Extract the downloaded Hadoop tar file into a designated directory.

4. Set Environment Variables:

 Add HADOOP_HOME and JAVA_HOME to the system's environment

variables.
 Update the PATH variable to HADOOP_HOME/bin and JAVA_HOME/bin.

5. Configure Hadoop Files: Edit configuration files located in the etc/hadoop directory:

 [Link]: Set the default filesystem URI.

 [Link]: Configure replication factor and directories for NameNode and

DataNode.

 [Link]: Configure MapReduce settings.

 [Link]: Configure resource management settings.

6. Format NameNode:

 Run the command hdfs namenode -format to initialize the HDFS metadata.

7. Start Hadoop Services:

 Use commands like [Link] and [Link] to start HDFS and YARN
services.

8. Verify Installation:

 Run the command jps to check if all daemons (NameNode, DataNode,

ResourceManager, NodeManager) are running.
 Perform a simple word count program using Hadoop's example jar file to test
functionality.

Conclusion:

You have successfully installed Hadoop in a single-node setup on your local machine. You can
now run MapReduce jobs and explore HDFS functionalities.
EXPERIMENT 2
AIM: Develop a MapReduce program to calculate the frequency of a given word in a given
file
THEORY :

Program Structure

The MapReduce program consists of three main components:

1. Mapper Class : The Mapper class splits the input text into words and emits each word
with a count of 1.

This Mapper class does the following:

 Splits each line into words using whitespace as a delimiter.

 Converts each word to lowercase and trims it.
 Emits each non-empty word with a count of 1

2. Reducer Class : The Reducer class sums up the counts for each word.
This Reducer class:

 Iterates through all values for a given word.

 Sums up the counts.
 Emits the word with its total count

3. Driver Class : The Driver class sets up and configures the MapReduce job.
The Driver class:

 Creates a new Job instance.

 Sets the Mapper and Reducer classes.
 Specifies the output key and value classes.
 Sets input and output paths based on command-line arguments

CODE :

Mapper Class Code :-

import [Link];
import [Link];
import [Link];
import [Link];
import [Link];

public class SpecificWordMapper extends Mapper<LongWritable, Text, Text, IntWritable> {

private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
private String specificWord;
@Override
protected void setup(Context context) throws IOException, InterruptedException {
specificWord = [Link]().get("fox").toLowerCase();
[Link](specificWord);
}

public void map(LongWritable key, Text value, Context context) throws IOException,
InterruptedException {
String line = [Link]();
String[] words = [Link]("\\s+");
for (String str : words) {
if ([Link]().trim().equals(specificWord)) {
[Link](word, one);
}
}
}
}

Reducer Class Code :-

import [Link];
import [Link];
import [Link];
import [Link];

public class SpecificWordReducer extends Reducer<Text, IntWritable, Text, IntWritable> {

private IntWritable result = new IntWritable();

public void reduce(Text key, Iterable<IntWritable> values, Context context)

throws IOException, InterruptedException {
int sum = 0;
for (IntWritable val : values) {
sum += [Link]();
}
[Link](sum);
[Link](key, result);
}
}

Driver Class Code :-

import [Link];
import [Link];
import [Link];
import [Link];
import [Link];
import [Link];
import [Link];

public class SpecificWordCounter {

public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
Job job = [Link](conf, "fox");
[Link]([Link]);

// Set the specific word to count

String specificWord = "your_specific_word"; // Replace with the actual word or add as a
command-line argument
[Link]().set("[Link]", specificWord);

[Link]([Link]);
[Link]([Link]);

[Link](job, new Path(args[0]));

[Link](job, new Path(args[1]));

[Link]([Link](true) ? 0 : 1);
}
}
OUTPUT :

Created a sample input file named ‘sample_input.txt’ with the following content:

Sample Input :- The quick brown fox jumps over the lazy dog. The dog barks at the fox. Quick,
the fox runs away. The sly fox outsmarts the hound. No foxes were seen today.

Placing the input file in HDFS using code as below :

hadoop fs -put sales_data.txt /user/input/sample_input.txt

Now we would run the MapReduce job with a command similar to:

hadoop jar [Link] SpecificWordCounter /user/input/sample_input.txt

/user/output/fox_frequency

Sample Output :- The output file (typically named part-r-00000) would contain:-
EXPERIMENT 3
AIM: Develop a MapReduce program to find the maximum temperature in each year
CODE :
[Link] Code :-

import [Link];
import [Link];
import [Link];
import [Link];
import [Link];

public class TemperatureMapper extends Mapper<LongWritable, Text, IntWritable,

IntWritable> {
private IntWritable year = new IntWritable();
private IntWritable temperature = new IntWritable();

@Override
public void map(LongWritable key, Text value, Context context) throws IOException,
InterruptedException {
String line = [Link]();
String[] parts = [Link](",");
if ([Link] == 3) {
try {
int yearValue = [Link](parts[0]);
int tempValue = [Link](parts[2]);
[Link](yearValue);
[Link](tempValue);
[Link](year, temperature);
} catch (NumberFormatException e) {
// Skip lines with invalid data
}
}
}
}

[Link] Code :-

import [Link];
import [Link];
import [Link];

public class MaxTemperatureReducer extends Reducer<IntWritable, IntWritable,

IntWritable, IntWritable> {
private IntWritable maxTemp = new IntWritable();

@Override
public void reduce(IntWritable key, Iterable<IntWritable> values, Context context)
throws IOException, InterruptedException {
int max = Integer.MIN_VALUE;
for (IntWritable value : values) {
max = [Link](max, [Link]());
}
[Link](max);
[Link](key, maxTemp);
}
}

[Link] Code :-

import [Link];
import [Link];
import [Link];
import [Link];
import [Link];
import [Link];

public class MaxTemperatureDriver {

public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
Job job = [Link](conf, "max temperature");
[Link]([Link]);
[Link]([Link]);
[Link]([Link]);
[Link]([Link]);
[Link]([Link]);
[Link](job, new Path(args[0]));
[Link](job, new Path(args[1]));
[Link]([Link](true) ? 0 : 1);
}
}
OUTPUT :-

Created a sample input file ‘temperature_data.txt’ with the following content:

Sample Input :- Each line represents: Year,Month,Temperature

2020,01,25
2020,02,28
2020,03,32
2021,01,22
2021,02,30
2021,03,35
2022,01,24
2022,02,29
2022,03,33

After compiling the Java files and creating a JAR, we would run the MapReduce job with a
command like:

hadoop jar [Link] MaxTemperatureDriver /user/input/temperature_data.txt

/user/output/max_temperature

Sample Output :- The output file (typically named part-r-00000) would contain:
EXPERIMENT 4
AIM: Develop a MapReduce program to find the maximum temperature in each year
CODE :
[Link] Code :-
import [Link];
import [Link];
import [Link];
import [Link];

public class StudentGradeMapper extends Mapper<LongWritable, Text, Text, Text> {

private Text studentId = new Text();
private Text scoreInfo = new Text();

[Link] Code :-
import [Link];
import [Link];
import [Link];

public class StudentGradeReducer extends Reducer<Text, Text, Text, Text> {

private Text result = new Text();

@Override
public void reduce(Text key, Iterable<Text> values, Context context)
throws IOException, InterruptedException {
int totalScore = 0;
int count = 0;
for (Text value : values) {
String[] parts = [Link]().split(",");
totalScore += [Link](parts[1]);
count++;
}
double average = (double) totalScore / count;
String grade = calculateGrade(average);
[Link]([Link]("Average: %.2f, Grade: %s", average, grade));
[Link](key, result);
}

private String calculateGrade(double average) {

if (average >= 90) return "A";
if (average >= 80) return "B";
if (average >= 70) return "C";
if (average >= 60) return "D";
return "F";
}
}

[Link] Code :-
import [Link];
import [Link];
import [Link];
import [Link];
import [Link];
import [Link];

public class StudentGradeDriver {

public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
Job job = [Link](conf, "student grades");
[Link]([Link]);
[Link]([Link]);
[Link]([Link]);
[Link]([Link]);
[Link]([Link]);
[Link](job, new Path(args[0]));
[Link](job, new Path(args[1]));
[Link]([Link](true) ? 0 : 1);
}
}
OUTPUT :-

Created a sample input file named ‘student_scores.txt’ with the following content::

Sample Input :- Each line represents: StudentID,Subject,Score

S001,Math,85
S001,Science,92
S001,English,78
S002,Math,76
S002,Science,88
S002,English,81
S003,Math,92
S003,Science,95
S003,English,89

After compiling the Java files and creating a JAR, you would run the MapReduce job with a
command like:

hadoop jar [Link] StudentGradeDriver /user/input/student_scores.txt

/user/output/student_grades

Sample Output :- The output file (typically named part-r-00000) would contain:
EXPERIMENT 5
AIM: Develop a MapReduce program to implement Matrix Multiplication
THEORY:
Matrix multiplication is a fundamental operation in linear algebra with applications in various
fields such as computer graphics, scientific computing, and machine learning. When dealing
with large matrices, the computational requirements can be significant. MapReduce provides
an efficient framework for distributing this computation across multiple nodes, allowing for
the processing of very large matrices.

Matrix Multiplication Algorithm

For matrices A (m x n) and B (n x p), resulting in C (m x p):
C[i,j] = Σ(k=1 to n) A[i,k] * B[k,j]

CODE :
[Link] Code :-
import [Link];
import [Link];
import [Link];
import [Link];
public class MatrixMultiplicationMapper extends Mapper<LongWritable, Text, Text, Text>
{
private Text outputKey = new Text();
private Text outputValue = new Text();
@Override
public void map(LongWritable key, Text value, Context context) throws IOException,
InterruptedException {
String line = [Link]();
String[] parts = [Link](",");
if ([Link] == 4) {
String matrixName = parts[0];
int row = [Link](parts[1]);
int col = [Link](parts[2]);
int val = [Link](parts[3]);
if ([Link]("A")) {
for (int i = 0; i < [Link]().getInt("n", 10); i++) {
[Link](row + "," + i);
[Link]("A," + col + "," + val);
[Link](outputKey, outputValue);
}
} else if ([Link]("B")) {
for (int i = 0; i < [Link]().getInt("m", 10); i++) {
[Link](i + "," + col);
[Link]("B," + row + "," + val);
[Link](outputKey, outputValue);
}
}}}}

[Link] Code :-
import [Link];
import [Link];
import [Link];
import [Link];
public class MatrixMultiplicationReducer extends Reducer<Text, Text, Text, Text> {
private Text result = new Text();
@Override
public void reduce(Text key, Iterable<Text> values, Context context) throws IOException,
InterruptedException {
HashMap<Integer, Integer> mapA = new HashMap<>();
HashMap<Integer, Integer> mapB = new HashMap<>();
for (Text value : values) {
String[] parts = [Link]().split(",");
if (parts[0].equals("A")) {
[Link]([Link](parts[1]), [Link](parts[2]));
} else if (parts[0].equals("B")) {
[Link]([Link](parts[1]), [Link](parts[2]));
}
}
int sum = 0;
for (int i = 0; i < [Link]().getInt("p", 10); i++) {
if ([Link](i) && [Link](i)) {
sum += [Link](i) * [Link](i);
}
}
if (sum != 0) {
[Link]([Link](sum));
[Link](key, result);
}
}
}
[Link] Code :-
import [Link];
import [Link];
import [Link];
import [Link];
import [Link];
import [Link];
public class MatrixMultiplicationDriver {
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
// Set matrix dimensions: A(m x p), B(p x n)
[Link]("m", 2);
[Link]("p", 3);
[Link]("n", 2);
Job job = [Link](conf, "matrix multiplication");
[Link]([Link]);
[Link]([Link]);
[Link]([Link]);
[Link]([Link]);
[Link]([Link]);
[Link](job, new Path(args[0]));
[Link](job, new Path(args[1]));
[Link]([Link](true) ? 0 : 1);
}
}

OUTPUT :-
Created a sample input file named ‘matrix_input.txt’ with the following content::

Sample Input :- Each line represents: Matrix_Name , Row , Column , Value

A,0,0,1
A,0,1,2
A,0,2,3
A,1,0,4
A,1,1,5
A,1,2,6
B,0,0,7
B,0,1,8
B,1,0,9
B,1,1,10
B,2,0,11
B,2,1,12
After compiling the Java files and creating a JAR, you would run the MapReduce job with a
command like:
hadoop jar [Link] MatrixMultiplicationDriver /user/input/matrix_input.txt
/user/output/matrix_output
Sample Output :- The output file (typically named part-r-00000) would contain:
EXPERIMENT 6
AIM: Develop a MapReduce to find the maximum electrical consumption in each city and
average electrical consumption for each month in each year.
CODE :
[Link] Code :-
import [Link];
import [Link].*;
import [Link];
public class EnergyMapper extends Mapper<LongWritable, Text, Text, DoubleWritable> {
private Text outputKey = new Text();
private DoubleWritable consumption = new DoubleWritable();
@Override
public void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException {
String[] fields = [Link]().split(",");
if ([Link] >= 3) {
String city = fields[0].trim();
String date = fields[1].trim();
double energy = [Link](fields[2].trim());
// Emit for max consumption per city
[Link](city);
[Link](energy);
[Link](outputKey, consumption);
// Emit for monthly average (year-month,city as key)
String yearMonth = [Link](0, 7); // Assumes YYYY-MM-DD format
[Link](yearMonth + "," + city);
[Link](outputKey, consumption);
}
}
}

[Link] Code :-
import [Link];
import [Link].*;
import [Link];
import [Link];
public class EnergyReducer extends Reducer<Text, DoubleWritable, Text, Text> {
private MultipleOutputs<Text, Text> mos;
private Text result = new Text();
@Override
public void setup(Context context) {
mos = new MultipleOutputs<>(context);
}
@Override
public void reduce(Text key, Iterable<DoubleWritable> values, Context context)
throws IOException, InterruptedException {
String keyStr = [Link]();
if ([Link](",")) { // Monthly average computation
double sum = 0;
int count = 0;
for (DoubleWritable val : values) {
sum += [Link]();
count++;
}
double avg = sum / count;
[Link]([Link]("%.2f", avg));
[Link]("monthlyAvg", key, result);
} else { // Max consumption per city
double max = Double.MIN_VALUE;
for (DoubleWritable val : values) {
max = [Link](max, [Link]());
}
[Link]([Link]("%.2f", max));
[Link]("maxConsumption", key, result);
}
}
@Override
public void cleanup(Context context) throws IOException, InterruptedException {
[Link]();
}
}

[Link] Code :-
import [Link];
import [Link];
import [Link].*;
import [Link];
import [Link];
import [Link];
import [Link];
import [Link];
public class EnergyDriver {
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
Job job = [Link](conf, "Energy Consumption Analysis");
[Link]([Link]);
// Configure Mapper and Reducer
[Link]([Link]);
[Link]([Link]);
// Set output types
[Link]([Link]);
[Link]([Link]);
[Link]([Link]);
[Link]([Link]);
// Define MultipleOutputs
[Link](job, "maxConsumption",
[Link], [Link], [Link]);
[Link](job, "monthlyAvg",
[Link], [Link], [Link]);
// Set input/output paths
[Link](job, new Path(args[0]));
[Link](job, new Path(args[1]));
[Link]([Link](true) ? 0 : 1);
}
}

OUTPUT :-
Created a sample input file named ‘energy_input.csv’ with the following content:

Sample Input :- Each line represents: City, Date, Consumption

New York,2023-05-01,450.5
Los Angeles,2023-05-01,600.2
New York,2023-05-02,480.0
Los Angeles,2023-05-02,620.8
New York,2023-05-03,465.3
Los Angeles,2023-05-03,615.7
New York,2023-06-01,420.1
Los Angeles,2023-06-01,590.4
After compiling the Java files and creating a JAR, you would run the MapReduce job with a
command like:

hadoop jar [Link] EnergyDriver \ /user/input/energy_input.csv \

/user/output/energy_results

Sample Output :-
# View maximum consumption results
hadoop fs -cat /user/output/energy_results/maxConsumption-r-00000

The output file (typically named maxConsumption-r-00000) would contain:

# View monthly averages

hadoop fs -cat /user/output/energy_results/monthlyAvg-r-00000

The output file (typically named monthlyAvg-r-00000) would contain:

EXPERIMENT 7
AIM: Develop a MapReduce to analyze weather data set and print whether the day is shinny
or cool day.
CODE :
[Link] Code :-
import [Link];
import [Link];
import [Link];
import [Link];
public class WeatherMapper extends Mapper<LongWritable, Text, Text, Text> {
private Text outputKey = new Text();
private Text outputValue = new Text();
private static final int TEMPERATURE_THRESHOLD = 25; // Celsius
@Override
public void map(LongWritable key, Text value, Context context) throws IOException,
InterruptedException {
String line = [Link]();
String[] parts = [Link](",");
if ([Link] == 3) { // Assuming format: date,location,temperature
String date = parts[0];
String location = parts[1];
try {
int temperature = [Link](parts[2]);
String weatherType = (temperature > TEMPERATURE_THRESHOLD) ? "Sunny"
: "Cool";
[Link](date);
[Link](location + "," + temperature + "," + weatherType);
[Link](outputKey, outputValue);
} catch (NumberFormatException e) {
// Skip lines with invalid temperature data
}}}}
[Link] Code :-
import [Link];
import [Link];
import [Link];
public class WeatherReducer extends Reducer<Text, Text, Text, Text> {
private Text result = new Text();
@Override
public void reduce(Text key, Iterable<Text> values, Context context)
throws IOException, InterruptedException {
StringBuilder weatherInfo = new StringBuilder();
for (Text value : values) {
[Link]([Link]()).append("; ");
}
[Link]([Link]().trim());
[Link](key, result);
}}
[Link] Code :-
import [Link];
import [Link];
import [Link];
import [Link];
import [Link];
import [Link];
public class WeatherAnalysisDriver {
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
Job job = [Link](conf, "weather analysis");
[Link]([Link]);
[Link]([Link]);
[Link]([Link]);
[Link]([Link]);
[Link]([Link]);
[Link](job, new Path(args[0]));
[Link](job, new Path(args[1]));
[Link]([Link](true) ? 0 : 1);
}}
OUTPUT :-
Created a sample input file named ‘weather_data.txt’ with the following content::
Sample Input :- Each line represents: StudentID,Subject,Score
2023-05-01,New York,22
2023-05-01,Los Angeles,28
2023-05-02,New York,27
2023-05-02,Los Angeles,30
2023-05-03,New York,20
2023-05-03,Los Angeles,26
After compiling the Java files and creating a JAR, you would run the MapReduce job with a
command like:
hadoop jar [Link] WeatherAnalysisDriver /user/input/weather_data.txt
/user/output/weather_analysis
Sample Output :- The output file (typically named part-r-00000) would contain:
EXPERIMENT 8
AIM: Develop a MapReduce program to find the number of products sold in each country
by considering sales data containing fields like
Prod Pri Paym Nam Cit Sta Cou Accoun Last Latit Longi
uct ce ent_T e y te ntry t_Creat _Lo ude tude
ype ed gin

CODE :
[Link] Code :-
import [Link];
import [Link];
import [Link];
import [Link];
public class SalesMapper extends Mapper<Object, Text, Text, IntWritable> {
private Text country = new Text();
private final static IntWritable one = new IntWritable(1);
@Override
public void map(Object key, Text value, Context context) throws IOException,
InterruptedException {
String line = [Link]();
String[] fields = [Link](","); // Assuming fields are comma-separated
if ([Link] >= 8) { // Ensure there are enough fields
String countryName = fields[7].trim(); // Country is in the 8th column (index 7)
[Link](countryName);
[Link](country, one); // Emit (country, 1) for each product sold
}}}
[Link] Code :-
import [Link];
import [Link];
import [Link];
import [Link];
public class SalesReducer extends Reducer<Text, IntWritable, Text, IntWritable> {
private IntWritable result = new IntWritable();
@Override
public void reduce(Text key, Iterable<IntWritable> values, Context context)
throws IOException, InterruptedException {
int sum = 0;
for (IntWritable value : values) {
sum += [Link]();
}
[Link](sum);
[Link](key, result); // Emit (country, total products sold)
}}
[Link] Code :-
import [Link];
import [Link];
import [Link];
import [Link];
import [Link];
import [Link];
import [Link];
public class SalesAnalysisDriver {
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
Job job = [Link](conf, "sales analysis");
[Link]([Link]);
[Link]([Link]);
[Link]([Link]);
[Link]([Link]);
[Link]([Link]);
[Link](job, new Path(args[0]));
[Link](job, new Path(args[1]));
[Link]([Link](true) ? 0 : 1);
}}
OUTPUT :-
Created a sample input file named ‘sales_data.txt’ with the following content::
Sample Input :-
Each line represents:

After compiling the Java files and creating a JAR

 Place the input file in HDFS:

 Run the MapReduce program:

hadoop jar [Link] SalesAnalysisDriver /user/input/sales_data.txt
/user/output/sales_analysis
Sample Output :- The output file (typically named part-r-00000) would contain:
EXPERIMENT 9
AIM: Develop a MapReduce program to find the tags associated with each movie by
analyzing movie lens data.

CODE :

[Link] Code :- Extracts movie IDs and titles.

import [Link];
import [Link].*;
import [Link];

public class MovieMapper extends Mapper<LongWritable, Text, Text, Text> {

@Override
public void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException {
String[] fields = [Link]().split(",");
if ([Link] >= 3) {
String movieId = fields[0].trim();
String title = fields[1].trim();
[Link](new Text(movieId), new Text("TITLE:" + title));
}
}
}

[Link] Code :- Extracts movie IDs and associated tags.

import [Link];
import [Link].*;
import [Link];

public class TagMapper extends Mapper<LongWritable, Text, Text, Text> {

@Override
public void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException {
String[] fields = [Link]().split(",");
if ([Link] >= 3) {
String movieId = fields[1].trim();
String tag = fields[2].trim();
[Link](new Text(movieId), new Text("TAG:" + tag));
}
}
}

[Link] Code :- Aggregates tags for each movie and pairs them with titles.
import [Link];
import [Link];
import [Link];
import [Link];
import [Link];

public class MovieTagReducer extends Reducer<Text, Text, Text, Text> {

@Override
public void reduce(Text key, Iterable<Text> values, Context context)
throws IOException, InterruptedException {
String title = null;
Set<String> tags = new HashSet<>();

for (Text val : values) {

String str = [Link]();
if ([Link]("TITLE:")) {
title = [Link](6);
} else if ([Link]("TAG:")) {
[Link]([Link](4));
}
}

if (title != null && ![Link]()) {

[Link](new Text(title), new Text([Link](", ", tags)));
}
}
}

[Link] Code :- Configures the job to handle multiple inputs.

import [Link];
import [Link];
import [Link];
import [Link];
import [Link];
import [Link];
import [Link];

public class MovieTagDriver {

public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
Job job = [Link](conf, "Movie Tags Analysis");
[Link]([Link]);
[Link]([Link]);
[Link]([Link]);
[Link]([Link]);

// Multiple inputs for movies and tags

[Link](job, new Path(args[0]), [Link],
[Link]);
[Link](job, new Path(args[1]), [Link],
[Link]);
[Link](job, new Path(args[2]));

[Link]([Link](true) ? 0 : 1);
}
}

OUTPUT :-

Created a sample input file for movies named ‘[Link]’ with the following content:

After compiling the Java files and creating a JAR

 Place the input file in HDFS:

 Run the MapReduce program:

hadoop jar [Link] MovieTagDriver \ /user/input/[Link] \
/user/input/[Link] \ /user/output/movie_tags

Sample Output :- The output file (typically named part-r-00000) would contain:
EXPERIMENT 10
AIM: [Link] is an online music website where users listen to various tracks, the data gets
collected which is given below.
The data is coming in log files and looks like as shown below

Write a MapReduce program to get the following

 Number of unique listeners
 Number of times the track was shared with others
 Number of times the track was listened to on the radio
 Number of times the track was listened to in total
 Number of times the track was skipped on the radio

CODE :
Firstly , let's define constants to make the code more readable:
[Link] Code :-
package [Link];
public class MusicConstants {
// Indices for fields in each log record
public static final int USER_ID = 0;
public static final int TRACK_ID = 1;
public static final int IS_SHARED = 2;
public static final int IS_RADIO = 3;
public static final int IS_SKIPPED = 4;
// Types of metrics to calculate
public static final String UNIQUE_LISTENERS = "unique_listeners";
public static final String SHARED_COUNT = "shared_count";
public static final String RADIO_PLAYS = "radio_plays";
public static final String TOTAL_PLAYS = "total_plays";
public static final String RADIO_SKIPS = "radio_skips";
}

[Link] Code :-
package [Link];
import [Link];
import [Link];
import [Link];
import [Link];
public class MusicStatsMapper extends Mapper<LongWritable, Text, Text, Text> {
private Text outputKey = new Text();
private Text outputValue = new Text();
@Override
public void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException {
String line = [Link]();
String[] fields = [Link]("\\|");
// Skip malformed records
if ([Link] != 5) return;
try {
String userId = fields[MusicConstants.USER_ID].trim();
String trackId = fields[MusicConstants.TRACK_ID].trim();
int shared = [Link](fields[MusicConstants.IS_SHARED].trim());
int radio = [Link](fields[MusicConstants.IS_RADIO].trim());
int skipped = [Link](fields[MusicConstants.IS_SKIPPED].trim());
// For unique listeners - emit trackId,userId
[Link](MusicConstants.UNIQUE_LISTENERS + ":" + trackId);
[Link](userId);
[Link](outputKey, outputValue);
// For shared count - emit only if shared=1
if (shared == 1) {
[Link](MusicConstants.SHARED_COUNT + ":" + trackId);
[Link]("1");
[Link](outputKey, outputValue);
}
// For radio plays - emit only if radio=1
if (radio == 1) {
[Link](MusicConstants.RADIO_PLAYS + ":" + trackId);
[Link]("1");
[Link](outputKey, outputValue);
}
// For total plays - emit for every record
[Link](MusicConstants.TOTAL_PLAYS + ":" + trackId);
[Link]("1");
[Link](outputKey, outputValue);
// For radio skips - emit only if radio=1 and skipped=1
if (radio == 1 && skipped == 1) {
[Link](MusicConstants.RADIO_SKIPS + ":" + trackId);
[Link]("1");
[Link](outputKey, outputValue);
}
} catch (NumberFormatException e) {
// Skip records with invalid number format
}
}}
[Link] Code :-
package [Link];
import [Link];
import [Link];
import [Link];
import [Link];
import [Link];
public class MusicStatsReducer extends Reducer<Text, Text, Text, Text> {
private Text result = new Text();
@Override
public void reduce(Text key, Iterable<Text> values, Context context)
throws IOException, InterruptedException {
String keyStr = [Link]();
String[] keyParts = [Link](":");
if ([Link] != 2) return;
String metricType = keyParts[0];
String trackId = keyParts[1];
if ([Link](MusicConstants.UNIQUE_LISTENERS)) {
// Count unique listeners
Set<String> uniqueUsers = new HashSet<>();
for (Text value : values) {
[Link]([Link]());
}
[Link]("Unique Listeners: " + [Link]());
} else {
// Count other metrics (shared, radio plays, total plays, skips)
int count = 0;
for (Text value : values) {
count++;
}
if ([Link](MusicConstants.SHARED_COUNT)) {
[Link]("Shared Count: " + count);
} else if ([Link](MusicConstants.RADIO_PLAYS)) {
[Link]("Radio Plays: " + count);
} else if ([Link](MusicConstants.TOTAL_PLAYS)) {
[Link]("Total Plays: " + count);
} else if ([Link](MusicConstants.RADIO_SKIPS)) {
[Link]("Radio Skips: " + count);
}
}
[Link](new Text("Track " + trackId), result);
}
}
[Link] Code :-
package [Link];
import [Link];
import [Link];
import [Link];
import [Link];
import [Link];
import [Link];
public class MusicStatsDriver {
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
Job job = [Link](conf, "XYZ Music Stats");
[Link]([Link]);
// Set mapper and reducer classes
[Link]([Link]);
[Link]([Link]);
// Set output key and value types
[Link]([Link]);
[Link]([Link]);
// Set input and output paths
[Link](job, new Path(args[0]));
[Link](job, new Path(args[1]));
[Link]([Link](true) ? 0 : 1);
}
}

OUTPUT :-
1. First, compile the Java files and create a JAR:
2. Created a sample input file named ‘music_logs.txt’
3. After compiling the Java files and creating a JAR
 Place the input file in HDFS:

 Run the MapReduce program:

hadoop jar [Link] [Link]

/user/input/music_logs.txt /user/output/music_stats

Sample Output :- The output file (typically named part-r-00000) would contain:
hadoop fs -cat /user/output/music_stats/part-r-00000
EXPERIMENT 11
AIM: Develop a MapReduce program to find the frequency of books published each year
and find in which year maximum number of books were published using the following data.

CODE :
[Link] Code :-
import [Link];
import [Link];
import [Link];
import [Link];
import [Link];
public class BookPublicationMapper extends Mapper<LongWritable, Text, Text,
IntWritable> {
private final static IntWritable one = new IntWritable(1);
private Text year = new Text();
@Override
public void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException {
String line = [Link]();
String[] fields = [Link](",");
// Skip header or malformed rows
if ([Link] >= 3 && !fields[0].equals("Title")) {
try {
// Extract the published year (assuming it's in 3rd column)
String publishedYear = fields[2].trim();
[Link](publishedYear);
// Emit (year, 1) for each book
[Link](year, one);
} catch (Exception e) {
// Skip rows with parsing errors
}
}
}
}

[Link] Code :-
import [Link];
import [Link];
import [Link];
import [Link];
public class BookPublicationReducer extends Reducer<Text, IntWritable, Text, IntWritable>
{
private IntWritable result = new IntWritable();
@Override
public void reduce(Text key, Iterable<IntWritable> values, Context context)
throws IOException, InterruptedException {
int sum = 0;
// Sum up all occurrences (always 1) of books in this year
for (IntWritable val : values) {
sum += [Link]();
}
[Link](sum);
[Link](key, result);
}
}
[Link] Code :-
import [Link];
import [Link];
import [Link];
import [Link];
import [Link];
public class MaxPublicationMapper extends Mapper<LongWritable, Text, Text, Text> {
private final static Text maxKey = new Text("MAX");
private Text yearCount = new Text();
@Override
public void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException {
String line = [Link]();
String[] parts = [Link]("\\s+");
if ([Link] == 2) {
// Format: "year count"
[Link](line);
// Emit with a single key so all records go to one reducer
[Link](maxKey, yearCount);
}
}
}
[Link] Code :-
import [Link];
import [Link];
import [Link];
public class MaxPublicationReducer extends Reducer<Text, Text, Text, Text> {
private Text result = new Text();
@Override
public void reduce(Text key, Iterable<Text> values, Context context)
throws IOException, InterruptedException {
int maxCount = -1;
String maxYear = "";
// Find the year with maximum publications
for (Text val : values) {
String[] parts = [Link]().split("\\s+");
String year = parts[0];
int count = [Link](parts[1]);
if (count > maxCount) {
maxCount = count;
maxYear = year;
}
}
[Link]("Maximum number of books (" + maxCount + ") were published in year " +
maxYear);
[Link](new Text("Result:"), result);
}
}

[Link] Code :-
import [Link];
import [Link];
import [Link];
import [Link];
import [Link];
import [Link];
import [Link];
import [Link];
import [Link];
public class BookPublicationDriver {
public static void main(String[] args) throws Exception {
// Job 1: Count books per year
Configuration conf1 = new Configuration();
Job job1 = [Link](conf1, "Book Publication Count");
[Link]([Link]);
[Link]([Link]);
[Link]([Link]);
[Link]([Link]);
[Link]([Link]);
[Link](job1, new Path(args[0]));
Path tempOutput = new Path(args[1] + "_temp");
[Link](job1, tempOutput);
[Link](true);
// Job 2: Find year with maximum publications
Configuration conf2 = new Configuration();
Job job2 = [Link](conf2, "Max Publication Year");
[Link]([Link]);
[Link]([Link]);
[Link]([Link]);
[Link]([Link]);
[Link]([Link]);
[Link](job2, tempOutput);
[Link](job2, new Path(args[1]));
[Link]([Link](true) ? 0 : 1);
}
}

OUTPUT :-
1. First, compile the Java files and create a JAR:
2. Created a sample input file named ‘book_data.csv’

3. After compiling the Java files and creating a JAR

 Place the input file in HDFS:

 Run the MapReduce program:

hadoop jar [Link] BookPublicationDriver /user/input/book_data.csv
/user/output/book_analysis

Sample Output :- The output file (typically named part-r-00000) would contain:
hadoop fs -cat /user/output/book_analysis/part-r-00000
The first job produces year-count pairs:

The second job identifies the maximum:

EXPERIMENT 12
AIM: Develop a MapReduce program to analyze Titanic ship data and to find the average
age of the people (both male and female) who died in the tragedy. How many persons are
survived in each class.

CODE :
[Link] Code :-
import [Link];
import [Link];
import [Link];
import [Link];
import [Link];
public class TitanicAgeMapper extends Mapper<LongWritable, Text, Text, DoubleWritable>
{
private Text gender = new Text();
private DoubleWritable age = new DoubleWritable();
@Override
public void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException {
String line = [Link]();
String[] fields = [Link](",");
// Skip malformed records
if ([Link] >= 6) {
String survived = fields[1].trim();
String sex = fields[4].trim();
String ageStr = fields[5].trim();
// Process only if person died (survived=1) and age is valid
if ("1".equals(survived) && [Link]("\\d+(\\.\\d+)?")) {
double ageVal = [Link](ageStr);
[Link](sex);
[Link](ageVal);
[Link](gender, age);
}}}}
[Link] Code :-
import [Link];
import [Link].*;
import [Link];
public class TagMapper extends Mapper<LongWritable, Text, Text, Text> {
@Override
public void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException {
String[] fields = [Link]().split(",");
if ([Link] >= 3) {
String movieId = fields[1].trim();
String tag = fields[2].trim();
[Link](new Text(movieId), new Text("TAG:" + tag));
}}}
[Link] Code :-
import [Link];
import [Link];
import [Link];
import [Link];
import [Link];
public class TitanicSurvivorsMapper extends Mapper<LongWritable, Text, Text,
IntWritable> {
private Text passengerClass = new Text();
private static final IntWritable one = new IntWritable(1);
@Override
public void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException {
String line = [Link]();
String[] fields = [Link](",");
// Skip malformed records
if ([Link] >= 3) {
String survived = fields[1].trim();
// Process only if person survived (survived=0)
if ("0".equals(survived)) {
String pclass = "Class " + fields[2].trim();
[Link](pclass);
[Link](passengerClass, one);
}}}}
[Link] Code :-
import [Link];
import [Link];
import [Link];
import [Link];
public class TitanicSurvivorsReducer extends Reducer<Text, IntWritable, Text, IntWritable>
{
private IntWritable result = new IntWritable();
@Override
public void reduce(Text key, Iterable<IntWritable> values, Context context)
throws IOException, InterruptedException {
int sum = 0;
for (IntWritable val : values) {
sum += [Link]();
}
[Link](sum);
[Link](key, result);
}}
[Link] Code :-
import [Link];
import [Link];
import [Link];
import [Link];
import [Link];
import [Link];
import [Link];
import [Link];
public class TitanicAnalysisDriver {
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
// Job 1: Average Age by gender for deceased passengers
Job job1 = [Link](conf, "Average Age of Deceased");
[Link]([Link]);
[Link]([Link]);
[Link]([Link]);
[Link]([Link]);
[Link]([Link]);
[Link](job1, new Path(args[0]));
[Link](job1, new Path(args[1] + "_avg_age"));
[Link](true);

// Job 2: Survivors by class

Job job2 = [Link](conf, "Survivors By Class");
[Link]([Link]);
[Link]([Link]);
[Link]([Link]);
[Link]([Link]);
[Link]([Link]);
[Link](job2, new Path(args[0]));
[Link](job2, new Path(args[1] + "_survivors"));
[Link]([Link](true) ? 0 : 1);
}
}
OUTPUT :-

Created a sample input file named ‘titanic_data.csv’ with the following content:

PassengerId,Survived,Pclass,Name,Sex,Age
1,1,1,"Allen, Miss. Elisabeth Walton",female,29
2,0,1,"Allison, Master. Hudson Trevor",male,0.92
3,1,3,"Andersson, Mr. Anders Johan",male,39
4,0,3,"Andersson, Miss. Ingeborg Constanzia",female,9
5,1,3,"Andersson, Miss. Sigrid Elisabeth",female,11
6,0,1,"Andrews, Miss. Kornelia Theodosia",female,63
7,1,1,"Appleton, Mrs. Edward Dale",female,53
8,0,2,"Arnold-Franchi, Mrs. Josef",female,18
9,1,3,"Baclini, Miss. Eugenie",female,0.75
10,0,3,"Baclini, Miss. Helene Barbara",female,0.33
11,1,2,"Backstrom, Mrs. Karl Alfred",female,33
12,0,2,"Backstrom, Miss. Kristina",female,20
13,1,3,"Baclini, Miss. Marie Catherine",female,5
14,0,1,"Baxter, Mrs. James",female,50
15,1,2,"Becker, Miss. Marion Louise",female,4
16,0,3,"Bourke, Miss. Mary",female,28
17,1,1,"Brown, Mrs. James Joseph",female,44
18,0,2,"Brown, Miss. Amelia",female,24
19,1,3,"Cacic, Miss. Marija",female,30
20,0,3,"Cacic, Mr. Luka",male,38

1. Compile the Java files:

2. Create a JAR file:

3. After compiling the Java files and creating a JAR

 Place the input file in HDFS:

 Run the MapReduce program:

hadoop jar [Link] TitanicAnalysisDriver /user/input/titanic_data.csv
/user/output/titanic_analysis

Sample Output :- The output file (typically named part-r-00000) would contain:

hadoop fs -cat /user/output/titanic_analysis_avg_age/part-r-00000

hadoop fs -cat /user/output/titanic_analysis_survivors/part-r-00000

EXPERIMENT 13
AIM: Develop a MapReduce program to analyze Uber data set to find the days on which
each basement has more trips using the following dataset.
The Uber dataset consists of four columns they are

CODE :

[Link] Code :-
import [Link];
import [Link];
import [Link];
import [Link];
import [Link];
import [Link];
import [Link];
import [Link];

public class UberMapper extends Mapper<LongWritable, Text, Text, IntWritable> {

private Text outputKey = new Text();
private IntWritable tripCount = new IntWritable();
private String[] days = {"Sun", "Mon", "Tue", "Wed", "Thu", "Fri", "Sat"};
private SimpleDateFormat dateFormat = new SimpleDateFormat("MM/dd/yyyy");

@Override
public void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException {
// Skip header row
if ([Link]() == 0 && [Link]().contains("dispatching_base_number")) {
return;
}

String[] fields = [Link]().split(",");

if ([Link] == 4) { // Ensure valid record
String baseNumber = fields[0].trim();
String dateStr = fields[1].trim();
int trips = [Link](fields[3].trim());

try {
// Parse date and extract day of week
Date date = [Link](dateStr);
String dayOfWeek = days[[Link]()];
// Create composite key: "baseNumber dayOfWeek"
[Link](baseNumber + " " + dayOfWeek);
[Link](trips);

[Link](outputKey, tripCount);
} catch (ParseException e) {
// Skip records with invalid date format
[Link]("Error parsing date: " + dateStr);
}
}
}
}

[Link] Code :-
import [Link];
import [Link];
import [Link];
import [Link];

public class UberReducer extends Reducer<Text, IntWritable, Text, IntWritable> {

private IntWritable result = new IntWritable();

@Override
public void reduce(Text key, Iterable<IntWritable> values, Context context)
throws IOException, InterruptedException {
int sum = 0;

// Sum up all trips for this base-day combination

for (IntWritable val : values) {
sum += [Link]();
}

[Link](sum);
[Link](key, result);
}
}

[Link] Code :-
import [Link];
import [Link];
import [Link];
import [Link];
import [Link];
import [Link];
import [Link];

public class UberDriver {

public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
Job job = [Link](conf, "Uber Trip Analysis");
[Link]([Link]);

// Set Mapper and Reducer classes

[Link]([Link]);
[Link]([Link]); // Combiner for efficiency
[Link]([Link]);

// Define output types

[Link]([Link]);
[Link]([Link]);

// Set input and output paths

[Link](job, new Path(args[0]));
[Link](job, new Path(args[1]));

[Link]([Link](true) ? 0 : 1);
}
}

[Link] Code :- Configures the job to handle multiple inputs.

import [Link];
import [Link];
import [Link];
import [Link];
import [Link];
import [Link];
import [Link];

public class MovieTagDriver {

// Multiple inputs for movies and tags

[Link](job, new Path(args[0]), [Link],
[Link]);
[Link](job, new Path(args[1]), [Link],
[Link]);
[Link](job, new Path(args[2]));

[Link]([Link](true) ? 0 : 1);
}
}

OUTPUT :-

Created a sample input file named ‘uber_data.csv’ with the following content:

1. Compile the Java files:

javac -cp
/usr/local/hadoop/share/hadoop/common/*:/usr/local/hadoop/share/hadoop/mapreduce/*
[Link] [Link] [Link]

2. Create a JAR file:

3. After compiling the Java files and creating a JAR

 Place the input file in HDFS:

 Run the MapReduce program:

hadoop jar [Link] UberDriver /user/input/uber_data.csv /user/output/uber_analysis

Sample Output :- The output file (typically named part-r-00000) would contain:
EXPERIMENT 14
AIM: Develop a program to calculate the maximum recorded temperature by year wise for
the weather dataset in Pig Latin
THEORY :

Pig Latin is a high-level scripting language designed for analyzing large datasets in Hadoop
using the Apache Pig platform. Pig Latin scripts are composed of a series of data
transformation steps, much like a data flow, where each step performs a specific operation
such as loading, filtering, grouping, or aggregating data.

Pig Latin programs are written as a sequence of statements, each ending with a semicolon.
These statements operate on relations and include:
 LOAD: To load data from the file system (HDFS or local) into a relation.
 STORE: To save a relation back to the file system.
 FILTER: To remove unwanted rows from a relation.
 FOREACH ... GENERATE: To transform data or select specific fields.
 GROUP: To group data by one or more fields.
 JOIN: To join two or more relations.
 ORDER: To sort the data.
 DUMP: To display the contents of a relation on the console
When a Pig Latin script is run, the Pig engine parses the script and automatically translates it
into a series of MapReduce jobs, which are then executed on the Hadoop cluster. This means
users do not need to manage the details of parallel and distributed computation.

CODE :
-- Load the dataset (adjust delimiter if needed)
raw_data = LOAD ‘C:\Users\input\weather_data.csv' USING PigStorage(',')
AS (date:chararray, temperature:int, other_fields...);

-- Extract year from the date and project temperature

year_data = FOREACH raw_data GENERATE
SUBSTRING(date, 0, 4) AS year,
temperature;

-- Group data by year

grouped_data = GROUP year_data BY year;

-- Calculate maximum temperature for each year

max_temp_by_year = FOREACH grouped_data GENERATE
group AS year,
MAX(year_data.temperature) AS max_temperature;

-- Store or display results

STORE max_temp_by_year INTO 'hdfs_output_path';
-- DUMP max_temp_by_year; -- Use to print results to console
OUTPUT :-

Created a sample input file named ‘weather_data.csv’ with the following content:

1. Save the script to a file (e.g., max_temp.pig).

2. Run it using the Pig command-line: pig -x mapreduce max_temp.pig

EXPERIMENT 15
AIM: Write queries to sort and aggregate the data in a table using HiveQL.
THEORY :

Hive is an open-source data warehousing solution built on top of Hadoop. It supports an SQL-
like query language called HiveQL. These queries are compiled into MapReduce jobs that
are executed on Hadoop. While Hive uses Hadoop for execution of queries, it reduces the
effort that goes into writing and maintaining MapReduce jobs.

Hive supports database concepts like tables, columns, rows and partitions. Both primitive
(integer, float, string) and complex data-types(map, list, struct) are supported. Moreover,
these types can be composed to support structures of arbitrary complexity. The tables are
serialized/deserialized using default serializers/deserializer. Any new data format and type
can be supported by implementing SerDe and ObjectInspector java interface.

By using HiveQL ORDER BY and SORT BY clause, we can apply sort on the column. It
returns the result set either in ascending or descending order.

In HiveQL, ORDER BY clause performs a complete ordering of the query result set. Hence,
the complete data is passed through a single reducer. This may take much time in the
execution of large datasets. However, we can use LIMIT to minimize the sorting time.

CODE :
1. Select the database in which we want to create a table.

2. Now, create a table by using the following command:

3. Load the data into the table

4. Now, fetch the data in the descending order by using the following command
hive> select * from emp order by salary desc;

5. The HiveQL SORT BY clause is an alternative of ORDER BY clause.

hive> select * from emp sort by salary desc;
EXPERIMENT 16
AIM: Develop a Java application to find the maximum temperature using Spark.
THEORY :
Apache Spark is an open-source, distributed data processing engine designed for large-scale
data analytics. It provides a fast, flexible, and easy-to-use platform for processing big data
workloads, including batch processing, real-time streaming, machine learning, and graph
computations. Spark is widely adopted across industries for its speed and versatility.

Architecture and Core Concepts :-

Apache Spark operates on a master-worker architecture consisting of:
 Driver Program: The master node that manages the execution of the application. It
creates a SparkContext which coordinates the job execution, schedules tasks, and
communicates with the cluster manager.
 Cluster Manager: Allocates resources and manages worker nodes. Spark can run
standalone or on cluster managers like Hadoop YARN or Apache Mesos.
 Executors: Worker nodes that run tasks assigned by the driver. They perform
computations and store data partitions in memory or disk.

Execution Model :-
Spark transforms user code into a Directed Acyclic Graph (DAG) of stages and tasks. The
DAG scheduler breaks jobs into stages based on shuffle boundaries, and the task scheduler
distributes these tasks across executors. Spark optimizes execution by:
 Minimizing data movement (data locality)
 Speculative execution to handle slow tasks
 In-memory caching to reduce disk I/O

Execution Flow :-
1. Input Data: Loaded as an RDD (Resilient Distributed Dataset) from HDFS/local
storage.
2. Transformations:
 map(), filter(), reduceByKey() create new RDDs (lazily evaluated).
3. Actions:
 saveAsTextFile() triggers job execution.
4. DAG Scheduler: Breaks the job into stages (e.g., map, reduce).
5. Task Scheduler: Distributes tasks to executors.

CODE :
import [Link];
import [Link];
import [Link];
import [Link];
import scala.Tuple2;
import [Link];
public class MaxTemperatureSpark {
public static void main(String[] args) {
// Initialize Spark configuration and context
SparkConf conf = new SparkConf()
.setAppName("MaxTemperature")
.setMaster("local[*]"); // Use "yarn" for cluster mode
JavaSparkContext sc = new JavaSparkContext(conf);
// Load input data (format: Year, Temperature)
JavaRDD<String> lines = [Link]("input/weather_data.csv");
// Parse lines into (Year, Temperature) pairs
JavaPairRDD<String, Integer> yearTemps = lines
.mapToPair(line -> {
String[] parts = [Link](",");
String year = parts[0];
int temp = [Link](parts[1]);
return new Tuple2<>(year, temp);
});
// Filter out invalid temperatures (e.g., -9999)
JavaPairRDD<String, Integer> filtered = yearTemps
.filter(tuple -> tuple._2() != -9999);
// Reduce by key to find max temperature per year
JavaPairRDD<String, Integer> maxTemps = filtered
.reduceByKey((a, b) -> [Link](a, b));
// Save or print results
[Link]("output/max_temps");
[Link]();
}
}
OUTPUT :-
Created a sample input file named ‘weather_data.csv’ with the following content:

1. Compile and Package :- mvn package # Creates [Link]

2. Submit Spark Job :-

spark-submit \ --class MaxTemperatureSpark \ --master local[*] \

target/[Link] \ input/weather_data.csv \ output/max_temps

BDT Lab Manual
No ratings yet
BDT Lab Manual
48 pages
DSBDA GRP B Print
No ratings yet
DSBDA GRP B Print
21 pages
CS702 Big Data Programs
No ratings yet
CS702 Big Data Programs
59 pages
02-Wordcount Mapreduce
No ratings yet
02-Wordcount Mapreduce
5 pages
Experiment 1 Copy 1
No ratings yet
Experiment 1 Copy 1
8 pages
Hadoop MapReduce Lab Guide
No ratings yet
Hadoop MapReduce Lab Guide
24 pages
Big Data Analytics Lab Manual
No ratings yet
Big Data Analytics Lab Manual
61 pages
CSF443 Lab-Report Nimish Shandilya 1000016934
No ratings yet
CSF443 Lab-Report Nimish Shandilya 1000016934
17 pages
Week 2 de Unedited
No ratings yet
Week 2 de Unedited
13 pages
Big Data Analytics with Hadoop Guide
No ratings yet
Big Data Analytics with Hadoop Guide
10 pages
Mapreduce Program
No ratings yet
Mapreduce Program
3 pages
MapReduce Programs
No ratings yet
MapReduce Programs
10 pages
BDF Programs
No ratings yet
BDF Programs
32 pages
Big Data Lab Manual Printout
No ratings yet
Big Data Lab Manual Printout
51 pages
Bda Lab S
No ratings yet
Bda Lab S
92 pages
B1 Instructions
No ratings yet
B1 Instructions
9 pages
Big Data Lab Guide for CS Students
No ratings yet
Big Data Lab Guide for CS Students
53 pages
Experiment 1 2
No ratings yet
Experiment 1 2
19 pages
DA Lab Program-2
No ratings yet
DA Lab Program-2
6 pages
Dsbda GRP B Print
No ratings yet
Dsbda GRP B Print
17 pages
BDA3
No ratings yet
BDA3
7 pages
CS702 Big Data Programs
No ratings yet
CS702 Big Data Programs
58 pages
Import Import Import Import Import Import Import Import Public Class Extends Implements
No ratings yet
Import Import Import Import Import Import Import Import Public Class Extends Implements
7 pages
BDA University Questions
No ratings yet
BDA University Questions
10 pages
BDA1 Lab Manual
No ratings yet
BDA1 Lab Manual
5 pages
Sets Bda
No ratings yet
Sets Bda
19 pages
BDA Exp (1 To 7)
No ratings yet
BDA Exp (1 To 7)
22 pages
Hadoop MapReduce WordCount Guide
No ratings yet
Hadoop MapReduce WordCount Guide
5 pages
Big Data (Lab Manual)
No ratings yet
Big Data (Lab Manual)
33 pages
Lab3 BigData-MapReduce
No ratings yet
Lab3 BigData-MapReduce
8 pages
Dsbda 11
No ratings yet
Dsbda 11
15 pages
Hadoop Installation & MapReduce Guide
No ratings yet
Hadoop Installation & MapReduce Guide
13 pages
BDC Output 3
No ratings yet
BDC Output 3
4 pages
Experiment-4 BDA LAB
No ratings yet
Experiment-4 BDA LAB
7 pages
Hadoop Installation & MapReduce Guide
No ratings yet
Hadoop Installation & MapReduce Guide
7 pages
Java WordCount with Hadoop Guide
No ratings yet
Java WordCount with Hadoop Guide
6 pages
Practical 2c
No ratings yet
Practical 2c
2 pages
MR Progs For Self Excercise
No ratings yet
MR Progs For Self Excercise
14 pages
Bda Lab Manual - Cse 8 Sem - Compl
No ratings yet
Bda Lab Manual - Cse 8 Sem - Compl
57 pages
First Map-Reduce Program in Hadoop
No ratings yet
First Map-Reduce Program in Hadoop
22 pages
Exp 4 Word Count
No ratings yet
Exp 4 Word Count
4 pages
MapReduce Word Count Program Guide
No ratings yet
MapReduce Word Count Program Guide
14 pages
Hadoop Administrator Training - Lab Hand Book
No ratings yet
Hadoop Administrator Training - Lab Hand Book
12 pages
Big Data Manual
No ratings yet
Big Data Manual
82 pages
104 Da11-13
No ratings yet
104 Da11-13
14 pages
Bda Lab Manual 2024
No ratings yet
Bda Lab Manual 2024
45 pages
Hadoop Installation Guide and Setup
No ratings yet
Hadoop Installation Guide and Setup
37 pages
Steps To Create Jar File and Execute Word Count Problem in Mapper Reducer
No ratings yet
Steps To Create Jar File and Execute Word Count Problem in Mapper Reducer
5 pages
BDA Manual
No ratings yet
BDA Manual
41 pages
CS246 TA Session: Hadoop Tutorial: Peyman Kazemian 1/11/2011
No ratings yet
CS246 TA Session: Hadoop Tutorial: Peyman Kazemian 1/11/2011
13 pages
Bda Exp1 Chinmay
No ratings yet
Bda Exp1 Chinmay
13 pages
Tutorial-Counting Words in File (S) Using Mapreduce: Prerequisites
No ratings yet
Tutorial-Counting Words in File (S) Using Mapreduce: Prerequisites
11 pages
Aji Bda2 Final
No ratings yet
Aji Bda2 Final
4 pages
Sanoob BDA 1 S Merged
No ratings yet
Sanoob BDA 1 S Merged
8 pages
Sanoob BDA - 2
No ratings yet
Sanoob BDA - 2
4 pages
Data Science
No ratings yet
Data Science
82 pages
BDC Final Record
No ratings yet
BDC Final Record
36 pages
Hadoop Single Node Cluster Setup Guide
No ratings yet
Hadoop Single Node Cluster Setup Guide
61 pages
BDA Practicalfile
No ratings yet
BDA Practicalfile
19 pages
Cloud PDF
No ratings yet
Cloud PDF
47 pages
Assignment 2 - My Expense Buddy
No ratings yet
Assignment 2 - My Expense Buddy
4 pages
Operating Systems Exam 2020
No ratings yet
Operating Systems Exam 2020
3 pages
Parameters and Argument in Python
No ratings yet
Parameters and Argument in Python
5 pages
Lahiru Bandara
No ratings yet
Lahiru Bandara
2 pages
DMLab
No ratings yet
DMLab
27 pages
Binary Files
No ratings yet
Binary Files
8 pages
SCSA3015 Deep Learning Quiz For IV Year (Batch 2019 - 2023)
No ratings yet
SCSA3015 Deep Learning Quiz For IV Year (Batch 2019 - 2023)
15 pages
Arduino Cheat Sheet: Structure Digital I/O Data Types
100% (1)
Arduino Cheat Sheet: Structure Digital I/O Data Types
1 page
Mtech 1ST Fe
No ratings yet
Mtech 1ST Fe
8 pages
Before Report Trigger Info
No ratings yet
Before Report Trigger Info
2 pages
Sys Software Mod1 Notes
No ratings yet
Sys Software Mod1 Notes
7 pages
Scratch Q.ans
No ratings yet
Scratch Q.ans
1 page
Compiler Design Qbank 2023
No ratings yet
Compiler Design Qbank 2023
15 pages
Console Application
100% (2)
Console Application
13 pages
Xaodanahelpers PDF
No ratings yet
Xaodanahelpers PDF
270 pages
Library Project: Parameters & Returns
No ratings yet
Library Project: Parameters & Returns
146 pages
A Multicast RPC Implementation For Java
No ratings yet
A Multicast RPC Implementation For Java
7 pages
Ab Initio Commandline Utilities
100% (2)
Ab Initio Commandline Utilities
18 pages
MERN Stack Web Development Guide
No ratings yet
MERN Stack Web Development Guide
251 pages
Wa0029.
No ratings yet
Wa0029.
14 pages
Chapter 2-Updated
No ratings yet
Chapter 2-Updated
53 pages
Select EndSelect
No ratings yet
Select EndSelect
1 page
CMake Lists
No ratings yet
CMake Lists
28 pages
Enhancing CellDesigner Plug-in Interface
No ratings yet
Enhancing CellDesigner Plug-in Interface
4 pages
Al 2022 Computer Science Paper 3
No ratings yet
Al 2022 Computer Science Paper 3
4 pages
Department of Information Technology 20Cs301 - Data Structures Using C++ Question Bank Part A - Module I
No ratings yet
Department of Information Technology 20Cs301 - Data Structures Using C++ Question Bank Part A - Module I
7 pages
Algorithm Lab Report 6
No ratings yet
Algorithm Lab Report 6
7 pages
Jakub Bieńczyk's CV: Computer Science & Math
No ratings yet
Jakub Bieńczyk's CV: Computer Science & Math
1 page
C++ Programming Concepts Quiz
0% (1)
C++ Programming Concepts Quiz
36 pages
Build a Resume Builder App in React
100% (1)
Build a Resume Builder App in React
11 pages