0% found this document useful (0 votes)
8 views48 pages

24afi08 Big Data File

The document outlines a series of experiments related to Big Data Analytics using Apache Hadoop and MapReduce programming. It includes objectives such as installing Hadoop, developing MapReduce programs for various tasks like word frequency calculation, maximum temperature finding, and student grade processing. Each experiment provides a theoretical background, code examples, and expected outputs for practical implementation.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views48 pages

24afi08 Big Data File

The document outlines a series of experiments related to Big Data Analytics using Apache Hadoop and MapReduce programming. It includes objectives such as installing Hadoop, developing MapReduce programs for various tasks like word frequency calculation, maximum temperature finding, and student grade processing. Each experiment provides a theoretical background, code examples, and expected outputs for practical implementation.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 48

DELHI TECHNOLOGICAL UNIVERSITY

(Formerly Delhi College of Engineering) Shahbad Daulatpur, Bawana


Road, Delhi-110042

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING

BIG DATA ANALYTICS

Subject code: AI5342

Submitted To: Submitted By:


Prof. Rahul Katarya Arya Patel
Department of Computer [Link] AI IST Year (Sem-II)
Science and Engineering Roll No. 24/AFI/08
INDEX

S. No. Objective Date Signature

1 Install Apache Hadoop

2 MapReduce program to calculate the frequency of a given


word in a given file.

3 MapReduce program to find the maximum temperature in


each year.

4 MapReduce program to find the grades of student’s.

5 MapReduce program to implement Matrix Multiplication.

6 MapReduce to find the maximum electrical consumption in


each year

7 MapReduce program to print whether the day is shinny or cool


day.

8 MapReduce program to find the number of products sold in


each country

9 MapReduce program to find the tags associated with each


movie

10 MapReduce program to get the following from [Link]


online music website.

11 MapReduce program to find the frequency of books published


each year

12 MapReduce program to find the average age of the people


(both male and female) who died in the tragedy

13 MapReduce program to find the days on which each basement


has more trips using the following dataset.

14 Program to calculate the maximum recorded temperature by


year wise for the weather dataset in Pig Latin

15 Write queries to sort and aggregate the data in a table using


HiveQL.

16 Develop a Java application to find the maximum temperature


using Spark.
EXPERIMENT 1
AIM: Install Apache Hadoop
THEORY :

Prerequisites:

• Java (JDK 8 or later)


• SSH (Secure Shell)
• Linux-based OS (Ubuntu recommended) or Windows with WSL
• At least 4GB of RAM (Recommended 8GB)

Apache Hadoop is an open-source framework designed for distributed storage and processing
of large datasets using the MapReduce programming model. It operates on a cluster of
computers and provides scalability, fault tolerance, and high availability. Hadoop consists of
four main modules:

 Hadoop Common: Provides utilities and libraries required by other modules.


 Hadoop Distributed File System (HDFS): A distributed storage system.
 Hadoop YARN: Manages cluster resources and job scheduling.
 Hadoop MapReduce: A programming model for data processing.

Procedure :

1. Download Hadoop:

 Visit the official Apache Hadoop website and download the desired version of
Hadoop.

2. Install Java:

 Ensure Java is installed and set up correctly.


 Configure the JAVA_HOME environment variable to point to the Java
installation directory.

Verify the installation:

3. Extract Hadoop Files:

 Extract the downloaded Hadoop tar file into a designated directory.

4. Set Environment Variables:

 Add HADOOP_HOME and JAVA_HOME to the system's environment


variables.
 Update the PATH variable to HADOOP_HOME/bin and JAVA_HOME/bin.

5. Configure Hadoop Files: Edit configuration files located in the etc/hadoop directory:

 [Link]: Set the default filesystem URI.

 [Link]: Configure replication factor and directories for NameNode and


DataNode.

 [Link]: Configure MapReduce settings.


 [Link]: Configure resource management settings.

6. Format NameNode:

 Run the command hdfs namenode -format to initialize the HDFS metadata.

7. Start Hadoop Services:

 Use commands like [Link] and [Link] to start HDFS and YARN
services.

8. Verify Installation:

 Run the command jps to check if all daemons (NameNode, DataNode,


ResourceManager, NodeManager) are running.
 Perform a simple word count program using Hadoop's example jar file to test
functionality.

Conclusion:

You have successfully installed Hadoop in a single-node setup on your local machine. You can
now run MapReduce jobs and explore HDFS functionalities.
EXPERIMENT 2
AIM: Develop a MapReduce program to calculate the frequency of a given word in a given
file
THEORY :

Program Structure

The MapReduce program consists of three main components:

1. Mapper Class : The Mapper class splits the input text into words and emits each word
with a count of 1.

This Mapper class does the following:

 Splits each line into words using whitespace as a delimiter.


 Converts each word to lowercase and trims it.
 Emits each non-empty word with a count of 1

2. Reducer Class : The Reducer class sums up the counts for each word.
This Reducer class:

 Iterates through all values for a given word.


 Sums up the counts.
 Emits the word with its total count

3. Driver Class : The Driver class sets up and configures the MapReduce job.
The Driver class:

 Creates a new Job instance.


 Sets the Mapper and Reducer classes.
 Specifies the output key and value classes.
 Sets input and output paths based on command-line arguments

CODE :

Mapper Class Code :-

import [Link];
import [Link];
import [Link];
import [Link];
import [Link];

public class SpecificWordMapper extends Mapper<LongWritable, Text, Text, IntWritable> {


private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
private String specificWord;
@Override
protected void setup(Context context) throws IOException, InterruptedException {
specificWord = [Link]().get("fox").toLowerCase();
[Link](specificWord);
}

public void map(LongWritable key, Text value, Context context) throws IOException,
InterruptedException {
String line = [Link]();
String[] words = [Link]("\\s+");
for (String str : words) {
if ([Link]().trim().equals(specificWord)) {
[Link](word, one);
}
}
}
}

Reducer Class Code :-


import [Link];
import [Link];
import [Link];
import [Link];

public class SpecificWordReducer extends Reducer<Text, IntWritable, Text, IntWritable> {


private IntWritable result = new IntWritable();

public void reduce(Text key, Iterable<IntWritable> values, Context context)


throws IOException, InterruptedException {
int sum = 0;
for (IntWritable val : values) {
sum += [Link]();
}
[Link](sum);
[Link](key, result);
}
}

Driver Class Code :-


import [Link];
import [Link];
import [Link];
import [Link];
import [Link];
import [Link];
import [Link];

public class SpecificWordCounter {


public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
Job job = [Link](conf, "fox");
[Link]([Link]);

// Set the specific word to count


String specificWord = "your_specific_word"; // Replace with the actual word or add as a
command-line argument
[Link]().set("[Link]", specificWord);

[Link]([Link]);
[Link]([Link]);

[Link]([Link]);
[Link]([Link]);

[Link](job, new Path(args[0]));


[Link](job, new Path(args[1]));

[Link]([Link](true) ? 0 : 1);
}
}
OUTPUT :

Created a sample input file named ‘sample_input.txt’ with the following content:

Sample Input :- The quick brown fox jumps over the lazy dog. The dog barks at the fox. Quick,
the fox runs away. The sly fox outsmarts the hound. No foxes were seen today.

Placing the input file in HDFS using code as below :

hadoop fs -put sales_data.txt /user/input/sample_input.txt

Now we would run the MapReduce job with a command similar to:

hadoop jar [Link] SpecificWordCounter /user/input/sample_input.txt


/user/output/fox_frequency

Sample Output :- The output file (typically named part-r-00000) would contain:-
EXPERIMENT 3
AIM: Develop a MapReduce program to find the maximum temperature in each year
CODE :
[Link] Code :-

import [Link];
import [Link];
import [Link];
import [Link];
import [Link];

public class TemperatureMapper extends Mapper<LongWritable, Text, IntWritable,


IntWritable> {
private IntWritable year = new IntWritable();
private IntWritable temperature = new IntWritable();

@Override
public void map(LongWritable key, Text value, Context context) throws IOException,
InterruptedException {
String line = [Link]();
String[] parts = [Link](",");
if ([Link] == 3) {
try {
int yearValue = [Link](parts[0]);
int tempValue = [Link](parts[2]);
[Link](yearValue);
[Link](tempValue);
[Link](year, temperature);
} catch (NumberFormatException e) {
// Skip lines with invalid data
}
}
}
}

[Link] Code :-

import [Link];
import [Link];
import [Link];

public class MaxTemperatureReducer extends Reducer<IntWritable, IntWritable,


IntWritable, IntWritable> {
private IntWritable maxTemp = new IntWritable();

@Override
public void reduce(IntWritable key, Iterable<IntWritable> values, Context context)
throws IOException, InterruptedException {
int max = Integer.MIN_VALUE;
for (IntWritable value : values) {
max = [Link](max, [Link]());
}
[Link](max);
[Link](key, maxTemp);
}
}

[Link] Code :-

import [Link];
import [Link];
import [Link];
import [Link];
import [Link];
import [Link];

public class MaxTemperatureDriver {


public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
Job job = [Link](conf, "max temperature");
[Link]([Link]);
[Link]([Link]);
[Link]([Link]);
[Link]([Link]);
[Link]([Link]);
[Link](job, new Path(args[0]));
[Link](job, new Path(args[1]));
[Link]([Link](true) ? 0 : 1);
}
}
OUTPUT :-

Created a sample input file ‘temperature_data.txt’ with the following content:

Sample Input :- Each line represents: Year,Month,Temperature


2020,01,25
2020,02,28
2020,03,32
2021,01,22
2021,02,30
2021,03,35
2022,01,24
2022,02,29
2022,03,33

After compiling the Java files and creating a JAR, we would run the MapReduce job with a
command like:

hadoop jar [Link] MaxTemperatureDriver /user/input/temperature_data.txt


/user/output/max_temperature

Sample Output :- The output file (typically named part-r-00000) would contain:
EXPERIMENT 4
AIM: Develop a MapReduce program to find the maximum temperature in each year
CODE :
[Link] Code :-
import [Link];
import [Link];
import [Link];
import [Link];

public class StudentGradeMapper extends Mapper<LongWritable, Text, Text, Text> {


private Text studentId = new Text();
private Text scoreInfo = new Text();

@Override
public void map(LongWritable key, Text value, Context context) throws IOException,
InterruptedException {
String line = [Link]();
String[] parts = [Link](",");
if ([Link] == 3) {
[Link](parts[0]);
[Link](parts[1] + "," + parts[2]);
[Link](studentId, scoreInfo);
}
}
}

[Link] Code :-
import [Link];
import [Link];
import [Link];

public class StudentGradeReducer extends Reducer<Text, Text, Text, Text> {


private Text result = new Text();

@Override
public void reduce(Text key, Iterable<Text> values, Context context)
throws IOException, InterruptedException {
int totalScore = 0;
int count = 0;
for (Text value : values) {
String[] parts = [Link]().split(",");
totalScore += [Link](parts[1]);
count++;
}
double average = (double) totalScore / count;
String grade = calculateGrade(average);
[Link]([Link]("Average: %.2f, Grade: %s", average, grade));
[Link](key, result);
}

private String calculateGrade(double average) {


if (average >= 90) return "A";
if (average >= 80) return "B";
if (average >= 70) return "C";
if (average >= 60) return "D";
return "F";
}
}

[Link] Code :-
import [Link];
import [Link];
import [Link];
import [Link];
import [Link];
import [Link];

public class StudentGradeDriver {


public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
Job job = [Link](conf, "student grades");
[Link]([Link]);
[Link]([Link]);
[Link]([Link]);
[Link]([Link]);
[Link]([Link]);
[Link](job, new Path(args[0]));
[Link](job, new Path(args[1]));
[Link]([Link](true) ? 0 : 1);
}
}
OUTPUT :-

Created a sample input file named ‘student_scores.txt’ with the following content::

Sample Input :- Each line represents: StudentID,Subject,Score


S001,Math,85
S001,Science,92
S001,English,78
S002,Math,76
S002,Science,88
S002,English,81
S003,Math,92
S003,Science,95
S003,English,89

After compiling the Java files and creating a JAR, you would run the MapReduce job with a
command like:

hadoop jar [Link] StudentGradeDriver /user/input/student_scores.txt


/user/output/student_grades

Sample Output :- The output file (typically named part-r-00000) would contain:
EXPERIMENT 5
AIM: Develop a MapReduce program to implement Matrix Multiplication
THEORY:
Matrix multiplication is a fundamental operation in linear algebra with applications in various
fields such as computer graphics, scientific computing, and machine learning. When dealing
with large matrices, the computational requirements can be significant. MapReduce provides
an efficient framework for distributing this computation across multiple nodes, allowing for
the processing of very large matrices.

Matrix Multiplication Algorithm


For matrices A (m x n) and B (n x p), resulting in C (m x p):
C[i,j] = Σ(k=1 to n) A[i,k] * B[k,j]

CODE :
[Link] Code :-
import [Link];
import [Link];
import [Link];
import [Link];
public class MatrixMultiplicationMapper extends Mapper<LongWritable, Text, Text, Text>
{
private Text outputKey = new Text();
private Text outputValue = new Text();
@Override
public void map(LongWritable key, Text value, Context context) throws IOException,
InterruptedException {
String line = [Link]();
String[] parts = [Link](",");
if ([Link] == 4) {
String matrixName = parts[0];
int row = [Link](parts[1]);
int col = [Link](parts[2]);
int val = [Link](parts[3]);
if ([Link]("A")) {
for (int i = 0; i < [Link]().getInt("n", 10); i++) {
[Link](row + "," + i);
[Link]("A," + col + "," + val);
[Link](outputKey, outputValue);
}
} else if ([Link]("B")) {
for (int i = 0; i < [Link]().getInt("m", 10); i++) {
[Link](i + "," + col);
[Link]("B," + row + "," + val);
[Link](outputKey, outputValue);
}
}}}}

[Link] Code :-
import [Link];
import [Link];
import [Link];
import [Link];
public class MatrixMultiplicationReducer extends Reducer<Text, Text, Text, Text> {
private Text result = new Text();
@Override
public void reduce(Text key, Iterable<Text> values, Context context) throws IOException,
InterruptedException {
HashMap<Integer, Integer> mapA = new HashMap<>();
HashMap<Integer, Integer> mapB = new HashMap<>();
for (Text value : values) {
String[] parts = [Link]().split(",");
if (parts[0].equals("A")) {
[Link]([Link](parts[1]), [Link](parts[2]));
} else if (parts[0].equals("B")) {
[Link]([Link](parts[1]), [Link](parts[2]));
}
}
int sum = 0;
for (int i = 0; i < [Link]().getInt("p", 10); i++) {
if ([Link](i) && [Link](i)) {
sum += [Link](i) * [Link](i);
}
}
if (sum != 0) {
[Link]([Link](sum));
[Link](key, result);
}
}
}
[Link] Code :-
import [Link];
import [Link];
import [Link];
import [Link];
import [Link];
import [Link];
public class MatrixMultiplicationDriver {
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
// Set matrix dimensions: A(m x p), B(p x n)
[Link]("m", 2);
[Link]("p", 3);
[Link]("n", 2);
Job job = [Link](conf, "matrix multiplication");
[Link]([Link]);
[Link]([Link]);
[Link]([Link]);
[Link]([Link]);
[Link]([Link]);
[Link](job, new Path(args[0]));
[Link](job, new Path(args[1]));
[Link]([Link](true) ? 0 : 1);
}
}

OUTPUT :-
Created a sample input file named ‘matrix_input.txt’ with the following content::

Sample Input :- Each line represents: Matrix_Name , Row , Column , Value


A,0,0,1
A,0,1,2
A,0,2,3
A,1,0,4
A,1,1,5
A,1,2,6
B,0,0,7
B,0,1,8
B,1,0,9
B,1,1,10
B,2,0,11
B,2,1,12
After compiling the Java files and creating a JAR, you would run the MapReduce job with a
command like:
hadoop jar [Link] MatrixMultiplicationDriver /user/input/matrix_input.txt
/user/output/matrix_output
Sample Output :- The output file (typically named part-r-00000) would contain:
EXPERIMENT 6
AIM: Develop a MapReduce to find the maximum electrical consumption in each city and
average electrical consumption for each month in each year.
CODE :
[Link] Code :-
import [Link];
import [Link].*;
import [Link];
public class EnergyMapper extends Mapper<LongWritable, Text, Text, DoubleWritable> {
private Text outputKey = new Text();
private DoubleWritable consumption = new DoubleWritable();
@Override
public void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException {
String[] fields = [Link]().split(",");
if ([Link] >= 3) {
String city = fields[0].trim();
String date = fields[1].trim();
double energy = [Link](fields[2].trim());
// Emit for max consumption per city
[Link](city);
[Link](energy);
[Link](outputKey, consumption);
// Emit for monthly average (year-month,city as key)
String yearMonth = [Link](0, 7); // Assumes YYYY-MM-DD format
[Link](yearMonth + "," + city);
[Link](outputKey, consumption);
}
}
}

[Link] Code :-
import [Link];
import [Link].*;
import [Link];
import [Link];
public class EnergyReducer extends Reducer<Text, DoubleWritable, Text, Text> {
private MultipleOutputs<Text, Text> mos;
private Text result = new Text();
@Override
public void setup(Context context) {
mos = new MultipleOutputs<>(context);
}
@Override
public void reduce(Text key, Iterable<DoubleWritable> values, Context context)
throws IOException, InterruptedException {
String keyStr = [Link]();
if ([Link](",")) { // Monthly average computation
double sum = 0;
int count = 0;
for (DoubleWritable val : values) {
sum += [Link]();
count++;
}
double avg = sum / count;
[Link]([Link]("%.2f", avg));
[Link]("monthlyAvg", key, result);
} else { // Max consumption per city
double max = Double.MIN_VALUE;
for (DoubleWritable val : values) {
max = [Link](max, [Link]());
}
[Link]([Link]("%.2f", max));
[Link]("maxConsumption", key, result);
}
}
@Override
public void cleanup(Context context) throws IOException, InterruptedException {
[Link]();
}
}

[Link] Code :-
import [Link];
import [Link];
import [Link].*;
import [Link];
import [Link];
import [Link];
import [Link];
import [Link];
public class EnergyDriver {
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
Job job = [Link](conf, "Energy Consumption Analysis");
[Link]([Link]);
// Configure Mapper and Reducer
[Link]([Link]);
[Link]([Link]);
// Set output types
[Link]([Link]);
[Link]([Link]);
[Link]([Link]);
[Link]([Link]);
// Define MultipleOutputs
[Link](job, "maxConsumption",
[Link], [Link], [Link]);
[Link](job, "monthlyAvg",
[Link], [Link], [Link]);
// Set input/output paths
[Link](job, new Path(args[0]));
[Link](job, new Path(args[1]));
[Link]([Link](true) ? 0 : 1);
}
}

OUTPUT :-
Created a sample input file named ‘energy_input.csv’ with the following content:

Sample Input :- Each line represents: City, Date, Consumption


New York,2023-05-01,450.5
Los Angeles,2023-05-01,600.2
New York,2023-05-02,480.0
Los Angeles,2023-05-02,620.8
New York,2023-05-03,465.3
Los Angeles,2023-05-03,615.7
New York,2023-06-01,420.1
Los Angeles,2023-06-01,590.4
After compiling the Java files and creating a JAR, you would run the MapReduce job with a
command like:

hadoop jar [Link] EnergyDriver \ /user/input/energy_input.csv \


/user/output/energy_results

Sample Output :-
# View maximum consumption results
hadoop fs -cat /user/output/energy_results/maxConsumption-r-00000

The output file (typically named maxConsumption-r-00000) would contain:

# View monthly averages


hadoop fs -cat /user/output/energy_results/monthlyAvg-r-00000

The output file (typically named monthlyAvg-r-00000) would contain:


EXPERIMENT 7
AIM: Develop a MapReduce to analyze weather data set and print whether the day is shinny
or cool day.
CODE :
[Link] Code :-
import [Link];
import [Link];
import [Link];
import [Link];
public class WeatherMapper extends Mapper<LongWritable, Text, Text, Text> {
private Text outputKey = new Text();
private Text outputValue = new Text();
private static final int TEMPERATURE_THRESHOLD = 25; // Celsius
@Override
public void map(LongWritable key, Text value, Context context) throws IOException,
InterruptedException {
String line = [Link]();
String[] parts = [Link](",");
if ([Link] == 3) { // Assuming format: date,location,temperature
String date = parts[0];
String location = parts[1];
try {
int temperature = [Link](parts[2]);
String weatherType = (temperature > TEMPERATURE_THRESHOLD) ? "Sunny"
: "Cool";
[Link](date);
[Link](location + "," + temperature + "," + weatherType);
[Link](outputKey, outputValue);
} catch (NumberFormatException e) {
// Skip lines with invalid temperature data
}}}}
[Link] Code :-
import [Link];
import [Link];
import [Link];
public class WeatherReducer extends Reducer<Text, Text, Text, Text> {
private Text result = new Text();
@Override
public void reduce(Text key, Iterable<Text> values, Context context)
throws IOException, InterruptedException {
StringBuilder weatherInfo = new StringBuilder();
for (Text value : values) {
[Link]([Link]()).append("; ");
}
[Link]([Link]().trim());
[Link](key, result);
}}
[Link] Code :-
import [Link];
import [Link];
import [Link];
import [Link];
import [Link];
import [Link];
public class WeatherAnalysisDriver {
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
Job job = [Link](conf, "weather analysis");
[Link]([Link]);
[Link]([Link]);
[Link]([Link]);
[Link]([Link]);
[Link]([Link]);
[Link](job, new Path(args[0]));
[Link](job, new Path(args[1]));
[Link]([Link](true) ? 0 : 1);
}}
OUTPUT :-
Created a sample input file named ‘weather_data.txt’ with the following content::
Sample Input :- Each line represents: StudentID,Subject,Score
2023-05-01,New York,22
2023-05-01,Los Angeles,28
2023-05-02,New York,27
2023-05-02,Los Angeles,30
2023-05-03,New York,20
2023-05-03,Los Angeles,26
After compiling the Java files and creating a JAR, you would run the MapReduce job with a
command like:
hadoop jar [Link] WeatherAnalysisDriver /user/input/weather_data.txt
/user/output/weather_analysis
Sample Output :- The output file (typically named part-r-00000) would contain:
EXPERIMENT 8
AIM: Develop a MapReduce program to find the number of products sold in each country
by considering sales data containing fields like
Prod Pri Paym Nam Cit Sta Cou Accoun Last Latit Longi
uct ce ent_T e y te ntry t_Creat _Lo ude tude
ype ed gin

CODE :
[Link] Code :-
import [Link];
import [Link];
import [Link];
import [Link];
public class SalesMapper extends Mapper<Object, Text, Text, IntWritable> {
private Text country = new Text();
private final static IntWritable one = new IntWritable(1);
@Override
public void map(Object key, Text value, Context context) throws IOException,
InterruptedException {
String line = [Link]();
String[] fields = [Link](","); // Assuming fields are comma-separated
if ([Link] >= 8) { // Ensure there are enough fields
String countryName = fields[7].trim(); // Country is in the 8th column (index 7)
[Link](countryName);
[Link](country, one); // Emit (country, 1) for each product sold
}}}
[Link] Code :-
import [Link];
import [Link];
import [Link];
import [Link];
public class SalesReducer extends Reducer<Text, IntWritable, Text, IntWritable> {
private IntWritable result = new IntWritable();
@Override
public void reduce(Text key, Iterable<IntWritable> values, Context context)
throws IOException, InterruptedException {
int sum = 0;
for (IntWritable value : values) {
sum += [Link]();
}
[Link](sum);
[Link](key, result); // Emit (country, total products sold)
}}
[Link] Code :-
import [Link];
import [Link];
import [Link];
import [Link];
import [Link];
import [Link];
import [Link];
public class SalesAnalysisDriver {
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
Job job = [Link](conf, "sales analysis");
[Link]([Link]);
[Link]([Link]);
[Link]([Link]);
[Link]([Link]);
[Link]([Link]);
[Link](job, new Path(args[0]));
[Link](job, new Path(args[1]));
[Link]([Link](true) ? 0 : 1);
}}
OUTPUT :-
Created a sample input file named ‘sales_data.txt’ with the following content::
Sample Input :-
Each line represents:

After compiling the Java files and creating a JAR


 Place the input file in HDFS:

 Run the MapReduce program:


hadoop jar [Link] SalesAnalysisDriver /user/input/sales_data.txt
/user/output/sales_analysis
Sample Output :- The output file (typically named part-r-00000) would contain:
EXPERIMENT 9
AIM: Develop a MapReduce program to find the tags associated with each movie by
analyzing movie lens data.

CODE :

[Link] Code :- Extracts movie IDs and titles.


import [Link];
import [Link].*;
import [Link];

public class MovieMapper extends Mapper<LongWritable, Text, Text, Text> {


@Override
public void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException {
String[] fields = [Link]().split(",");
if ([Link] >= 3) {
String movieId = fields[0].trim();
String title = fields[1].trim();
[Link](new Text(movieId), new Text("TITLE:" + title));
}
}
}

[Link] Code :- Extracts movie IDs and associated tags.


import [Link];
import [Link].*;
import [Link];

public class TagMapper extends Mapper<LongWritable, Text, Text, Text> {


@Override
public void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException {
String[] fields = [Link]().split(",");
if ([Link] >= 3) {
String movieId = fields[1].trim();
String tag = fields[2].trim();
[Link](new Text(movieId), new Text("TAG:" + tag));
}
}
}

[Link] Code :- Aggregates tags for each movie and pairs them with titles.
import [Link];
import [Link];
import [Link];
import [Link];
import [Link];

public class MovieTagReducer extends Reducer<Text, Text, Text, Text> {


@Override
public void reduce(Text key, Iterable<Text> values, Context context)
throws IOException, InterruptedException {
String title = null;
Set<String> tags = new HashSet<>();

for (Text val : values) {


String str = [Link]();
if ([Link]("TITLE:")) {
title = [Link](6);
} else if ([Link]("TAG:")) {
[Link]([Link](4));
}
}

if (title != null && ![Link]()) {


[Link](new Text(title), new Text([Link](", ", tags)));
}
}
}

[Link] Code :- Configures the job to handle multiple inputs.


import [Link];
import [Link];
import [Link];
import [Link];
import [Link];
import [Link];
import [Link];

public class MovieTagDriver {


public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
Job job = [Link](conf, "Movie Tags Analysis");
[Link]([Link]);
[Link]([Link]);
[Link]([Link]);
[Link]([Link]);

// Multiple inputs for movies and tags


[Link](job, new Path(args[0]), [Link],
[Link]);
[Link](job, new Path(args[1]), [Link],
[Link]);
[Link](job, new Path(args[2]));

[Link]([Link](true) ? 0 : 1);
}
}

OUTPUT :-

Created a sample input file for movies named ‘[Link]’ with the following content:

Created a sample input file for movies named ‘[Link]’ with the following content:

After compiling the Java files and creating a JAR


 Place the input file in HDFS:

 Run the MapReduce program:


hadoop jar [Link] MovieTagDriver \ /user/input/[Link] \
/user/input/[Link] \ /user/output/movie_tags

Sample Output :- The output file (typically named part-r-00000) would contain:
EXPERIMENT 10
AIM: [Link] is an online music website where users listen to various tracks, the data gets
collected which is given below.
The data is coming in log files and looks like as shown below

Write a MapReduce program to get the following


 Number of unique listeners
 Number of times the track was shared with others
 Number of times the track was listened to on the radio
 Number of times the track was listened to in total
 Number of times the track was skipped on the radio

CODE :
Firstly , let's define constants to make the code more readable:
[Link] Code :-
package [Link];
public class MusicConstants {
// Indices for fields in each log record
public static final int USER_ID = 0;
public static final int TRACK_ID = 1;
public static final int IS_SHARED = 2;
public static final int IS_RADIO = 3;
public static final int IS_SKIPPED = 4;
// Types of metrics to calculate
public static final String UNIQUE_LISTENERS = "unique_listeners";
public static final String SHARED_COUNT = "shared_count";
public static final String RADIO_PLAYS = "radio_plays";
public static final String TOTAL_PLAYS = "total_plays";
public static final String RADIO_SKIPS = "radio_skips";
}

[Link] Code :-
package [Link];
import [Link];
import [Link];
import [Link];
import [Link];
public class MusicStatsMapper extends Mapper<LongWritable, Text, Text, Text> {
private Text outputKey = new Text();
private Text outputValue = new Text();
@Override
public void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException {
String line = [Link]();
String[] fields = [Link]("\\|");
// Skip malformed records
if ([Link] != 5) return;
try {
String userId = fields[MusicConstants.USER_ID].trim();
String trackId = fields[MusicConstants.TRACK_ID].trim();
int shared = [Link](fields[MusicConstants.IS_SHARED].trim());
int radio = [Link](fields[MusicConstants.IS_RADIO].trim());
int skipped = [Link](fields[MusicConstants.IS_SKIPPED].trim());
// For unique listeners - emit trackId,userId
[Link](MusicConstants.UNIQUE_LISTENERS + ":" + trackId);
[Link](userId);
[Link](outputKey, outputValue);
// For shared count - emit only if shared=1
if (shared == 1) {
[Link](MusicConstants.SHARED_COUNT + ":" + trackId);
[Link]("1");
[Link](outputKey, outputValue);
}
// For radio plays - emit only if radio=1
if (radio == 1) {
[Link](MusicConstants.RADIO_PLAYS + ":" + trackId);
[Link]("1");
[Link](outputKey, outputValue);
}
// For total plays - emit for every record
[Link](MusicConstants.TOTAL_PLAYS + ":" + trackId);
[Link]("1");
[Link](outputKey, outputValue);
// For radio skips - emit only if radio=1 and skipped=1
if (radio == 1 && skipped == 1) {
[Link](MusicConstants.RADIO_SKIPS + ":" + trackId);
[Link]("1");
[Link](outputKey, outputValue);
}
} catch (NumberFormatException e) {
// Skip records with invalid number format
}
}}
[Link] Code :-
package [Link];
import [Link];
import [Link];
import [Link];
import [Link];
import [Link];
public class MusicStatsReducer extends Reducer<Text, Text, Text, Text> {
private Text result = new Text();
@Override
public void reduce(Text key, Iterable<Text> values, Context context)
throws IOException, InterruptedException {
String keyStr = [Link]();
String[] keyParts = [Link](":");
if ([Link] != 2) return;
String metricType = keyParts[0];
String trackId = keyParts[1];
if ([Link](MusicConstants.UNIQUE_LISTENERS)) {
// Count unique listeners
Set<String> uniqueUsers = new HashSet<>();
for (Text value : values) {
[Link]([Link]());
}
[Link]("Unique Listeners: " + [Link]());
} else {
// Count other metrics (shared, radio plays, total plays, skips)
int count = 0;
for (Text value : values) {
count++;
}
if ([Link](MusicConstants.SHARED_COUNT)) {
[Link]("Shared Count: " + count);
} else if ([Link](MusicConstants.RADIO_PLAYS)) {
[Link]("Radio Plays: " + count);
} else if ([Link](MusicConstants.TOTAL_PLAYS)) {
[Link]("Total Plays: " + count);
} else if ([Link](MusicConstants.RADIO_SKIPS)) {
[Link]("Radio Skips: " + count);
}
}
[Link](new Text("Track " + trackId), result);
}
}
[Link] Code :-
package [Link];
import [Link];
import [Link];
import [Link];
import [Link];
import [Link];
import [Link];
public class MusicStatsDriver {
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
Job job = [Link](conf, "XYZ Music Stats");
[Link]([Link]);
// Set mapper and reducer classes
[Link]([Link]);
[Link]([Link]);
// Set output key and value types
[Link]([Link]);
[Link]([Link]);
// Set input and output paths
[Link](job, new Path(args[0]));
[Link](job, new Path(args[1]));
[Link]([Link](true) ? 0 : 1);
}
}

OUTPUT :-
1. First, compile the Java files and create a JAR:
2. Created a sample input file named ‘music_logs.txt’
3. After compiling the Java files and creating a JAR
 Place the input file in HDFS:

 Run the MapReduce program:

hadoop jar [Link] [Link]


/user/input/music_logs.txt /user/output/music_stats

Sample Output :- The output file (typically named part-r-00000) would contain:
hadoop fs -cat /user/output/music_stats/part-r-00000
EXPERIMENT 11
AIM: Develop a MapReduce program to find the frequency of books published each year
and find in which year maximum number of books were published using the following data.

CODE :
[Link] Code :-
import [Link];
import [Link];
import [Link];
import [Link];
import [Link];
public class BookPublicationMapper extends Mapper<LongWritable, Text, Text,
IntWritable> {
private final static IntWritable one = new IntWritable(1);
private Text year = new Text();
@Override
public void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException {
String line = [Link]();
String[] fields = [Link](",");
// Skip header or malformed rows
if ([Link] >= 3 && !fields[0].equals("Title")) {
try {
// Extract the published year (assuming it's in 3rd column)
String publishedYear = fields[2].trim();
[Link](publishedYear);
// Emit (year, 1) for each book
[Link](year, one);
} catch (Exception e) {
// Skip rows with parsing errors
}
}
}
}

[Link] Code :-
import [Link];
import [Link];
import [Link];
import [Link];
public class BookPublicationReducer extends Reducer<Text, IntWritable, Text, IntWritable>
{
private IntWritable result = new IntWritable();
@Override
public void reduce(Text key, Iterable<IntWritable> values, Context context)
throws IOException, InterruptedException {
int sum = 0;
// Sum up all occurrences (always 1) of books in this year
for (IntWritable val : values) {
sum += [Link]();
}
[Link](sum);
[Link](key, result);
}
}
[Link] Code :-
import [Link];
import [Link];
import [Link];
import [Link];
import [Link];
public class MaxPublicationMapper extends Mapper<LongWritable, Text, Text, Text> {
private final static Text maxKey = new Text("MAX");
private Text yearCount = new Text();
@Override
public void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException {
String line = [Link]();
String[] parts = [Link]("\\s+");
if ([Link] == 2) {
// Format: "year count"
[Link](line);
// Emit with a single key so all records go to one reducer
[Link](maxKey, yearCount);
}
}
}
[Link] Code :-
import [Link];
import [Link];
import [Link];
public class MaxPublicationReducer extends Reducer<Text, Text, Text, Text> {
private Text result = new Text();
@Override
public void reduce(Text key, Iterable<Text> values, Context context)
throws IOException, InterruptedException {
int maxCount = -1;
String maxYear = "";
// Find the year with maximum publications
for (Text val : values) {
String[] parts = [Link]().split("\\s+");
String year = parts[0];
int count = [Link](parts[1]);
if (count > maxCount) {
maxCount = count;
maxYear = year;
}
}
[Link]("Maximum number of books (" + maxCount + ") were published in year " +
maxYear);
[Link](new Text("Result:"), result);
}
}

[Link] Code :-
import [Link];
import [Link];
import [Link];
import [Link];
import [Link];
import [Link];
import [Link];
import [Link];
import [Link];
public class BookPublicationDriver {
public static void main(String[] args) throws Exception {
// Job 1: Count books per year
Configuration conf1 = new Configuration();
Job job1 = [Link](conf1, "Book Publication Count");
[Link]([Link]);
[Link]([Link]);
[Link]([Link]);
[Link]([Link]);
[Link]([Link]);
[Link](job1, new Path(args[0]));
Path tempOutput = new Path(args[1] + "_temp");
[Link](job1, tempOutput);
[Link](true);
// Job 2: Find year with maximum publications
Configuration conf2 = new Configuration();
Job job2 = [Link](conf2, "Max Publication Year");
[Link]([Link]);
[Link]([Link]);
[Link]([Link]);
[Link]([Link]);
[Link]([Link]);
[Link](job2, tempOutput);
[Link](job2, new Path(args[1]));
[Link]([Link](true) ? 0 : 1);
}
}

OUTPUT :-
1. First, compile the Java files and create a JAR:
2. Created a sample input file named ‘book_data.csv’

3. After compiling the Java files and creating a JAR

 Place the input file in HDFS:

 Run the MapReduce program:


hadoop jar [Link] BookPublicationDriver /user/input/book_data.csv
/user/output/book_analysis

Sample Output :- The output file (typically named part-r-00000) would contain:
hadoop fs -cat /user/output/book_analysis/part-r-00000
The first job produces year-count pairs:

The second job identifies the maximum:


EXPERIMENT 12
AIM: Develop a MapReduce program to analyze Titanic ship data and to find the average
age of the people (both male and female) who died in the tragedy. How many persons are
survived in each class.

CODE :
[Link] Code :-
import [Link];
import [Link];
import [Link];
import [Link];
import [Link];
public class TitanicAgeMapper extends Mapper<LongWritable, Text, Text, DoubleWritable>
{
private Text gender = new Text();
private DoubleWritable age = new DoubleWritable();
@Override
public void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException {
String line = [Link]();
String[] fields = [Link](",");
// Skip malformed records
if ([Link] >= 6) {
String survived = fields[1].trim();
String sex = fields[4].trim();
String ageStr = fields[5].trim();
// Process only if person died (survived=1) and age is valid
if ("1".equals(survived) && [Link]("\\d+(\\.\\d+)?")) {
double ageVal = [Link](ageStr);
[Link](sex);
[Link](ageVal);
[Link](gender, age);
}}}}
[Link] Code :-
import [Link];
import [Link].*;
import [Link];
public class TagMapper extends Mapper<LongWritable, Text, Text, Text> {
@Override
public void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException {
String[] fields = [Link]().split(",");
if ([Link] >= 3) {
String movieId = fields[1].trim();
String tag = fields[2].trim();
[Link](new Text(movieId), new Text("TAG:" + tag));
}}}
[Link] Code :-
import [Link];
import [Link];
import [Link];
import [Link];
import [Link];
public class TitanicSurvivorsMapper extends Mapper<LongWritable, Text, Text,
IntWritable> {
private Text passengerClass = new Text();
private static final IntWritable one = new IntWritable(1);
@Override
public void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException {
String line = [Link]();
String[] fields = [Link](",");
// Skip malformed records
if ([Link] >= 3) {
String survived = fields[1].trim();
// Process only if person survived (survived=0)
if ("0".equals(survived)) {
String pclass = "Class " + fields[2].trim();
[Link](pclass);
[Link](passengerClass, one);
}}}}
[Link] Code :-
import [Link];
import [Link];
import [Link];
import [Link];
public class TitanicSurvivorsReducer extends Reducer<Text, IntWritable, Text, IntWritable>
{
private IntWritable result = new IntWritable();
@Override
public void reduce(Text key, Iterable<IntWritable> values, Context context)
throws IOException, InterruptedException {
int sum = 0;
for (IntWritable val : values) {
sum += [Link]();
}
[Link](sum);
[Link](key, result);
}}
[Link] Code :-
import [Link];
import [Link];
import [Link];
import [Link];
import [Link];
import [Link];
import [Link];
import [Link];
public class TitanicAnalysisDriver {
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
// Job 1: Average Age by gender for deceased passengers
Job job1 = [Link](conf, "Average Age of Deceased");
[Link]([Link]);
[Link]([Link]);
[Link]([Link]);
[Link]([Link]);
[Link]([Link]);
[Link](job1, new Path(args[0]));
[Link](job1, new Path(args[1] + "_avg_age"));
[Link](true);

// Job 2: Survivors by class


Job job2 = [Link](conf, "Survivors By Class");
[Link]([Link]);
[Link]([Link]);
[Link]([Link]);
[Link]([Link]);
[Link]([Link]);
[Link](job2, new Path(args[0]));
[Link](job2, new Path(args[1] + "_survivors"));
[Link]([Link](true) ? 0 : 1);
}
}
OUTPUT :-

Created a sample input file named ‘titanic_data.csv’ with the following content:

PassengerId,Survived,Pclass,Name,Sex,Age
1,1,1,"Allen, Miss. Elisabeth Walton",female,29
2,0,1,"Allison, Master. Hudson Trevor",male,0.92
3,1,3,"Andersson, Mr. Anders Johan",male,39
4,0,3,"Andersson, Miss. Ingeborg Constanzia",female,9
5,1,3,"Andersson, Miss. Sigrid Elisabeth",female,11
6,0,1,"Andrews, Miss. Kornelia Theodosia",female,63
7,1,1,"Appleton, Mrs. Edward Dale",female,53
8,0,2,"Arnold-Franchi, Mrs. Josef",female,18
9,1,3,"Baclini, Miss. Eugenie",female,0.75
10,0,3,"Baclini, Miss. Helene Barbara",female,0.33
11,1,2,"Backstrom, Mrs. Karl Alfred",female,33
12,0,2,"Backstrom, Miss. Kristina",female,20
13,1,3,"Baclini, Miss. Marie Catherine",female,5
14,0,1,"Baxter, Mrs. James",female,50
15,1,2,"Becker, Miss. Marion Louise",female,4
16,0,3,"Bourke, Miss. Mary",female,28
17,1,1,"Brown, Mrs. James Joseph",female,44
18,0,2,"Brown, Miss. Amelia",female,24
19,1,3,"Cacic, Miss. Marija",female,30
20,0,3,"Cacic, Mr. Luka",male,38

1. Compile the Java files:


2. Create a JAR file:

3. After compiling the Java files and creating a JAR


 Place the input file in HDFS:

 Run the MapReduce program:


hadoop jar [Link] TitanicAnalysisDriver /user/input/titanic_data.csv
/user/output/titanic_analysis

Sample Output :- The output file (typically named part-r-00000) would contain:

hadoop fs -cat /user/output/titanic_analysis_avg_age/part-r-00000

hadoop fs -cat /user/output/titanic_analysis_survivors/part-r-00000


EXPERIMENT 13
AIM: Develop a MapReduce program to analyze Uber data set to find the days on which
each basement has more trips using the following dataset.
The Uber dataset consists of four columns they are

CODE :

[Link] Code :-
import [Link];
import [Link];
import [Link];
import [Link];
import [Link];
import [Link];
import [Link];
import [Link];

public class UberMapper extends Mapper<LongWritable, Text, Text, IntWritable> {


private Text outputKey = new Text();
private IntWritable tripCount = new IntWritable();
private String[] days = {"Sun", "Mon", "Tue", "Wed", "Thu", "Fri", "Sat"};
private SimpleDateFormat dateFormat = new SimpleDateFormat("MM/dd/yyyy");

@Override
public void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException {
// Skip header row
if ([Link]() == 0 && [Link]().contains("dispatching_base_number")) {
return;
}

String[] fields = [Link]().split(",");


if ([Link] == 4) { // Ensure valid record
String baseNumber = fields[0].trim();
String dateStr = fields[1].trim();
int trips = [Link](fields[3].trim());

try {
// Parse date and extract day of week
Date date = [Link](dateStr);
String dayOfWeek = days[[Link]()];
// Create composite key: "baseNumber dayOfWeek"
[Link](baseNumber + " " + dayOfWeek);
[Link](trips);

[Link](outputKey, tripCount);
} catch (ParseException e) {
// Skip records with invalid date format
[Link]("Error parsing date: " + dateStr);
}
}
}
}

[Link] Code :-
import [Link];
import [Link];
import [Link];
import [Link];

public class UberReducer extends Reducer<Text, IntWritable, Text, IntWritable> {


private IntWritable result = new IntWritable();

@Override
public void reduce(Text key, Iterable<IntWritable> values, Context context)
throws IOException, InterruptedException {
int sum = 0;

// Sum up all trips for this base-day combination


for (IntWritable val : values) {
sum += [Link]();
}

[Link](sum);
[Link](key, result);
}
}

[Link] Code :-
import [Link];
import [Link];
import [Link];
import [Link];
import [Link];
import [Link];
import [Link];

public class UberDriver {


public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
Job job = [Link](conf, "Uber Trip Analysis");
[Link]([Link]);

// Set Mapper and Reducer classes


[Link]([Link]);
[Link]([Link]); // Combiner for efficiency
[Link]([Link]);

// Define output types


[Link]([Link]);
[Link]([Link]);

// Set input and output paths


[Link](job, new Path(args[0]));
[Link](job, new Path(args[1]));

[Link]([Link](true) ? 0 : 1);
}
}

[Link] Code :- Configures the job to handle multiple inputs.


import [Link];
import [Link];
import [Link];
import [Link];
import [Link];
import [Link];
import [Link];

public class MovieTagDriver {


public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
Job job = [Link](conf, "Movie Tags Analysis");
[Link]([Link]);
[Link]([Link]);
[Link]([Link]);
[Link]([Link]);

// Multiple inputs for movies and tags


[Link](job, new Path(args[0]), [Link],
[Link]);
[Link](job, new Path(args[1]), [Link],
[Link]);
[Link](job, new Path(args[2]));

[Link]([Link](true) ? 0 : 1);
}
}

OUTPUT :-

Created a sample input file named ‘uber_data.csv’ with the following content:

1. Compile the Java files:


javac -cp
/usr/local/hadoop/share/hadoop/common/*:/usr/local/hadoop/share/hadoop/mapreduce/*
[Link] [Link] [Link]

2. Create a JAR file:

3. After compiling the Java files and creating a JAR


 Place the input file in HDFS:

 Run the MapReduce program:


hadoop jar [Link] UberDriver /user/input/uber_data.csv /user/output/uber_analysis

Sample Output :- The output file (typically named part-r-00000) would contain:
EXPERIMENT 14
AIM: Develop a program to calculate the maximum recorded temperature by year wise for
the weather dataset in Pig Latin
THEORY :

Pig Latin is a high-level scripting language designed for analyzing large datasets in Hadoop
using the Apache Pig platform. Pig Latin scripts are composed of a series of data
transformation steps, much like a data flow, where each step performs a specific operation
such as loading, filtering, grouping, or aggregating data.

Pig Latin programs are written as a sequence of statements, each ending with a semicolon.
These statements operate on relations and include:
 LOAD: To load data from the file system (HDFS or local) into a relation.
 STORE: To save a relation back to the file system.
 FILTER: To remove unwanted rows from a relation.
 FOREACH ... GENERATE: To transform data or select specific fields.
 GROUP: To group data by one or more fields.
 JOIN: To join two or more relations.
 ORDER: To sort the data.
 DUMP: To display the contents of a relation on the console
When a Pig Latin script is run, the Pig engine parses the script and automatically translates it
into a series of MapReduce jobs, which are then executed on the Hadoop cluster. This means
users do not need to manage the details of parallel and distributed computation.

CODE :
-- Load the dataset (adjust delimiter if needed)
raw_data = LOAD ‘C:\Users\input\weather_data.csv' USING PigStorage(',')
AS (date:chararray, temperature:int, other_fields...);

-- Extract year from the date and project temperature


year_data = FOREACH raw_data GENERATE
SUBSTRING(date, 0, 4) AS year,
temperature;

-- Group data by year


grouped_data = GROUP year_data BY year;

-- Calculate maximum temperature for each year


max_temp_by_year = FOREACH grouped_data GENERATE
group AS year,
MAX(year_data.temperature) AS max_temperature;

-- Store or display results


STORE max_temp_by_year INTO 'hdfs_output_path';
-- DUMP max_temp_by_year; -- Use to print results to console
OUTPUT :-

Created a sample input file named ‘weather_data.csv’ with the following content:

1. Save the script to a file (e.g., max_temp.pig).

2. Run it using the Pig command-line: pig -x mapreduce max_temp.pig


EXPERIMENT 15
AIM: Write queries to sort and aggregate the data in a table using HiveQL.
THEORY :

Hive is an open-source data warehousing solution built on top of Hadoop. It supports an SQL-
like query language called HiveQL. These queries are compiled into MapReduce jobs that
are executed on Hadoop. While Hive uses Hadoop for execution of queries, it reduces the
effort that goes into writing and maintaining MapReduce jobs.

Hive supports database concepts like tables, columns, rows and partitions. Both primitive
(integer, float, string) and complex data-types(map, list, struct) are supported. Moreover,
these types can be composed to support structures of arbitrary complexity. The tables are
serialized/deserialized using default serializers/deserializer. Any new data format and type
can be supported by implementing SerDe and ObjectInspector java interface.

By using HiveQL ORDER BY and SORT BY clause, we can apply sort on the column. It
returns the result set either in ascending or descending order.

In HiveQL, ORDER BY clause performs a complete ordering of the query result set. Hence,
the complete data is passed through a single reducer. This may take much time in the
execution of large datasets. However, we can use LIMIT to minimize the sorting time.

CODE :
1. Select the database in which we want to create a table.

2. Now, create a table by using the following command:

3. Load the data into the table


4. Now, fetch the data in the descending order by using the following command
hive> select * from emp order by salary desc;

5. The HiveQL SORT BY clause is an alternative of ORDER BY clause.


hive> select * from emp sort by salary desc;
EXPERIMENT 16
AIM: Develop a Java application to find the maximum temperature using Spark.
THEORY :
Apache Spark is an open-source, distributed data processing engine designed for large-scale
data analytics. It provides a fast, flexible, and easy-to-use platform for processing big data
workloads, including batch processing, real-time streaming, machine learning, and graph
computations. Spark is widely adopted across industries for its speed and versatility.

Architecture and Core Concepts :-


Apache Spark operates on a master-worker architecture consisting of:
 Driver Program: The master node that manages the execution of the application. It
creates a SparkContext which coordinates the job execution, schedules tasks, and
communicates with the cluster manager.
 Cluster Manager: Allocates resources and manages worker nodes. Spark can run
standalone or on cluster managers like Hadoop YARN or Apache Mesos.
 Executors: Worker nodes that run tasks assigned by the driver. They perform
computations and store data partitions in memory or disk.

Execution Model :-
Spark transforms user code into a Directed Acyclic Graph (DAG) of stages and tasks. The
DAG scheduler breaks jobs into stages based on shuffle boundaries, and the task scheduler
distributes these tasks across executors. Spark optimizes execution by:
 Minimizing data movement (data locality)
 Speculative execution to handle slow tasks
 In-memory caching to reduce disk I/O

Execution Flow :-
1. Input Data: Loaded as an RDD (Resilient Distributed Dataset) from HDFS/local
storage.
2. Transformations:
 map(), filter(), reduceByKey() create new RDDs (lazily evaluated).
3. Actions:
 saveAsTextFile() triggers job execution.
4. DAG Scheduler: Breaks the job into stages (e.g., map, reduce).
5. Task Scheduler: Distributes tasks to executors.

CODE :
import [Link];
import [Link];
import [Link];
import [Link];
import scala.Tuple2;
import [Link];
public class MaxTemperatureSpark {
public static void main(String[] args) {
// Initialize Spark configuration and context
SparkConf conf = new SparkConf()
.setAppName("MaxTemperature")
.setMaster("local[*]"); // Use "yarn" for cluster mode
JavaSparkContext sc = new JavaSparkContext(conf);
// Load input data (format: Year, Temperature)
JavaRDD<String> lines = [Link]("input/weather_data.csv");
// Parse lines into (Year, Temperature) pairs
JavaPairRDD<String, Integer> yearTemps = lines
.mapToPair(line -> {
String[] parts = [Link](",");
String year = parts[0];
int temp = [Link](parts[1]);
return new Tuple2<>(year, temp);
});
// Filter out invalid temperatures (e.g., -9999)
JavaPairRDD<String, Integer> filtered = yearTemps
.filter(tuple -> tuple._2() != -9999);
// Reduce by key to find max temperature per year
JavaPairRDD<String, Integer> maxTemps = filtered
.reduceByKey((a, b) -> [Link](a, b));
// Save or print results
[Link]("output/max_temps");
[Link]();
}
}
OUTPUT :-
Created a sample input file named ‘weather_data.csv’ with the following content:

1. Compile and Package :- mvn package # Creates [Link]


2. Submit Spark Job :-

spark-submit \ --class MaxTemperatureSpark \ --master local[*] \


target/[Link] \ input/weather_data.csv \ output/max_temps

You might also like