24afi08 Big Data File
24afi08 Big Data File
Prerequisites:
Apache Hadoop is an open-source framework designed for distributed storage and processing
of large datasets using the MapReduce programming model. It operates on a cluster of
computers and provides scalability, fault tolerance, and high availability. Hadoop consists of
four main modules:
Procedure :
1. Download Hadoop:
Visit the official Apache Hadoop website and download the desired version of
Hadoop.
2. Install Java:
5. Configure Hadoop Files: Edit configuration files located in the etc/hadoop directory:
6. Format NameNode:
Run the command hdfs namenode -format to initialize the HDFS metadata.
Use commands like [Link] and [Link] to start HDFS and YARN
services.
8. Verify Installation:
Conclusion:
You have successfully installed Hadoop in a single-node setup on your local machine. You can
now run MapReduce jobs and explore HDFS functionalities.
EXPERIMENT 2
AIM: Develop a MapReduce program to calculate the frequency of a given word in a given
file
THEORY :
Program Structure
1. Mapper Class : The Mapper class splits the input text into words and emits each word
with a count of 1.
2. Reducer Class : The Reducer class sums up the counts for each word.
This Reducer class:
3. Driver Class : The Driver class sets up and configures the MapReduce job.
The Driver class:
CODE :
import [Link];
import [Link];
import [Link];
import [Link];
import [Link];
public void map(LongWritable key, Text value, Context context) throws IOException,
InterruptedException {
String line = [Link]();
String[] words = [Link]("\\s+");
for (String str : words) {
if ([Link]().trim().equals(specificWord)) {
[Link](word, one);
}
}
}
}
[Link]([Link]);
[Link]([Link]);
[Link]([Link]);
[Link]([Link]);
[Link]([Link](true) ? 0 : 1);
}
}
OUTPUT :
Created a sample input file named ‘sample_input.txt’ with the following content:
Sample Input :- The quick brown fox jumps over the lazy dog. The dog barks at the fox. Quick,
the fox runs away. The sly fox outsmarts the hound. No foxes were seen today.
Now we would run the MapReduce job with a command similar to:
Sample Output :- The output file (typically named part-r-00000) would contain:-
EXPERIMENT 3
AIM: Develop a MapReduce program to find the maximum temperature in each year
CODE :
[Link] Code :-
import [Link];
import [Link];
import [Link];
import [Link];
import [Link];
@Override
public void map(LongWritable key, Text value, Context context) throws IOException,
InterruptedException {
String line = [Link]();
String[] parts = [Link](",");
if ([Link] == 3) {
try {
int yearValue = [Link](parts[0]);
int tempValue = [Link](parts[2]);
[Link](yearValue);
[Link](tempValue);
[Link](year, temperature);
} catch (NumberFormatException e) {
// Skip lines with invalid data
}
}
}
}
[Link] Code :-
import [Link];
import [Link];
import [Link];
@Override
public void reduce(IntWritable key, Iterable<IntWritable> values, Context context)
throws IOException, InterruptedException {
int max = Integer.MIN_VALUE;
for (IntWritable value : values) {
max = [Link](max, [Link]());
}
[Link](max);
[Link](key, maxTemp);
}
}
[Link] Code :-
import [Link];
import [Link];
import [Link];
import [Link];
import [Link];
import [Link];
After compiling the Java files and creating a JAR, we would run the MapReduce job with a
command like:
Sample Output :- The output file (typically named part-r-00000) would contain:
EXPERIMENT 4
AIM: Develop a MapReduce program to find the maximum temperature in each year
CODE :
[Link] Code :-
import [Link];
import [Link];
import [Link];
import [Link];
@Override
public void map(LongWritable key, Text value, Context context) throws IOException,
InterruptedException {
String line = [Link]();
String[] parts = [Link](",");
if ([Link] == 3) {
[Link](parts[0]);
[Link](parts[1] + "," + parts[2]);
[Link](studentId, scoreInfo);
}
}
}
[Link] Code :-
import [Link];
import [Link];
import [Link];
@Override
public void reduce(Text key, Iterable<Text> values, Context context)
throws IOException, InterruptedException {
int totalScore = 0;
int count = 0;
for (Text value : values) {
String[] parts = [Link]().split(",");
totalScore += [Link](parts[1]);
count++;
}
double average = (double) totalScore / count;
String grade = calculateGrade(average);
[Link]([Link]("Average: %.2f, Grade: %s", average, grade));
[Link](key, result);
}
[Link] Code :-
import [Link];
import [Link];
import [Link];
import [Link];
import [Link];
import [Link];
Created a sample input file named ‘student_scores.txt’ with the following content::
After compiling the Java files and creating a JAR, you would run the MapReduce job with a
command like:
Sample Output :- The output file (typically named part-r-00000) would contain:
EXPERIMENT 5
AIM: Develop a MapReduce program to implement Matrix Multiplication
THEORY:
Matrix multiplication is a fundamental operation in linear algebra with applications in various
fields such as computer graphics, scientific computing, and machine learning. When dealing
with large matrices, the computational requirements can be significant. MapReduce provides
an efficient framework for distributing this computation across multiple nodes, allowing for
the processing of very large matrices.
CODE :
[Link] Code :-
import [Link];
import [Link];
import [Link];
import [Link];
public class MatrixMultiplicationMapper extends Mapper<LongWritable, Text, Text, Text>
{
private Text outputKey = new Text();
private Text outputValue = new Text();
@Override
public void map(LongWritable key, Text value, Context context) throws IOException,
InterruptedException {
String line = [Link]();
String[] parts = [Link](",");
if ([Link] == 4) {
String matrixName = parts[0];
int row = [Link](parts[1]);
int col = [Link](parts[2]);
int val = [Link](parts[3]);
if ([Link]("A")) {
for (int i = 0; i < [Link]().getInt("n", 10); i++) {
[Link](row + "," + i);
[Link]("A," + col + "," + val);
[Link](outputKey, outputValue);
}
} else if ([Link]("B")) {
for (int i = 0; i < [Link]().getInt("m", 10); i++) {
[Link](i + "," + col);
[Link]("B," + row + "," + val);
[Link](outputKey, outputValue);
}
}}}}
[Link] Code :-
import [Link];
import [Link];
import [Link];
import [Link];
public class MatrixMultiplicationReducer extends Reducer<Text, Text, Text, Text> {
private Text result = new Text();
@Override
public void reduce(Text key, Iterable<Text> values, Context context) throws IOException,
InterruptedException {
HashMap<Integer, Integer> mapA = new HashMap<>();
HashMap<Integer, Integer> mapB = new HashMap<>();
for (Text value : values) {
String[] parts = [Link]().split(",");
if (parts[0].equals("A")) {
[Link]([Link](parts[1]), [Link](parts[2]));
} else if (parts[0].equals("B")) {
[Link]([Link](parts[1]), [Link](parts[2]));
}
}
int sum = 0;
for (int i = 0; i < [Link]().getInt("p", 10); i++) {
if ([Link](i) && [Link](i)) {
sum += [Link](i) * [Link](i);
}
}
if (sum != 0) {
[Link]([Link](sum));
[Link](key, result);
}
}
}
[Link] Code :-
import [Link];
import [Link];
import [Link];
import [Link];
import [Link];
import [Link];
public class MatrixMultiplicationDriver {
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
// Set matrix dimensions: A(m x p), B(p x n)
[Link]("m", 2);
[Link]("p", 3);
[Link]("n", 2);
Job job = [Link](conf, "matrix multiplication");
[Link]([Link]);
[Link]([Link]);
[Link]([Link]);
[Link]([Link]);
[Link]([Link]);
[Link](job, new Path(args[0]));
[Link](job, new Path(args[1]));
[Link]([Link](true) ? 0 : 1);
}
}
OUTPUT :-
Created a sample input file named ‘matrix_input.txt’ with the following content::
[Link] Code :-
import [Link];
import [Link].*;
import [Link];
import [Link];
public class EnergyReducer extends Reducer<Text, DoubleWritable, Text, Text> {
private MultipleOutputs<Text, Text> mos;
private Text result = new Text();
@Override
public void setup(Context context) {
mos = new MultipleOutputs<>(context);
}
@Override
public void reduce(Text key, Iterable<DoubleWritable> values, Context context)
throws IOException, InterruptedException {
String keyStr = [Link]();
if ([Link](",")) { // Monthly average computation
double sum = 0;
int count = 0;
for (DoubleWritable val : values) {
sum += [Link]();
count++;
}
double avg = sum / count;
[Link]([Link]("%.2f", avg));
[Link]("monthlyAvg", key, result);
} else { // Max consumption per city
double max = Double.MIN_VALUE;
for (DoubleWritable val : values) {
max = [Link](max, [Link]());
}
[Link]([Link]("%.2f", max));
[Link]("maxConsumption", key, result);
}
}
@Override
public void cleanup(Context context) throws IOException, InterruptedException {
[Link]();
}
}
[Link] Code :-
import [Link];
import [Link];
import [Link].*;
import [Link];
import [Link];
import [Link];
import [Link];
import [Link];
public class EnergyDriver {
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
Job job = [Link](conf, "Energy Consumption Analysis");
[Link]([Link]);
// Configure Mapper and Reducer
[Link]([Link]);
[Link]([Link]);
// Set output types
[Link]([Link]);
[Link]([Link]);
[Link]([Link]);
[Link]([Link]);
// Define MultipleOutputs
[Link](job, "maxConsumption",
[Link], [Link], [Link]);
[Link](job, "monthlyAvg",
[Link], [Link], [Link]);
// Set input/output paths
[Link](job, new Path(args[0]));
[Link](job, new Path(args[1]));
[Link]([Link](true) ? 0 : 1);
}
}
OUTPUT :-
Created a sample input file named ‘energy_input.csv’ with the following content:
Sample Output :-
# View maximum consumption results
hadoop fs -cat /user/output/energy_results/maxConsumption-r-00000
CODE :
[Link] Code :-
import [Link];
import [Link];
import [Link];
import [Link];
public class SalesMapper extends Mapper<Object, Text, Text, IntWritable> {
private Text country = new Text();
private final static IntWritable one = new IntWritable(1);
@Override
public void map(Object key, Text value, Context context) throws IOException,
InterruptedException {
String line = [Link]();
String[] fields = [Link](","); // Assuming fields are comma-separated
if ([Link] >= 8) { // Ensure there are enough fields
String countryName = fields[7].trim(); // Country is in the 8th column (index 7)
[Link](countryName);
[Link](country, one); // Emit (country, 1) for each product sold
}}}
[Link] Code :-
import [Link];
import [Link];
import [Link];
import [Link];
public class SalesReducer extends Reducer<Text, IntWritable, Text, IntWritable> {
private IntWritable result = new IntWritable();
@Override
public void reduce(Text key, Iterable<IntWritable> values, Context context)
throws IOException, InterruptedException {
int sum = 0;
for (IntWritable value : values) {
sum += [Link]();
}
[Link](sum);
[Link](key, result); // Emit (country, total products sold)
}}
[Link] Code :-
import [Link];
import [Link];
import [Link];
import [Link];
import [Link];
import [Link];
import [Link];
public class SalesAnalysisDriver {
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
Job job = [Link](conf, "sales analysis");
[Link]([Link]);
[Link]([Link]);
[Link]([Link]);
[Link]([Link]);
[Link]([Link]);
[Link](job, new Path(args[0]));
[Link](job, new Path(args[1]));
[Link]([Link](true) ? 0 : 1);
}}
OUTPUT :-
Created a sample input file named ‘sales_data.txt’ with the following content::
Sample Input :-
Each line represents:
CODE :
[Link] Code :- Aggregates tags for each movie and pairs them with titles.
import [Link];
import [Link];
import [Link];
import [Link];
import [Link];
[Link]([Link](true) ? 0 : 1);
}
}
OUTPUT :-
Created a sample input file for movies named ‘[Link]’ with the following content:
Created a sample input file for movies named ‘[Link]’ with the following content:
Sample Output :- The output file (typically named part-r-00000) would contain:
EXPERIMENT 10
AIM: [Link] is an online music website where users listen to various tracks, the data gets
collected which is given below.
The data is coming in log files and looks like as shown below
CODE :
Firstly , let's define constants to make the code more readable:
[Link] Code :-
package [Link];
public class MusicConstants {
// Indices for fields in each log record
public static final int USER_ID = 0;
public static final int TRACK_ID = 1;
public static final int IS_SHARED = 2;
public static final int IS_RADIO = 3;
public static final int IS_SKIPPED = 4;
// Types of metrics to calculate
public static final String UNIQUE_LISTENERS = "unique_listeners";
public static final String SHARED_COUNT = "shared_count";
public static final String RADIO_PLAYS = "radio_plays";
public static final String TOTAL_PLAYS = "total_plays";
public static final String RADIO_SKIPS = "radio_skips";
}
[Link] Code :-
package [Link];
import [Link];
import [Link];
import [Link];
import [Link];
public class MusicStatsMapper extends Mapper<LongWritable, Text, Text, Text> {
private Text outputKey = new Text();
private Text outputValue = new Text();
@Override
public void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException {
String line = [Link]();
String[] fields = [Link]("\\|");
// Skip malformed records
if ([Link] != 5) return;
try {
String userId = fields[MusicConstants.USER_ID].trim();
String trackId = fields[MusicConstants.TRACK_ID].trim();
int shared = [Link](fields[MusicConstants.IS_SHARED].trim());
int radio = [Link](fields[MusicConstants.IS_RADIO].trim());
int skipped = [Link](fields[MusicConstants.IS_SKIPPED].trim());
// For unique listeners - emit trackId,userId
[Link](MusicConstants.UNIQUE_LISTENERS + ":" + trackId);
[Link](userId);
[Link](outputKey, outputValue);
// For shared count - emit only if shared=1
if (shared == 1) {
[Link](MusicConstants.SHARED_COUNT + ":" + trackId);
[Link]("1");
[Link](outputKey, outputValue);
}
// For radio plays - emit only if radio=1
if (radio == 1) {
[Link](MusicConstants.RADIO_PLAYS + ":" + trackId);
[Link]("1");
[Link](outputKey, outputValue);
}
// For total plays - emit for every record
[Link](MusicConstants.TOTAL_PLAYS + ":" + trackId);
[Link]("1");
[Link](outputKey, outputValue);
// For radio skips - emit only if radio=1 and skipped=1
if (radio == 1 && skipped == 1) {
[Link](MusicConstants.RADIO_SKIPS + ":" + trackId);
[Link]("1");
[Link](outputKey, outputValue);
}
} catch (NumberFormatException e) {
// Skip records with invalid number format
}
}}
[Link] Code :-
package [Link];
import [Link];
import [Link];
import [Link];
import [Link];
import [Link];
public class MusicStatsReducer extends Reducer<Text, Text, Text, Text> {
private Text result = new Text();
@Override
public void reduce(Text key, Iterable<Text> values, Context context)
throws IOException, InterruptedException {
String keyStr = [Link]();
String[] keyParts = [Link](":");
if ([Link] != 2) return;
String metricType = keyParts[0];
String trackId = keyParts[1];
if ([Link](MusicConstants.UNIQUE_LISTENERS)) {
// Count unique listeners
Set<String> uniqueUsers = new HashSet<>();
for (Text value : values) {
[Link]([Link]());
}
[Link]("Unique Listeners: " + [Link]());
} else {
// Count other metrics (shared, radio plays, total plays, skips)
int count = 0;
for (Text value : values) {
count++;
}
if ([Link](MusicConstants.SHARED_COUNT)) {
[Link]("Shared Count: " + count);
} else if ([Link](MusicConstants.RADIO_PLAYS)) {
[Link]("Radio Plays: " + count);
} else if ([Link](MusicConstants.TOTAL_PLAYS)) {
[Link]("Total Plays: " + count);
} else if ([Link](MusicConstants.RADIO_SKIPS)) {
[Link]("Radio Skips: " + count);
}
}
[Link](new Text("Track " + trackId), result);
}
}
[Link] Code :-
package [Link];
import [Link];
import [Link];
import [Link];
import [Link];
import [Link];
import [Link];
public class MusicStatsDriver {
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
Job job = [Link](conf, "XYZ Music Stats");
[Link]([Link]);
// Set mapper and reducer classes
[Link]([Link]);
[Link]([Link]);
// Set output key and value types
[Link]([Link]);
[Link]([Link]);
// Set input and output paths
[Link](job, new Path(args[0]));
[Link](job, new Path(args[1]));
[Link]([Link](true) ? 0 : 1);
}
}
OUTPUT :-
1. First, compile the Java files and create a JAR:
2. Created a sample input file named ‘music_logs.txt’
3. After compiling the Java files and creating a JAR
Place the input file in HDFS:
Sample Output :- The output file (typically named part-r-00000) would contain:
hadoop fs -cat /user/output/music_stats/part-r-00000
EXPERIMENT 11
AIM: Develop a MapReduce program to find the frequency of books published each year
and find in which year maximum number of books were published using the following data.
CODE :
[Link] Code :-
import [Link];
import [Link];
import [Link];
import [Link];
import [Link];
public class BookPublicationMapper extends Mapper<LongWritable, Text, Text,
IntWritable> {
private final static IntWritable one = new IntWritable(1);
private Text year = new Text();
@Override
public void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException {
String line = [Link]();
String[] fields = [Link](",");
// Skip header or malformed rows
if ([Link] >= 3 && !fields[0].equals("Title")) {
try {
// Extract the published year (assuming it's in 3rd column)
String publishedYear = fields[2].trim();
[Link](publishedYear);
// Emit (year, 1) for each book
[Link](year, one);
} catch (Exception e) {
// Skip rows with parsing errors
}
}
}
}
[Link] Code :-
import [Link];
import [Link];
import [Link];
import [Link];
public class BookPublicationReducer extends Reducer<Text, IntWritable, Text, IntWritable>
{
private IntWritable result = new IntWritable();
@Override
public void reduce(Text key, Iterable<IntWritable> values, Context context)
throws IOException, InterruptedException {
int sum = 0;
// Sum up all occurrences (always 1) of books in this year
for (IntWritable val : values) {
sum += [Link]();
}
[Link](sum);
[Link](key, result);
}
}
[Link] Code :-
import [Link];
import [Link];
import [Link];
import [Link];
import [Link];
public class MaxPublicationMapper extends Mapper<LongWritable, Text, Text, Text> {
private final static Text maxKey = new Text("MAX");
private Text yearCount = new Text();
@Override
public void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException {
String line = [Link]();
String[] parts = [Link]("\\s+");
if ([Link] == 2) {
// Format: "year count"
[Link](line);
// Emit with a single key so all records go to one reducer
[Link](maxKey, yearCount);
}
}
}
[Link] Code :-
import [Link];
import [Link];
import [Link];
public class MaxPublicationReducer extends Reducer<Text, Text, Text, Text> {
private Text result = new Text();
@Override
public void reduce(Text key, Iterable<Text> values, Context context)
throws IOException, InterruptedException {
int maxCount = -1;
String maxYear = "";
// Find the year with maximum publications
for (Text val : values) {
String[] parts = [Link]().split("\\s+");
String year = parts[0];
int count = [Link](parts[1]);
if (count > maxCount) {
maxCount = count;
maxYear = year;
}
}
[Link]("Maximum number of books (" + maxCount + ") were published in year " +
maxYear);
[Link](new Text("Result:"), result);
}
}
[Link] Code :-
import [Link];
import [Link];
import [Link];
import [Link];
import [Link];
import [Link];
import [Link];
import [Link];
import [Link];
public class BookPublicationDriver {
public static void main(String[] args) throws Exception {
// Job 1: Count books per year
Configuration conf1 = new Configuration();
Job job1 = [Link](conf1, "Book Publication Count");
[Link]([Link]);
[Link]([Link]);
[Link]([Link]);
[Link]([Link]);
[Link]([Link]);
[Link](job1, new Path(args[0]));
Path tempOutput = new Path(args[1] + "_temp");
[Link](job1, tempOutput);
[Link](true);
// Job 2: Find year with maximum publications
Configuration conf2 = new Configuration();
Job job2 = [Link](conf2, "Max Publication Year");
[Link]([Link]);
[Link]([Link]);
[Link]([Link]);
[Link]([Link]);
[Link]([Link]);
[Link](job2, tempOutput);
[Link](job2, new Path(args[1]));
[Link]([Link](true) ? 0 : 1);
}
}
OUTPUT :-
1. First, compile the Java files and create a JAR:
2. Created a sample input file named ‘book_data.csv’
Sample Output :- The output file (typically named part-r-00000) would contain:
hadoop fs -cat /user/output/book_analysis/part-r-00000
The first job produces year-count pairs:
CODE :
[Link] Code :-
import [Link];
import [Link];
import [Link];
import [Link];
import [Link];
public class TitanicAgeMapper extends Mapper<LongWritable, Text, Text, DoubleWritable>
{
private Text gender = new Text();
private DoubleWritable age = new DoubleWritable();
@Override
public void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException {
String line = [Link]();
String[] fields = [Link](",");
// Skip malformed records
if ([Link] >= 6) {
String survived = fields[1].trim();
String sex = fields[4].trim();
String ageStr = fields[5].trim();
// Process only if person died (survived=1) and age is valid
if ("1".equals(survived) && [Link]("\\d+(\\.\\d+)?")) {
double ageVal = [Link](ageStr);
[Link](sex);
[Link](ageVal);
[Link](gender, age);
}}}}
[Link] Code :-
import [Link];
import [Link].*;
import [Link];
public class TagMapper extends Mapper<LongWritable, Text, Text, Text> {
@Override
public void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException {
String[] fields = [Link]().split(",");
if ([Link] >= 3) {
String movieId = fields[1].trim();
String tag = fields[2].trim();
[Link](new Text(movieId), new Text("TAG:" + tag));
}}}
[Link] Code :-
import [Link];
import [Link];
import [Link];
import [Link];
import [Link];
public class TitanicSurvivorsMapper extends Mapper<LongWritable, Text, Text,
IntWritable> {
private Text passengerClass = new Text();
private static final IntWritable one = new IntWritable(1);
@Override
public void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException {
String line = [Link]();
String[] fields = [Link](",");
// Skip malformed records
if ([Link] >= 3) {
String survived = fields[1].trim();
// Process only if person survived (survived=0)
if ("0".equals(survived)) {
String pclass = "Class " + fields[2].trim();
[Link](pclass);
[Link](passengerClass, one);
}}}}
[Link] Code :-
import [Link];
import [Link];
import [Link];
import [Link];
public class TitanicSurvivorsReducer extends Reducer<Text, IntWritable, Text, IntWritable>
{
private IntWritable result = new IntWritable();
@Override
public void reduce(Text key, Iterable<IntWritable> values, Context context)
throws IOException, InterruptedException {
int sum = 0;
for (IntWritable val : values) {
sum += [Link]();
}
[Link](sum);
[Link](key, result);
}}
[Link] Code :-
import [Link];
import [Link];
import [Link];
import [Link];
import [Link];
import [Link];
import [Link];
import [Link];
public class TitanicAnalysisDriver {
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
// Job 1: Average Age by gender for deceased passengers
Job job1 = [Link](conf, "Average Age of Deceased");
[Link]([Link]);
[Link]([Link]);
[Link]([Link]);
[Link]([Link]);
[Link]([Link]);
[Link](job1, new Path(args[0]));
[Link](job1, new Path(args[1] + "_avg_age"));
[Link](true);
Created a sample input file named ‘titanic_data.csv’ with the following content:
PassengerId,Survived,Pclass,Name,Sex,Age
1,1,1,"Allen, Miss. Elisabeth Walton",female,29
2,0,1,"Allison, Master. Hudson Trevor",male,0.92
3,1,3,"Andersson, Mr. Anders Johan",male,39
4,0,3,"Andersson, Miss. Ingeborg Constanzia",female,9
5,1,3,"Andersson, Miss. Sigrid Elisabeth",female,11
6,0,1,"Andrews, Miss. Kornelia Theodosia",female,63
7,1,1,"Appleton, Mrs. Edward Dale",female,53
8,0,2,"Arnold-Franchi, Mrs. Josef",female,18
9,1,3,"Baclini, Miss. Eugenie",female,0.75
10,0,3,"Baclini, Miss. Helene Barbara",female,0.33
11,1,2,"Backstrom, Mrs. Karl Alfred",female,33
12,0,2,"Backstrom, Miss. Kristina",female,20
13,1,3,"Baclini, Miss. Marie Catherine",female,5
14,0,1,"Baxter, Mrs. James",female,50
15,1,2,"Becker, Miss. Marion Louise",female,4
16,0,3,"Bourke, Miss. Mary",female,28
17,1,1,"Brown, Mrs. James Joseph",female,44
18,0,2,"Brown, Miss. Amelia",female,24
19,1,3,"Cacic, Miss. Marija",female,30
20,0,3,"Cacic, Mr. Luka",male,38
Sample Output :- The output file (typically named part-r-00000) would contain:
CODE :
[Link] Code :-
import [Link];
import [Link];
import [Link];
import [Link];
import [Link];
import [Link];
import [Link];
import [Link];
@Override
public void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException {
// Skip header row
if ([Link]() == 0 && [Link]().contains("dispatching_base_number")) {
return;
}
try {
// Parse date and extract day of week
Date date = [Link](dateStr);
String dayOfWeek = days[[Link]()];
// Create composite key: "baseNumber dayOfWeek"
[Link](baseNumber + " " + dayOfWeek);
[Link](trips);
[Link](outputKey, tripCount);
} catch (ParseException e) {
// Skip records with invalid date format
[Link]("Error parsing date: " + dateStr);
}
}
}
}
[Link] Code :-
import [Link];
import [Link];
import [Link];
import [Link];
@Override
public void reduce(Text key, Iterable<IntWritable> values, Context context)
throws IOException, InterruptedException {
int sum = 0;
[Link](sum);
[Link](key, result);
}
}
[Link] Code :-
import [Link];
import [Link];
import [Link];
import [Link];
import [Link];
import [Link];
import [Link];
[Link]([Link](true) ? 0 : 1);
}
}
[Link]([Link](true) ? 0 : 1);
}
}
OUTPUT :-
Created a sample input file named ‘uber_data.csv’ with the following content:
Sample Output :- The output file (typically named part-r-00000) would contain:
EXPERIMENT 14
AIM: Develop a program to calculate the maximum recorded temperature by year wise for
the weather dataset in Pig Latin
THEORY :
Pig Latin is a high-level scripting language designed for analyzing large datasets in Hadoop
using the Apache Pig platform. Pig Latin scripts are composed of a series of data
transformation steps, much like a data flow, where each step performs a specific operation
such as loading, filtering, grouping, or aggregating data.
Pig Latin programs are written as a sequence of statements, each ending with a semicolon.
These statements operate on relations and include:
LOAD: To load data from the file system (HDFS or local) into a relation.
STORE: To save a relation back to the file system.
FILTER: To remove unwanted rows from a relation.
FOREACH ... GENERATE: To transform data or select specific fields.
GROUP: To group data by one or more fields.
JOIN: To join two or more relations.
ORDER: To sort the data.
DUMP: To display the contents of a relation on the console
When a Pig Latin script is run, the Pig engine parses the script and automatically translates it
into a series of MapReduce jobs, which are then executed on the Hadoop cluster. This means
users do not need to manage the details of parallel and distributed computation.
CODE :
-- Load the dataset (adjust delimiter if needed)
raw_data = LOAD ‘C:\Users\input\weather_data.csv' USING PigStorage(',')
AS (date:chararray, temperature:int, other_fields...);
Created a sample input file named ‘weather_data.csv’ with the following content:
Hive is an open-source data warehousing solution built on top of Hadoop. It supports an SQL-
like query language called HiveQL. These queries are compiled into MapReduce jobs that
are executed on Hadoop. While Hive uses Hadoop for execution of queries, it reduces the
effort that goes into writing and maintaining MapReduce jobs.
Hive supports database concepts like tables, columns, rows and partitions. Both primitive
(integer, float, string) and complex data-types(map, list, struct) are supported. Moreover,
these types can be composed to support structures of arbitrary complexity. The tables are
serialized/deserialized using default serializers/deserializer. Any new data format and type
can be supported by implementing SerDe and ObjectInspector java interface.
By using HiveQL ORDER BY and SORT BY clause, we can apply sort on the column. It
returns the result set either in ascending or descending order.
In HiveQL, ORDER BY clause performs a complete ordering of the query result set. Hence,
the complete data is passed through a single reducer. This may take much time in the
execution of large datasets. However, we can use LIMIT to minimize the sorting time.
CODE :
1. Select the database in which we want to create a table.
Execution Model :-
Spark transforms user code into a Directed Acyclic Graph (DAG) of stages and tasks. The
DAG scheduler breaks jobs into stages based on shuffle boundaries, and the task scheduler
distributes these tasks across executors. Spark optimizes execution by:
Minimizing data movement (data locality)
Speculative execution to handle slow tasks
In-memory caching to reduce disk I/O
Execution Flow :-
1. Input Data: Loaded as an RDD (Resilient Distributed Dataset) from HDFS/local
storage.
2. Transformations:
map(), filter(), reduceByKey() create new RDDs (lazily evaluated).
3. Actions:
saveAsTextFile() triggers job execution.
4. DAG Scheduler: Breaks the job into stages (e.g., map, reduce).
5. Task Scheduler: Distributes tasks to executors.
CODE :
import [Link];
import [Link];
import [Link];
import [Link];
import scala.Tuple2;
import [Link];
public class MaxTemperatureSpark {
public static void main(String[] args) {
// Initialize Spark configuration and context
SparkConf conf = new SparkConf()
.setAppName("MaxTemperature")
.setMaster("local[*]"); // Use "yarn" for cluster mode
JavaSparkContext sc = new JavaSparkContext(conf);
// Load input data (format: Year, Temperature)
JavaRDD<String> lines = [Link]("input/weather_data.csv");
// Parse lines into (Year, Temperature) pairs
JavaPairRDD<String, Integer> yearTemps = lines
.mapToPair(line -> {
String[] parts = [Link](",");
String year = parts[0];
int temp = [Link](parts[1]);
return new Tuple2<>(year, temp);
});
// Filter out invalid temperatures (e.g., -9999)
JavaPairRDD<String, Integer> filtered = yearTemps
.filter(tuple -> tuple._2() != -9999);
// Reduce by key to find max temperature per year
JavaPairRDD<String, Integer> maxTemps = filtered
.reduceByKey((a, b) -> [Link](a, b));
// Save or print results
[Link]("output/max_temps");
[Link]();
}
}
OUTPUT :-
Created a sample input file named ‘weather_data.csv’ with the following content: