You are on page 1of 11

3/29/24, 4:36 PM GitHub - prince6635/movie-ratings-by-mapreduce-and-hadoop: Big data (movie ratings) based on Hadoop and MapReduce

prince6635 / movie-ratings-by-mapreduce-and-hadoop Public

Big data (movie ratings) based on Hadoop and MapReduce

6 stars 9 forks Branches Tags Activity

Star Notifications

Code Issues Pull requests Actions Projects Security Insights

master 1 Branch 0 Tags Go to file Go to file Code

prince6635 Run MR job on AWS EMR 8 years ago

assets Run MR job on AWS EMR 8 years ago

.gitignore Initial commit 8 years ago

README.md Run MR job on AWS EMR 8 years ago

friends_by_age.py MapReduce example - average n… 8 years ago

min_temperatures_by_loc… MapReduce example - min temp… 8 years ago

most_popular_movie.py MapReduce example - movie rati… 8 years ago

most_popular_movie_with… MapReduce example - movie rati… 8 years ago

most_popular_superhero.py MapReduce example - Most pop… 8 years ago

movie_recommendation_… Run MR job on AWS EMR 8 years ago

process_marvel_data.py MapReduce example - find super… 8 years ago

superhero_relatons_by_BF… MapReduce example - find super… 8 years ago

total_amount_spent_by_c… MapReduce example - total amo… 8 years ago

total_amount_spent_by_c… MapReduce example - total amo… 8 years ago

word_frequency.py MapReduce example - word freq… 8 years ago

word_frequency_better.py MapReduce example - word freq… 8 years ago

word_frequency_sorted_b… MapReduce example - word freq… 8 years ago

word_frequency_with_co… MapReduce example - movie rati… 8 years ago

https://github.com/prince6635/movie-ratings-by-mapreduce-and-hadoop 1/11
3/29/24, 4:36 PM GitHub - prince6635/movie-ratings-by-mapreduce-and-hadoop: Big data (movie ratings) based on Hadoop and MapReduce

Big data (movie ratings) based on Hadoop and


MapReduce

MapReduce

Exmaple:

how many movies that each user has watched? => key: user_id and value: movie_id, now
duplicate keys are ok, since reducer will handle that later.

https://github.com/prince6635/movie-ratings-by-mapreduce-and-hadoop 2/11
3/29/24, 4:36 PM GitHub - prince6635/movie-ratings-by-mapreduce-and-hadoop: Big data (movie ratings) based on Hadoop and MapReduce

Map:

https://github.com/prince6635/movie-ratings-by-mapreduce-and-hadoop 3/11
3/29/24, 4:36 PM GitHub - prince6635/movie-ratings-by-mapreduce-and-hadoop: Big data (movie ratings) based on Hadoop and MapReduce

https://github.com/prince6635/movie-ratings-by-mapreduce-and-hadoop 4/11
3/29/24, 4:36 PM GitHub - prince6635/movie-ratings-by-mapreduce-and-hadoop: Big data (movie ratings) based on Hadoop and MapReduce

Reduce:

All:

Code snippet: # of movies for each rating?

https://github.com/prince6635/movie-ratings-by-mapreduce-and-hadoop 5/11
3/29/24, 4:36 PM GitHub - prince6635/movie-ratings-by-mapreduce-and-hadoop: Big data (movie ratings) based on Hadoop and MapReduce

Fields: user_id movie_id rating timestamp

Combiner: when mapper is done producing key-value pairs, do some reduction work in mapper,
like aggregating data before sending to reducer to save some network bandwidth.

ex: ./word_frequency_with_combiner.py

Attach config/data file with each MapReduce job across distributed nodes:
./most_popular_movie_with_name_lookup.py

README

https://github.com/prince6635/movie-ratings-by-mapreduce-and-hadoop 6/11
3/29/24, 4:36 PM GitHub - prince6635/movie-ratings-by-mapreduce-and-hadoop: Big data (movie ratings) based on Hadoop and MapReduce

How MapReduce scales / distributed computing:

Hadoop (Run MapReduce job in a distributed way)

https://github.com/prince6635/movie-ratings-by-mapreduce-and-hadoop 7/11
3/29/24, 4:36 PM GitHub - prince6635/movie-ratings-by-mapreduce-and-hadoop: Big data (movie ratings) based on Hadoop and MapReduce

HDFS (Hadoop Distributed File System): is used by Hadoop for distributing data and information
that Hadoop accesses, YARN manages how Hadoop jobs distributed across the cluster.

https://github.com/prince6635/movie-ratings-by-mapreduce-and-hadoop 8/11
3/29/24, 4:36 PM GitHub - prince6635/movie-ratings-by-mapreduce-and-hadoop: Big data (movie ratings) based on Hadoop and MapReduce

Apache YARN (Hadoop uses to figure out what mapper/reducer to run where, how to connect
them all together, keep tracking what's running, etc.)

https://github.com/prince6635/movie-ratings-by-mapreduce-and-hadoop 9/11
3/29/24, 4:36 PM GitHub - prince6635/movie-ratings-by-mapreduce-and-hadoop: Big data (movie ratings) based on Hadoop and MapReduce

AWS Elastic MapReduce

Tools
Python tool for big data: Enthought canopy
mrjob package: for MapReduce Editor -> !pip install mrjob
https://github.com/prince6635/movie-ratings-by-mapreduce-and-hadoop 10/11
3/29/24, 4:36 PM GitHub - prince6635/movie-ratings-by-mapreduce-and-hadoop: Big data (movie ratings) based on Hadoop and MapReduce

Sample data: http://grouplens.org/


datasets -> MovieLens 100K Dataset (ml-100k.zip)

Releases

No releases published

Packages

No packages published

Languages

Python 94.0% Perl 3.2% Shell 2.8%

https://github.com/prince6635/movie-ratings-by-mapreduce-and-hadoop 11/11

You might also like