Hadoop Python MapReduce Tutorial For Beginners

Home (/) About Me (/about.
html) Archive (/archive)
Hire Me  (https://rathbonelabs.com)
Beekeeper Studio  (https://beekeeperstudio.io)
MATTHEW RATHBONE (/)
Topics (/archive/topics.html) / Hadoop (/archive/topics.html#hadoop) /

Hadoop Python MapReduce Tutorial for Beginners
Hadoop Python MapReduce Tutorial for Beginners
By Matthew Rathbone on November 17 2013
 Share (http://www.facebook.com/sharer/sharer.php?
u=https://blog.matthewrathbone.com/2013/11/17/python-map-reduce-on-hadoop-a-
beginners-tutorial.html)
 Tweet (https://twitter.com/intent/tweet?text=Hadoop Python MapReduce Tutorial for

Beginners&url=https://blog.matthewrathbone.com/2013/11/17/python-map-reduce-on-
hadoop-a-beginners-tutorial.html&via=rathboma)
 Post (http://www.linkedin.com/shareArticle?
mini=true&url=https://blog.matthewrathbone.com/2013/11/17/python-map-reduce-on-
hadoop-a-beginners-tutorial.html&title=Hadoop Python MapReduce Tutorial for
Beginners&summary=A step-by-step tutorial for writing your rst map reduce with Python
and Hadoop Streaming.&source=rathboma)
This article originally accompanied my tutorial session at the Big
Data Madison Meetup, November 2013
(http://www.meetup.com/BigDataMadison/events/149122882/).
The goal of this article is to:
introduce you to the hadoop streaming library (the mechanism

which allows us to run non-jvm code on hadoop)
teach you how to write a simple map reduce pipeline in Python
(single input, single output).
teach you how to write a more complex pipeline in Python
(multiple inputs, single output).
There are other good resouces online about Hadoop streaming, so
I’m going over old ground a little. Here are some good links:
1. Hadoop Streaming of cial Documentation

(http://hadoop.apache.org/docs/r1.1.2/streaming.html)
2. Michael Knoll’s Python Streaming Tutorial (http://www.michael-
noll.com/tutorials/writing-an-hadoop-mapreduce-program-in-
python/)
3. An Amazon EMR Python streaming tutorial
(http://aws.amazon.com/articles/2294)
If you are new to Hadoop, you might want to check out my beginners
guide to Hadoop (/2013/04/17/what-is-hadoop.html) before digging
in to any code (it’s a quick read I promise!).
Setup
I’m going to use the Cloudera Quickstart VM

(http://www.cloudera.com/content/support/en/downloads/download-
components/download-products.html?productID=F6mO278Rvo) to
run these examples.
Once you’re booted into the quickstart VM we’re going to get our
dataset. I’m going to use the play-by-play n data by Brian Burke
(http://www.advancedn stats.com/2010/04/play-by-play-data.html).
To start with we’re only going to use the data in his Git repository
(https://github.com/eljefe6a/n data).
Once you’re in the cloudera VM, clone the repo:
cd ~/workspace
git clone https://github.com/eljefe6a/nfldata.git
To start we’re going to use stadiums.csv . However this data was

encoded in Windows (grr) so has ^M line separators instead of new
lines \n . We need to change the encoding before we can play with it:
cd workspace/nfldata
cat stadiums.csv # BAH! Everything is a single line
dos2unix -l -n stadiums.csv unixstadiums.csv
cat unixstadiums.csv # Hooray! One stadium per line
Hadoop Streaming Intro
The way you ordinarily run a map-reduce is to write a java program

with at least three parts.
1. A Main method which con gures the job, and lauches it
set # reducers
set mapper and reducer classes
set partitioner
set other hadoop con gurations
2. A Mapper Class
takes K,V inputs, writes K,V outputs
3. A Reducer Class
takes K, Iterator[V] inputs, and writes K,V outputs
Hadoop Streaming is actually just a java library that implements

these things, but instead of actually doing anything, it pipes data to
scripts. By doing so, it provides an API for other languages:
read from STDIN

write to STDOUT
Streaming has some (con gurable) conventions that allow it to
understand the data returned. Most importantly, it assumes that
Keys and Values are separated by a \t . This is important for the rest
of the map reduce pipeline to work properly (partitioning and
sorting). To understand why check out my intro to Hadoop
(/2013/04/17/what-is-hadoop.html), where I discuss the pipeline in
detail.
RUNNING A BASIC STREAMING JOB
It’s just like running a normal mapreduce job, except that you need to
provide some information about what scripts you want to use.
Hadoop comes with the streaming jar in it’s lib directory, so just nd
that to use it. The job below counts the number of lines in our
stadiums le. (This is really overkill, because there are only 32
records)
hadoop fs -mkdir nfldata/stadiums

hadoop fs -put ~/workspace/nfldata/unixstadiums.csv nfldata/stad
hadoop jar /usr/lib/hadoop-0.20-mapreduce/contrib/streaming/hadoop

-Dmapred.reduce.tasks=1 \
-input nfldata/stadiums \
-output nfldata/output1 \
-mapper cat \
-reducer "wc -l"
# now we check our results:

hadoop fs -ls nfldata/output1
# looks like files are there, lets get the result:

hadoop fs -text nfldata/output1/part*
# => 32
A good way to make sure your job has run properly is to look at the
jobtracker dashboard. In the quickstart VM there is a link in the
bookmarks bar.
You should see your job in the running/completed sections, clicking

on it brings up a bunch of information. The most useful data on this
page is under the Map-Reduce Framework section, in particular look for
stuff like:
Map Input Records

Map Output Records
Reduce Output Records
In our example, input records are 32 and output records is 1:
A Simple Example in Python
Looking in columns.txt we can see that the stadium le has the

following elds:
Stadium (String) - The name of the stadium

Capacity (Int) - The capacity of the stadium
ExpandedCapacity (Int) - The expanded capacity of the stadium
Location (String) - The location of the stadium
PlayingSurface (String) - The type of grass, etc that the stadium
IsArtificial (Boolean) - Is the playing surface artificial
Team (String) - The name of the team that plays at the stadium
Opened (Int) - The year the stadium opened
WeatherStation (String) - The name of the weather station closest
RoofType (Possible Values:None,Retractable,Dome) - The type of roo
Elevation - The elevation of the stadium
Lets use map reduce to nd the number of stadiums with arti cial
and natrual playing surfaces.
The pseudo-code looks like this:

def map(line):
fields = line.split(",")
print(fields.isArtificial, 1)
def reduce(isArtificial, totals):

print(isArtificial, sum(totals))
You can nd the nished code in my Hadoop framework examples

repository (https://github.com/rathboma/hadoop-framework-
examples).
IMPORTANT GOTCHA!
The reducer interface for streaming is actually different than in Java.

Instead of receiving reduce(k, Iterator[V]) , your script is actually sent
one line per value, including the key.
So for example, instead of receiving:
reduce('TRUE', Iterator(1, 1, 1, 1))

reduce('FALSE', Iterator(1, 1, 1))
It will receive:
TRUE 1
TRUE 1
TRUE 1
TRUE 1
FALSE 1
FALSE 1
FALSE 1
This means you have to do a little state tracking in your reducer. This
will be demonstrated in the code below.
To follow along, check out my git repository
(https://github.com/rathboma/hadoop-framework-examples) (on the
virtual machine):
cd ~/workspace
git clone https://github.com/rathboma/hadoop-framework-examples.g
cd hadoop-framework-examples
MAPPER
import sys
for line in sys.stdin:

line = line.strip()
unpacked = line.split(",")
stadium, capacity, expanded, location, surface, turf, team, op
results = [turf, "1"]
print("\t".join(results))
REDUCER
import sys
# Example input (ordered by key)

# FALSE 1
# FALSE 1
# TRUE 1
# TRUE 1
# UNKNOWN 1
# UNKNOWN 1
# keys come grouped together

# so we need to keep track of state a little bit
# thus when the key changes (turf), we need to reset
# our counter, and write out the count we've accumulated
last_turf = None
turf_count = 0
for line in sys.stdin:
line = line.strip()
turf, count = line.split("\t")
count = int(count)
# if this is the first iteration
if not last_turf:
last_turf = turf
# if they're the same, log it

if turf == last_turf:
turf_count += count
else:
# state change (previous line was k=x, this line is k=y)
result = [last_turf, turf_count]
print("\t".join(str(v) for v in result))
last_turf = turf
turf_count = 1
# this is to catch the final counts after all records have been re
print("\t".join(str(v) for v in [last_turf, turf_count]))
You might notice that the reducer is signi cantly more complex then
the pseudocode. That is because the streaming interface is limited
and cannot really provide a way to implement the standard API.
As noted, each line read contains both the KEY and the VALUE , so it’s up
to our reducer to keep track of Key changes and act accordingly.
Don’t forget to make your scripts executable:
chmod +x simple/mapper.py
chmod +x simple/reducer.py
TESTING
Because our example is so simple, we can actually test it without

using hadoop at all.
cd streaming-python
cat ~/workspace/nfldata/unixstadiums.csv | simple/mapper.py | sort
# FALSE 15
# TRUE 17
Looking good so far!
Running with Hadoop should produce the same output.
hadoop jar /usr/lib/hadoop-0.20-mapreduce/contrib/streaming/hadoop

-mapper mapper.py \
-reducer reducer.py \
-input nfldata/stadiums \
-output nfldata/pythonoutput \
-file simple/mapper.py \
-file simple/reducer.py
# ...twiddle thumbs for a while
hadoop fs -text nfldata/pythonoutput/part-*

FALSE 15
TRUE 17
A Complex Example in Python
Check out my advanced python MapReduce guide

(/hadoop/2016/02/09/python-tutorial.html) to see how to join two
datasets together using python.
Python MapReduce Book
While there are no books speci c to Python MapReduce

development the following book has some pretty good examples:
(http://amzn.to/2hVekf0)
MASTERING PYTHON FOR DATA SCIENCE
(HTTP://AMZN.TO/2HVEKF0)
While not speci c to MapReduce, this book gives some examples of

using the Python 'HadoopPy' framework to write some MapReduce
code. It's also an excellent book in it's own right.
 Share (http://www.facebook.com/sharer/sharer.php?
u=https://blog.matthewrathbone.com/2013/11/17/python-map-reduce-on-hadoop-a-
beginners-tutorial.html)
 Tweet (https://twitter.com/intent/tweet?text=Hadoop Python MapReduce Tutorial for

Beginners&url=https://blog.matthewrathbone.com/2013/11/17/python-map-reduce-on-
hadoop-a-beginners-tutorial.html&via=rathboma)
 Post (http://www.linkedin.com/shareArticle?
mini=true&url=https://blog.matthewrathbone.com/2013/11/17/python-map-reduce-on-
hadoop-a-beginners-tutorial.html&title=Hadoop Python MapReduce Tutorial for
Beginners&summary=A step-by-step tutorial for writing your rst map reduce with Python
and Hadoop Streaming.&source=rathboma)
Matthew Rathbone
Consultant Big Data Infrastructure Engineer at Rathbone Labs
(https://www.rathbonelabs.com). British. Data Nerd. Lucky husband and
father.
(https://click.linksynergy.com/deeplink?
id=KuenSouhzgQ&mid=40328&murl=https%3A%2F%2Fwww.courser
data)
PREVIOUS
« Hadoop MapReduce Scoobi Tutorial with Examples (/2013/11/03/real-world-

hadoop-implementing-a-left-outer-join-with-scoobi.html)
NEXT
Reading data from HDFS programatically using java (and scala) »

(/2013/12/28/Reading-data-from-HDFS-even-if-it-is-compressed.html)
Related Hadoop Articles

Should you use Parquet? (/2019/12/20/parquet-or-bust.html)
Beginners Guide to Columnar File Formats in Spark and Hadoop

(/2019/11/21/guide-to-columnar- le-formats.html)
Join the discussion

Show Comments
Join my newsletter
Learn more about Hadoop and Big Data.
email address
Subscribe
Beekeeper Studio
I maintain an open source SQL editor and database manager with a focus on usability.
It is cross-platform and really nice to use.
Check it Out (https://beekeeperstudio.io)
Related Articles
Should you use Parquet? (/2019/12/20/parquet-or-bust.html)
Beginners Guide to Columnar File Formats in Spark and Hadoop

(/2019/11/21/guide-to-columnar- le-formats.html)
A Quick Guide to Concurrency in Scala (/2017/03/28/scala-concurrency-
options.html)
4 Fun and Useful Things to Know about Scala's apply() functions (/2017/03/06/scala-
object-apply-functions.html)
10+ Great Books and Resources for Learning and Perfecting Scala
(/2017/02/14/scala-books.html)
Links
Twitter (https://twitter.com/rathboma)
GitHub (https://github.com/rathboma)
Beekeeper Studio (https://beekeeperstudio.io)
Resources
Hadoop HDFS Cheatsheet (/pages/hdfs-cheatsheet.html)
Copyright Matthew Rathbone 2020, All Rights Reserved. Background image from Subtle Patterns
(http://subtlepatterns.com/)

Hadoop Python MapReduce Tutorial For Beginners

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Hadoop Python MapReduce Tutorial For Beginners

Uploaded by

Copyright:

Available Formats

Home (/) About Me (/about.

html) Archive (/archive)

Beekeeper Studio  (https://beekeeperstudio.io)

MATTHEW RATHBONE (/)

Topics (/archive/topics.html) / Hadoop (/archive/topics.html#hadoop) /

Hadoop Python MapReduce Tutorial for Beginners

By Matthew Rathbone on November 17 2013

 Tweet (https://twitter.com/intent/tweet?text=Hadoop Python MapReduce Tutorial for

The goal of this article is to:

introduce you to the hadoop streaming library (the mechanism

1. Hadoop Streaming of cial Documentation

I’m going to use the Cloudera Quickstart VM

Once you’re in the cloudera VM, clone the repo:

To start we’re going to use stadiums.csv . However this data was

Hadoop Streaming Intro

The way you ordinarily run a map-reduce is to write a java program

Hadoop Streaming is actually just a java library that implements

read from STDIN

RUNNING A BASIC STREAMING JOB

hadoop fs -mkdir nfldata/stadiums

hadoop jar /usr/lib/hadoop-0.20-mapreduce/contrib/streaming/hadoop

# now we check our results:

# looks like files are there, lets get the result:

You should see your job in the running/completed sections, clicking

Map Input Records

Looking in columns.txt we can see that the stadium le has the

Stadium (String) - The name of the stadium

The pseudo-code looks like this:

def reduce(isArtificial, totals):

You can nd the nished code in my Hadoop framework examples

The reducer interface for streaming is actually different than in Java.

So for example, instead of receiving:

reduce('TRUE', Iterator(1, 1, 1, 1))

for line in sys.stdin:

# Example input (ordered by key)

# keys come grouped together

for line in sys.stdin:

# if they're the same, log it

Don’t forget to make your scripts executable:

Because our example is so simple, we can actually test it without

Looking good so far!

Running with Hadoop should produce the same output.

hadoop jar /usr/lib/hadoop-0.20-mapreduce/contrib/streaming/hadoop

hadoop fs -text nfldata/pythonoutput/part-*

Check out my advanced python MapReduce guide

Python MapReduce Book

While there are no books speci c to Python MapReduce

MASTERING PYTHON FOR DATA SCIENCE

While not speci c to MapReduce, this book gives some examples of

 Tweet (https://twitter.com/intent/tweet?text=Hadoop Python MapReduce Tutorial for

« Hadoop MapReduce Scoobi Tutorial with Examples (/2013/11/03/real-world-

Reading data from HDFS programatically using java (and scala) »

Related Hadoop Articles

Beginners Guide to Columnar File Formats in Spark and Hadoop

Join the discussion

Check it Out (https://beekeeperstudio.io)

Beginners Guide to Columnar File Formats in Spark and Hadoop

You might also like