ds2 3 Mapreduce

MapReduce
Design Patterns, Bloom filter

Agenda
● Anti Patterns
● Recap
● MapReduce cont’d
○ Join Pattern: Reduce Side Join (w/ Bloom Filter)
○ MetaPatterns (patterns about patterns)
○ Data Organization Patterns, Input/Output patterns
○ Summary
● Demo
○ Hadoop deployment
○ Linux commands
○ Bash
○ Awk
○ Hadoop streaming
Anti Patterns
Software Development AntiPatterns
● A key goal of development AntiPatterns is to describe useful forms of software
refactoring. Software refactoring is a form of code modification, used to improve the
software structure in support of subsequent extension and long-term maintenance. In
most cases, the goal is to transform code without impacting correctness.
Software Architecture AntiPatterns

● Architecture AntiPatterns focus on the system-level and enterprise-level structure of
applications and components. Although the engineering discipline of software
architecture is relatively immature, what has been determined repeatedly by software
research and experience is the overarching importance of architecture in software
development.
Software Project Management AntiPatterns

● In the modern engineering profession, more than half of the job involves human
communication and resolving people issues. The management AntiPatterns identify
some of the key scenarios in which these issues are destructive to software processes.
Cut-And-Paste Programming
AntiPattern Name:
Cut-and-Paste Programming
● Also Known As: Clipboard Coding, Software

Cloning, Software Propagation
● Anecdotal Evidence:
○ "Hey, I thought you fixed that bug already, so why is it

doing this again?"
○ "Man, you guys work fast. Over 400,000 lines of code

in three weeks is outstanding progress!"
Background
● maintenance nightmares;
● It comes from the notion that it's easier to modify existing software than program
from scratch;
General Form
● This AntiPattern is identified by the presence of several similar segments of code
interspersed throughout the software project. Usually, the project contains many
programmers who are learning how to develop software by following the
examples of more experienced developers;
● However, they are learning by modifying code that has been proven to work in
similar situations, and potentially customizing it to support new data types or
slightly customized behavior. This creates code duplication, which may have
positive short-term consequences such as boosting line count metrics, which
may be used in performance evaluations;
● Furthermore, it's easy to extend the code as the developer has full control over
the code used in his or her application and can quickly meet short-term
modifications to satisfy new requirements.
Symptoms And Consequences
● The same software bug reoccurs throughout software despite many local fixes.
● Lines of code increase without adding to overall productivity.
● Code reviews and inspections are needlessly extended.
● It becomes difficult to locate and fix all instances of a particular mistake.
● Code is considered self-documenting.
● Code can be reused with a minimum of effort.
● This AntiPattern leads to excessive software maintenance costs.
● Software defects are replicated through the system.
● Reusable assets are not converted into an easily reusable and documented form.
● Cut-and-Paste Programming form of reuse deceptively inflates the number of lines of code developed
without the expected reduction in maintenance costs associated with other forms of reuse.
Spaghetti Code
● AntiPattern Name: Spaghetti Code
● Anecdotal Evidence:
● "Ugh! What a mess!"
● "You do realize that the language supports more

than one function, right?"
● "It's easier to rewrite this code than to attempt to

modify it."
● "Software engineers don't write spaghetti code."
● "The quality of your software structure is an

investment for future modification and
extension."
Background
● The Spaghetti Code AntiPattern is the classic and most famous
AntiPattern;
● it has existed in one form or another since the invention of

programming languages.
● Nonobject-oriented languages appear to be more susceptible to this

AntiPattern, but it is fairly common among developers who have yet to
fully master the advanced concepts underlying object orientation.
General Form
● Very little software structure;
● Coding and progressive extensions compromise the software structure

to such an extent that the structure lacks clarity, even to the original
developer, if he or she is away from the software for any length of time.
● If developed using an object-oriented language, the software may

include a small number of objects that contain methods with very
large implementations that invoke a single, multistage process flow.
● Furthermore, the object methods are invoked in a very predictable

manner, and there is a negligible degree of dynamic interaction
between the objects in the system. The system is very difficult to
maintain and extend, and there is no opportunity to reuse the objects
and modules in other similar systems.
● After code mining, only parts of object and methods seem suitable
for reuse. Mining Spaghetti Code can often be a poor return on
investment; this should be taken into account before a decision to
mine is made.
● Methods are very process-oriented; sequential.
● Minimal relationships exist between objects.
● Many object methods have no parameters
● Code is difficult to reuse, and when it is, it is often through cloning. In

many cases, however, code is never considered for reuse.
● Benefits of object orientation are lost; inheritance is not used to

extend the system; polymorphism is not used.
● Software quickly reaches a point of diminishing returns; the effort

involved in maintaining an existing code base is greater than the cost
of developing a new solution from the ground up.
Lava Flow
● AntiPattern Name: Lava Flow
● Also Known As: Dead Code
● Anecdotal Evidence: "Oh that! Well
Ray and Emil (they're no longer with
the company) wrote that routine back
when Jim (who left last month) was
trying a workaround for Irene's input
processing code (she's in another
department now, too). I don't think it's
used anywhere now, but I'm not really
sure. Irene didn't really document it
very clearly, so we figured we would
just leave well enough alone for now.
After all, the bloomin' thing works
doesn't it?!"
● Undocumented complex, important-looking functions, classes, or segments that don't clearly relate
to the system architecture.
● Whole blocks of commented-out code with no explanation or documentation.
● Unused (dead) code, just left in.
● Unused, inexplicable, or obsolete interfaces located in header files.
● If existing Lava Flow code is not removed, it can continue to proliferate as code is reused in other areas.
● If the process that leads to Lava Flow is not checked, there can be exponential growth as succeeding
developers, too rushed or intimidated to analyze the original flows, continue to produce new,
secondary flows as they try to work around the original ones, this compounds the problem.
● As the flows compound and harden, it rapidly becomes impossible to document the code or
understand its architecture enough to make improvements.
Summary Previous Lecture I
● Basic principles in parallel data processing;
○ Quarter counting, limits, Amdahl’s law
● MapReduce + Key Value Pair (KVP) paradigm;
○ Mapper
○ Reducer
○ Combiner?
● Apache Spark in-memory parallel processing framework (RDD engine);
○ Chaining jobs (in memory)
○ Lazy execution
● WordCount example in MapReduce and Spark.

Summary Previous Lecture II
● Hadoop Streaming
● Python MapReduce Design Patterns
○ Filter Pattern
■ Bloom Filter
○ Summarization Pattern
■ Std. Dev?
■ Median?
● Join Patterns
Bloom Filter Pattern
● Intent
■ Filter such that we keep records that are
member of some predefined set of values.
■ The predetermined list of values will be

called the set of hot values.
■ Decision inaccurate: Maybe/No

Bloom Filter
● Burton Howard Bloom in 1970
● Test whether an element is a member of a set.
○ False positive matches are possible, but false negatives are not
○ "possibly in set" or "definitely not in set".

Bloom Filter: Algorithm
● An empty Bloom filter : bit array of m bits, all 0.
● There must also be k different hash functions defined, each of

which hashes some set element to one of the m array positions,
generating a uniform random distribution.
● k<<m, constant
● m is proportional to the number of elements to be added;
● the precise choice of k and the constant of proportionality of m are

determined by the intended false positive rate of the filter.
Bloom Filter: Operations
● To add an element 🡪 k hash functions 🡪 k array positions. Set the bits at
all these positions to 1.
● To query for an element (test whether it is in the set) 🡪 k hash functions

🡪 k array positions.
○ If any of the bits at these positions is 0, the element is definitely not in the set.
○ If all are 1, then either the element is in the set, or the bits have by chance been set to 1
during the insertion of other elements, resulting in a false positive.
● In a simple Bloom filter, there is no way to distinguish between the two

cases, but more advanced techniques can address this problem.
Bloom Filter - Example
● representing the set {x, y, z}.

● The colored arrows show the positions in the
bit array that each set element is mapped to.
● The element w is not in the set {x, y, z},
because it hashes to one bit-array position
containing 0. For this figure, m = 18 and k = 3.
Bloom Filter: Properties
● Removing element is
impossible;
● Faster decision;
● Higher memory need
● Add/Check items
O(k)
Bloom Filter: Probability of false positives
● Assume that a hash function selects each array position with equal
probability. (uniform)
● 1 bit is zero out of m
● If k is the number of hash functions, the probability that the bit is not set
to 1 by any of the hash functions is
Bloom Filter: Probability of false positives
● inserted n elements, the probability that a certain bit is still 0. 🡪 1
● test membership of an element that is not in the set. All k bit is 1.

Bloom Filter: Optimal parameters
● for a given m and n, the value of k that minimizes the false positive
probability is
● Substitute in the previous formula for p

False Positive Prob
Bloom Filter: Applications
● Akamai’s web servers use Bloom filters to prevent "one-hit-wonders" from
being stored in its disk caches.
○ Used nearly three-quarters of their caching infrastructure.
○ Using a Bloom filter to detect the second request for a web object
○ caching that object only on its second request prevents one-hit wonders from entering the disk
cache, significantly reducing disk workload and increasing disk cache hit rates.
Bloom Filter: Applications II
● Google Bigtable, Apache HBase and Postgresql use Bloom filters to
reduce the disk lookups for non-existent rows or columns. Avoiding
costly disk lookups considerably increases the performance of a
database query operation.
● The Google Chrome web browser used to use a Bloom filter to identify

malicious URLs. Any URL was first checked against a local Bloom filter,
and only if the Bloom filter returned a positive result was a full check of
the URL performed (and the user warned, if that too returned a positive
result).
Bloom Filter 🡪 Counting Filter
● implement a delete operation on a Bloom filter without recreating the
filter afresh.
● single bit 🡪 n-bit counter. bucket
● The insert operation is extended to increment the value of the buckets,

and the lookup operation checks that each of the required buckets is
non-zero. The delete operation then consists of decrementing the value
of each of the respective buckets.
Join Patterns
● Inner join
● Left/right outer join
● Outer join
● Antijoin
Reduce Side Join Pattern
● Easy to implement
● Work large dataset
● Slow
● Any join
● The foreign key 🡪output

key, entire input 🡪 output
value
Reduce Side Join Pattern: Slow
● Stackoverflow data
● interested in enriching comments with reputable users, i.e., greater than 1,500
reputation.
● A standard reduce side join, condition to verify that a user’s reputation > 1,500
prior to writing to the context object. This requires all the data to be parsed
and forwarded to the reduce phase for joining.
● stop outputting data from the mappers that we know are not going to be
needed in the join, then we can drastically reduce network I/O.
● Bloom filter:
○ inner join operation
○ full outer join operation or an antijoin.

Reduce Side Join Pattern w/ Bloom Filter
MetaPatterns
● Patterns about Patterns
● Pattern/Job refactoring
● Job Chaining
○ Multistage jobs
○ Oozie workflow mgmt
● Chain Folding
● Job Merging
Original chain and optimizing mappers
Optimized
Original chain and optimizing a reducer with a
mapper
Optimized
Map/Reduce Optimization
Optimized
Job Merge
Optimized
Appendix
Design Patterns, Bloom filter

MapReduce Summary
● Each map task in Hadoop is broken into the following phases:
○ record reader: data🡪 record
○ mapper: KV 🡪 (KV)*
○ combiner: group in map phase KVs 🡪 grouped KVs
○ partitioner: KV🡪 block (rarely necessary to modify).
● The output of the map tasks, called the intermediate keys and values, are sent
to the reducers. The reduce tasks are broken into the following phases:
○ shuffle & sort: not customizable and the framework handles everything automatically
○ reducer: grouped KV 🡪 KV stats
○ output format: save to file

Hadoop Modes
● Local (Standalone) Mode
○ By default, Hadoop is configured to run in a non-distributed mode, as a single Java process.

This is useful for debugging.
● Pseudo-Distributed Mode
○ Hadoop can also be run on a single-node in a pseudo-distributed mode where each Hadoop
daemon runs in a separate Java process.
○ This is a simulated multi node environment based on a single node server.
● Fully-Distributed Mode
○ Before we start the distributed mode installation, we must ensure that we have the pseudo
distributed setup done and we have at least two machines, one acting as master and the other
acting as a slave.
bash
● ls, cd, mkdir, rm, mv, cp,rmdir, man, head, tail
● |,wc, grep, cut, sort, cat, less, more
● Vim, regexp,…
awk
● Cornerstone of Unix shell programming;
● Extremely useful in record processing (e.g. text file);
● Main awk file structure (details in the demo).
BEGIN{}
{}
END{}

ds2 3 Mapreduce

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

ds2 3 Mapreduce

Uploaded by

Copyright:

Available Formats

MapReduce

Design Patterns, Bloom ﬁlter

Software Architecture AntiPatterns

Software Project Management AntiPatterns

● Also Known As: Clipboard Coding, Software

○ "Hey, I thought you ﬁxed that bug already, so why is it

○ "Man, you guys work fast. Over 400,000 lines of code

● Lines of code increase without adding to overall productivity.

● Code reviews and inspections are needlessly extended.

● It becomes diﬃcult to locate and ﬁx all instances of a particular mistake.

● Code is considered self-documenting.

● Code can be reused with a minimum of eﬀort.

● This AntiPattern leads to excessive software maintenance costs.

● Software defects are replicated through the system.

● "Ugh! What a mess!"

● "You do realize that the language supports more

● "It's easier to rewrite this code than to attempt to

● "Software engineers don't write spaghetti code."

● "The quality of your software structure is an

● it has existed in one form or another since the invention of

● Nonobject-oriented languages appear to be more susceptible to this

● Coding and progressive extensions compromise the software structure

● If developed using an object-oriented language, the software may

● Furthermore, the object methods are invoked in a very predictable

● Methods are very process-oriented; sequential.

● Minimal relationships exist between objects.

● Many object methods have no parameters

● Code is diﬃcult to reuse, and when it is, it is often through cloning. In

● Beneﬁts of object orientation are lost; inheritance is not used to

● Software quickly reaches a point of diminishing returns; the eﬀort

● Also Known As: Dead Code

● Whole blocks of commented-out code with no explanation or documentation.

● Unused (dead) code, just left in.

● Unused, inexplicable, or obsolete interfaces located in header ﬁles.

○ Quarter counting, limits, Amdahl’s law

● MapReduce + Key Value Pair (KVP) paradigm;

● Apache Spark in-memory parallel processing framework (RDD engine);

○ Chaining jobs (in memory)

● WordCount example in MapReduce and Spark.

● Python MapReduce Design Patterns

■ The predetermined list of values will be

■ Decision inaccurate: Maybe/No

● Test whether an element is a member of a set.

○ False positive matches are possible, but false negatives are not

○ "possibly in set" or "deﬁnitely not in set".

● There must also be k diﬀerent hash functions deﬁned, each of

● m is proportional to the number of elements to be added;

● the precise choice of k and the constant of proportionality of m are

● To query for an element (test whether it is in the set) 🡪 k hash functions

● In a simple Bloom ﬁlter, there is no way to distinguish between the two

● representing the set {x, y, z}.

● 1 bit is zero out of m

● test membership of an element that is not in the set. All k bit is 1.

● Substitute in the previous formula for p

● The Google Chrome web browser used to use a Bloom ﬁlter to identify

● single bit 🡪 n-bit counter. bucket

● The insert operation is extended to increment the value of the buckets,

● Left/right outer join

● Work large dataset

● The foreign key 🡪output

○ inner join operation

○ full outer join operation or an antijoin.

○ Oozie workﬂow mgmt