Professional Documents
Culture Documents
● Anecdotal Evidence:
● The same software bug reoccurs throughout software despite many local fixes.
● Reusable assets are not converted into an easily reusable and documented form.
● Cut-and-Paste Programming form of reuse deceptively inflates the number of lines of code developed
without the expected reduction in maintenance costs associated with other forms of reuse.
Spaghetti Code
● AntiPattern Name: Spaghetti Code
● Anecdotal Evidence:
● Anecdotal Evidence: "Oh that! Well
Ray and Emil (they're no longer with
the company) wrote that routine back
when Jim (who left last month) was
trying a workaround for Irene's input
processing code (she's in another
department now, too). I don't think it's
used anywhere now, but I'm not really
sure. Irene didn't really document it
very clearly, so we figured we would
just leave well enough alone for now.
After all, the bloomin' thing works
doesn't it?!"
Symptoms And Consequences
● Undocumented complex, important-looking functions, classes, or segments that don't clearly relate
to the system architecture.
● If existing Lava Flow code is not removed, it can continue to proliferate as code is reused in other areas.
● If the process that leads to Lava Flow is not checked, there can be exponential growth as succeeding
developers, too rushed or intimidated to analyze the original flows, continue to produce new,
secondary flows as they try to work around the original ones, this compounds the problem.
● As the flows compound and harden, it rapidly becomes impossible to document the code or
understand its architecture enough to make improvements.
Summary Previous Lecture I
● Basic principles in parallel data processing;
○ Mapper
○ Reducer
○ Combiner?
○ Lazy execution
○ Filter Pattern
■ Bloom Filter
○ Summarization Pattern
■ Std. Dev?
■ Median?
● Join Patterns
Bloom Filter Pattern
● Intent
■ Filter such that we keep records that are
member of some predefined set of values.
● k<<m, constant
○ If any of the bits at these positions is 0, the element is definitely not in the set.
○ If all are 1, then either the element is in the set, or the bits have by chance been set to 1
during the insertion of other elements, resulting in a false positive.
● If k is the number of hash functions, the probability that the bit is not set
to 1 by any of the hash functions is
Bloom Filter: Probability of false positives
● inserted n elements, the probability that a certain bit is still 0. 🡪 1
○ Using a Bloom filter to detect the second request for a web object
○ caching that object only on its second request prevents one-hit wonders from entering the disk
cache, significantly reducing disk workload and increasing disk cache hit rates.
Bloom Filter: Applications II
● Google Bigtable, Apache HBase and Postgresql use Bloom filters to
reduce the disk lookups for non-existent rows or columns. Avoiding
costly disk lookups considerably increases the performance of a
database query operation.
● Outer join
● Antijoin
Reduce Side Join Pattern
● Easy to implement
● Slow
● Any join
● interested in enriching comments with reputable users, i.e., greater than 1,500
reputation.
● A standard reduce side join, condition to verify that a user’s reputation > 1,500
prior to writing to the context object. This requires all the data to be parsed
and forwarded to the reduce phase for joining.
● stop outputting data from the mappers that we know are not going to be
needed in the join, then we can drastically reduce network I/O.
● Bloom filter:
● Pattern/Job refactoring
● Job Chaining
○ Multistage jobs
● Chain Folding
● Job Merging
Original chain and optimizing mappers
Optimized
Original chain and optimizing a reducer with a
mapper
Optimized
Map/Reduce Optimization
Optimized
Job Merge
Optimized
Appendix
○ mapper: KV 🡪 (KV)*
● The output of the map tasks, called the intermediate keys and values, are sent
to the reducers. The reduce tasks are broken into the following phases:
○ shuffle & sort: not customizable and the framework handles everything automatically
● Pseudo-Distributed Mode
○ Hadoop can also be run on a single-node in a pseudo-distributed mode where each Hadoop
daemon runs in a separate Java process.
● Fully-Distributed Mode
○ Before we start the distributed mode installation, we must ensure that we have the pseudo
distributed setup done and we have at least two machines, one acting as master and the other
acting as a slave.
bash
● ls, cd, mkdir, rm, mv, cp,rmdir, man, head, tail
● Vim, regexp,…
awk
● Cornerstone of Unix shell programming;
BEGIN{}
{}
END{}