You are on page 1of 10

WHITEPAPER

The Lambda Architecture


Simplified

DATE: April 2016


A Brief History of the Lambda Architecture
The surest sign you have invented something worthwhile is when several other people invent it
too. That means the creative pressure that gave birth to the idea is more general than your
particular situation.

Even when faced with the same pressures, people will approach an idea in different ways.
When Jay Kreps was developing Kafka at LinkedIn, he called it The Log. Facebook (being
Facebook) created several independent implementations of “stream-oriented processing”,
including Puma and TailerSwift. Twitter has the adorably named Summingbird. The jargon we
seem to be converging on for these kinds of systems is the Lambda Architecture.

Lambda Origin
In his book, Big Data: Principles and best practices of scalable real-time data systems, Nathan
Marz coined the term ‘Lambda Architecture’ to describe a generic, scalable and fault-tolerant
data processing architecture based on his experience in working on distributed systems at
Backtype and Twitter.

Lambda in a Nutshell
The gist of the Lambda Architecture is to model everything that goes on in a complex
computing system as an ordered, immutable log of events. Processing the data (say, totaling
up the number of website visitors) is completed as a series of transformations that output to
new tables or streams.

It is important to keep the input unchanged. By breaking data processing into independent
pieces, each with a defined input and output, you get closer to the ideal of purely functional
programming. Writing and testing each piece is made simpler and parallelization can be
automated. Parts of the dataflow can be replayed (say, when code changes or machines fail)
and toyed together with other flows.

2
This sequenced approach is a nice property to have as it retains data integrity and simplifies
troubleshooting. A long time ago, people who did 3D modeling would “carve” digital blocks
into the shapes they wanted. If they wanted to undo something 10 steps back, they were
largely out of luck. Then 3DStudio introduced a brilliant feature it called the “transform stack”.
The stack records every change to an object separately, and applies them in real time. This
allows the modeler to modify, add, remove, and even reorder their changes on the fly. A
sequenced approach to data pipelines is similar, providing a nifty solution for data reprocessing
when changes to code occur.

Autodesk 3DS Max Taper Modifier

So far, this is simply good data engineering hygiene. Any well-run batch processing or
map/reduce system will follow the same principles. There’s nothing special about stream
processing that makes immutable data flows work better.

Writing Data in Two Places


The special trick that makes Lambda “Lambda” is the technique of writing data to two places.
That’s one reason why the logo is the symbol “λ”. In effect, one half of a Lambda system
optimizes for space and the other optimizes for time. Lambda systems incorporate a slower,
high-capacity batch-processing system, and a faster stream-processing track. This allows
existing map/reduce systems to be upgraded with a new fast track. It also leaves the system of
record untouched, which is the main selling point for data teams looking to improve the
responsiveness of their data flows.

3
Lambda Architecture Diagram - http://lambda-architecture.net/

Lambda is an old and venerable technique. Document search engines of a certain age (eg,
Yahoo’s Vespa) often have a “slow” index that is compact but difficult to update. To
compensate they will also have a “fast” index, perhaps in memory, where changes are cached
until the next index rebuild. Under the hood a search will consult both indexes and merge the
results.

The problem is, the Lambda Architecture was an evolution on top of the slower batched index.
It is not certain that you would do it that way if you were building from scratch. Lucene, for
example, uses an incremental index for everything. Jay Kreps, in a thoughtful critique of
Lambda, points out that you need two implementations of the same queries and data flow.
And of course, you need two copies of the data. If you had a better streaming system, one that
could “read a table” simply by replaying a stream, why would you need both kinds of system?

The Lambda Architecture Isn’t


The Lambda Architecture isn’t. What it is, is a sensible set of data engineering practices, which
you should be applying anyway, plus a clever (but transitional) double-write approach to add a
low-latency fast track to existing big data systems. Throughout the rest of this guide, we will
detail the technologies and data processing requirements that will help you implement a
simplified Lambda Architecture.

4
Rethinking the Lambda Architecture
Most companies have responded to the influx of data by adapting their data management
strategy. However, managing streaming data still poses challenges for many enterprises.
Complicating the matter further, most enterprises need instant access to both historical and
real-time data, which require specific considerations and solutions. Of the many approaches to
managing real-time and historical data concurrently, the Lambda Architecture is by far the
most talked about, and accepted today.

A Fork in the Road


Like the physical aspect of the Greek letter, the Lambda Architecture forks into two paths: one
is a streaming (real-time) path, the other a batch path. Thus, it accommodates a real-time high-
speed data service along with an immutable data lake. Oftentimes a serving layer sits on top of
the streaming path to power applications or dashboards.

5
Many Internet-scale companies, like Pinterest, Zynga, Akamai, and Comcast, are using a
memory-optimized database to achieve the high-speed data component of the Lambda
Architecture. These companies are splitting the input stream to push data into both an in-
memory database and a data lake, like HDFS, in parallel.

In this era of ubiquitous big data, it is not enough for companies to merely process data.
Analyzing data to detect patterns, which can be immediately applied to maximizing operational
efficiency, is the real driver of business value.

MemSQL: A Complete Solution for Lambda


MemSQL delivers real-time analytics on a rapidly changing data set, making it an ideal match
for the characteristics of the Lambda Architecture speed service. Other data stores have
limitations that inhibit high-speed data ingestion, lack analytical capabilities, or cannot scale
affordably.

MemSQL offers a complete solution: the ability to handle millions of transactions per second
while performing complex multi-table join queries. Let’s dig into some of the key innovations
that make MemSQL an ideal solution for simplifying the Lambda Architecture.

Scalability

MemSQL uses a distributed shared nothing architecture that scales on commodity hardware
and local storage, supporting petabytes of data. MemSQL is a memory-first, relational database
that also offers a disk-based columnstore. In-memory optimization provides high-speed data
ingestion while simultaneously delivering analytics on the changing data set. The disk-based
columnstore provides historical data management and access to historical data trends to
leverage in combination with the “hot” data to deliver real-time analytics.

Multi-model, Multi-mode

MemSQL supports the ingestion of unstructured, structured and semi-structured data.


Flexibility to align a structure to data in support of analytics meets the business requirements
of the operation. Real-time analytics requires a real-time data structure, which MemSQL
supports through a fully relational model. Furthermore, MemSQL supports the ingestion of
unstructured and semi-structured (JSON) data into key-value pairs.

6
Full ANSI SQL support makes MemSQL readily accessible to data analysts, business analysts
and data scientists reducing application code requirements. Plugging data visualization and
query tools into the analytics architecture delivers immediate value from data to the business.

MemSQL also has extended SQL including JSON support. Traversing a JSON document is
similar to SQL with extensions to traverse the key-value pairs.

Open Source Connectors


MemSQL offers several connectors for smooth integration with additional data sources. One
example is MemSQL Streamliner: an integrated Apache Spark solution. Streamliner provides
easy deployment of Apache Spark — a critical component for building real-time data pipelines
that delivers advanced data enrichment and transformation. Another important connector is
the MemSQL Loader, which can important data from HDFS, as well as import and synchronize
data from Amazon S3.

7
Lambda In Production
In this section, we will take a look at examples from innovative companies using a Lambda
Architecture built for real-time data processing and exploration.

Real-Time Analytics at Comcast


Our first example comes from the Comcast Xfinity data team, who built a data processing
infrastructure that focuses on real-time operational analytics. Using a combination of MemSQL
and Hadoop, Comcast can proactively diagnose potential issues in an instant and deliver the
best possible video experience. The Comcast architecture writes one copy of data to a
MemSQL instance and a separate copy to Hadoop.

Log Collection Real-Time Analytics

~ 1 second
•  Analysts query live data
•  Alerts on complex objects
~ 30 minutes •  Optimize CDN efficiency

This enables Comcast to run real-time analytics on massive, ever-changing datasets, while also
making their analytics infrastructure more performant. Instead of just logging all Xfinity data
and analyzing it hours or days later, Comcast has the power to get both viewership and
infrastructure monitoring metrics the moment they occur.

HDFS provides a quasi-infinite data store where they can run machine learning jobs and other
“offline” analytics.

Watch the Comcast team’s recorded session from Strata+Hadoop World to learn how
Comcast architected their Xfinity platform to work with millions of users, process enormous
volumes of data and, at the same time, perform advanced real-time analytics. Recording Here

8
Tapjoy Powers its Mobile Ad Platform
Tapjoy, the mobile app industry’s leading mobile marketing automation and monetization
platform, is processing and analyzing real-time and historical data concurrently to power its ad
platform.

Tapjoy optimizes ad performance by taking advantage of the speed and scalability of in-
memory computing. With the processing power to run 60,000 queries at a response time of
less than ten milliseconds, Tapjoy is able to cross-reference user data and serve higher-
performing ads to more than 500 million global users.

Above is a diagram of Tapjoy’s database architecture. For a more detailed look and
explanation, watch Principal Data Analytics Engineer at Tapjoy, David Abercrombie’s session at
the In-Memory Computing Summit.

9
Conclusion
The pace of data is not slowing. Applications of today are built with infinite data sets in mind.
As these real-time applications become the norm, and batch processing becomes a relic of the
past, digital enterprises will implement memory-optimized, distributed data systems to simplify
Lambda Architectures for real-time data processing and exploration.

What should I do?


Start by asking questions. What data systems do you currently have in place? Are you
complicating matters with database infrastructure that can be consolidated? What applications
do you plan to build in the next week/month/year? How much data will be streaming into those
applications? How quickly will you need answers from your data set?

By answering questions like these, you will have a clear starting point for where to improve
your existing data management system, and how to prepare for the applications you plan to
build. From there, you can narrow which technologies to try for a proof of concept. If you
need help along the way, we would love to hear from you. Send us an email at
info@memsql.com or give us a call at (855) 463-6775.

10

You might also like