Bringing Learnings From Googley Microservices With GRPC - Varun Talwar, Google

Bringing learnings from
Googley microservices with

gRPC
Microservices Summit
Varun Talwar
Google confidential Do not

distribute
Contents
1. Context: Why are we here?
2. Learnings from Stubby experience
a. HTTP/JSON doesnt cut it
b. Establish a Lingua Franca
c. Design for fault tolerance and control: Sync/Async, Deadlines, Cancellations, Flow control
d. Flying blind without stats
e. Diagnosing with tracing
f. Load Balancing is critical
3. gRPC
a. Cross platform matters !
b. Performance and Standards matter: HTTP/2
c. Pluggability matters: Interceptors, Name Resolvers, Auth plugins
d. Usability matters !
CONTEXT
WHY ARE WE HERE?
Business Agility
Developer Productivity
Performance
INTRODUCING
STUBBY
Microservices at Google
~O(1010) RPCs per second.

Images by Connie
Zhou
distribute
Stubby Magic @ Google
Making Google magic available to all
Borg
Kubernetes
Stubby
LEARNINGS FROM
STUBBY
Key learnings
1. HTTP/JSON doesnt cut it !
2. Establish a lingua franca
3. Design for fault tolerance and provide control knobs
4. Dont fly blind: Service Analytics
5. Diagnosing problems: Tracing
6. Load Balancing is critical
1 HTTP/JSON doesnt cut it !
1. WWW, browser growth - bled into services

2. Stateless
3. Text on the wire
4. Loose contracts
5. TCP connection per request
6. Nouns based
7. Harder API evolution
8. Think compute, network on cloud platforms
2 Establish a lingua franca
1. Protocol Buffers - Since 2003.

2. Start with IDL
3. Have a language agnostic way of agreeing on data semantics
4. Code Gen in various languages
5. Forward and Backward compatibility
6. API Evolution
How we roll at Google
Service Definition (weather.proto)
syntax = "proto3"; message WeatherResponse {

Temperature temperature = 1;
service Weather { float humidity = 2;
rpc GetCurrent(WeatherRequest) returns (WeatherResponse); }
}
message Temperature {
float degrees = 1;
message WeatherRequest { Units units = 2;
Coordinates coordinates = 1;
enum Units {
message Coordinates { FAHRENHEIT = 0;
fixed64 latitude = 1; CELSIUS = 1;
fixed64 longitude = 2; KELVIN = 2;
} }
} }
Google Cloud Platform

3 Design for fault tolerance and control
Sync and Async APIs
Need fault tolerance: Deadlines, Cancellations
Control Knobs: Flow control, Service Config, Metadata

gRPC Deadlines
First-class feature in gRPC.
Deadline is an absolute point in time.
Deadline indicates to the server how

long the client is willing to wait for an
answer.
RPC will fail with DEADLINE_EXCEEDED

status code when deadline reached.
18
Deadline Propagation
withDeadlineAfter(200, MILLISECONDS)
40 ms 20 ms 60 ms
90 ms 20 ms
Gateway
DEADLINE_EXCEEDED DEADLINE_EXCEEDED DEADLINE_EXCEEDED DEADLINE_EXCEEDED

Now = Now =
Now = Now =
1476600000000 1476600000230
1476600000040 1476600000150
Deadline = Deadline =
Deadline = Deadline =
1476600000200 1476600000200
1476600000200 1476600000200
Cancellation?
Deadlines are expected.
What about unpredictable cancellations?
User cancelled request.
Caller is not interested in the result any

more.
etc
20
Cancellation?
Active RPC Active RPC

Busy Busy Busy
Active RPC
Active RPC Active RPC Active RPC

GW Busy Busy Busy
Active RPC
Active RPC Active RPC

Busy Busy Busy

Cancellation Propagation
Idle Idle Idle
GW Idle Idle Idle
Idle Idle Idle

Cancellation
Automatically propagated.
RPC fails with CANCELLED status code.
Cancellation status be accessed by the

receiver.
Server (receiver) always knows if RPC is

valid!
23
BiDi Streaming - Slow Client
Slow Client Fast Server
Request
Responses
CANCELLED
UNAVAILABLE
RESOURCE_EXHAUSTED

BiDi Streaming - Slow Server
Fast Client Slow Server

Request
Response
Requests
CANCELLED
UNAVAILABLE
RESOURCE_EXHAUSTED

Flow-Control
Flow-control helps to balance

computing power and network
capacity between client and server.
gRPC supports both client- and

server-side flow control.
Photo taken by Andrey Borisenko.

26
Service Config
Policies where server tells client what

they should do
Can specify deadlines, lb policy,

payload size per method of a service
Loved by SREs, they have more control
Discovery via DNS
27
Metadata helps in exchange of useful information
Metadata Exchange - Common cross-cutting concerns

like authentication or tracing rely on the exchange of
data that is not part of the declared interface of a
service. Deployments rely on their ability to evolve these
features at a different rate to the individual APIs
exposed by services.
4 Dont fly blind: Stats
What is the mean latency time per RPC?
How many RPCs per hour for a service?
Errors in last minute/hour?
How many bytes sent? How many connections to my server?
Data collection by arbitrary metadata is useful
Any services resource usage and performance stats in real time by (almost)
any arbitrary metadata
1. Service X can monitor CPU usage in their jobs broken down by the name of the invoked RPC
and the mdb user who sent it.
2. Social can monitor the RPC latency of shared bigtable jobs when responding to their requests,
broken down by whether the request originated from a user on web/Android/iOS.
3. Gmail can collect usage on servers, broken down by according POP/IMAP/web/Android/iOS.
Layer propagates Gmail's metadata down to every service, even if the request was made by an
intermediary job that Gmail doesn't own
Stats layer export data to varz and streamz, and provides stats to many
monitoring systems and dashboards
5 Diagnosing problems: Tracing
1/10K requests takes very long. Its an ad query :-) I need to find out.
Take a sample and store in database; help identify request in sample which
took similar amount of time
I didnt get a response from the service. What happened? Which link in the
service dependency graph got stuck? Stitch a trace and figure out.
Where is it taking time for a trace? Hotspot analysis
What all are the dependencies for a service?
5 Load Balancing is important !
Iteration 1: Stubby Balancer

Iteration 2: Client side load balancing
Iteration 3: Hybrid
Iteration 4: gRPC-lb
Next gen of load balancing
Current client support intentionally dumb (simplicity).
Pick first available - Avoid connection establishment latency
Round-robin-over-list - Lists not sets ability to represent weights
For anything more advanced, move the burden to an external "LB Controller", a
regular gRPC server and rely on a client-side implementation of the so-called
gRPC LB policy.
3) RR over addresses of backends

address-list
gRPC LB
1) Control RPC
client LB Controller
2) address-list
In summary, what did we learn
Contracts should be strict
Common language helps
Common understanding for deadlines, cancellations, flow control
Common stats/tracing framework is essential for monitoring, debugging
Common framework lets uniform policy application for control and lb
Single point of integration for logging, monitoring, tracing, service

discovery and load balancing makes lives much easier !
INTRODUCING
gRPC
gRPC core
gRPC Java
gRPC Go
Open source on Github for C, C++, Java, Node.js,

Python, Ruby, Go, C#, PHP, Objective-C
Where is the project today?
1.0 with stable APIs
Well documented with an active community
Reliable with continuous running tests on GCE
Deployable in your environment
Measured with an open performance dashboard
Deployable in your environment
Well adopted inside and outside Google
More lessons
1. Cross language & Cross platform matters !

2. Performance and Standards matter: HTTP/2
3. Pluggability matters: Interceptors, Name Resolvers,
Auth plugins
4. Usability matters !
More lessons

Auth plugins
gRPC Principles & Requirements
Coverage & Simplicity
The stack should be available on every popular

development platform and easy for someone to build
for their platform of choice. It should be viable on
CPU & memory limited devices.
http://www.grpc.io/blog/principles

gRPC Speaks Your Language
Service definitions and client libraries Platforms supported

Java MacOS
Go Linux
C/C++ Windows
C# Android
Node.js iOS
PHP
Ruby
Python
Objective-C

Interoperability
gRPC
Service
gRPC
Stub
gRPC GoLang
Service Service
gRPC
Stub
Java
Service gRPC gRPC
Stub Service
gRPC
Stub gRPC
Service
gRPC
Python Stub C++
Service Service

More lessons

Auth plugins
HTTP/2 in One Slide
HTTP/1.x
Single TCP connection. POST: /upload

HTTP/1.1
Host: www.javaday.org.ua
Application (HTTP/2)
No Head-of-line blocking. Content-Type: application/json
Binary Framing Content-Length: 27
Session (TLS) [optional] {msg: Welcome to 2016!}
Binary framing layer. Transport(TCP)
Network (IP)
HTTP/2
Request > Stream.
HEADERS Frame
DATA Frame
Header Compression.

Binary Framing
Stream 1 HEADERS
:method: GET
Request
:path: /kyiv
HTTP/2 breaks down the :version: HTTP/2
:scheme: https
HTTP protocol
communication into an HEADERS DATA
:status: 200
Response
exchange of :version: HTTP/2
<payload>
TCP :server: nginx/1.10.1
binary-encoded frames, ...
which are then mapped to

messages that belong to a
Stream 2
stream, and all of which
are multiplexed within a
single TCP connection. Stream N

HTTP/1.x vs HTTP/2
http://http2.golang.org/gophertiles
http://www.http2demo.io/

gRPC Service Definitions
Unary Server streaming Client streaming BiDi streaming
Unary RPCs where the The client sends a The client send a Both sides send a
client sends a single request to the server sequence of messages sequence of messages
request to the server and gets a stream to to the server using a using a read-write
and gets a single read a sequence of provided stream. stream. The two
response back, just like messages back. Once the client has streams operate
a normal function call. The client reads from finished writing the independently. The
the returned stream messages, it waits for order of messages in
until there are no more the server to read them each stream is
messages. and return its response. preserved.

BiDi Streaming Use-Cases
Messaging applications.
Games / multiplayer tournaments.
Moving objects.
Sport results.
Stock market quotes.
Smart home devices.
You name it!
48
Performance
Open Performance Benchmark and Dashboard

Benchmarks run in GCE VMs per Pull Request for regression testing.
gRPC Users can run these in their environments.
Good Performance across languages:
Java Throughput: 500 K RPCs/Sec and 1.3 M Streaming messages/Sec on 32 core VMs
Java Latency: ~320 us for unary ping-pong (netperf 120us)
C++ Throughput: ~1.3 M RPCs/Sec and 3 M Streaming Messages/Sec on 32 core VMs.
More lessons

3. Pluggability matters: Interceptors, Auth
Pluggable
Large distributed systems need security,

health-checking, load-balancing and failover,
monitoring, tracing, logging, and so on.
Implementations should provide extensions points
to allow for plugging in these features and, where
useful, default implementations.

Interceptors
Client Server
interceptors interceptors
Request
Client Server
Response

Pluggability
Auth & Security - TLS [Mutual], Plugin auth mechanism (e.g. OAuth)
Proxies
Basic: nghttp2, haproxy, traefik
Advanced: Envoy, linkerd, Google LB, Nginx (in progress)
Service Discovery
etcd, Zookeeper, Eureka,
Monitor & Trace
Zipkin, Prometheus, Statsd, Google, DIY
More lessons

3. Pluggability matters: Interceptors, Auth
Get Started
Coming soon !
1. Server reflection
2. Health Checking
3. Automatic retries
4. Streaming compression
5. Mechanism to do caching
6. Binary Logging
a. Debugging, auditing though costly
7. Unit Testing support
a. Automated mock testing
b. Dont need to bring up all dependent services just to test
8. Web support
Some early adopters
Microservices: in data centres
Client Server communication/Internal APIs
Streaming telemetry from network devices
Mobile Apps
Thank you!
Thank you!
Twitter: @grpcio
Site: grpc.io
Group: grpc-io@googlegroups.com
Repo: github.com/grpc
github.com/grpc/grpc-java
github.com/grpc/grpc-go
Q&A
Why gRPC?
Multi-language Open Strict Service contracts
9 languages Open source and growing Define and enforce contracts,

community backward compatible
Performant Pluggable design Efficiency on wire
1m+ QPS - unary, 3m+ streaming Auth, Transport, IDL, LB 2-3X gains
(dashboard)
Streaming APIs Standard compliant Easy to use
Large payloads, speech, logs HTTP/2 Single line installation

The Fallacies of Distributed Computing
The network is reliable Topology doesn't change
Latency is zero There is one administrator
Bandwidth is infinite Transport cost is zero
The network is secure The network is homogeneous
https://blogs.oracle.com/jag/resource/Fallacies.html

How is gRPC Used?
Direct RPCs :
Microservices
On Other
GCP
Prem Cloud
How is gRPC Used?
Direct RPCs :
Microservices
On GCP Other
Prem Cloud
Google APIs
RPCs to
access APIs
Your APIs
How is gRPC Used?
Direct RPCs :
Microservices
Mobile/Web
RPCs
On Other
GCP Cloud
Prem
Your
Mobile
/Web
Apps
Google APIs
RPCs to
access APIs
Your APIs
What are the benefits?
Developers Operators Architects/Manag

ers
Ease of use Uniform Monitoring

Defined Contracts
Performance Debugging/Tracing
Single uniform
framework for control
Versioning Cross
platform/language
Visibility
Programming model

distribute
Layered
Key facets of the stack must be able to evolve

independently. A revision to the wire-format should
not disrupt application layer bindings.

Layered Architecture
Code Gend Standard applications

Service API
Stub
Code Gen
Support Code
Initialization, interceptors,
Channel API and advanced applications
Transport API
Layered Architecture
RPC Client-Side App RPC Server-side Apps
Pluggable Future Blocking

Load Stub Service Definition
Stub Stub
Balancing (extends generated definition)
and ClientCall
Service ServerCall
Discovery Channel
NameResolver LoadBalancer ServerCall handler
Tran #1 Tran #2 Tran #N Transport
HTTP/2

Takeaways
HTTP/2 is a high performance production-ready multiplexed

bidirectional protocol.
gRPC (http://grpc.io):
HTTP/2 transport based, open source, general purpose
standards-based, feature-rich RPC framework.
Bidirectional streaming over one single TCP connection.
Netty transport provides asynchronous and non-blocking I/O.
Deadline and cancellations propagation.
Client- and server-side flow-control.
Layered, pluggable and extensible.
Supports 10 programming languages.
Build-in testing support.
Production-ready (current version is 1.0.1) and growing ecosystem.

Growing Ecosystem
gRPC Gateway
https://github.com/grpc-ecosystem/grpc-gateway
Migration.
Testing.
Swagger / OpenAPI
tooling.
Photo taken by Andrey Borisenko.

74
Metadata and Auth
Protocol Structure
Request <Call Spec> <Header Metadata> <Messages>*
Response <Header Metadata> <Messages>* <Trailing Metadata> <Status>
Generic mechanism for attaching metadata to requests and responses
Commonly used to attach bearer tokens to requests for Auth
OAuth2 access tokens
JWT e.g. OpenId Connect Id Tokens
Session state for specific Auth mechanisms is encapsulated in an
Auth-credentials object

Bringing Learnings From Googley Microservices With GRPC - Varun Talwar, Google

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Bringing Learnings From Googley Microservices With GRPC - Varun Talwar, Google

Uploaded by

Copyright:

Available Formats

Bringing learnings from

Googley microservices with

Google confidential Do not

Google confidential Do not

1. WWW, browser growth - bled into services

1. Protocol Buffers - Since 2003.

syntax = "proto3"; message WeatherResponse {

Google Cloud Platform

Sync and Async APIs

Need fault tolerance: Deadlines, Cancellations

Control Knobs: Flow control, Service Config, Metadata

First-class feature in gRPC.

Deadline is an absolute point in time.

Deadline indicates to the server how

RPC will fail with DEADLINE_EXCEEDED

DEADLINE_EXCEEDED DEADLINE_EXCEEDED DEADLINE_EXCEEDED DEADLINE_EXCEEDED

Deadlines are expected.

What about unpredictable cancellations?

User cancelled request.

Caller is not interested in the result any

Active RPC Active RPC

Active RPC Active RPC Active RPC

Active RPC Active RPC

Google Cloud Platform

Idle Idle Idle

GW Idle Idle Idle

Idle Idle Idle

Google Cloud Platform

RPC fails with CANCELLED status code.

Cancellation status be accessed by the

Server (receiver) always knows if RPC is

Slow Client Fast Server

Google Cloud Platform

Fast Client Slow Server

Google Cloud Platform

Flow-control helps to balance

gRPC supports both client- and

Photo taken by Andrey Borisenko.

Policies where server tells client what

Can specify deadlines, lb policy,

Loved by SREs, they have more control

Discovery via DNS

Metadata Exchange - Common cross-cutting concerns

Iteration 1: Stubby Balancer

3) RR over addresses of backends

Single point of integration for logging, monitoring, tracing, service

Open source on Github for C, C++, Java, Node.js,

1. Cross language & Cross platform matters !

1. Cross language & Cross platform matters !

Coverage & Simplicity

The stack should be available on every popular

Google Cloud Platform

Service definitions and client libraries Platforms supported

Google Cloud Platform

Google Cloud Platform

1. Cross language & Cross platform matters !

Single TCP connection. POST: /upload

Google Cloud Platform

which are then mapped to

Google Cloud Platform

Google Cloud Platform

Unary Server streaming Client streaming BiDi streaming

Google Cloud Platform