You are on page 1of 29

The Evolution of Testing Methodology at

AWS: From Status Quo To Formal Methods


With TLA+

Tim Rath
Principal Engineer
AWS Database Services
Amazon.com

1
AWS Landscape

2
Services comprise large fleets
of servers decomposed into
smaller services

3
Many of which experience
sustained exponential growth

4
S3 experienced exponential growth for 6
years to reach 1 trillion objects stored;
less than a year later it reached 2 trillion
objects [1]

5
DynamoDB processes millions of
transactions per second in a single
AWS region around the clock [2]

6
Systems and data are managed
through subtle concurrent and
distributed algorithms

7
“Must Haves” of Every Service

Security
Durability
Availability
Scalability
8
General Test Strategy

• Developers • QA (release testing)


– Unit Tests – Functional Tests
– Integration Tests – Performance Tests
– Stress Tests
– Failure Tests

9
Test Adequacy Criteria
• Literature expresses as a form of measurable code
coverage [3]

• Status Quo Criteria:


– Statement coverage
• Most common adequacy criteria employed today
• Tools readily available to measure and report
• Extremely weak criteria

10
Test Adequacy Criteria
• Literature expresses as a form of measurable code
coverage [3]

• Perfect Criteria:
– Cover every execution path across every possible state
for the system
• States may be infinite
• Test space is exponential with path length
11
Test Adequacy Criteria

• Real world practice further defines test


adequacy criteria through an ad-hoc process:
– Brain-storm test scenarios
– Brain-storm stress test workloads
– Brain-storm failure scenarios

12
Better testing of distributed algorithms

• We look for strategies that:


– Help to understand and protect algorithm invariants
– Help expand test coverage
– Allow thorough testing as early as possible in the
development process

13
Development starts with specification

s to t s
h e l p ri an
h a t i nva
g y t h m
r at e o ri t
St a l g
ta n d
d e rs
u n 14
Testing Starts With Development
• Write tests and test support structure while working
on the implementation
l ps
– The spirit of “Test Driven Development” withoutesubscribing
a t h g e
to the specific formula
y t h v e ra
a e g
• It is unclear how useful adhering
t to strict
t co
TDD concepts
really is
[4]
S t r t e s
a
• The emphasis it puts on test,n d
and test thoroughness is
where the value is exp
[5]

15
Assert system invariants
s t o s
e l p
hspecification a nt
• Asserts enforceathe
h t va r i in code
g y t m i n
ate i th l p s
S tr assertlstatements
• Strong
go r come from clear he
a t e
c t a
understanding of the specification
y t h e ra g
r o te te g co v
p S t ra t e s t
a n d
e x p 16
Generative Testing

Formalization around randomized testing with


invariant or “property” checking

17
Generative Testing
Test Case Test
Generator Execution

Validation Of Expected
Properties Properties Of
Loop Against Result Result

No Properties Yes Report


Violated? Failure Case

18
Anecdotes From QuickCheck Paper [6]
t a n d
d e rs
• Made them think harder about
o u n n
properties;
a ts
ps t v a ri
h e l
document the specification i n
t ha t rit h m e l p s
• Neede g y
to think l g
about
a o the input domaina t hto ge
t ra t e c t y t h e ra
S exercise
p ro t
less probable paths
a te g c o v
d S t r e s t
an a n d t
e x p 19
In Process Clusters
Process

2) Run
1) Insert
3) multiple
Ability
arbitrary
to pipeline
codeininto
nodes the
thethe
inserted
communication
same bits as
process of unit
code
channel
test

20
In Process Clusters
e l p s
• Helps write better
a t h integration
g e tests
u g h
– Allowsy t h e ra o r o
at e g c o v
easy construction
s h
of intricate test scenarios
t e i n
Str– Integration
te st
testing of distributedl lo w
components in
s iba l
an
unit d
test environment hat a p o s s s
p
e• xHelps write better y t ly a s r o c e
a te gstress tests
e a r t p
t r
S at the a s m e n
– Direct control
ti n gcommunication
l o p layer
– Much faster, s
telower over-head
e v e test cycles
d
the 21
Informal Proofs

• Requires deep thinking which promotes s to events


e l p
h algorithms ri an
better understanding ofathe
h t i nv a
g y t h m
• Hard to get right
rate – can still
o r i t
lead to a false
S t
sense of security d a l g
sta n
d e r
u n
22
Formal Methods
• Precise specification of algorithms
• Tools to validate correctness
• We surveyed some of the systems and languages for
writing formal specifications, and ended up finding
what we were looking for in TLA+ [7]
• TLA+ gives us all possible executions over all possible
system states for algorithm designs
23
read TLA+ read
write
write

read
write

read
write
read
write
24
TLA+
s to t s g e
h e l p ri an e ra
• How the model h a t
works: i n v a c o v h
g y t h m t e st o u g
a e
– Initt== <set
r == <set of r i
initial
o t system d
states>
n h o r
S–tNext a l g x p a w s t l e i n
t a n d of possible
p s e next actions>
a ll o ss i b
e rs h e l a t p o s s
d t
un• That’syit!tha tegy t arly a h s o c e
g a e t p r
ra t e S t r a s m e n
St ti n g lo p
tes d e v e 25
Real World Examples
• DynamoDB
– Replication protocols
– Membership handling
– Quorum Configuration Changes
• Other AWS projects [8]
– Low level distributed network protocol
– Internal distributed lock manager
– S3, EC2, EBS system management algorithms
26
Thank You!

we’re hiring
rath@amazon.com

27
[1] Barr, J. Amazon S3-The First Trillion Objects. Amazon Web Services Blog. June 2012;
https://aws.amazon.com/blogs/aws/amazon-s3-two-trillion-objects-11-million-requests-second/
[2] Hamilton, J. Challenges in Designing at Scale: Formal Methods in Building Robust
Distributed Systems. Perspectives Blog. July 2014;
http://perspectives.mvdirona.com/2014/07/03/ChallengesInDesigningAtScaleFormalMethodsInBuildingRobustDistributedSystems.aspx

[3] Zhu, H., et al. Software Unit Test Coverage and Adequacy. ACM Computing Surveys,
Vol. 29, No. 4, December 1997;
http://www.cs.toronto.edu/~chechik/courses07/csc410/p366-zhu.pdf
[4] Bulajic, A., Sambasivam, S., Stojic, R. Overview of the Test Driven Development
Research Projects and Experiments. Proceedings of Informing Science & IT Education
Conference (InSITE), 2012;
http://proceedings.informingscience.org/InSITE2012/InSITE12p165-187Bulajic0052.pdf
[5] Dalke, A. Problems with TDD. Dalke Scientific (dalkescientific.com), December 2009;
http://www.dalkescientific.com/writings/diary/archive/2009/12/29/problems_with_tdd.html
28
[6] Claessen, K., Hughes, J. QuickCheck: A Lightweight Tool for Random Testing of
Haskell Programs. ACM SIGPLAN Notices Volume 35 Issue 9, Sept. 2000;
http://www.eecs.northwestern.edu/~robby/courses/395-495-2009-fall/quick.pdf
[7] Newcombe, C., et al. Use of Formal Methods at Amazon Web Services. (pending ACM
publication), November 2013;
http://research.microsoft.com/en-us/um/people/lamport/tla/amazon.html
[8] Newcombe, C. Why Amazon Chose TLA+. Lecture Notes in Computer Science Volume
8477, June 2014; http://link.springer.com/chapter/10.1007%2F978-3-662-43652-3_3

29

You might also like