You are on page 1of 26

Testing data streaming

applications
Lars Albertsson, independent consultant
Øyvind Løkling, Schibsted Media Group

www.mapflat.com
Who’s talking?
● Swedish Institute of Computer Science (distributed system test+debug tools)
● Sun Microsystems (very large machines)
● Google (Hangouts, productivity)
● Recorded Future (NLP startup)
● Cinnober Financial Tech. (trading systems)
● Spotify (data processing & modelling)
● Schibsted Media Group (data processing & modelling)
● Mapflat - independent data engineering consultant

www.mapflat.com
Why stream processing?
Business
● Increasing number of Curated
content
Exploration
intelligence

data-driven features User


behaviour
Experiments

● 90+% fed by batch processing User content Data-based


features
○ Simpler, better tooling
○ 1+ hour data reaction time
Professional

● Stream processing for content


Recommendations

○ 100ms - 1 hour reaction System


diagnostics
Systems
○ Decoupled, asynchronous Ads / partners
Pushing
Ads

microservices

www.mapflat.com
The organic past
App App App

● Many paths
Service Service Service
● Synchronous
● Link failure -> chain failure Aggregate
logs
Queue
● Heterogeneous
● Difficult to recover from HTTP

transformation bugs DB
Poll
Queue

NFS
DB Hourly dump
ETL

scp
NFS
Data
warehouse

www.mapflat.com
The unified log
● Publish data in streams App App App

● Replicated, sharded
Ads Search Feed
append-only log
● Pub / sub with history
○ Kafka, Google Pub/Sub,
Unified log
AWS Kinesis Stream Stream Stream

● Tap to data lake for batch


processing
Data lake

www.mapflat.com
App App App

Stream processing Ads Search Feed

● Decoupled
producers/consumers Stream Stream Stream
○ In source/deployment
○ In space Job Job
○ In time
Data lake
● Publish results to log Stream Stream Stream

● Recovers from link failures


Job Job
● Replay on job bug fix
Stream Stream Stream

Job Job Business


intelligence

www.mapflat.com
Stream processing building blocks
● Aggregate
○ Calculate time windows
○ Aggregate state (in memory / local database / shared database)
● Filter
○ Slim down stream
○ Privacy, security concerns
● Join
○ Enrich by joining with datasets, e.g. geo IP lookup, demographics
○ Join streams within time windows, e.g. click-through rate
● Transform
○ Bring data into same “shape”, schema

www.mapflat.com
Stream processing technologies
● Spark Streaming
○ Ideal if you are already using Spark, same model
○ Bridges gap between data science / data engineers, batch and stream
● Kafka Streams
○ Library - new, positions itself as a lightweight alternative
○ Tightly coupled to Kafka
● Others
○ Storm, Heron, Flink, Samza, Google Dataflow, AWS Lambda

www.mapflat.com
Egress
App
● Update database table, e.g. for
polling dashboard Service

● Create service index table n+1.


Notify service to switch.
Stream Stream

● Post to external web service Job Job

● Push stream to client

www.mapflat.com
Test concepts Seam

Test harness

Test System under test Test


input (SUT) oracle

IDEs 3rd party


Test 3rd party 3rd party
component
fixture component component
(e.g. DB)
Build
tools
Test framework (e.g. JUnit, Scalatest)
www.mapflat.com
App

Potential test scopes


Service

● Unit
● Single job Stream

● Multiple jobs
Job
● Pipeline, including service
● Full system, including client Stream

Choose stable interfaces Job

Each scope has a cost Stream

Job

www.mapflat.com
Stream application properties
● Output = function(input, code)
○ Perfect for testing!
○ Avoid: indeterministic processing, reading wall clock
● Pipeline and job endpoints are stable
○ Correspond to business value
● Internal abstractions are volatile
○ Reslicing in different dimensions is common

www.mapflat.com
App

Recommended scopes
Service

● Single job
● Multiple jobs Stream

● Pipeline, including service


Job

Stream

Job

Stream

Job

www.mapflat.com
App

Scopes to avoid
Service

● Unit
○ Few stable interfaces Stream

○ Not necessary
○ Avoid mocks, DI rituals Job

● Full system, including client


Stream
○ Client automation fragile

“Focus on functional system Job

tests, complement with smaller


Stream
where you cannot get
coverage.” - Henrik Kniberg Job

www.mapflat.com
Stream application, example harness
Kafka
Topic Polling
Docker

Test Test
input DB oracle

Scalatest Spark Streaming jobs


IDE, CI, debug integration

IDE / Gradle

www.mapflat.com 15
1. Docker

Test lifecycle 5 6

1. Start fixture containers


2. Await fixture ready
2, 7. Scalatest 4. Spark
3. Allocate test case resources
4. Start jobs IDE / Gradle
5. Push input data to Kafka
6. While (!done && !timeout) { pollDatabase(); sleep(1ms) }
7. While (moreTests) { Goto 3 }
8. Tear down fixture

For absence test, send dummy sync messages at end.

www.mapflat.com
Input generation
● Input & output is denormalised & wide
● Fields are frequently changed
○ Additions are compatible
○ Modifications are incompatible => new, similar data type
● Static test input, e.g. JSON files
○ Unmaintainable
● Input generation routines
○ Robust to changes, reusable

www.mapflat.com
Test oracles
● Compare with expected output
● Check fields relevant for test
○ Robust to field changes
○ Reusable for new, similar types
● Tip: Use lenses
○ JSON: JsonPath (Java), Play JSON (Scala)
○ Case classes: Monocle
● Express invariants for each data type
○ Reuse for production data quality monitoring

www.mapflat.com
Data pipeline = yet another program
Don’t veer from best practices

● Regression testing
● Design: Separation of concerns, modularity, etc
● Process: CI/CD, code review, static analysis tools
● Avoid anti-patterns: Global state, hard-coding location, duplication, ...

In data engineering, slipping is in the culture... :-(

● Mix in solid backend engineers


● Document “golden path”

www.mapflat.com
Testing with cloud services
● PaaS components do not work locally
○ Cloud providers should provide fake implementations
○ Exceptions: Kubernetes, Cloud SQL, Relational Database Service, (S3)
● Integrate PaaS service as fixture component is challenging
○ Distribute access tokens, etc
○ Pay $ or $$$

www.mapflat.com
Top anti-patterns
1. Test as afterthought or in production
Data processing applications are suited for test!
2. Static test input in version control
3. Exact expected output test oracle
4. Unit testing volatile interfaces
5. Using mocks & dependency injection
6. Tool-specific test framework - vendor lock-in
7. Using wall clock time
8. Embedded fixture components

www.mapflat.com
Thank you. Questions?
Credits:

Øyvind Løkling, Schibsted Media Group

● Content inspiration
Confluent, LinkedIn, Google, Netflix, Apache Samza
● Images
Tracey Saxby, Integration and Application Network, University of Maryland
Center for Environmental Science (ian.umces.edu/imagelibrary/).

www.mapflat.com
Bonus slides

www.mapflat.com
Quality testing variants
● Functional regression
○ Binary, key to productivity
● Golden set
○ Extreme inputs => obvious output
○ No regressions tolerated
● (Saved) production data input
○ Individual regressions ok
○ Weighted sum must not decline
○ Beware of privacy

24
www.mapflat.com
Obtaining quality metrics

● Processing tool (Spark/Hadoop) counters


○ Odd code path => bump counter
● Dedicated quality assessment pipelines
○ Reuse test oracle invariants in production

Hadoop / Spark counters DB

Quality assessment job


25
www.mapflat.com
Quality testing in the process
● Binary self-contained Code ∆!
○ Validate in CI
● Relative vs history DB

○ E.g. large drops

○ Precondition for publishing dataset

● Push aggregates to DB

○ Standard ops: monitor, alert

∆?

26
www.mapflat.com

You might also like