Testing Data Streaming Applications: Lars Albertsson, Independent Consultant Øyvind Løkling, Schibsted Media Group

Testing data streaming
applications
Lars Albertsson, independent consultant
Øyvind Løkling, Schibsted Media Group
www.mapflat.com
Who’s talking?
● Swedish Institute of Computer Science (distributed system test+debug tools)
● Sun Microsystems (very large machines)
● Google (Hangouts, productivity)
● Recorded Future (NLP startup)
● Cinnober Financial Tech. (trading systems)
● Spotify (data processing & modelling)
● Schibsted Media Group (data processing & modelling)
● Mapflat - independent data engineering consultant
www.mapflat.com
Why stream processing?
Business
● Increasing number of Curated
content
Exploration
intelligence
data-driven features User

behaviour
Experiments
● 90+% fed by batch processing User content Data-based

features
○ Simpler, better tooling
○ 1+ hour data reaction time
Professional
● Stream processing for content

Recommendations
○ 100ms - 1 hour reaction System

diagnostics
Systems
○ Decoupled, asynchronous Ads / partners
Pushing
Ads
microservices
www.mapflat.com
The organic past
App App App
● Many paths
Service Service Service
● Synchronous
● Link failure -> chain failure Aggregate
logs
Queue
● Heterogeneous
● Difficult to recover from HTTP
transformation bugs DB
Poll
Queue
NFS
DB Hourly dump
ETL
scp
NFS
Data
warehouse
www.mapflat.com
The unified log
● Publish data in streams App App App
● Replicated, sharded
Ads Search Feed
append-only log
● Pub / sub with history
○ Kafka, Google Pub/Sub,
Unified log
AWS Kinesis Stream Stream Stream
● Tap to data lake for batch

processing
Data lake
www.mapflat.com
App App App
Stream processing Ads Search Feed
● Decoupled
producers/consumers Stream Stream Stream
○ In source/deployment
○ In space Job Job
○ In time
Data lake
● Publish results to log Stream Stream Stream
● Recovers from link failures

Job Job
● Replay on job bug fix
Stream Stream Stream
Job Job Business

intelligence
www.mapflat.com
Stream processing building blocks
● Aggregate
○ Calculate time windows
○ Aggregate state (in memory / local database / shared database)
● Filter
○ Slim down stream
○ Privacy, security concerns
● Join
○ Enrich by joining with datasets, e.g. geo IP lookup, demographics
○ Join streams within time windows, e.g. click-through rate
● Transform
○ Bring data into same “shape”, schema
www.mapflat.com
Stream processing technologies
● Spark Streaming
○ Ideal if you are already using Spark, same model
○ Bridges gap between data science / data engineers, batch and stream
● Kafka Streams
○ Library - new, positions itself as a lightweight alternative
○ Tightly coupled to Kafka
● Others
○ Storm, Heron, Flink, Samza, Google Dataflow, AWS Lambda
www.mapflat.com
Egress
App
● Update database table, e.g. for
polling dashboard Service
● Create service index table n+1.

Notify service to switch.
Stream Stream
● Post to external web service Job Job
● Push stream to client
www.mapflat.com
Test concepts Seam
Test harness
Test System under test Test

input (SUT) oracle
IDEs 3rd party

Test 3rd party 3rd party
component
fixture component component
(e.g. DB)
Build
tools
Test framework (e.g. JUnit, Scalatest)
www.mapflat.com
App
Potential test scopes

Service
● Unit
● Single job Stream
● Multiple jobs
Job
● Pipeline, including service
● Full system, including client Stream
Choose stable interfaces Job
Each scope has a cost Stream
Job
www.mapflat.com
Stream application properties
● Output = function(input, code)
○ Perfect for testing!
○ Avoid: indeterministic processing, reading wall clock
● Pipeline and job endpoints are stable
○ Correspond to business value
● Internal abstractions are volatile
○ Reslicing in different dimensions is common
www.mapflat.com
App
Recommended scopes
Service
● Single job
● Multiple jobs Stream
● Pipeline, including service

Job
Stream
Job
Stream
Job
www.mapflat.com
App
Scopes to avoid
Service
● Unit
○ Few stable interfaces Stream
○ Not necessary
○ Avoid mocks, DI rituals Job
● Full system, including client

Stream
○ Client automation fragile
“Focus on functional system Job
tests, complement with smaller

Stream
where you cannot get
coverage.” - Henrik Kniberg Job
www.mapflat.com
Stream application, example harness
Kafka
Topic Polling
Docker
Test Test
input DB oracle
Scalatest Spark Streaming jobs

IDE, CI, debug integration
IDE / Gradle
www.mapflat.com 15
1. Docker
Test lifecycle 5 6
1. Start fixture containers

2. Await fixture ready
2, 7. Scalatest 4. Spark
3. Allocate test case resources
4. Start jobs IDE / Gradle
5. Push input data to Kafka
6. While (!done && !timeout) { pollDatabase(); sleep(1ms) }
7. While (moreTests) { Goto 3 }
8. Tear down fixture
For absence test, send dummy sync messages at end.
www.mapflat.com
Input generation
● Input & output is denormalised & wide
● Fields are frequently changed
○ Additions are compatible
○ Modifications are incompatible => new, similar data type
● Static test input, e.g. JSON files
○ Unmaintainable
● Input generation routines
○ Robust to changes, reusable
www.mapflat.com
Test oracles
● Compare with expected output
● Check fields relevant for test
○ Robust to field changes
○ Reusable for new, similar types
● Tip: Use lenses
○ JSON: JsonPath (Java), Play JSON (Scala)
○ Case classes: Monocle
● Express invariants for each data type
○ Reuse for production data quality monitoring
www.mapflat.com
Data pipeline = yet another program
Don’t veer from best practices
● Regression testing
● Design: Separation of concerns, modularity, etc
● Process: CI/CD, code review, static analysis tools
● Avoid anti-patterns: Global state, hard-coding location, duplication, ...
In data engineering, slipping is in the culture... :-(
● Mix in solid backend engineers

● Document “golden path”
www.mapflat.com
Testing with cloud services
● PaaS components do not work locally
○ Cloud providers should provide fake implementations
○ Exceptions: Kubernetes, Cloud SQL, Relational Database Service, (S3)
● Integrate PaaS service as fixture component is challenging
○ Distribute access tokens, etc
○ Pay $ or $$$
www.mapflat.com
Top anti-patterns
1. Test as afterthought or in production
Data processing applications are suited for test!
2. Static test input in version control
3. Exact expected output test oracle
4. Unit testing volatile interfaces
5. Using mocks & dependency injection
6. Tool-specific test framework - vendor lock-in
7. Using wall clock time
8. Embedded fixture components
www.mapflat.com
Thank you. Questions?
Credits:
Øyvind Løkling, Schibsted Media Group
● Content inspiration
Confluent, LinkedIn, Google, Netflix, Apache Samza
● Images
Tracey Saxby, Integration and Application Network, University of Maryland
Center for Environmental Science (ian.umces.edu/imagelibrary/).
www.mapflat.com
Bonus slides
www.mapflat.com
Quality testing variants
● Functional regression
○ Binary, key to productivity
● Golden set
○ Extreme inputs => obvious output
○ No regressions tolerated
● (Saved) production data input
○ Individual regressions ok
○ Weighted sum must not decline
○ Beware of privacy
24
www.mapflat.com
Obtaining quality metrics
● Processing tool (Spark/Hadoop) counters

○ Odd code path => bump counter
● Dedicated quality assessment pipelines
○ Reuse test oracle invariants in production
Hadoop / Spark counters DB
Quality assessment job

25
www.mapflat.com
Quality testing in the process
● Binary self-contained Code ∆!
○ Validate in CI
● Relative vs history DB
○ E.g. large drops
○ Precondition for publishing dataset
● Push aggregates to DB
○ Standard ops: monitor, alert
∆?
26
www.mapflat.com

Testing Data Streaming Applications: Lars Albertsson, Independent Consultant Øyvind Løkling, Schibsted Media Group

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Testing Data Streaming Applications: Lars Albertsson, Independent Consultant Øyvind Løkling, Schibsted Media Group

Uploaded by

Copyright:

Available Formats

Testing data streaming

data-driven features User

● 90+% fed by batch processing User content Data-based

● Stream processing for content

○ 100ms - 1 hour reaction System

● Tap to data lake for batch

Stream processing Ads Search Feed

● Recovers from link failures

Job Job Business

● Create service index table n+1.

● Post to external web service Job Job

● Push stream to client

Test System under test Test

IDEs 3rd party

Potential test scopes

Choose stable interfaces Job

Each scope has a cost Stream

● Pipeline, including service

● Full system, including client

“Focus on functional system Job

tests, complement with smaller

Scalatest Spark Streaming jobs

1. Start fixture containers

For absence test, send dummy sync messages at end.

In data engineering, slipping is in the culture... :-(

● Mix in solid backend engineers

Øyvind Løkling, Schibsted Media Group

● Processing tool (Spark/Hadoop) counters

Hadoop / Spark counters DB

Quality assessment job

○ E.g. large drops

○ Precondition for publishing dataset

○ Standard ops: monitor, alert

You might also like