Professional Documents
Culture Documents
applications
Lars Albertsson, independent consultant
Øyvind Løkling, Schibsted Media Group
www.mapflat.com
Who’s talking?
● Swedish Institute of Computer Science (distributed system test+debug tools)
● Sun Microsystems (very large machines)
● Google (Hangouts, productivity)
● Recorded Future (NLP startup)
● Cinnober Financial Tech. (trading systems)
● Spotify (data processing & modelling)
● Schibsted Media Group (data processing & modelling)
● Mapflat - independent data engineering consultant
www.mapflat.com
Why stream processing?
Business
● Increasing number of Curated
content
Exploration
intelligence
microservices
www.mapflat.com
The organic past
App App App
● Many paths
Service Service Service
● Synchronous
● Link failure -> chain failure Aggregate
logs
Queue
● Heterogeneous
● Difficult to recover from HTTP
transformation bugs DB
Poll
Queue
NFS
DB Hourly dump
ETL
scp
NFS
Data
warehouse
www.mapflat.com
The unified log
● Publish data in streams App App App
● Replicated, sharded
Ads Search Feed
append-only log
● Pub / sub with history
○ Kafka, Google Pub/Sub,
Unified log
AWS Kinesis Stream Stream Stream
www.mapflat.com
App App App
● Decoupled
producers/consumers Stream Stream Stream
○ In source/deployment
○ In space Job Job
○ In time
Data lake
● Publish results to log Stream Stream Stream
www.mapflat.com
Stream processing building blocks
● Aggregate
○ Calculate time windows
○ Aggregate state (in memory / local database / shared database)
● Filter
○ Slim down stream
○ Privacy, security concerns
● Join
○ Enrich by joining with datasets, e.g. geo IP lookup, demographics
○ Join streams within time windows, e.g. click-through rate
● Transform
○ Bring data into same “shape”, schema
www.mapflat.com
Stream processing technologies
● Spark Streaming
○ Ideal if you are already using Spark, same model
○ Bridges gap between data science / data engineers, batch and stream
● Kafka Streams
○ Library - new, positions itself as a lightweight alternative
○ Tightly coupled to Kafka
● Others
○ Storm, Heron, Flink, Samza, Google Dataflow, AWS Lambda
www.mapflat.com
Egress
App
● Update database table, e.g. for
polling dashboard Service
www.mapflat.com
Test concepts Seam
Test harness
● Unit
● Single job Stream
● Multiple jobs
Job
● Pipeline, including service
● Full system, including client Stream
Job
www.mapflat.com
Stream application properties
● Output = function(input, code)
○ Perfect for testing!
○ Avoid: indeterministic processing, reading wall clock
● Pipeline and job endpoints are stable
○ Correspond to business value
● Internal abstractions are volatile
○ Reslicing in different dimensions is common
www.mapflat.com
App
Recommended scopes
Service
● Single job
● Multiple jobs Stream
Stream
Job
Stream
Job
www.mapflat.com
App
Scopes to avoid
Service
● Unit
○ Few stable interfaces Stream
○ Not necessary
○ Avoid mocks, DI rituals Job
www.mapflat.com
Stream application, example harness
Kafka
Topic Polling
Docker
Test Test
input DB oracle
IDE / Gradle
www.mapflat.com 15
1. Docker
Test lifecycle 5 6
www.mapflat.com
Input generation
● Input & output is denormalised & wide
● Fields are frequently changed
○ Additions are compatible
○ Modifications are incompatible => new, similar data type
● Static test input, e.g. JSON files
○ Unmaintainable
● Input generation routines
○ Robust to changes, reusable
www.mapflat.com
Test oracles
● Compare with expected output
● Check fields relevant for test
○ Robust to field changes
○ Reusable for new, similar types
● Tip: Use lenses
○ JSON: JsonPath (Java), Play JSON (Scala)
○ Case classes: Monocle
● Express invariants for each data type
○ Reuse for production data quality monitoring
www.mapflat.com
Data pipeline = yet another program
Don’t veer from best practices
● Regression testing
● Design: Separation of concerns, modularity, etc
● Process: CI/CD, code review, static analysis tools
● Avoid anti-patterns: Global state, hard-coding location, duplication, ...
www.mapflat.com
Testing with cloud services
● PaaS components do not work locally
○ Cloud providers should provide fake implementations
○ Exceptions: Kubernetes, Cloud SQL, Relational Database Service, (S3)
● Integrate PaaS service as fixture component is challenging
○ Distribute access tokens, etc
○ Pay $ or $$$
www.mapflat.com
Top anti-patterns
1. Test as afterthought or in production
Data processing applications are suited for test!
2. Static test input in version control
3. Exact expected output test oracle
4. Unit testing volatile interfaces
5. Using mocks & dependency injection
6. Tool-specific test framework - vendor lock-in
7. Using wall clock time
8. Embedded fixture components
www.mapflat.com
Thank you. Questions?
Credits:
● Content inspiration
Confluent, LinkedIn, Google, Netflix, Apache Samza
● Images
Tracey Saxby, Integration and Application Network, University of Maryland
Center for Environmental Science (ian.umces.edu/imagelibrary/).
www.mapflat.com
Bonus slides
www.mapflat.com
Quality testing variants
● Functional regression
○ Binary, key to productivity
● Golden set
○ Extreme inputs => obvious output
○ No regressions tolerated
● (Saved) production data input
○ Individual regressions ok
○ Weighted sum must not decline
○ Beware of privacy
24
www.mapflat.com
Obtaining quality metrics
● Push aggregates to DB
∆?
26
www.mapflat.com