You are on page 1of 50

Engineering data quality

Øredev, 2019-11-08
Lars Albertsson (@lalleal)
Scling
1
www.scling.com
Data value requires data quality
Hey, the CRM pipeline is down!
We really need the data.
But the data is completely
bogus, and we need to work
with the provider to fix it.

But we use it to feed our


analytics, and need data now!

…?

2
www.scling.com
Scope
Data engineering perspective on data quality

● Context - big data environments

● Origins of good or bad data

● Quality assessment

● Quality assurance

3
www.scling.com
Big data - a collaboration paradigm

Data Stream storage


democratised

Data lake

4
www.scling.com
Data pipelines

Data lake

5
www.scling.com
More data - decreased friction

Stream storage

Data lake

6
www.scling.com
Scling - data-value-as-a-service
● Extract value from your data
● Data platform + custom data pipelines
● Imitate data leaders:
○ Quick idea-to-production
○ Operational efficiency Stream storage

Our marketing strategy:


● Promiscuously share knowledge
○ On slides devoid of glossy polish :-)
Data lake

7
www.scling.com
Dataset
Data platform overview Job Pipeline

Batch
processing

Service

Service Workflow
orchestration
Data lake

Cold
store
Online Offline
services data platform
8
www.scling.com
Data quality dimensions
● Timeliness
○ E.g. the customer engagement report was produced at the expected time

● Correctness
○ The numbers in the reports were calculated correctly

● Completeness
○ The report includes information on all customers, using all information from the whole time period

● Consistency
○ The customer summaries are all based on the same time period

9
www.scling.com
The truth is out there

I love working with


data, because data
is true.

10
www.scling.com
Truth mutated
We put the new model
out for A/B testing, and
it looks great! Great. What fraction of
test users showed a
KPI improvement?

100%!

Hmm..

Wait, it seems ads


were disabled for the
test group....

11
www.scling.com
Not the whole truth

Our steel customers are Sure, hang on.


affected by cracks, causing
corrosion. Can you look at
our defect reports, and help
us predict issues?

We have found a strong


signal: The customer id...

12
www.scling.com
Something but the truth

Huh, why do have a sharp increase


in invalid_media_type ?

I’ll have a look.

It seems that we have a new


media type “bullshit”...

13
www.scling.com
Hearsay
Manufacturing line
disruptions are expensive.
Can you look at our sensor
data, and help us predict?

Sure, hang on.

14
www.scling.com
This looks like an
Hearsay early indicator!

Manufacturing line
disruptions are expensive.
Can you look at our sensor
data, and help us predict?

Sure, hang on.

Wait, is this interpolated?

15
www.scling.com
Events vs current state

DB DB’
join?

● join(event, snapshot) → always time mismatch

● Usually acceptable
○ In one direction

16
www.scling.com
Monitoring timeliness, examples
● Datamon - Spotify internal
● Twitter Ambrose (dead?)
● Airflow

17
www.scling.com
Ensuring timeliness
● First rule of distributed systems: Avoid distributed systems.

● Keep things simple.

● Master workflow orchestration.

Other than that, very large topic...

18
www.scling.com
Design for testability
● Output = function(input, code)

● No dependency on external services DB Service

● Avoid non-deterministic factors

19
www.scling.com
App

Potential test scopes


Service

● Unit/component
● Single job Stream

● Multiple jobs
● Pipeline, including service Job
● Full system, including client
Stream
Choose stable interfaces
Job
Each scope has a cost

Stream

Job

20
www.scling.com
App

Recommended scopes
Service

● Single job
● Multiple jobs Stream

● Pipeline, including service


Job

Stream

Job

Stream

Job

21
www.scling.com
App

Scopes to avoid
Service

● Unit/Component
○ Few stable interfaces Stream
○ Avoid mocks, dependency injection rituals

Job
● Full system, including client
○ Client automation fragile
Stream

“Focus on functional system tests, complement


with smaller where you cannot get coverage.” Job

- Henrik Kniberg
Stream

Job

22
www.scling.com
Testing single batch job
Runs well in
CI / from IDE

Standard Scalatest harness

1. Generate input 2. Run in local mode 3. Verify output

f() p()

file://test_input/ Job file://test_output/

23
www.scling.com
Invariants
class CleanUserTest extends FlatSpec {
def validateInvariants(
● Some things are true
input: Seq[User], ○ For every record
output: Seq[User],
counters: Map[String, Int]) = {
○ For every job invocation
output.foreach(recordInvariant) ● Not necessarily in production
// Dataset invariants ○ Reuse invariant predicates as quality
assert(input.size === output.size) probes
assert(input.size should be >= counters["upper-cased"])
}

def recordInvariant(u: User) =


assert(u.country.size === 2)

def runJob(input: Seq[User]): Seq[User] = {


// Same as before
...
validateInvariants(input, output, counters)
(output, counters)
}

// Test case is the same


}

24
www.scling.com
Measuring correctness: counters
case class Order(item: ItemId, userId: UserId)
● User-defined case class User(id: UserId, country: String)

val orders = read(orderPath)


● Technical from framework val users = read(userPath)

○ Execution time val orderNoUserCounter = longAccumulator("order-no-user")


○ Memory consumption
val joined: C[(Order, Option[User])] = orders
○ Data volumes .groupBy(_.userId)
○ ... .leftJoin(users.groupBy(_.id))
.values

val orderWithUser: C[(Order, User)] = joined


.flatMap( orderUser match
case (order, Some(user)) => Some((order, user))
case (order, None) => {
orderNoUserCounter.add(1)
None
})
SQL: Nope

25
www.scling.com
Measuring correctness: counters
● Processing tool (Spark/Hadoop) counters
○ Odd code path => bump counter Standard graphing tools
○ System metrics

Hadoop / Spark counters DB

Standard
alerting
service

26
www.scling.com
Measuring correctness: pipelines
● Processing tool (Spark/Hadoop) counters
○ Odd code path => bump counter Standard graphing tools
○ System metrics

● Dedicated quality assessment pipelines

DB

Standard
alerting
service
Quality assessment job

Quality metadataset (tiny)


27
www.scling.com
Conditional consumption
● Conditional consumption
○ Express in workflow orchestration metrics
○ Read metrics DB, quality dataset
○ Producer can recommend, not decide
Recommendation
● Insufficient quality?
○ Wait for bug fix
○ Use older/newer input dataset

Report

28
www.scling.com
The unknown unknowns
● Measure user behaviour
○ E.g. session length, engagement, funnel

● Revert to old egress dataset if necessary

Measure interactions
Standard
Stream Job DB alerting
service

29
www.scling.com
Fuzzy products
● Data-driven applications often have fuzzy logic
○ No clear right output
○ Quality == total experience for all users

● Testing for productivity must be binary

● Cause → effect connection must be strong


○ Cause = code change
○ Effect = quality degradation

Code ∆?

30
www.scling.com
Converting fuzzy to binary
● Break out binary behaviour

Output sane?
Simple input
Invariants?

Clear cut Clear cut


scenario result?

31
www.scling.com
Golden scenario suite
● Scenarios that must never fail
● May include real-world data

"Stockholm"

Geo data

32
www.scling.com
Weighted quality sum
● Sum of test case results should not regress
● Individual regressions are acceptable
● Example: Map searches, done from a US IP address with English language browser setting

Input Output Verdict (0-1) Weight Weighted verdict

Springfield Springfield, MA 1 2 2

Hartfield Hartford, CT 0 1 0

Philadelphia Philadelphia, Egypt 0.2 5 1

Boston Boston, UK 0.4 4 1.6

Betlehem Betlehem, Israel 0.8 1 0.8

Sum = 5.4
33
www.scling.com
Testing with real world / production data
● Data is volatile
○ Separate code change from test data change
○ Take snapshots to use for test
Code ∆!
● Beware of privacy issues

Input ∆?

∆?

34
www.scling.com
Data completeness
● Static workflow DAGs ensure dataset completeness

● Dataset completeness != data completeness

● Collected events might be delayed

● Event creation to collection delay is unbounded


○ Consider offline phones

35
www.scling.com
Incompleteness recovery
class OrderShuffle(SparkSubmitTask): val orderLateCounter = longAccumulator("order-event-late")
hour = DateHourParameter()
delay_hours = IntParameter() val hourPaths = conf.order.split(",")
val order = hourPaths
jar = 'orderpipeline.jar'
entry_class = 'com.example.shop.OrderJob' .map(spark.read.json(_))
.reduce(a, b => a.union(b))
def requires(self):
# Note: This delays processing by three hours. val orderThisHour = order
return [Order(hour=hour) for hour in .map({ cl =>
[self.hour + timedelta(hour=h) for h in # Count the events that came after the delay window
range(self.delay_hours)]] if (cl.eventTime.hour + config.delayHours <
config.hour) {
def output(self):
orderLateCounter.add(1)
return HdfsTarget("/prod/red/order/v1/"
f"delay={self.delay}/" }
f"{self.hour:%Y/%m/%d/%H}/") order
})
def app_options(self): .filter(cl => cl.eventTime.hour == config.hour)
return [ "--hour", self.hour,
"--delay-hours", self.delay_hours,
"--order",
",".join([i.path for i in self.input()]),
"--output", self.output().path]

SQL: Separate job for measuring window leakage.


36
www.scling.com
Fast data, complete data
class OrderShuffleAll(WrapperTask):
hour = DateHourParameter() Delay: 0
def requires(self):
return [OrderShuffle(hour=self.hour, delay_hour=d)
for d in [0, 4, 12]]

class OrderDashboard(mysql.CopyToTable):
hour = DateHourParameter()
Delay: 4
def requires(self):
return OrderShuffle(hour=self.hour, delay_hour=0)

class FinancialReport(SparkSubmitTask):
date = DateParameter()

def requires(self): Delay: 12


return [OrderShuffle(
hour=datetime.combine(self.date, time(hour=h)),
delay_hour=12)
for h in range(24)]

37
www.scling.com
Things to plan for early

Data quality

38
www.scling.com
Things to plan for early
Multi-cloud
Machine
learning Bidi
bias Web languages
security Testability

Data quality
Cloud
Mobile i18n
native
browsers

Input Foobarility
validation
Software Accessi-
supply Scalability
bility Perfor-
chain mance
User
UX
feedback

Get the MVP out?


39
www.scling.com
Does anyone care about data quality?
No.
No graph Any graph Valuable
graph

Great! Meh.

No model Any ML Valuable


model model

40
www.scling.com
1999: Does anyone care about code quality?
No.

7 years

41
www.scling.com
Code quality 1999

We think some QA and test


automation would be great.
Behold our
great code!

Nah, boring. We
don’t have time.
Just put it in
production for us.

42
www.scling.com
Code quality 2019

We have invented DevOps


and continuous delivery. That sounds familiar...
Test automation is key!

43
www.scling.com
Data quality 2019

We think some data quality


assessment and automation
Behold our would be great.
great model!

Nah, why? We don’t have time.


Just put it in production for us.

44
www.scling.com
Data quality 2029

We have invented MLOps


and continuous modelling.
That sounds familiar...
Quality feedback
automation is key!

45
www.scling.com
Changing culture bottom-up

46
www.scling.com
Repeating the success?

47
www.scling.com
Resources, credits
Presentations, articles on related subjects: Thank you,

● https://www.scling.com/reading-list ● Irene Gonzálvez, Spotify


https://youtu.be/U63TmQPS9Z8
● https://www.scling.com/presentations
● Anders Holst, RISE
Useful tools:

● https://github.com/awslabs/deequ

● https://github.com/great-expectations/great_expectations

● https://github.com/spotify/ratatool

48
www.scling.com
Tech has massive impact on society
Supplier?

Cloud?

Employer?

Product?

Make an active
choice whether to
have an impact!
49
www.scling.com
Laptop sticker
Vintage data visualisations, by Karin Lind.

● Charles Minard: Napoleon’s Russian campaign of 1812. Drawn 1869.

● Matthew F Maury: Wind and Current Chart of the North Atlantic. Drawn 1852.

● Florence Nightingale: Causes of Mortality in the Army of the East. Crimean


war, 1854-1856. Drawn 1858.
○ Blue = disease, red = wounds, black = battle + other.

● Harold Craft: Radio Observations of the Pulse Profiles and Dispersion


Measures of Twelve Pulsars, 1970
○ Joy Division:
Unknown Pleasures, 1979
○ “Joy plot” → “ridge plot”
50
www.scling.com

You might also like