Professional Documents
Culture Documents
Øredev, 2019-11-08
Lars Albertsson (@lalleal)
Scling
1
www.scling.com
Data value requires data quality
Hey, the CRM pipeline is down!
We really need the data.
But the data is completely
bogus, and we need to work
with the provider to fix it.
…?
2
www.scling.com
Scope
Data engineering perspective on data quality
● Quality assessment
● Quality assurance
3
www.scling.com
Big data - a collaboration paradigm
Data lake
4
www.scling.com
Data pipelines
Data lake
5
www.scling.com
More data - decreased friction
Stream storage
Data lake
6
www.scling.com
Scling - data-value-as-a-service
● Extract value from your data
● Data platform + custom data pipelines
● Imitate data leaders:
○ Quick idea-to-production
○ Operational efficiency Stream storage
7
www.scling.com
Dataset
Data platform overview Job Pipeline
Batch
processing
Service
Service Workflow
orchestration
Data lake
Cold
store
Online Offline
services data platform
8
www.scling.com
Data quality dimensions
● Timeliness
○ E.g. the customer engagement report was produced at the expected time
● Correctness
○ The numbers in the reports were calculated correctly
● Completeness
○ The report includes information on all customers, using all information from the whole time period
● Consistency
○ The customer summaries are all based on the same time period
9
www.scling.com
The truth is out there
10
www.scling.com
Truth mutated
We put the new model
out for A/B testing, and
it looks great! Great. What fraction of
test users showed a
KPI improvement?
100%!
Hmm..
11
www.scling.com
Not the whole truth
12
www.scling.com
Something but the truth
13
www.scling.com
Hearsay
Manufacturing line
disruptions are expensive.
Can you look at our sensor
data, and help us predict?
14
www.scling.com
This looks like an
Hearsay early indicator!
Manufacturing line
disruptions are expensive.
Can you look at our sensor
data, and help us predict?
15
www.scling.com
Events vs current state
DB DB’
join?
● Usually acceptable
○ In one direction
16
www.scling.com
Monitoring timeliness, examples
● Datamon - Spotify internal
● Twitter Ambrose (dead?)
● Airflow
17
www.scling.com
Ensuring timeliness
● First rule of distributed systems: Avoid distributed systems.
18
www.scling.com
Design for testability
● Output = function(input, code)
19
www.scling.com
App
● Unit/component
● Single job Stream
● Multiple jobs
● Pipeline, including service Job
● Full system, including client
Stream
Choose stable interfaces
Job
Each scope has a cost
Stream
Job
20
www.scling.com
App
Recommended scopes
Service
● Single job
● Multiple jobs Stream
Stream
Job
Stream
Job
21
www.scling.com
App
Scopes to avoid
Service
● Unit/Component
○ Few stable interfaces Stream
○ Avoid mocks, dependency injection rituals
Job
● Full system, including client
○ Client automation fragile
Stream
- Henrik Kniberg
Stream
Job
22
www.scling.com
Testing single batch job
Runs well in
CI / from IDE
f() p()
23
www.scling.com
Invariants
class CleanUserTest extends FlatSpec {
def validateInvariants(
● Some things are true
input: Seq[User], ○ For every record
output: Seq[User],
counters: Map[String, Int]) = {
○ For every job invocation
output.foreach(recordInvariant) ● Not necessarily in production
// Dataset invariants ○ Reuse invariant predicates as quality
assert(input.size === output.size) probes
assert(input.size should be >= counters["upper-cased"])
}
24
www.scling.com
Measuring correctness: counters
case class Order(item: ItemId, userId: UserId)
● User-defined case class User(id: UserId, country: String)
25
www.scling.com
Measuring correctness: counters
● Processing tool (Spark/Hadoop) counters
○ Odd code path => bump counter Standard graphing tools
○ System metrics
Standard
alerting
service
26
www.scling.com
Measuring correctness: pipelines
● Processing tool (Spark/Hadoop) counters
○ Odd code path => bump counter Standard graphing tools
○ System metrics
DB
Standard
alerting
service
Quality assessment job
Report
28
www.scling.com
The unknown unknowns
● Measure user behaviour
○ E.g. session length, engagement, funnel
Measure interactions
Standard
Stream Job DB alerting
service
29
www.scling.com
Fuzzy products
● Data-driven applications often have fuzzy logic
○ No clear right output
○ Quality == total experience for all users
Code ∆?
30
www.scling.com
Converting fuzzy to binary
● Break out binary behaviour
Output sane?
Simple input
Invariants?
31
www.scling.com
Golden scenario suite
● Scenarios that must never fail
● May include real-world data
"Stockholm"
Geo data
32
www.scling.com
Weighted quality sum
● Sum of test case results should not regress
● Individual regressions are acceptable
● Example: Map searches, done from a US IP address with English language browser setting
Springfield Springfield, MA 1 2 2
Hartfield Hartford, CT 0 1 0
Sum = 5.4
33
www.scling.com
Testing with real world / production data
● Data is volatile
○ Separate code change from test data change
○ Take snapshots to use for test
Code ∆!
● Beware of privacy issues
Input ∆?
∆?
34
www.scling.com
Data completeness
● Static workflow DAGs ensure dataset completeness
35
www.scling.com
Incompleteness recovery
class OrderShuffle(SparkSubmitTask): val orderLateCounter = longAccumulator("order-event-late")
hour = DateHourParameter()
delay_hours = IntParameter() val hourPaths = conf.order.split(",")
val order = hourPaths
jar = 'orderpipeline.jar'
entry_class = 'com.example.shop.OrderJob' .map(spark.read.json(_))
.reduce(a, b => a.union(b))
def requires(self):
# Note: This delays processing by three hours. val orderThisHour = order
return [Order(hour=hour) for hour in .map({ cl =>
[self.hour + timedelta(hour=h) for h in # Count the events that came after the delay window
range(self.delay_hours)]] if (cl.eventTime.hour + config.delayHours <
config.hour) {
def output(self):
orderLateCounter.add(1)
return HdfsTarget("/prod/red/order/v1/"
f"delay={self.delay}/" }
f"{self.hour:%Y/%m/%d/%H}/") order
})
def app_options(self): .filter(cl => cl.eventTime.hour == config.hour)
return [ "--hour", self.hour,
"--delay-hours", self.delay_hours,
"--order",
",".join([i.path for i in self.input()]),
"--output", self.output().path]
class OrderDashboard(mysql.CopyToTable):
hour = DateHourParameter()
Delay: 4
def requires(self):
return OrderShuffle(hour=self.hour, delay_hour=0)
class FinancialReport(SparkSubmitTask):
date = DateParameter()
37
www.scling.com
Things to plan for early
Data quality
38
www.scling.com
Things to plan for early
Multi-cloud
Machine
learning Bidi
bias Web languages
security Testability
Data quality
Cloud
Mobile i18n
native
browsers
Input Foobarility
validation
Software Accessi-
supply Scalability
bility Perfor-
chain mance
User
UX
feedback
Great! Meh.
40
www.scling.com
1999: Does anyone care about code quality?
No.
7 years
41
www.scling.com
Code quality 1999
Nah, boring. We
don’t have time.
Just put it in
production for us.
42
www.scling.com
Code quality 2019
43
www.scling.com
Data quality 2019
44
www.scling.com
Data quality 2029
45
www.scling.com
Changing culture bottom-up
46
www.scling.com
Repeating the success?
47
www.scling.com
Resources, credits
Presentations, articles on related subjects: Thank you,
● https://github.com/awslabs/deequ
● https://github.com/great-expectations/great_expectations
● https://github.com/spotify/ratatool
48
www.scling.com
Tech has massive impact on society
Supplier?
Cloud?
Employer?
Product?
Make an active
choice whether to
have an impact!
49
www.scling.com
Laptop sticker
Vintage data visualisations, by Karin Lind.
● Matthew F Maury: Wind and Current Chart of the North Atlantic. Drawn 1852.