You are on page 1of 60

go/badwolf-internship

BadWolf
Internship Project
Abuse @BHZ Weekly • November 13, 2020

Roger Leite Lucena


lucenaroger@

opensource.google/projects/badwolf
github.com/google/badwolf

Counter Abuse Technology Attorney / Client Privileged and Confidential 1


Agenda 1. BadWolf overview

2. Project
a. Optimization (FILTER keyword)
b. Profiling
c. Tracing
d. Usability (query language)

3. Demo!

4. Wrap-up

Counter Abuse Technology Attorney / Client Privileged and Confidential 2


01

BadWolf overview

What is BW?
Why did we need it?
How is it used by Google?
A glimpse into BQL

Counter Abuse Technology Attorney / Client Privileged and Confidential 3


BadWolf What is it?
Temporal graph store abstraction loosely modeled as a triple store

Counter Abuse Technology Attorney / Client Privileged and Confidential 4


BadWolf What is it?
Temporal graph store abstraction loosely modeled as a triple store

Counter Abuse Technology Attorney / Client Privileged and Confidential 5


BadWolf What is it?
Temporal graph store abstraction loosely modeled as a triple store

Triple: subject-predicate-object statements

Counter Abuse Technology Attorney / Client Privileged and Confidential 6


BadWolf What is it?
Temporal graph store abstraction loosely modeled as a triple store

Triple: subject-predicate-object statements

eg: Charles Darwin was born in Shrewsbury

Counter Abuse Technology Attorney / Client Privileged and Confidential 7


BadWolf What is it?
Temporal graph store abstraction loosely modeled as a triple store

Triple: subject-predicate-object statements

eg: Charles Darwin was born in Shrewsbury

Counter Abuse Technology Attorney / Client Privileged and Confidential 8


BadWolf What is it?
Temporal graph store abstraction loosely modeled as a triple store

Triple: subject-predicate-object statements


Charles Darwin eg: Charles Darwin was born in Shrewsbury

Born in

Shrewsbury

Counter Abuse Technology Attorney / Client Privileged and Confidential 9


(Reso RDF
urce D
Fram escription
ewor
k)
W
“meta 3C

BadWolf
data
What is it? mode data
l”

Temporal graph store abstraction loosely modeled as a triple store

Triple: subject-predicate-object statements


Charles Darwin eg: Charles Darwin was born in Shrewsbury

Born in

Shrewsbury

Counter Abuse Technology Attorney / Client Privileged and Confidential 10


(Reso RDF
urce D
Fram escription
ewor
k)
W
“meta 3C

BadWolf
data
What is it? mode data
l”

Temporal graph store abstraction loosely modeled as a triple store

Triple: subject-predicate-object statements


Charles Darwin eg: Charles Darwin was born in Shrewsbury

eg: Shrewsbury has a population of 60,000


Born in

Shrewsbury

Counter Abuse Technology Attorney / Client Privileged and Confidential 11


(Reso RDF
urce D
Fram escription
ewor
k)
W
“meta 3C

BadWolf
data
What is it? mode data
l”

Temporal graph store abstraction loosely modeled as a triple store

Triple: subject-predicate-object statements


Charles Darwin eg: Charles Darwin was born in Shrewsbury

eg: Shrewsbury has a population of 60,000


Born in

Shrewsbury

Population

60,000
Counter Abuse Technology Attorney / Client Privileged and Confidential 12
(Reso RDF
urce D
Fram escription
ewor
k)
W
“meta 3C

BadWolf
data
What is it? mode data
l”

Temporal graph store abstraction loosely modeled as a triple store

Triple: subject-predicate-object statements


Charles Darwin eg: Charles Darwin was born in Shrewsbury

eg: Shrewsbury has a population of 60,000


Born in

But statements change over time!


Shrewsbury

Population

60,000
Counter Abuse Technology Attorney / Client Privileged and Confidential 13
(Reso RDF
urce D
Fram escription
ewor
k)
W
“meta 3C

BadWolf
data
What is it? mode data
l”

Temporal graph store abstraction loosely modeled as a triple store

Triple: subject-predicate-object statements


Charles Darwin eg: Charles Darwin was born in Shrewsbury

eg: Shrewsbury has a population (in 2018) of 60,000


Born in

But statements change over time!


Shrewsbury

Population

60,000
Counter Abuse Technology Attorney / Client Privileged and Confidential 14
(Reso RDF
urce D
Fram escription
ewor
k)
W
“meta 3C

BadWolf
data
What is it? mode data
l”

Temporal graph store abstraction loosely modeled as a triple store

Triple: subject-predicate-object statements


Charles Darwin eg: Charles Darwin was born in Shrewsbury

eg: Shrewsbury has a population (in 2018) of 60,000


Born in

But statements change over time!


Shrewsbury

Population@[2018]

60,000
Counter Abuse Technology Attorney / Client Privileged and Confidential 15
BadWolf BW extends the predicate semantics:

Counter Abuse Technology Attorney / Client Privileged and Confidential 16


BadWolf BW extends the predicate semantics:

Immutable predicates

Charles Darwin Do not change with time:


eg: Charles Darwin was born in Shrewsbury
BW predicate: “born_in”@[]
Born in

Shrewsbury

BW predicate format: “ID”@[TIME]


Counter Abuse Technology Attorney / Client Privileged and Confidential 17
BadWolf BW extends the predicate semantics:

Immutable predicates

Charles Darwin Do not change with time:


eg: Charles Darwin was born in Shrewsbury
BW predicate: “born_in”@[]
Born in

Temporal predicates
Shrewsbury
Only true at some point in time
eg: Shrewsbury has a population (in 2018) of 60,000
Population@[2018] BW predicate: “population”@[2018]

60,000 BW predicate format: “ID”@[TIME]


Counter Abuse Technology Attorney / Client Privileged and Confidential 18
BadWolf BW extends the predicate semantics:

Immutable predicates
Name from

Do not change with time:


eg: Charles Darwin was born in Shrewsbury
BadWolf entity in “Doctor Who” series BW predicate: “born_in”@[]
(season 1 episode 13)

Temporal predicates

Only true at some point in time


eg: Shrewsbury has a population (in 2018) of 60,000
BW predicate: “population”@[2018]

BW predicate format: “ID”@[TIME]


Counter Abuse Technology Attorney / Client Privileged and Confidential 19
BadWolf also provides:

● An efficient query ● Flexible storage ● Data interchange


language abstraction model

BQL (BadWolf Query The storage interface is implemented Storage and linking of
Language) separately arbitrary objects in a directed
eg: Spanner, RAM memory graph
- “SQL for graphs”
- Similar to SPARQL

Counter Abuse Technology Attorney / Client Privileged and Confidential 20


Why did we need BadWolf?

Statements that Flexibility to model and New concepts Ill-defined problems


change over time reason without frequently added or with changing
schema/code changes removed landscapes

Counter Abuse Technology Attorney / Client Privileged and Confidential 21


How is BadWolf used by Google?

● Used internally to model and query


semi-structured graph data

Used across a range of some of the


most famous Google products and
services (confidential names)
● Used mainly in abuse-fighting contexts

BW simplifies querying and


understanding complex relations

● BW will receive more internal clients and


workloads
Counter Abuse Technology Attorney / Client Privileged and Confidential 22
A glimpse into BQL Charles Darwin

Born in
BW nodes format: /TYPE<ID>

BW predicate format: “ID”@[TIME_ANCHOR]


Shrewsbury
BW literals format: “VALUE”^^type:TYPE
(int64, float64, bool, text, blob)
Population@[2018]

60,000

Counter Abuse Technology Attorney / Client Privileged and Confidential 23


A glimpse into BQL Charles Darwin

Born in
BW nodes format: /TYPE<ID>

BW predicate format: “ID”@[TIME_ANCHOR]


Shrewsbury
BW literals format: “VALUE”^^type:TYPE
(int64, float64, bool, text, blob)
Population@[2018]

● BQL triples for this graph:


60,000
/person<Charles Darwin> "born_in"@[] /city<Shrewsbury>
/city<Shrewsbury> "population"@[2018] "60000"^^type:int64

Counter Abuse Technology Attorney / Client Privileged and Confidential 24


A glimpse into BQL

BQL Query

SELECT ?grandparent, ?grandchild Triples of the ?family graph:


FROM ?family
WHERE { /u<joe> "parent_of"@[] /u<mary>
?grandparent "parent_of"@[] ?x . /u<joe> "parent_of"@[] /u<peter>
?x "parent_of"@[] ?grandchild /u<peter> "parent_of"@[] /u<john>
}; /u<peter> "parent_of"@[] /u<eve>

Counter Abuse Technology Attorney / Client Privileged and Confidential 25


/u<joe>

A glimpse into BQL “parent_of”@[] “parent_of”@[]

“parent_of”@[]
/u<mary> /u<peter>

/u<john>
“parent_of”@[]
BQL Query
/u<eve>
SELECT ?grandparent, ?grandchild Triples of the ?family graph:
FROM ?family
WHERE { /u<joe> "parent_of"@[] /u<mary>
?grandparent "parent_of"@[] ?x . /u<joe> "parent_of"@[] /u<peter>
?x "parent_of"@[] ?grandchild /u<peter> "parent_of"@[] /u<john>
}; /u<peter> "parent_of"@[] /u<eve>

Counter Abuse Technology Attorney / Client Privileged and Confidential 26


/u<joe>

A glimpse into BQL “parent_of”@[] “parent_of”@[]

“parent_of”@[]
/u<mary> /u<peter>

/u<john>
“parent_of”@[]
BQL Query
/u<eve>
SELECT ?grandparent, ?grandchild Triples of the ?family graph:
FROM ?family
WHERE { /u<joe> "parent_of"@[] /u<mary>
?grandparent "parent_of"@[] ?x . /u<joe> "parent_of"@[] /u<peter>
?x "parent_of"@[] ?grandchild /u<peter> "parent_of"@[] /u<john>
}; /u<peter> "parent_of"@[] /u<eve>

Result:
?grandparent ?grandchild
/u<joe> /u<john>
/u<joe> /u<eve>

Counter Abuse Technology Attorney / Client Privileged and Confidential 27


02

Project

a. Optimization (FILTER keyword)


b. Profiling
c. Tracing
d. Usability (query language)

Counter Abuse Technology Attorney / Client Privileged and Confidential 28


Project

a.
Optimization

Implement a FILTER
keyword

(from lexer/grammar
to planner and
storage/driver levels)

Counter Abuse Technology Attorney / Client Privileged and Confidential 29


Project

a. b.
Optimization Profiling

Implement a FILTER Add support for pprof


keyword profiling in BadWolf

(from lexer/grammar (memory and CPU


to planner and profiles activated
storage/driver levels) through the BW CLI)

Counter Abuse Technology Attorney / Client Privileged and Confidential 30


Project

a. b. c.
Optimization Profiling Tracing

Implement a FILTER Add support for pprof Make the BadWolf


keyword profiling in BadWolf tracer more easily
extendable, improve
(from lexer/grammar (memory and CPU
debugability, assure
to planner and profiles activated
coverage, add verbosity
storage/driver levels) through the BW CLI)
levels

Counter Abuse Technology Attorney / Client Privileged and Confidential 31


Project

a. b. c. d.
Optimization Profiling Tracing Usability

Implement a FILTER Add support for pprof Make the BadWolf Make BQL more
keyword profiling in BadWolf tracer more easily compliant with W3C
extendable, improve recommendations,
(from lexer/grammar (memory and CPU
debugability, assure improve reliability,
to planner and profiles activated
coverage, add verbosity make it more intuitive,
storage/driver levels) through the BW CLI)
levels support a more
complete HAVING
clause

Counter Abuse Technology Attorney / Client Privileged and Confidential 32


Debugability and help
BW will receive heavier Easier to use
finding bottlenecks after
workloads: new internal

Project
heavy queries
clients

a. b. c. d.
Optimization Profiling Tracing Usability

Implement a FILTER Add support for pprof Make the BadWolf Make BQL more
keyword profiling in BadWolf tracer more easily compliant with W3C
extendable, improve recommendations,
(from lexer/grammar (memory and CPU
debugability, assure improve reliability,
to planner and profiles activated
coverage, add verbosity make it more intuitive,
storage/driver levels) through the BW CLI)
levels support a more
complete HAVING
clause

Counter Abuse Technology Attorney / Client Privileged and Confidential 33


a.
Optimization
Implement a FILTER keyword

● Why?
Allow the user to customize, in a level closer to storage/driver, the data they want
to retrieve - improving performance

Similar to SPARQL’s FILTER

Counter Abuse Technology Attorney / Client Privileged and Confidential 34


a.
Optimization
FILTER keyword example

Common time series scenario, ?test graph:

/u<peter> “bought”@[1901] /gift<model 1>


/u<peter> “bought”@[1902] /gift<model 2>
/u<peter> “bought”@[1903] /gift<model 3>

/u<peter> “bought”@[2000] /gift<model 2000>

How to get only the last triple of the time


series?

Counter Abuse Technology Attorney / Client Privileged and Confidential 35


Common time series scenario, ?test graph:

a. /u<peter>
/u<peter>
“bought”@[1901]
“bought”@[1902]
/gift<model 1>
/gift<model 2>
Optimization /u<peter>

“bought”@[1903] /gift<model 3>

/u<peter> “bought”@[2000] /gift<model 2000>

FILTER keyword example How to get only the last triple of the time series?

Without FILTER:

SELECT ?pred, ?obj


FROM ?test
WHERE {
/u<peter> ?pred AT ?time ?obj
}
ORDER BY ?time DESC
LIMIT "1"^^type:int64;

Result:
?pred ?obj
“bought”@[2000] /gift<model 2000>

Counter Abuse Technology Attorney / Client Privileged and Confidential 36


Common time series scenario, ?test graph:

a. /u<peter>
/u<peter>
“bought”@[1901]
“bought”@[1902]
/gift<model 1>
/gift<model 2>
Optimization /u<peter>

“bought”@[1903] /gift<model 3>

/u<peter> “bought”@[2000] /gift<model 2000>

FILTER keyword example How to get only the last triple of the time series?

Without FILTER: With FILTER:

SELECT ?pred, ?obj SELECT ?pred, ?obj


FROM ?test FROM ?test
WHERE { WHERE {
/u<peter> ?pred AT ?time ?obj /u<peter> ?pred ?obj .
} FILTER latest(?pred)
ORDER BY ?time DESC };
LIMIT "1"^^type:int64;

Result: Result:
?pred ?obj ?pred ?obj
“bought”@[2000] /gift<model 2000> “bought”@[2000] /gift<model 2000>

Counter Abuse Technology Attorney / Client Privileged and Confidential 37


a.
Optimization
Currently implemented FILTER functions:
● Latest
Allows only the last element of the time series for the given binding

● IsImmutable
Allows only Immutable predicates for the given binding

● IsTemporal
Allows only Temporal predicates for the given binding

Counter Abuse Technology Attorney / Client Privileged and Confidential 38


a.
Optimization
FILTER keyword: example of impact on performance

CPU consumption in blue could be completely avoided with a FILTER


(~ 35 CPUs used during an entire day each time)
Counter Abuse Technology Attorney / Client Privileged and Confidential 39
b.
Profiling
Integrate pprof profiling to BadWolf

● Why?
Transparency on memory and CPU metrics

Helpful to identify bottlenecks and see what is happening behind the drapes

Counter Abuse Technology Attorney / Client Privileged and Confidential 40


b.
Profiling
Integrate pprof profiling to BadWolf

● How?
Through the BadWolf CLI

bql> start profiling [-cpurate samples_per_second];

bql> stop profiling;

Counter Abuse Technology Attorney / Client Privileged and Confidential 41


b.
Profiling
CPU profile
(Open Source volatile driver)
CPUProfileRate: 1000 Hz

Counter Abuse Technology Attorney / Client Privileged and Confidential 42


b.
Profiling
CPU profile (Open Source volatile driver) - flame graph

CPUProfileRate: 1000 Hz

Counter Abuse Technology Attorney / Client Privileged and Confidential 43


b.
Profiling
Memory profile
(Open Source volatile driver)

Counter Abuse Technology Attorney / Client Privileged and Confidential 44


c.
Tracing
Why not use Dapper?
Dapper is an internal tool and the main piece of BadWolf is Open Source

Counter Abuse Technology Attorney / Client Privileged and Confidential 45


c.
Tracing
Why improve BadWolf’s tracer?
● Improve debugability

● Trace more key metadata to make it more helpful in the occasion of production issues

● Have control over its verbosity to customize the output depending on the use case and also
to not overcharge the server without need

Counter Abuse Technology Attorney / Client Privileged and Confidential 46


c.
[2020-11-12T07:03:15.633787-03:00] Attempting to read file "../play.bql"
Tracing
[2020-11-12T07:03:15.63409-03:00] Executing query: CREATE GRAPH ?test;
[2020-11-12T07:03:15.634365-03:00] Plan successfully created
[2020-11-12T07:03:15.635451-03:00] Creating new graph "?test"
[2020-11-12T07:03:15.640759-03:00] Executed plan returned 0 rows Example of trace
[2020-11-12T07:03:15.640785-03:00] Executing query: INSERT DATA INTO ?test { /u<joe> "parent_of"@[] /u<mary> .
/u<joe> "parent_of"@[] /u<peter> . /u<peter> "parent_of"@[] /u<john> . … };
[2020-11-12T07:03:15.6413-03:00] Plan successfully created
[2020-11-12T07:03:15.641313-03:00] Inserting triples to graph "?test" (Intermediate verbosity level)
[2020-11-12T07:03:15.641651-03:00] Executed plan returned 0 rows
[2020-11-12T07:03:15.641661-03:00] Executing query: SELECT ?p1, ?o1, ?p2 FROM ?test WHERE { /l<barcelona> ?p1 ?o1 .
/item/book<000> ?p2 ?o2 . FILTER latest(?o1) . FILTER latest(?p2) };
[2020-11-12T07:03:15.641938-03:00] Plan successfully created - BQL file is:
[2020-11-12T07:03:15.641946-03:00] Setting global lookup options to <limit=0, lower_anchor=nil, upper_anchor=nil,
LatestAnchor=false, FilterOptions={Operation:latest Field:object field Value:}>
[2020-11-12T07:03:15.641971-03:00] Starting to process clauses CREATE GRAPH ?test;
[2020-11-12T07:03:15.641972-03:00] Starting to process clause 0: { opt=false /l<barcelona> ?p1@[] ?o1[] }
[2020-11-12T07:03:15.641984-03:00] g.TriplesForSubject(/l<barcelona>, <limit=0, lower_anchor=nil, upper_anchor=nil,
LatestAnchor=false, FilterOptions={Operation:latest Field:object field Value:}>), graph: ?test INSERT DATA INTO ?test {...};
[2020-11-12T07:03:15.64217-03:00] Received 1 triples from driver, in planner.addTriples
[2020-11-12T07:03:15.64217-03:00] Added 1 rows to table, in planner.addTriples
[2020-11-12T07:03:15.642172-03:00] Finished processing clause 0: { opt=false /l<barcelona> ?p1@[] ?o1[] }, latency: SELECT ?p1, ?o1, ?p2
199.146µs
[2020-11-12T07:03:15.642172-03:00] Starting to process clause 1: { opt=false /item/book<000> ?p2@[] ?o2[] } FROM ?test
[2020-11-12T07:03:15.642182-03:00] g.TriplesForSubject(/item/book<000>, <limit=0, lower_anchor=nil, WHERE {
upper_anchor=nil, LatestAnchor=false, FilterOptions={Operation:latest Field:predicate field Value:}>), graph: ?test
[2020-11-12T07:03:15.642237-03:00] Received 1 triples from driver, in planner.addTriples
/l<barcelona> ?p1 ?o1 .
[2020-11-12T07:03:15.642238-03:00] Added 1 rows to table, in planner.addTriples /item/book<000> ?p2 ?o2 .
[2020-11-12T07:03:15.642244-03:00] Finished processing clause 1: { opt=false /item/book<000> ?p2@[] ?o2[] }, FILTER latest(?o1) .
latency: 65.223µs
[2020-11-12T07:03:15.642245-03:00] Finished processing all clauses, total latency: 272.783µs FILTER latest(?p2)
[2020-11-12T07:03:15.642253-03:00] Executed plan returned 1 rows };
[2020-11-12T07:03:15.642261-03:00] Executing query: DROP GRAPH ?test;
[2020-11-12T07:03:15.642349-03:00] Plan successfully created
[2020-11-12T07:03:15.642356-03:00] Deleting graph "?test" DROP GRAPH ?test;
[2020-11-12T07:03:15.642358-03:00] Executed plan returned 0 rows

Counter Abuse Technology Attorney / Client Privileged and Confidential 47


c.
Tracing
How to configure verbosity?
Through the BadWolf CLI

bql> start tracing [-v verbosity_level] [trace_file];

bql> stop tracing;

Counter Abuse Technology Attorney / Client Privileged and Confidential 48


c.
Tracing
How to trace messages in code?
Inspired in the Go open source version of the Google Logging Library (glog)

tracer.V(2).Trace(p.tracer, func() *tracer.Arguments {


return &tracer.Arguments{
Msgs: []string{fmt.Sprintf("Starting to process clauses")},
}
})

Counter Abuse Technology Attorney / Client Privileged and Confidential 49


d.
Usability
Improvements on the query language (BQL)
● Make it more compliant with W3C recommendations (eg: trailing dot use inside WHERE clause)

● Make it more intuitive for the user

● Make it more reliable, especially in corner cases for binding extractions

● Support a more complete HAVING clause

Counter Abuse Technology Attorney / Client Privileged and Confidential 50


d.
Usability
Usability example: ID and time binding comparisons inside HAVING clauses

SELECT ?s, ?p, ?p_id, ?time


FROM ?test
WHERE {
?s ?p ID ?p_id AT ?time ?o
}
HAVING (?p_id < "in"^^type:text) AND (?time > 2016-02-01T00:00:00-08:00);

Also, optional trailing dot for the last clause inside WHERE (W3C)

Counter Abuse Technology Attorney / Client Privileged and Confidential 51


03

Demo

Counter Abuse Technology Attorney / Client Privileged and Confidential 52


Demo - Google Drive link
04

Wrap-up

Counter Abuse Technology Attorney / Client Privileged and Confidential 54


Wrap-up

Opportunity to work on Touch multiple levels of the Participate on technical Learn a lot: Go,
very different implementation of the query and non-technical SPARQL, project
dimensions of BadWolf language and the BW discussions about the development, design,
processing flow future of BadWolf agile
1. Query Optimization
2. Profiling ● Lexer ● Design docs
In spite of not having
3. Tracing ● Grammar ● One pagers
access to g3docs,
4. Usability ● Semantic / Hooks ● GitHub Issues
google3 and code search
● Planner
● Storage / Driver

Counter Abuse Technology Attorney / Client Privileged and Confidential 55


Wrap-up

Opportunity to work on Touch multiple levels of the Participate on technical Learn a lot: Go,
very different implementation of the query and non-technical SPARQL, project
dimensions of BadWolf language and the BW discussions about the development, design,
processing flow future of BadWolf agile
1. Query Optimization
2. Profiling ● Lexer ● Design docs
In spite of not having
3. Tracing ● Grammar ● One pagers
access to g3docs,
4. Usability ● Semantic / Hooks ● GitHub Issues
google3 and code search
● Planner
● Storage / Driver

Counter Abuse Technology Attorney / Client Privileged and Confidential 56


Advancing BadWolf is a lot of work :)

Since we like numbers,


during the internship ...

GitHub project history (contributions) internship period

● … 35 Pull Requests ● … 233+ meetings


● … 8,800++ additions, 3,584-- deletions ● … 0 meals at the “long long table” :(
● … 261 commits ● … met, worked with and learned from a
● … 14 Issues opened number of amazing people

Counter Abuse Technology Attorney / Client Privileged and Confidential 57


Thank you Carol, Thiago, Kloss and the ares-storage@ team!
Thank you very much Xavier, Tati and Jose!

Counter Abuse Technology Attorney / Client Privileged and Confidential 58


References
● Google Open Source - BadWolf
https://opensource.google/projects/badwolf

● GitHub - BadWolf
https://github.com/google/badwolf

● Google Logging Library (glog)


https://github.com/golang/glog

● “The BadWolf Project - Temporal Graph Store”


Presentation deck, 19 Sep 2017, Xavier Llorà

Counter Abuse Technology Attorney / Client Privileged and Confidential 59


“Question”(s)?

Counter Abuse Technology Attorney / Client Privileged and Confidential 60

You might also like