You are on page 1of 31

[Music]

today I'll talk a little bit about

bighead which is our end-to-end machine

learning platform I'll cover the

different components for the agenda I

intend to talk about the background

design goals architecture deep dive and

then talk a little bit about open source

plans I tend to move a little bit so I

feel kind of bad for the camera man okay

you're gonna be okay cool

so I guess overall I just wanted to

start how many people have heard of

Airbnb bighead okay that's not too bad a

couple of hands so some of this might be

a little bit repetitive but if you do

have questions I think we have office

hours afterwards and I could dive a lot

deeper this is going to be a little bit

of a high level of exactly what we

decided to do with bighead so background

information how many people have stated

their baby okay

it would have made me very sad if those

lonely a couple hands agenda so overall

as you guys are well aware Airbnb is

product is a kind of a global travel

community that offers magical end to end

trips including where you stay what you

do and the people you meet and machine


learning is imbued within the product

it's been there for quite a long time

and so historically the teams that have

built their own machine learning

infrastructure or the search ranking

team smart pricing and fraud detection

as you can imagine as early as air maybe

existed there were search and so with

search came search ranking and so they

these are the teams that have staff

their own ml infrastructure teams they

are the ones that have invested the

efforts and engineers actually maintain

their own infra but there were

significantly more opportunities to take

advantage of at Airbnb and so some of

the use cases were paid growth how do we

know exactly where we invest our money

in terms of advertising is there any

smarter where you can do it is there any

way we can use ML classifying and

categorizing listings how do we make

sure that we give a better experience

the users by classifying their listings

experience ranking room type

categorizations which is a very

interesting one the room type

categorizations is if we have photos of

homes is there any way that we can tell


the bathroom living room bedroom yes the

use case seems silly when I just say it

but in terms of if we do automated ad

generation we really don't want to show

you five pictures of bathrooms or five

pictures or garage and you can imagine

that really tanks our conversion rate

and so there are so many opportunities

for ml but it's really really difficult

to staff an entire ml and 14 for each of

these use cases and so I'm gonna dive

into kind of the two components of how

we viewed ml and this is really led to

the inception of the ml infrastructure

team and so the intrinsic complexity is

the complexity with understanding the

business domain these are also

complexities that are fundamental

machine learning and you often encounter

these every time you go into a machine

learning problem and so that's

understanding the business domain

selecting the appropriate model

selecting the appropriate features and

fine tuning hyper parameter tuning your

model what are the incidental

complexities what are complexities that

can really be solved and simplified and

so some of the incidental complexities

are integrating with air BnB data


warehouse scale scaling model training

and serving keeping consistency between

prototyping versus production training

versus inference an inference if people

are not familiar it's more of kind of

scoring keep you track of multiple

models versions experiments supporting

the iteration of ml models and so these

incidental complexities come often and

it's you know time that spent precious

time that's spent on your model and so

this actually led to a lot of bloat

so ml models would take on average eight

to twelve weeks to build an ml workflows

tend to be slow fragmented and extremely

brittle and so we've yeah

human time yeah from first thought to

yeah cool I don't know what's the policy

for questions but we can at the end yeah

we can do at the end but a great

question though no worries

yeah and it tended to be extremely

brittle and so you can imagine that as

you're deploying models into production

especially when it's going on Airbnb

website or kind of the the application

we want to make sure that these are very

robust and they're resilient to failure

and so the ML infrastructure team was


formed to address these challenges the

vision was Airbnb routinely ships ml

powered features o ml powered features

throughout the product and then the

mission is equipping Airbnb with share

technology to build production ready

machine learning applications with no

incidental complexity obviously the

intrinsic complexity will always be

there understanding the business domain

as well as building the model however we

really want to simplify the into the

incidental complexities with it

connecting to the data warehouse does

need to be reinvented over and over

again we can really simplify that so

this is kind of the pinwheel effect for

ml this was built by my predecessor and

it's kind of you start with your data

management and then you go into your

prototyping you go into your model

lifecycle management and then you go

into production ization and then there's

this constant iteration and the more you

iterate the better you can kind of build

your ml model and we want to make this

as streamlined as possible as you can

imagine we streamlined that iteration

process there's a lot more potential for

development and new models to be created


so our design goals these are the four

goals that we had in mind when building

our infrastructure our infrastructure we

started about two and a half years ago

building out this infra and the four

goals were keeping it seamless versatile

consistent and scalable and I'll talk

about each one and how it plays a role

in each of our components so defining

seamless a little bit further seamless

easy to prototype easy to production

eyes same workflow across different

frameworks

making it versatile supports all major

ML frameworks meets various requirements

online offline data size SLA so service

level agreements GPU training as well as

scheduled and ad-hoc training a

consistency consistent environment

across the entire stack consistent data

transformation whether that's your

prototyping environment or production

and whether it's an online or offline

use case and keeping it scaleable

horizontally scalable and making it very

elastic as you can imagine Airbnb is a

very seasonal type of business and so we

do see spikes in traffic here and there

and so we want to make sure that we do


have a very elastic infrastructure for

that so we'll go into the architecture

deep dive big head is primarily made up

of these seven components it starts off

with red spot goes into that big head

service and then goes into your deep

thought for real-time inference and then

it goes into ml Automator for your batch

training as well as your batch inference

the environment management the dock is

through docker image service another

component we own execution management is

through our big head library and then we

have feature data management zipline and

as you can see the three services we

have underneath which is the docker

image service big head library and

zipline these kind of span the entire

board of these other services and they

work in conjunction I'll talk about each

one individually exactly why we've built

it why we felt like we needed to build

it what problems they solve and describe

it a little bit further so diving into

redspot so how many people have at least

used jupiter notebooks before awesome ok

I think that's move more than people who

raise their hands for Airbnb here so

Jupiter notebooks everybody's pretty

familiar with it
we found that Jupiter notebooks are

ideal for machine learning because a lot

of machine learning models we create

them to almost like research papers and

projects it's not a fact of the end

result is as important it's also the

development it's the ideation of it the

things that you've tested the things

that you've seen

and so Jupiter notebooks really do a

good job in terms of persisting all that

intermediate state of experiments trial

and error and so you can actually

justify why this model is a good one to

put in production and the thought

process behind it so what makes an ideal

machine learning development environment

it's that interactivity and feedback

making sure that you can execute

different cells as well as you have that

visualization it's access to very very

powerful hardware and then it's access

to the data and so what is Red Spot Red

Spot is our supercharged Jupiter

notebook service it's a fork of Jupiter

hub for those of you who are familiar

with Jupiter hub it's integrated with

our data warehouse to make sure that

individuals don't have to integrate the


access themselves it has access to

specialized hardware so you can actually

run on GPUs if you'd like to it has file

sharing between users via AWS EFS we

greatly promote collaboration with an

airbnb and so this is a big part of most

of our infrastructure that ml the ml

community at Airbnb loves to share their

work as well as loves to share their

ideas and so we make sure that even in

red spot we have built-in functionality

so people can share their work with one

another and then we're packaged in that

familiar Jupiter hub UI so this is kind

of the snippets of the page the Red Spot

home page you get to choose your own

doctor image so you get to choose what

libraries you have within your doctor

container and then from there you get to

choose your actual instance type whether

you need something very small and so we

do support T to the mediums all within

AWS we also support GPUs for really

really expensive instances we usually

ask beforehand for people to ask our

team just because we are cost conscious

and we want to make sure that people

aren't spinning up X ones on AWS because

that would be very very expensive for us

but it's nice to have the support where


we can have a suite of different type of

hardware for users and different types

of use cases

so doubling back to some of the themes

that I've mentioned before you have

consistency so this promotes prototyping

in the exact same environment that your

model will be used in this is through

that docker image we make sure that

you're you're in the same image that

you're going to be deploying into

production versatility you get to use

customized Hardware whatever you need

for your ML use case and then customize

dependencies the docker images can

support Python to Python 3 we're slowly

deprecating python 2 just because python

3 is being deprecated across the

industry but we do allow you to have a

suite of different types of libraries

within your images and then seamless

it's integrated nicely with Paquette

service docker image service via the big

head api so as well as the UI widgets

and so when you're interacting with the

big head service you can actually see

the same exact visualizations that you

would see there within your redspot

environment and you can make sure that


you have that consistency across the

board so next step is diving into docker

image service docker image service it's

the environment customization built into

big head and so why do we need docker

image service so ml users have a diverse

heterogeneous set of dependencies I'm

sure everyone who's dabbled at ml here

knows that there are many many different

frameworks as well as many different

libraries and the entire industry is

very dynamic and it's been changing very

frequently in terms of new libraries new

support for things and so we need an

easy way to bootstrap their own runtime

environment so that they have support

for libraries they need as well as it

needs to be consistent with the rest of

Airbnb so infrastructure which is why

we've kind of moved towards a docker

environments and so dependency

customization we allow for people to

choose their own customization for their

docker images and create their own

images we've really built on top of the

docker API to simplify this for users so

that they can create all their own

custom images as well it promotes that

consistency and versatility and the fact

that users can choose what they want as


well as make sure that this docker image

is going to be used within production

this is something we've seen pretty

often where individuals would have

different versions of libraries when

they're prototyping and when they deploy

into production which can lead to a lot

of strange effects in production and

it's something that's very difficult to

debug as well next we'll talk about big

head service so biggest service is the

model lifecycle management so why is

this needed why is model lifecycle

management needed tracking ml model

changes is just as important as tracking

code changes it's not just code anymore

it's also the model weights that you

have to track you have to make sure that

you fully encapsulate what you've

actually put into this model and it's we

would argue a little bit more complex

than just tracking code now ml model

work needs to be reproducible as well as

sustainable and so if there is any type

of rollback how do you actually roll

back to a previous version of the model

how do you guarantee that that model is

identical to what was trained before and

then how do you compare experiments


before you launch models into production

this is also critical how do you know

whether the model that you've built now

is better than the one previously and so

these are all questions that we had in

mind and we've kind of went out and

built began service for this exact

purpose and big at service does the

model versioning this is the UI

component of it you can see here you

have your model you have the different

trained models you also have the version

get jaws for exactly what the code is

for that was used to train that model

it also has time stands for the data the

the date that it was trained so that you

can actually correlate to exactly the

date that all the features were used to

as well and then you could download the

artifacts and actually port it over if

you'd like as well in debug it so we do

understand that going into production

has production in some incidents that

always occur and so we do have links to

your kubernetes cluster that you can

actually debug it if your inference is

going wrong in production and then the

quality's again that I mentioned

consistency it's a central model

management service and so we keep track


of the get shot to make sure the code

that we version that we make sure that

we have the artifacts so basically all

the train weights

well saved with an s3 and that's also

version so that you can roll back

you can also we want to support in the

future so that you can do

experimentation to compare different

versions of the same model so that you

can see whether if you trained it on a

week's worth of data versus a month that

you can actually compare to see if one

performs better than the other it's a

single source of truth it's a nice place

where bighead users can go to see

exactly what's the point in to

production and exactly what is deployed

in production just because models can be

fairly complex in terms of dependencies

as well as when it was actually last

trained and then seamlessness its

context of where visualizations carry

over from Red Spot all the way to pick

at service to make sure that if you use

certain visualizations you have a

consistent experiment experience across

the board

big head library so ml models probably


everyone is very very familiar here it's

it's highly heterogeneous there are a

lot of different ml frameworks out there

and Airbnb is full of different

employees who have different experiences

with different ml frameworks and so some

of the frameworks that we've seen are

tensorflow tied toward Karros MX net XG

Bou scikit-learn and it's just wide

across the board and in terms of verse

the heterogeneity of the data as well

the data quality is very very different

across the board as well as you have

your structured data and your

unstructured data and then in terms of

environments the needs range from GPUs

to many many CPUs to a single CPU and

the dependencies also were very very

different across the board as well which

is why and so making it kind of concise

data in production is different from

data in training offline pipeline is

different from your online pipeline and

everyone does everything in a very very

different way ml is not a standardized

kind of format yet and so we've built

the big head library it essentially is

this framework that we've built where

you can build wrappers around common ml

frameworks just so that people can use


what they're already familiar with

we don't really dictate whether you

should use tensorflow over pi torch we

say that we have this wrapper so that we

can serialize your model no matter what

you use

and now we can make sure that code is

the same exact code deployed in

production and what you're using in

prototype environment and so basically

we do have pipelines and through this

pipeline it's a compute graph for

pre-processing your inference your

training your evaluation and

visualization these are composable

reusable and shareable it supports

popular frameworks like tensor flow pipe

Torchic arrows and X nets I could learn

actually boost we've built some of the

pre-processing steps in C++ just so that

we can share some of our best practices

and we've seen a 30x boost in

performance compared to the Python

implementation and we also have metadata

for the Train models as well persisted

within the bigot library and then it's a

for consistency it has a uniform API to

make sure that we can actually process

in a very similar way in your prototype


as well as your production environments

and it's serializable which means that

we can actually port it over into

production have it online offline be the

same exact pipeline and so this is

actually the config so over here you can

see that we've specified the categorical

features and then the numeric features

and then we instantiate a pipeline over

there we say for the numeric features

that we wanted to Anantha mean and then

for the categorical features we do one

hot label encoder and then we attach and

actually boost classifier at the end and

we set the hybrid parameters and so now

we can port this over into offline

training or offline inference or even

online inference and it's the same exact

pipeline and we can make sure that we

have that consistency of code across the

aldi in different environments this is

just the visualization of the pipeline

this visualization you could have both

in that prototyping environment Called

red spot as well as inside the big head

service to make sure that you're looking

at the same visualizations and then the

nice thing about it is you can serialize

it very quickly just the p dot serialize

and this means that you can manually


upload the model if you want into big

head service or we can automatically

upload it for you as well when you're

ready to deploy into production so these

are some of the visualizations that we

have in the big head service

you know some more visualizations this

is just a simple feature importance cool

next we'll talk about deep thought so

deep thought is our online inference

service so what's really difficult about

making a model serving traffic online

well consistency staying consistent with

training is is very very difficult just

because your training environment on

where you've trained your model is going

to be fairly different from what you're

doing in production and so it's

different data usually different

pipeline different code and it's

different dependencies it's very

difficult for data science to launch

models without an engineering team just

because it does need to plug into the

Airbnb application it also is very

difficult for engineers to rebuild

models just because it's not it's

difficult for engineers to often rebuild

models just because there's no previous


knowledge of how the model was built

from data science to pour it over into

production and then it also needs to be

very very scalable and robust just

because this is going into production

where this can affect the critical path

and so there are resource requirements

that vary across the different models to

make sure that let's say you need

inference in 50 milliseconds for to not

slow down the Airbnb website and so

there are a lot of requirements and

scalability requirements as well and

throughput fluctuates quite a bit across

time as well as you can imagine

seasonality is quite a big thing for

every MB just as a b2c business so how

this deep thought solved this well deep

thought solved the consistency aspect

through docker and the big library we're

making sure that it's the same exact

data source the pipelines identical as

well as the environment is identical as

training we want to make sure that

exactly what you did in training and

prototyping is identical to what you get

in production so that there's no

confusion as to the version of my

dependency has changed over time you

know and what was deployed and its


operating very differently it's seamless

so it integrates with event logging and

dashboards as well as it integrates with

zip line which I'll talk a little bit

later which is our feature management

framework and it's also highly scalable

it's built on kubernetes the

model pods can scale very easily and

there's resource segregation across

model so that we have no noisy neighbor

problem we're single model can take down

the other models this is the

architecture of how we've built deep

thought and so the client traffic goes

through our REST API it goes into our

model manager which checks with the big

head service client to see which models

are registered and then it goes through

routing where it routes the traffic to

the different model pods that are

actually hosting the artifacts users get

to choose how many pods they want to be

on as well as we've deployed Auto scale

to make sure that if there is a spike in

usage that we can deploy more pods for

that model as well

ml auto meter so and a lot of mater is

our offline training and batch inference

and so why is this needed automating


training and inference and eval is

necessary because you need to do

scheduling you need to do resource

allocation for these jobs you need to

save the results somewhere you need the

dashboards and alerts and you need to do

the entire orchestration for this to run

smoothly and so with ml Automator once

again with those themes it's consistent

because we have that same docker and big

head library the same one that we used

to deep thought this is especially nice

because users can now do offline

inference and online inference just with

a simple config change on whether they

want to deploy it online or offline it's

seamless and so you can automate tasks

via air flow how many people use air

flow here so you can actually automate

the tasks via air flow you could

generate your Dax for training inference

with the appropriate amount of resources

whether you want to use distributed

spark for training and then it has tight

integration with zip line for training

and scoring data and is highly scalable

we use spark for distributed computing

across large data sets and so this is

just a picture of the airflow UI and the

dag that's automatically generated


through ml Automator

and then lastly we will talk about

zipline and so feature management is

incredibly difficult we've we've

allocated about 50% of our team we're a

team about 12 now on zipline because

this feature management is what we found

to be one of the hardest things to do

for ml we've also found that a lot of

data scientists are spending most their

time with feature engineering the model

management model lifecycle is another

portion of it but a lot of the hardware

to spend in feature engineering so why

is it incredibly difficult to do feature

engineering correctly inconsistency

between your offline and online data

sets what does that mean that means that

the data that you have when you're

running things in production might not

be the data you're seeing when you're

training and so I do have a nice tie

maybe I can just show the diagram

quickly so I have this diagram here

which is the essence of why this is so

difficult in most companies previously

before zipline was in the middle part

you had your production data stores that

would be used for model scoring and then


it go into a data warehouse go through

ETL jobs that aggregated daily and

that's the data you're seeing to train

and so it turns out that data can be

very very different from the data that

you'll see live in production and so how

do you make sure there is no label

leakage right how do you make sure that

you're not training on data that are as

a good proxy for how your models should

predicted scores but now when you deploy

into production you know the accuracy

drops tremendously because of it which

is kind of the difficulty with the

offline online data sets it's also very

very tricky to generate your training

sets that depend on time correctly and

so one of the most difficult things is

getting that point in time correctness

especially as you can imagine if it goes

through multiple ETL jobs that aggregate

it daily you need to make sure that you

persist the exact precise timestamps of

this user has clicked on this button

here at this timestamp and these are the

sequence of events and now I can do a

personalized search result saying hey

this person is interested in this Airbnb

listing other problems that are making

it really really difficult is training


set back fills it really really takes a

lot of expertise to do back

phil's correctly and in a reasonable

amount of time and usually you know us

it's take number of years of practice

and painful painful understanding to

understand that backfills is no joke and

making an efficient is very very

difficult as well

inadequate data quality checks or

monitoring and so we've seen that

feature drifts do happen over time your

features do change and model training is

best practice to make sure that you're

constantly training your models

depending on how stale your models get

but over time there are models that

don't have this best hygiene put into

place and so how do you know if your

features are drifting slowly over time

and how do you make sure that you know

whether to retrain your model or not and

then the other kind of hard thing about

it especially at Airbnb is that unclear

feature ownership who owns what feature

especially because this these features

are shared across the entire company and

so if the feature breaks if an upstream

pipeline breaks who's responsible for


fixing it and so these are fundamentally

some of the problems that made it

extremely difficult for us and so we've

went out and build zip line and zip line

maintains that consistency across data

from training and scoring it maintains

that consistency of data that use in

development to production it also has

that point in time correctness to make

sure that we don't have any label

leakage as well as to make sure that if

you do relying on intraday sequence of

events that we persist that as well it's

seamless it integrates with deep thought

and ml on demeanor so that you can make

sure that you're using the same exact

features that you're using in

prototyping as well as in production and

then it's highly scaleable it leverages

spark for batch and it uses flame for

streaming workloads and so the way we've

solved it is through this kind of

architecture where we have zip lined in

the middle as a middle layer and zip

line is the one that's making that

consistency between your data stores and

your data warehouse and so as the data

comes into the data warehouse whether

it's through events or through database

mutations we ingest it raw from there


and we actually persist all of these

mutations and events so that we can

recreate your data at any point in time

this is amazing because now if you

choose to

have a snapshot of your data at the end

of the day we can actually recreate it

from the transactions and the mutations

to make sure that all the data that you

see there will be the ones that's

available to you when you're actually

scoring online as well so this helps

with preventing data leakage label

leakage as well as and make sure that if

you're relying on the sequence of events

that we persist those sequences as well

and this is nice because now we have a

single store where we can train our

models as well as we can use that model

the training data for actually scoring

and making sure that consistencies in

place it's the same exact data across

the board so overall the summary we've

built this end-to-end platform to build

deploy machine learning models to

production that seamless versatile

consistent and scalable model lifecycle

management has been a huge part of that

effort as well this is one of those


things that it won't buy you the first

time around because you've built your

model it bites you on the second third

time and on the third time if you ever

need to roll back to a previous model it

really really really is painful if you

don't have good model icicle management

feature generation management at future

generation and management was also a key

point with zip line this is also one of

those most painful things where feature

engineering could be simpler and we can

really scale this out to the company and

abstract away a lot of the pain points

for users and it turns out that a lot of

the features that are being used across

the board at the company are shared and

so if we solve it once we could solve it

quite for a number of people who are

rebuilding the same features over and

over again keeping that consistency

between online offline inference we do

that through ml Automator the big head

libraries as well as deep thought and

then pipeline library support for major

frameworks reiterating on the fact that

Airbnb is full of Engineers that come

from many different companies the Google

engineers really like using tensorflow

the data scientists who come from


Facebook really like to use PI tortes as

you could imagine and so we really don't

want to do an enforcement that everybody

has to

use tensorflow everybody has to use pine

torch and this is a stance we've taken

that's different a little than a lot of

other companies so if you look at tf-x

or a lot of the other frameworks they've

kind of built on top of one single one

and they're not really agnostic to

whichever nml framework you want to go

with and so our stance is if you want to

use a different library we're open to it

and we have a flexible bighead wrapper

so you can actually do these wrappers

across your library so that you can use

it however you want

and then we have the docker image

customisation service docker has been a

huge saver for us to make sure that we

have this consistent environment from

prototyping to production and then

multi-tenant training environments as

well as even multi-tenant serving

environments as well just making sure

that we can scale both in training as

well as in scoring just because some of

these cases get very very massive and


you know more and more of the deep

learning models and more layers that

come into it require more and more

resources for them to properly run and

so we want to make sure that we're

highly flexible there bighead is built

on a lot of open-source technologies

tensorflow of pipe torch Karos MX net

scikit-learn actually boosts spark

Jupiter kubernetes docker airflow and so

we're kind of falling suit and we're

looking to open sources as well for

those of you are already familiar with

begad probably know that we've been

saying this for quite some time but we

have made concrete steps towards it we

had to delay a little bit as everybody

gets ready for IPO and there's a lot of

work to be done still there but we are

going to open source and we're right now

in the phase of selecting our first

couple private collaborators essentially

we want to dip our toe into the open

source kind of space and we want to just

see exactly how much work it is and so

if you're interested please email me

we're looking for our partners just a

couple just to really test the waters

and see how much work open-source is

going to be and then we'll be moving


more towards that broad open source

where it'll be open to everyone cool any

quick questions

hi can you describe a little bit about

how the teams came together to build

this and maintain it you know the Airbnb

can I have to restructure a teams to

make it happen

yeah so the truth is the trust team at

one point was maintaining a lot of their

own infra for machine learning and it

turns out that there were two other

teams doing this as well

and from that point it became pretty

clear that building and rebuilding ml

infrastructure was not the goal of the

company and so this team kind of came

together with a couple of engineers who

thought we should really standardize

this across the board and then from

there it's been growing and growing in

terms of the number of use case that's

been onboarding this is actually the

third time we've built ml infra so we're

quite experienced at this we've had a v1

and even before that we had something

called arrow solve that I don't know if

anyone's familiar with there's a nod

were there but like it was a learning


experience across the board and so yeah

it's just more of we've been bitten

enough times that we keep rebuilding it

and it's a lot of work for engineers to

maintain ml infra over and over again

and so standardizing it we made a lot of

sense and so we charted a team got

together and kind of started building

this out but it started very not so much

top-down more bottom-up yeah

[Music]

You might also like