You are on page 1of 42

morning everyone and thank you for

joining the Google cloud Lunch and Learn

my name is Ingrid comes Ellis and I'm

the Google cloud Sales Director is I

don't mean doc so we on behalf of the

Google cloud family I would like to

thank all of you for joining us I hope

that you knew your family as well as

your colleagues are doing well during

this challenging time so Lunch and Learn

is part of the Whitney Tech Talk series

our goal is to serve you better

previously we cover topics such as

business continuity DevOps communities

our goal is to have this session as

interactive as possible we want to share

with our customers best practices I

encourage you to sign in Nano the link

is available in a live chat box as soon

as you sign in will send you the deck

and if the backbone please provide us

feedback and I think you will like

uncover during the next session we learn

with you and from you

so after today's session you will have

the opportunity to discuss questions

please why do questions in a youtube

light chat box without waiting at the

klezmer and the honor to introduce you

today our speaker eric smith eric is the

developer advocacy lead for data

analytic and applied data science within

Google Cloud listing focuses on enabling

data engineer data analysts and data

scientists across data pipelines that

are we're seeing streaming analytics and

business intelligence work lots prior to

his advocacy role Eric was the funding

product manager for cloud dataflow

bringing forth a unified model for batch

and stream processing today we'll talk

about the future pillars in the query

with the lens toward how will you use

them for various scenario so without

waiting Eric you have the floor

thank you thank you thank you

flip over to present well good morning

or good afternoon wherever you may be

hello and welcome to an overview of

bigquery as we look at features use

cases and best practices this Ingrid

said my name is Eric Schmidt and I leave

the data analytics and data science

developer advocacy team within Google

cloud my team works with customers and

practitioners at large helping them to

build data analysis pipelines you can

reach me at cloudy at or on

twitter at not that Eric where my

opinions are my own thank you for

joining me today to start with I'd like

to lay out the overall focus for this

talk this talk is about bigquery

but more specifically this talk is also

about bigquery early bigquery primarily

through the lens of technical

practitioners I'm talking about the DBAs

who manage infrastructure schema changes

and deal with access control for data

engineers who build data transformation

pipelines and toil over ever-expanding

processing demands of data or data

analysts driving descriptive analytics

and data scientists seeking to formulate

new questions about the past in the

future now if you're an IT DM making

purchasing decisions or a CSO wrangling

across organizations security challenges

and maybe a chief decision-maker you

should also find this talk useful as it

will shine light on the value

proposition of bigquery beyond technical

fundamentals if you're new to Google

cloud or maybe a seasoned pro it's still

important to understand the product

landscape for this talk as I just won't

be addressing bigquery I'll also be

talking about and expanding into the

broader product portfolio inside of the

data analytics

offerings on Google cloud now as a side

note for a little bit of fun here is a

printable poster that has every Google

cloud product along with a forward or

less description now whenever I started

on Google cloud back in 2013 I think

there were six products of which

bigquery was one of them so grab the

high-res poster or wallpaper from the

github link below print it out and you

can impress your fellow cloud and data

nerds now instead of looking at products

I'm going to dive into a concepts that

we call us solutions now Google cloud

platform has eight primary solutions

smart analytics being one of them the

way I like to talk about this is that

it's easy to talk about a single product

it's simple but in the real world as

practitioners we live and operate from a

solutions purpose perspective this is

typically framed as I need a solve for X

problem and in order to do so I need a B

and C products implemented with y and z

patterns so this frame you can get a

little messy

because you start mixing and matching

different processes products and

patterns so we shift away from data

analytics as a product grouping and we

dive into smart analytics now a quick

warning I have a few more market texture

slides to make sure we're on the same

page and then we're going to get into

sequel encode here we're gonna lay out

what the Google Cloud smart analytics

platform looks like Google Cloud smart

analytics enables for solution areas

streaming analytics data Lake and daily

modernization data warehouse

modernization and business intelligence

with integrated data science workflow

with bigquery at the heart of these

architectures and salute

so using the solution rubric that I

stated in the last slide we can state a

problem like I need to calculate store

sales in real time and for that I need

extreme ingestion mechanism a streaming

analytics engine and a scalable durable

cache stitched together with selected

patterns so what does bigquery and why

am I talking about streaming and data

Lakes B query is a stimulus fully

managed highly scalable and

cost-effective cloud data warehouse

design for business agility it enables

you to analyze bytes megabytes or

petabytes of data using standard ansi

sequel at blazing fast speeds with zero

operational overhead with

enterprise-grade security at the same

time bigquery also provides integrated

GIS capabilities it has an in-memory

analysis service and it also has a built

in machine learning features oh by the

way bigquery turned 10 years old this

week whenever I started back in 2013 we

had just added joints as a feature so 10

years of innovation and if you go online

you can google for lots of happy

birthday wishes and some reflection by

various product team members engineers

and others in the ecosystem who have

benefited from this amazing product so

happy birthday bigquery now that you

know what bigquery is at a high level

let's look at how it's built what is it

inside of this implementation the first

key point on this slide to understand is

the separate of the storage system that

stores information to be queried and the

query engine itself

now this separation provides several

important benefits first of all bigquery

stores data and a proprietary column or

format called capacitor which as it

evolved over time can provide more


specifically through data layout

optimization the data is physically

stored on Google's distributed file

system which is called Colossus this

ensures durability through something

called erasure encoding where we store

redundant chunks of data on multiple

physical disks and moreover this data is

replicated throughout multiple data

centers another key point is that as

optimizations occur in the query engine

they can be rolled out to the entire

Agusta customer base was zero downtime

so we can mix and match how

optimizations are added to both the

source system and the query engine the

second key point to understand here is

that these sub services store its

computed set are they all run on server

lists on our service resource model

which behind the scenes is fully managed

by Google so if you're coming from an

on-premise and/or a monolithic data

warehouse and/or data Lake

implementation this may be a completely

different model for you a model that

changes how you think about how you

build and manage such systems drilling a

deeper it changes the game because

bigquery replaces this concept of

typical hardware setup and deployment

there is nothing to deploy with bigquery

or manage from an infrastructure

perspective from there you can quickly

iterate and grow a basic data Mart using

data sets and table constructs you can

then use and deploy tables as a source

of truth to define schemas for a data

Lake both within bigquery or outside of

bigquery storage system your data Lake

may contain files elsewhere like on

cloud storage Google Drive or a

transactional system like cloud BigTable

bigquery also

it's cloud identity access management or

im2 grant permissions to specific

actions within the query this replaces

traditional concepts like sequel grant

and revoke ultimately you can develop

and deploy insights in real time at

scale with bigquery now with all this

framing in place let's dig in to some

sequel encode now as I was preparing for

this talk over the weekend

I like the majority of you we're also

experiencing some very challenging times

with protests civil unrest and dealing

with the Cova pandemic these are very

challenging times now at the same time

here in Seattle where I live we also

experienced some very violent weather

over the weekend which included


to which my seven-year-old asked lower

you're eating dinner if we get more

lightning here in Seattle then grandma

does who lives on the East Coast this is

something we talk about a lot mainly

because we don't get a lot of lightning

here in Seattle so for this demo we're

going to do a very simple and a data

analysis data science experiment which

is let's formulate a question do we get

more lightning here in Seattle than a

grandma lives on the East Coast so we

started thinking about this well where

can we go and get that information can

we go to the news could we go to NOAA

website could we Google it it turns out

this is a fairly challenging question to

ask if you want to compare lightning

strikes between two cities now also once

we start querying the data we have to

understand what my boundaries are what

is here in terms of if we live in

Seattle where are we defining here in

terms of those lightning strikes are we

talking about in our neighborhood across

the city etc and then once we have the

ability to query the data can we

visualize it to see if it makes sense

and does it answer the question so with

that let's flip over to bigquery

what you see here is the bigquery cloud

console and on the left hand side if I

scroll down you start to see different

projects that have I have access to now

you have access to the some of these

projects as well for example you have

access to something called the bigquery

public data project and inside of that

there are a bunch of freely publicly

available data sets that you can use to

perform analysis if I scroll down to one


NOAA lightning there's a lot of data

sets in here you can see that I have

tables for lightning strikes from

current year all the way back until 1987

and the schema looks like this basically

day the number of strikes and the center

points so if I preview it it's a fairly

basic layout so I sat down with my son

and I started asking questions like how

close do you think a lightning straight

would be if it was close to our house

and we agreed we agreed upon roughly

around five miles so since this isn't

meters I just did some conversion and

then from there we use built in GIS

functions inside of bigquery to

calculate the distance between where the

lightning struck according to the record

and the center point in our neighborhood

both for our house and Grandma's house

and then from there I'm using a function

called table wildcards so that can query

across all of these years and do some

aggregations so let's run this this

query and see what happens

keep in mind in the background

bigquery just went and scanned all of

the data in storage and we scanned 4.9

gigs in roughly five point four seconds

so this query it looks like it's

producing reasonable results I have

distance to our house which is seems

very very far

way in this strike was fairly close to

grandma's house on the East Coast

great so I've effectively built this

query but it really doesn't tell a story

yet does it answer the question you know

who effectively gets more lightning so

what I'm going to do is flip over to

another tool which is called AI platform

notebooks and here once I jump out of

bigquery I can jump into AI platform

notebooks and I have a notebook already

pre-built that helps me interact live

and in real-time with that same

information inside of bigquery and we're

going to talk more about different

access mechanisms later but suffice it

to say I'm going to import some

libraries I'm going to connect to

bigquery I'm going to take that same

query that I wrote and we'll go ahead

and execute it here so go out rerun the

same query and I get the results back

pump them back into a data frame so they

can see them here this is pretty helpful

but let's go ahead and visualize this

and bring this to life so in this view

I'm gonna go ahead and filter out

lightning strikes that are close to our

house so over the years you can

basically see we don't get a lot of

lightning so some days it's basically

zero looks like we had a peak here there

was a lot of strikes back in 1999 but a

lot being 40 if I go down and I look at

what happens in Pittsburgh around

grandma's house you can see the scale is

much different you know they have

multiple storms that range in the peaks

of 200 and on average if I just do an

eye test lightning strikes from between

50 and 100 with a really really really

big surge this was a massive storm back

in July of 2018 so there we go we can

definitively say that yes

Pittsburgh gets way more lightning

strikes at least in this location

compared to what's happening in Seattle

in our neighborhood the beauty here is

we were able to do all

with zero infrastructure I didn't have

to ingest any data I didn't have to move

any data also how to do it was Express

or query execute it and then visualize

it now as an aside my son then started

to ask do we get more rain here in

Seattle versus Pittsburgh and I just

looked at him and said hmm I'm not sure

but there's another public data set we

could play with into that he was no

longer interested so let's get back to

bigquery and features what I'm going to

walk through next is how you integrate

interact with bigquery there are

multiple ways to do it I just showed you

one of them which is through the

developer console this is typically

where you would spend a lot of your time

initially as you learn the product

mainly because it provides you almost

complete access to all of the API

operations within bigquery it's also

easy to access job and query histories

so as you execute queries like I did I

can go back and pull up queries that

I've ran in the past I can also save

them and share them easily with others

the editor also provides visual hints as

you're typing out syntax to make sure

that those queries are correct you can

also do some inline exploration with

sheets and data studio so over here

over here I could say explore data and

push this out the sheets and start doing

very similar analysis instead of sheets

versus doing it inside of C and I a

Python notebook will talk more about

this later

another way to interact with bigquery is

through the bigquery SDK that comes with

the Google Cloud SDK so this is a

command-line SDK this is an excellent

way to learn the full API you can

imagine doing nothing in the console

that I just showed you and doing

everything from the command line so this

in essence held to build a lot of muscle

memory about different functions and

ways to interact with bigquery now the

truth be told I end up using a

combination of all these tools that I'm

showing you specifically I like to use

the command line tool is basically an

auditing tool so as I may be running

things in the console or I have

background jobs running in code I can

also use the command line to basically

audit and tail what's happening with the

system one other nice benefit is inside

of Google Cloud we have something called

cloud Show so if you fire up a cloud

show the Google Cloud SDK is

pre-installed and it is a free compute

instance for you to basically do any

type of console based ssh work another

way to interact with bigquery is through

client libraries or you can build your

own library on top of the bigquery REST

API which is what the client libraries

are built on top of so in my python

ipython notebook that I showed you I'm

using the Python client library I highly

suggest that you end up using the client

libraries as the pre-built because

they're fully supported and we have a

common documentation and sample model

across all of these libraries as another

you know side note I was using a

platform notebook yeah

platform notebooks which is a feature of

the Google cloud platform and it has

pre-installed all these libraries for

you as well so as I jumped into this

notebook I had to do very very little

bootstrapping in order to start doing my


so and we only talked about this quickly

you can easily integrate with bigquery

through sheets Google sheets as well as

data studio the reverse is true I could

start my journey inside of sheets or

inside of data studio which is a way to

build visualizations or I could go ahead

and pipe results out of the console into

these environments now taking a step

back you know deeper into more classic

business intelligence tools we have a

broad spectrum of partners with deep

integrations with bigquery specifically

tableau and looker so a net-net if you

have choice you have choices out there

if you find yourself spending maybe a

little bit too much time in the console

and you're restricted on what you're

trying to do look around you have lots

of options next I'm going to turn to

talk about ingestion and interaction

interacting with data in more detail so

like the access options that I just

showed you you have a collection of ways

to ingest as well as query data from

bigquery so on the left hand side you

can use tools like cloud dataflow or

data prep to transform and load data

into bigquery bigquery can also import

data in various formats CSV JSON as long

as it's new line delimited you can also

import Avro park' and pork formats you

can use the data transfer service DTS to

automate ingestion or and/or

transformation transform data from other

cloud providers so you have data on AWS

s3 you can use DTS to pull that data

over and load it into bigquery you can

use DTS to also connect to SAS

applications like Salesforce and sa P

another way that you can import data is

just straight through the API so

basically any place where you can get

code up and running you can insert data

into bigquery tables so you could be

running on a compute engine engines

maybe inside a container and kubernetes

app engine etc the one caveat here are

those that you have to recreate all the

data processing foundation making

connections dealing with error cases

etc so a good press practice is to use a

high-level primitive something like data

proc or data flow they have that has

these type of implementations built-in

and I'll show you that in a minute

now on the right hand side you can

access and query data across other

bigquery datasets just like the one that

I showed you with the lightning data set

all the storages man managed in another

project in our public data sets project

you can also execute federated queries

over other databases from bigquery like

cloud sequel which is a man is risen and

my sequel now one of the unique aspects

of bigquery is that you can choose to

load data in batches or you can load

data in real time the extreme ingestion

and there are a few decision points here

on when to use what and when so if you

need to use daily if you need your data

updated on a daily or weekly basis batch

loading is more likely a solid choice

now if you start to reduce your load

window say to under 5 minutes and you

have lots of tables to ingest streaming

adjustment may be a better choice you

can mix and match these models for

example you may have one table that is

batch allotted saying on a daily basis

or weekly basis and then another table

which is stream loaded from the same

source but you're only doing 20%

sampling of real-time events another way

to load or generate data is basically

through a simple select statement we

basically select results from a local

data set or from a remote data set and

then save those results back into your

data set to create new tables I'll show

you a demo of this in a minute alright

so let's look at some streaming workflow

we've talked about streaming and the

ability to ingest data in real time so

I'm going to flip over to this console

which is Google Cloud dataflow and

specifically this is a running pipeline

that has been running for about a day or

so now

this pipeline is ingesting transactional

data from a fictitious retailing system

that I built it's also ingesting

clickstream data from browser activity

related to this retailer and it's also

looking at some real-time stock

information so it's ingesting this the

this information from cloud pub/sub it's

doing some transformations on those

streams it's doing a little bit of

aggregation where I'm looking at counts

within a particular time window and then

it also passes all the data through back

out and writes it into bigquery in real

time so you can see here on the

right-hand side we roughly doing around

227 - maybe we spiked up to around 700

events a second so as new transactions

come in the process in real time written

into bigquery and also aggregated in

real time so with this streaming

pipeline in place I can go back over to

bigquery and we're gonna run a query

we're gonna run a query that tails that

table where the data is being written

from my cloud dataflow pipeline so let's

go ahead and run this I have no cash

results on and what we should see is

basically event time for the events that

are flowing through the pipeline the

current timestamp and then the event

that I'm looking at which in this case

I'm just looking at some basic

clickstream data and the page reference

that it was associated with so this will

take probably around 23 - maybe 25

seconds to run so there you go we scan

72 gigs of data that took 28 seconds and

we're running rate about a second behind

live so if you look at UTC now on your

clocks you

can see that we are indeed streaming

live into this table this is great

so I have effectively ingested acquired

real-time stream information and put it

into bigquery so it's immediately

actionable now the next bit I want to

show you is how to query external

sources in this case is in this case I'm

going to query some data that's living

inside of the cloud sequel so I showed

you that I had real-time click stream

information coming in from data flow

into bigquery but I also have some stock

level information that I need to reach

out to that's sitting inside of a my

sequin implementation so with that I

have something called a federated query

and I'm going to switch my projects over

to the project where my federated

connection exists and if you see here I

have something called local operations

which the connection type is called a

cloud sequel my sequel up top I have a

query and there is a from clause that

says I want to extract some information

from an external query from local

operations looking at stock levels so in

this case this cloud sequel instance has

the stock level information that's

manually reconcile on a daily basis but

at the same time I have all my sales

information my orders information that I

showed you that it's being streamed

through Cloud dataflow and I want to

join these two so let's go ahead and run

this query I'm gonna reach out to

execute a query into cloud sequel as

well as run a local query and B query

join these two and I can look at

reconciliation so at this store for this

day of sale I sold 85 85 units and at

the end of the day I had 56 units on

hand this is really powerful because

instead of me having to basically do

some type of ETL job and pull all that

information out of cloud sequel into a

local table and be queried I can simply

reach out federate and do the join on

the fly the other thing I want to

quickly show is query materialization I

have a query that I ran this is actually

last night it took two almost three

hours to run and it processed 24

terabytes of data of topics you see it's

a fairly simple query I'm basically

doing a group I on some order

information by the basically by week by

store by order hour of day in essence I

wanted to look to see what type of

waterfall I might have on a per hour

basis inside of a store so I scanned the

entire order lines by geo table which

resulted in processing 24 terabytes of

data in the output of this or is quite


I could use this output as a fact table

for some type of data analysis so in

this case also have to do is say save

results go back to bigquery table and

I'll call this maybe FAQs I store week

and if I hit save I can project all the

results from that query back as a table

now you know I could continue to run

this query over and over again say on a

weekly basis or a monthly basis the

point here is I'm basically doing my

transformation and loading in line

inside of bigquery instead of having

this to move this data out and run those

transformations now that we have some

architectural fundamentals in place

looks look at the resource economy and

also understand some performance tips so

bigquery is very powerful so it's

important to understand the resourcing

and how resources are used and the cost

so you can better serve your needs now

if you recall on that slide about the

query engine take note of those little

compute models inside of the query

engine aka Dremel those little compute

units in bigquery parlance are called

slots so a bigquery slot is a unit of

computational capacity that's used to

execute queries so bigquery underneath

the covers automatically calculates how

many slots required by each query

depending on the size and complexity so

a slot at the end of the day is just a

combination of CPU memory networking

resources it also has a couple of their

technologies and sub services in a slots

from a developer's perspective

engineering perspective that is

approximately a half a VM compute one

gig of memory all those specifications

keep changing over time because as data

centers are upgraded and underlying

harbor is upgraded the abilities of

these slots continue to improve so you

know under the hood think about the

analytics throughput in bigquery is

really measured by slots if you want

things to run faster you apply more

slots if you have more concurrent

queries and you don't provide more slots

then you have basically a slower through

February so there are ways to to to

basically modulate and Alan how fast we

want to run or how fast you want to

deplete outstanding jobs so now that you

understand slots let's talk about

pricing model there are effectively two

pricing models that you can mix and

match inside of your organization of

projects one is referred to as on-demand

which is really a consumption based

model the other one is called flat rate

which is a capacity based model so on

demand pricing which is the the default

pricing that you get with B query is you

pay for the amount of data process of

five dollars per terabyte your first

terabyte per month is free and by

default projects get a two thousand

slaughter Lahman so you basically in two

thousand slots to execute queries and

there is some first ability but as

available there's no guarantee that

you're you know always going to get

first in over 2,000 slots so on-demand

pricing is really good for spiky

workloads as long as you dot dot

understand the overall capacity load

that you're going to be pushing through

those slots now one of the issues

sometimes is that's a little bit more

challenging to budget due to variability

because if you're running queries that

are processing different amounts of data

you know one day you could be spending

say $10 the next day you could be

spending twenty five depending on how

much data you're pushing through the

system so net-net it's a little bit more

challenging as concurrency grows and

then complexity of your queries grow so

it's as a solid choice though whenever

you're ramping up on workflow and/or you

have very very predictable usage

patterns now on the flip side Flowery

pricing is a fixed capacity model where

you basically pay for fixed capacity and

you pay the same amount regardless of

how many queries you submit now the

basic collaborate pricing is that you

can cancel after 30 days so you

basically commit to a number of slots a

minimum of 500 slot commitment and then

you have 30 days in this case to use

those slots there's a no limit to the

number of commitments you have so you

could have commitments for a thousand

two thousand ten thousand slots

depending on the overall load one of the

nice things about flat rate pricing is

it provides very very stable budgeting

but what happens if you want to extend

the capacity of your flat rate slots so

insert recently we introduced this music

concept that called the flat rate flex

lots so you basically can paying for a

flick stock capacity regardless how many

queries you submit however you can now

buy these slots and a much smaller

commitment so you're basically spending

$200 per hour for price on

a lot commitment but you can cancel

these slots after 60 seconds so you can

basically say how when it commits I need

to do some really really big workload

for say three minutes I commence with

those slots I use them and then I go so

this provides very very stable budgeting

but it also provides you the ability to

deploy additional resources as you see

fit so let's go ahead and run through

some cost estimation and optimization

techniques so you can better understand

how these concepts apply to you so I'm

going to jump back over to my notebook

and I'm in this notebook called

understanding scan costs so in an

on-demand model like I said you're gonna

pay based on the amount of data it's

processed if I go ahead and run this

query you'll see that this query would

have scanned almost a terabyte of data

and would have cost me four dollars now

the way that I did this was I set the

job configuration to dry run is true I

didn't actually execute this query I

only asked the system to tell me how

much data it would scan now I've note

you know if I were to say limit 1 this

is kind of an anti-pattern in bigquery

it wouldn't scan the same amount of data

because in this case it's how much data

that we're scanning not necessarily how

much data were rendering so how would I

make improvements to this overall cost

of the screen well the first thing I can

do is I can add some partitioning so in

this case I'm going to partition by date

so I take that non partition table I'm

going to select from it and I'm going to

apply a partition and create a new table

and let's see what the costs on that

table would look like so now I've gone

down to 81 sets a major major reduction

roughly around say 83 percent or so of

the original cost and you can see that

the tenor might scan the amount of data

with scan because now I'm only looking

for a very specific set of date

partitions another thing that you can do

is you can add clustering so close

we'll provide optimization on pruning

out elements within a where clause so in

this case I'm looking for a particular

page target so again I'll go ahead and

create a table so I have my partitioning

might be my date from before but now I'm

also going to cluster by page target and

if I run this query now on top of that

optimized table actually executed this

query this query cost me 0.001 cents and

it consumed 592 slot noise so now this

is a highly highly optimized query so

I've gone from roughly you know 4.8

dollars down to point on what cents on

the exact same query using partitioning

and clustering techniques know there's

much more about this topic specifically

understanding the cardinality of the

data and so the use case that I came up

with is a fairly classic use case I'm

using columns that have lower

cardinality things like dates month

counts as my partition and things that

have slightly higher cardinality like

page or product counts for my clustering

I encourage you to go read much more

about this but this will set you down

the path of optimizing your queries now

the other thing I want to quickly demo

is this concept of reservations so if

you remember before I talked about you

could go out and commit to a number of

slots in order to execute queries

especially especially if you have some

type of resource constraint so right now

I'm going to go ahead and use the

bigquery API to list all of the

reservations and commitments that have

outstanding so it looks like right now I

have 500 slots currently deployed and

let's say my ops and ETL Department came

over and said hey we have some bursty

workloads that are coming

this afternoon maybe in the next five

minutes maybe in the next week we need a

thousand more slots instead of having to

go through a bunch of deployments etc

you can easily use this API to basically

go creative commitments go create a

reservation and then apply that

reservation to an assignment so let's go

ahead and do this so whenever I run this

this code it's going to go out to make

the commitment apply the reservation and

then make those 1,000 slots available so

extremely extremely powerful and after

60 seconds I can go ahead and tear down

those slow those thousand slots one last

piece around pricing etc there's also a

sandbox option where you can sign up

there's no credit card required so this

is kind of the complete anti of the slot

process that I was just talking about so

if you're just getting started you can

sign up with no credit card required 10

gigs of active storage and you get one

terabyte of processed queries per month

so the next one of the next thing and

kind of this thing the last thing I want

to talk about is machine learning now if

you are familiar with machine learning

this is kind of a common model that you

or flow that you would follow whenever

you're building out machine learning

models you typically identify problem

you have to pre press your data and

maybe do some splitting then actually

build the model itself using various

techniques whether it's a tensor flow or

other SDKs even a train then you

evaluate then you deploy and then you

make predictions it's kind of a classic

flow but one of the things that we've

done in bigquery is we've implemented

this concept of bigquery ml which

effectively lets you identify a problem

skip a lot of the pre-processing

splitting and actual model fabrication

and let you just focus on identification

of problem training it and then making


right now there are a collection of

different model types that you can

implement inside of aquarium up and now

so we have linear regression for basic

estimation you can do logistic rejection

regression you can do clustering we have

matrix factorization which works well

for recommendation systems and we really

cently announced ARIMA for forecasting

which is an alpha you can also import

tensor flow models back into bigquery

for for prediction if you have trained

them someplace else so in this case what

I'm going to do is I'm going to run a

forecast this is the the retailing site

that I was talking about that I'm

addressing all of my Creek stream

inventory and purchasing data and what I

was asked here is that my boss came in

and said hey I want to get a quick

forecast on our clickstream data now

fortunately I have all that clickstream

data and sales data flowing into my data

warehouse so let's look at how became

yel BQ ml can be used to build a

forecasting model over that data know

back in my notebook I start this journey

so basically I'm going to do a quick

inspection on top of my clickstream data

so you can see this here looks you know

very very basic I print out some columns

because what I really want to do is just

focus on how many people are coming and

browsing into the site I'm basically

trying to project out what type of

traffic I am going to see in the future

on my site so I take that query and I do

some reduction because what I really

want to do is I want to look at traffic

by day not necessarily by individual

product so I go ahead and do a group buy

and now I can see sales and browsing

activity by day so this is good I've

basically effectively created the target

set that I want to build my motto on and

this is the in essence magic if you will

inside of the query I can write this

statement which is create a replace

model in this case I'm going to use a

Rhema for forecasting

passing the columns for my sales which

is the target and then the time series

information which is Dave and I tell it

in essence to go train and model this

takes depending on the size your data a

couple seconds up to a couple minutes or

hours you don't have to worry about the

scalability because underneath the

covers we do scale out the number of

slots needed and orange trim them all

then from there I can go ahead and call

forecast so in this case on a forecast

out thirty days beyond the data that

I've loaded I do a little bit of graph

development and here are the outputs of

that forecast so here's all the

historical information and you can kind

of see starting in January you know

information or levels are a little bit

low then they start the spike we have a

little seasonality we have a little bit

more seasonality here kind of sales drop

off and then we're right into starting

in the start so you can kind of see that

the forecast was was was pretty on spots

we had a confidence interval right now

of 90 percent and you can kind of see

rate at the very end it starts the drift

problem here is actually don't have

enough data historical data about what's

happening we see a big big spike this is

basically as people get in the summer

and they're starting to buy a lot more

shoes so one of the ways to solve this

problem is to either provide some hints

around seasonality because we know

people are buying more shoes or get

access to more data now the beauty here

is I didn't have to export any of this

information it all stayed inside of a

query so the flow is a lot more

streamlined and if I had to do more

complex transformations on that input

data I could easily do that and just go

run some additional queries save them

back as tables and use those as my

source of truth for the features that

have been put into this model now the

last piece as we wrap up kind of bring

this all together is a demoing inside of

looker and I need to log back in to my


and I may have typed that incorrectly oh

there we go

let me bring this the correct dashboard

up for you there we go so with my ETL in

place with all my fact tables in place

with my transformations in place I can

end up building out a fairly robust

dashboard in this case I was using

looker because looker provides a very

powerful semantic model to layer on top

of the query schema so it makes it easy

for you to express concepts like

funneling conversion etc in human

readable and repeatable patterns so

bigquery is the source of truth for all

my inputs but looker is the source of

truth for all the semantics and meaning

all right so we're pretty much at I'm

there are a ton of topics we didn't

cover things like access control more

refined cost controls using information

schema to understand what's happening

with queries that are being executed

monitoring logging there are all

different types of ways to do query

optimization etc you know after 10 years

there's a lot to cover especially

whenever you're constrained of 45 to 50

minutes so with that I encourage you to

go get started

so you can sign up for the free tier if

you just want to hack by yourself if you

have bigquery already deployed and

available inside of your company ask for

you to get a carve-out on to a separate

project for your own testing if you're

already doing a lot of prototyping on

BAE query

check out the new flex level flex lock

just kind of all it's a great way to

start expanding workloads and throughput

for the data analysis processes that

you're building and the last thing I'll

leave you with is a couple of quick

links if you really want to get up to

speed I encourage you to go read the

book called Google bigquery the

definitive guide it was written by two

Googlers black box phenom and derp

- gah Lak is the head of solutions

engineering for data analytics and ml

learn is the director of product

management for bigquery so this is an

excellent source of truth I also

encourage you to learn more via the data

engineering with Google cloud of course

series on Coursera which also helps you

get certified with the Google

professional data engineer certification

and last but not least I encourage you

to go out and follow these folks if you

don't follow them all right Felipe Hoffa

is the lead developer advocate for

bigquery he is an amazing source of

information and tribal knowledge for

bigquery I would encourage you to fall

lock in Jordan because they are

effectively tracking a lot of new

features and architecture information in

real time as well as the tea no tea or

tea Nitesh go who is one of the lead

product managers as well on bigquery

follow them and then they will lead you

to more and more valuable information

about bigquery and with that I thank I

thank you for your time and I hope you

stay safe and healthy so let's move over

to some questions on the street

absolutely thank you Erika we do have a

lot of questions from our customer on

the live stream so thank you for these

outstanding presentation and just want

to add a word to our customer do not

exit 8 to reach out to you Google

accounting we are here to help you and

to develop and further any subject you

would like to discuss in the future so

the first question was answer already

but we have the seventh one is from

Anthony yeah so hi there I

so yeah so I didn't talk about

materialized views in this session there

are a new feature coming to be query and

in Antonio probably the best thing to do

is if you just want to send me a mail

and talk about we can dig into kind of

what's happening with the materialized

view itself what's in what's in the view

there are some education where you may

not see performance improvements so I'd

be more than happy to take a look at

that query and schema that you have

offline if you can use cleave the

presentation what piece thank you I'm

sorry can you say that again

so yeah the presentation model so

customer can see you see thank you oh

there we go

sir all right so next question

oh so I'm gonna skip down to the

question from Alex says do you have to

still span for scan terabytes when using

Flex lots the answer to that is no so

once you're on a a slot model you're

paying only for the slot cost not

through not for the scan cost the reason

why I show those the demo though is to

help you understand how complex those

queries would be because the more data

that you're scanning the more slot

utilization you're more than likely

going to consume so as a best practice I

typically always will go ahead and look

at the the scan volume whenever I'm

writing queries just so I make sure I

understand the potential impact and then

also just looking at ways to gain

efficiencies a lot of times I'll prune

off columns because they don't need them

or I'll look at the table and say oh

maybe I have a better partitioning or

clustering scheme that I can come up

with to reduce resource utilization


okay so what is the meaning of sloppy

knees for bigquery job I don't know the

number of slots consumed by a query

execution so a really good question

so slot minis you it is just basically

an aggregate overall a slot time so if

you look at the output in the bigquery

console it'll show you the overall slot

Milly's which is the total on the number

of seconds that were per millisecond

that were processed across all of the

slots that were used in the new office

that will also show you the effective

wall clock time so say the effective

wall clock time for a query was one

minute but you used ten thousand or a

hundred thousand milliseconds of slots

that means effectively you were

processing that the processing was

distributed over you know n number of

slots in side of that time window now

the one thing in you can do is you can

go dig into the actual query plan itself

because as the query runs as it moves

from stage the stage in one stage and

maybe say using a smaller number of

slots and then the second stage which is

more resource intensive it expands out

to a watch a much wider slot utilization

so the number that you see is basically

just the total slot Meli's that we use

for that particular query let's see what

else is in there Oh Antonio sorry I kind

of I skipped over your question about

the bigquery storage API yeah I didn't

get to talk about that either here but

in mail we could also dig into that

because I'd be interested to understand

what your use cases are whether you know

you're basically looking to build some

type of SDK or integrate it into some

type of other product and especially yes

I will post the these notebooks up on a

github link

so that you can play with them for sure

wonderful so if we do not have more

question I'm just taking the last time

we can definitely wrap up so thank you

Eric for your presentation it was very

rich in term of contents and I want to

thank the customer for your great

questions so I encourage you to

subscribe the Google cloud forums where

you will see all the YouTube live

session from the past for the one from

today as soon as is described on the

week as well we'll be able to send you

the presentation and I also encourage

you as I mentioned earlier to reach out

to your google cloud account in we are

here to help and serve you so do not

hesitate so thank you everyone for

joining us for the on every Wednesday at

noon call a station of Lunch and Learn

and I wish all of you to stay safe and I

wish you a great week thank you everyone

and talk to you very soon

You might also like