(English (Auto-Generated) ) (Cloud Forum) Understanding BigQuery - Use Cases and Best Practices (DownSub - Com)

morning everyone and thank you for
joining the Google cloud Lunch and Learn
my name is Ingrid comes Ellis and I'm
the Google cloud Sales Director is I
don't mean doc so we on behalf of the
Google cloud family I would like to
thank all of you for joining us I hope
that you knew your family as well as
your colleagues are doing well during
this challenging time so Lunch and Learn
is part of the Whitney Tech Talk series
our goal is to serve you better
previously we cover topics such as
business continuity DevOps communities
our goal is to have this session as
interactive as possible we want to share
with our customers best practices I
encourage you to sign in Nano the link
is available in a live chat box as soon
as you sign in will send you the deck
and if the backbone please provide us
feedback and I think you will like
uncover during the next session we learn
with you and from you
so after today's session you will have
the opportunity to discuss questions
please why do questions in a youtube
light chat box without waiting at the
klezmer and the honor to introduce you
today our speaker eric smith eric is the

developer advocacy lead for data
analytic and applied data science within
Google Cloud listing focuses on enabling
data engineer data analysts and data
scientists across data pipelines that
are we're seeing streaming analytics and
business intelligence work lots prior to
his advocacy role Eric was the funding
product manager for cloud dataflow
bringing forth a unified model for batch
and stream processing today we'll talk
about the future pillars in the query
with the lens toward how will you use
them for various scenario so without
waiting Eric you have the floor
thank you thank you thank you
flip over to present well good morning
or good afternoon wherever you may be
hello and welcome to an overview of
bigquery as we look at features use
cases and best practices this Ingrid
said my name is Eric Schmidt and I leave
the data analytics and data science
developer advocacy team within Google
cloud my team works with customers and
practitioners at large helping them to
build data analysis pipelines you can
reach me at cloudy at google.com or on
twitter at not that Eric where my

opinions are my own thank you for
joining me today to start with I'd like
to lay out the overall focus for this
talk this talk is about bigquery
but more specifically this talk is also
about bigquery early bigquery primarily
through the lens of technical
practitioners I'm talking about the DBAs
who manage infrastructure schema changes
and deal with access control for data
engineers who build data transformation
pipelines and toil over ever-expanding
processing demands of data or data
analysts driving descriptive analytics
and data scientists seeking to formulate
new questions about the past in the
future now if you're an IT DM making
purchasing decisions or a CSO wrangling
across organizations security challenges
and maybe a chief decision-maker you
should also find this talk useful as it
will shine light on the value
proposition of bigquery beyond technical
fundamentals if you're new to Google
cloud or maybe a seasoned pro it's still
important to understand the product
landscape for this talk as I just won't
be addressing bigquery I'll also be
talking about and expanding into the
broader product portfolio inside of the

data analytics
offerings on Google cloud now as a side
note for a little bit of fun here is a
printable poster that has every Google
cloud product along with a forward or
less description now whenever I started
on Google cloud back in 2013 I think
there were six products of which
bigquery was one of them so grab the
high-res poster or wallpaper from the
github link below print it out and you
can impress your fellow cloud and data
nerds now instead of looking at products
I'm going to dive into a concepts that
we call us solutions now Google cloud
platform has eight primary solutions
smart analytics being one of them the
way I like to talk about this is that
it's easy to talk about a single product
it's simple but in the real world as
practitioners we live and operate from a
solutions purpose perspective this is
typically framed as I need a solve for X
problem and in order to do so I need a B
and C products implemented with y and z
patterns so this frame you can get a
little messy
because you start mixing and matching
different processes products and

patterns so we shift away from data
analytics as a product grouping and we
dive into smart analytics now a quick
warning I have a few more market texture
slides to make sure we're on the same
page and then we're going to get into
sequel encode here we're gonna lay out
what the Google Cloud smart analytics
platform looks like Google Cloud smart
analytics enables for solution areas
streaming analytics data Lake and daily
modernization data warehouse
modernization and business intelligence
with integrated data science workflow
with bigquery at the heart of these
architectures and salute
so using the solution rubric that I
stated in the last slide we can state a
problem like I need to calculate store
sales in real time and for that I need
extreme ingestion mechanism a streaming
analytics engine and a scalable durable
cache stitched together with selected
patterns so what does bigquery and why
am I talking about streaming and data
Lakes B query is a stimulus fully
managed highly scalable and
cost-effective cloud data warehouse
design for business agility it enables
you to analyze bytes megabytes or

petabytes of data using standard ansi
sequel at blazing fast speeds with zero
operational overhead with
enterprise-grade security at the same
time bigquery also provides integrated
GIS capabilities it has an in-memory
analysis service and it also has a built
in machine learning features oh by the
way bigquery turned 10 years old this
week whenever I started back in 2013 we
had just added joints as a feature so 10
years of innovation and if you go online
you can google for lots of happy
birthday wishes and some reflection by
various product team members engineers
and others in the ecosystem who have
benefited from this amazing product so
happy birthday bigquery now that you
know what bigquery is at a high level
let's look at how it's built what is it
inside of this implementation the first
key point on this slide to understand is
the separate of the storage system that
stores information to be queried and the
query engine itself
now this separation provides several
important benefits first of all bigquery
stores data and a proprietary column or
format called capacitor which as it

evolved over time can provide more
optimism
specifically through data layout
optimization the data is physically
stored on Google's distributed file
system which is called Colossus this
ensures durability through something
called erasure encoding where we store
redundant chunks of data on multiple
physical disks and moreover this data is
replicated throughout multiple data
centers another key point is that as
optimizations occur in the query engine
they can be rolled out to the entire
Agusta customer base was zero downtime
so we can mix and match how
optimizations are added to both the
source system and the query engine the
second key point to understand here is
that these sub services store its
computed set are they all run on server
lists on our service resource model
which behind the scenes is fully managed
by Google so if you're coming from an
on-premise and/or a monolithic data
warehouse and/or data Lake
implementation this may be a completely
different model for you a model that
changes how you think about how you
build and manage such systems drilling a

deeper it changes the game because
bigquery replaces this concept of
typical hardware setup and deployment
there is nothing to deploy with bigquery
or manage from an infrastructure
perspective from there you can quickly
iterate and grow a basic data Mart using
data sets and table constructs you can
then use and deploy tables as a source
of truth to define schemas for a data
Lake both within bigquery or outside of
bigquery storage system your data Lake
may contain files elsewhere like on
cloud storage Google Drive or a
transactional system like cloud BigTable
bigquery also
it's cloud identity access management or
im2 grant permissions to specific
actions within the query this replaces
traditional concepts like sequel grant
and revoke ultimately you can develop
and deploy insights in real time at
scale with bigquery now with all this
framing in place let's dig in to some
sequel encode now as I was preparing for
this talk over the weekend
I like the majority of you we're also
experiencing some very challenging times
with protests civil unrest and dealing

with the Cova pandemic these are very
challenging times now at the same time
here in Seattle where I live we also
experienced some very violent weather
over the weekend which included
lightning
to which my seven-year-old asked lower
you're eating dinner if we get more
lightning here in Seattle then grandma
does who lives on the East Coast this is
something we talk about a lot mainly
because we don't get a lot of lightning
here in Seattle so for this demo we're
going to do a very simple and a data
analysis data science experiment which
is let's formulate a question do we get
more lightning here in Seattle than a
grandma lives on the East Coast so we
started thinking about this well where
can we go and get that information can
we go to the news could we go to NOAA
website could we Google it it turns out
this is a fairly challenging question to
ask if you want to compare lightning
strikes between two cities now also once
we start querying the data we have to
understand what my boundaries are what
is here in terms of if we live in
Seattle where are we defining here in
terms of those lightning strikes are we

talking about in our neighborhood across
the city etc and then once we have the
ability to query the data can we
visualize it to see if it makes sense
and does it answer the question so with
that let's flip over to bigquery
what you see here is the bigquery cloud
console and on the left hand side if I
scroll down you start to see different
projects that have I have access to now
you have access to the some of these
projects as well for example you have
access to something called the bigquery
public data project and inside of that
there are a bunch of freely publicly
available data sets that you can use to
perform analysis if I scroll down to one
called
NOAA lightning there's a lot of data
sets in here you can see that I have
tables for lightning strikes from
current year all the way back until 1987
and the schema looks like this basically
day the number of strikes and the center
points so if I preview it it's a fairly
basic layout so I sat down with my son
and I started asking questions like how
close do you think a lightning straight
would be if it was close to our house

and we agreed we agreed upon roughly
around five miles so since this isn't
meters I just did some conversion and
then from there we use built in GIS
functions inside of bigquery to
calculate the distance between where the
lightning struck according to the record
and the center point in our neighborhood
both for our house and Grandma's house
and then from there I'm using a function
called table wildcards so that can query
across all of these years and do some
aggregations so let's run this this
query and see what happens
keep in mind in the background
bigquery just went and scanned all of
the data in storage and we scanned 4.9
gigs in roughly five point four seconds
so this query it looks like it's
producing reasonable results I have
distance to our house which is seems
very very far
way in this strike was fairly close to
grandma's house on the East Coast
great so I've effectively built this
query but it really doesn't tell a story
yet does it answer the question you know
who effectively gets more lightning so
what I'm going to do is flip over to
another tool which is called AI platform

notebooks and here once I jump out of
bigquery I can jump into AI platform
notebooks and I have a notebook already
pre-built that helps me interact live
and in real-time with that same
information inside of bigquery and we're
going to talk more about different
access mechanisms later but suffice it
to say I'm going to import some
libraries I'm going to connect to
bigquery I'm going to take that same
query that I wrote and we'll go ahead
and execute it here so go out rerun the
same query and I get the results back
pump them back into a data frame so they
can see them here this is pretty helpful
but let's go ahead and visualize this
and bring this to life so in this view
I'm gonna go ahead and filter out
lightning strikes that are close to our
house so over the years you can
basically see we don't get a lot of
lightning so some days it's basically
zero looks like we had a peak here there
was a lot of strikes back in 1999 but a
lot being 40 if I go down and I look at
what happens in Pittsburgh around
grandma's house you can see the scale is
much different you know they have

multiple storms that range in the peaks
of 200 and on average if I just do an
eye test lightning strikes from between
50 and 100 with a really really really
big surge this was a massive storm back
in July of 2018 so there we go we can
definitively say that yes
Pittsburgh gets way more lightning
strikes at least in this location
compared to what's happening in Seattle
in our neighborhood the beauty here is
we were able to do all
with zero infrastructure I didn't have
to ingest any data I didn't have to move
any data also how to do it was Express
or query execute it and then visualize
it now as an aside my son then started
to ask do we get more rain here in
Seattle versus Pittsburgh and I just
looked at him and said hmm I'm not sure
but there's another public data set we
could play with into that he was no
longer interested so let's get back to
bigquery and features what I'm going to
walk through next is how you integrate
interact with bigquery there are
multiple ways to do it I just showed you
one of them which is through the
developer console this is typically
where you would spend a lot of your time

initially as you learn the product
mainly because it provides you almost
complete access to all of the API
operations within bigquery it's also
easy to access job and query histories
so as you execute queries like I did I
can go back and pull up queries that
I've ran in the past I can also save
them and share them easily with others
the editor also provides visual hints as
you're typing out syntax to make sure
that those queries are correct you can
also do some inline exploration with
sheets and data studio so over here
over here I could say explore data and
push this out the sheets and start doing
very similar analysis instead of sheets
versus doing it inside of C and I a
Python notebook will talk more about
this later
another way to interact with bigquery is
through the bigquery SDK that comes with
the Google Cloud SDK so this is a
command-line SDK this is an excellent
way to learn the full API you can
imagine doing nothing in the console
that I just showed you and doing
everything from the command line so this
in essence held to build a lot of muscle

memory about different functions and
ways to interact with bigquery now the
truth be told I end up using a
combination of all these tools that I'm
showing you specifically I like to use
the command line tool is basically an
auditing tool so as I may be running
things in the console or I have
background jobs running in code I can
also use the command line to basically
audit and tail what's happening with the
system one other nice benefit is inside
of Google Cloud we have something called
cloud Show so if you fire up a cloud
show the Google Cloud SDK is
pre-installed and it is a free compute
instance for you to basically do any
type of console based ssh work another
way to interact with bigquery is through
client libraries or you can build your
own library on top of the bigquery REST
API which is what the client libraries
are built on top of so in my python
ipython notebook that I showed you I'm
using the Python client library I highly
suggest that you end up using the client
libraries as the pre-built because
they're fully supported and we have a
common documentation and sample model
across all of these libraries as another

you know side note I was using a
platform notebook yeah
platform notebooks which is a feature of
the Google cloud platform and it has
pre-installed all these libraries for
you as well so as I jumped into this
notebook I had to do very very little
bootstrapping in order to start doing my
analysis
so and we only talked about this quickly
you can easily integrate with bigquery
through sheets Google sheets as well as
data studio the reverse is true I could
start my journey inside of sheets or
inside of data studio which is a way to
build visualizations or I could go ahead
and pipe results out of the console into
these environments now taking a step
back you know deeper into more classic
business intelligence tools we have a
broad spectrum of partners with deep
integrations with bigquery specifically
tableau and looker so a net-net if you
have choice you have choices out there
if you find yourself spending maybe a
little bit too much time in the console
and you're restricted on what you're
trying to do look around you have lots
of options next I'm going to turn to

talk about ingestion and interaction
interacting with data in more detail so
like the access options that I just
showed you you have a collection of ways
to ingest as well as query data from
bigquery so on the left hand side you
can use tools like cloud dataflow or
data prep to transform and load data
into bigquery bigquery can also import
data in various formats CSV JSON as long
as it's new line delimited you can also
import Avro park' and pork formats you
can use the data transfer service DTS to
automate ingestion or and/or
transformation transform data from other
cloud providers so you have data on AWS
s3 you can use DTS to pull that data
over and load it into bigquery you can
use DTS to also connect to SAS
applications like Salesforce and sa P
another way that you can import data is
just straight through the API so
basically any place where you can get
code up and running you can insert data
into bigquery tables so you could be
running on a compute engine engines
maybe inside a container and kubernetes
app engine etc the one caveat here are
those that you have to recreate all the
data processing foundation making

connections dealing with error cases
etc so a good press practice is to use a
high-level primitive something like data
proc or data flow they have that has
these type of implementations built-in
and I'll show you that in a minute
now on the right hand side you can
access and query data across other
bigquery datasets just like the one that
I showed you with the lightning data set
all the storages man managed in another
project in our public data sets project
you can also execute federated queries
over other databases from bigquery like
cloud sequel which is a man is risen and
my sequel now one of the unique aspects
of bigquery is that you can choose to
load data in batches or you can load
data in real time the extreme ingestion
and there are a few decision points here
on when to use what and when so if you
need to use daily if you need your data
updated on a daily or weekly basis batch
loading is more likely a solid choice
now if you start to reduce your load
window say to under 5 minutes and you
have lots of tables to ingest streaming
adjustment may be a better choice you
can mix and match these models for

example you may have one table that is
batch allotted saying on a daily basis
or weekly basis and then another table
which is stream loaded from the same
source but you're only doing 20%
sampling of real-time events another way
to load or generate data is basically
through a simple select statement we
basically select results from a local
data set or from a remote data set and
then save those results back into your
data set to create new tables I'll show
you a demo of this in a minute alright
so let's look at some streaming workflow
we've talked about streaming and the
ability to ingest data in real time so
I'm going to flip over to this console
which is Google Cloud dataflow and
specifically this is a running pipeline
that has been running for about a day or
so now
this pipeline is ingesting transactional
data from a fictitious retailing system
that I built it's also ingesting
clickstream data from browser activity
related to this retailer and it's also
looking at some real-time stock
information so it's ingesting this the
this information from cloud pub/sub it's
doing some transformations on those

streams it's doing a little bit of
aggregation where I'm looking at counts
within a particular time window and then
it also passes all the data through back
out and writes it into bigquery in real
time so you can see here on the
right-hand side we roughly doing around
227 - maybe we spiked up to around 700
events a second so as new transactions
come in the process in real time written
into bigquery and also aggregated in
real time so with this streaming
pipeline in place I can go back over to
bigquery and we're gonna run a query
we're gonna run a query that tails that
table where the data is being written
from my cloud dataflow pipeline so let's
go ahead and run this I have no cash
results on and what we should see is
basically event time for the events that
are flowing through the pipeline the
current timestamp and then the event
that I'm looking at which in this case
I'm just looking at some basic
clickstream data and the page reference
that it was associated with so this will
take probably around 23 - maybe 25
seconds to run so there you go we scan
72 gigs of data that took 28 seconds and

we're running rate about a second behind
live so if you look at UTC now on your
clocks you
can see that we are indeed streaming
live into this table this is great
so I have effectively ingested acquired
real-time stream information and put it
into bigquery so it's immediately
actionable now the next bit I want to
show you is how to query external
sources in this case is in this case I'm
going to query some data that's living
inside of the cloud sequel so I showed
you that I had real-time click stream
information coming in from data flow
into bigquery but I also have some stock
level information that I need to reach
out to that's sitting inside of a my
sequin implementation so with that I
have something called a federated query
and I'm going to switch my projects over
to the project where my federated
connection exists and if you see here I
have something called local operations
which the connection type is called a
cloud sequel my sequel up top I have a
query and there is a from clause that
says I want to extract some information
from an external query from local
operations looking at stock levels so in

this case this cloud sequel instance has
the stock level information that's
manually reconcile on a daily basis but
at the same time I have all my sales
information my orders information that I
showed you that it's being streamed
through Cloud dataflow and I want to
join these two so let's go ahead and run
this query I'm gonna reach out to
execute a query into cloud sequel as
well as run a local query and B query
join these two and I can look at
reconciliation so at this store for this
day of sale I sold 85 85 units and at
the end of the day I had 56 units on
hand this is really powerful because
instead of me having to basically do
some type of ETL job and pull all that
information out of cloud sequel into a
local table and be queried I can simply
reach out federate and do the join on
the fly the other thing I want to
quickly show is query materialization I
have a query that I ran this is actually
last night it took two almost three
hours to run and it processed 24
terabytes of data of topics you see it's
a fairly simple query I'm basically
doing a group I on some order

information by the basically by week by
store by order hour of day in essence I
wanted to look to see what type of
waterfall I might have on a per hour
basis inside of a store so I scanned the
entire order lines by geo table which
resulted in processing 24 terabytes of
data in the output of this or is quite
simple
I could use this output as a fact table
for some type of data analysis so in
this case also have to do is say save
results go back to bigquery table and
I'll call this maybe FAQs I store week
and if I hit save I can project all the
results from that query back as a table
now you know I could continue to run
this query over and over again say on a
weekly basis or a monthly basis the
point here is I'm basically doing my
transformation and loading in line
inside of bigquery instead of having
this to move this data out and run those
transformations now that we have some
architectural fundamentals in place
looks look at the resource economy and
also understand some performance tips so
bigquery is very powerful so it's
important to understand the resourcing
and how resources are used and the cost

so you can better serve your needs now
if you recall on that slide about the
query engine take note of those little
compute models inside of the query
engine aka Dremel those little compute
units in bigquery parlance are called
slots so a bigquery slot is a unit of
computational capacity that's used to
execute queries so bigquery underneath
the covers automatically calculates how
many slots required by each query
depending on the size and complexity so
a slot at the end of the day is just a
combination of CPU memory networking
resources it also has a couple of their
technologies and sub services in a slots
from a developer's perspective
engineering perspective that is
approximately a half a VM compute one
gig of memory all those specifications
keep changing over time because as data
centers are upgraded and underlying
harbor is upgraded the abilities of
these slots continue to improve so you
know under the hood think about the
analytics throughput in bigquery is
really measured by slots if you want
things to run faster you apply more
slots if you have more concurrent

queries and you don't provide more slots
then you have basically a slower through
February so there are ways to to to
basically modulate and Alan how fast we
want to run or how fast you want to
deplete outstanding jobs so now that you
understand slots let's talk about
pricing model there are effectively two
pricing models that you can mix and
match inside of your organization of
projects one is referred to as on-demand
which is really a consumption based
model the other one is called flat rate
which is a capacity based model so on
demand pricing which is the the default
pricing that you get with B query is you
pay for the amount of data process of
five dollars per terabyte your first
terabyte per month is free and by
default projects get a two thousand
slaughter Lahman so you basically in two
thousand slots to execute queries and
there is some first ability but as
available there's no guarantee that
you're you know always going to get
first in over 2,000 slots so on-demand
pricing is really good for spiky
workloads as long as you dot dot
understand the overall capacity load
that you're going to be pushing through

those slots now one of the issues
sometimes is that's a little bit more
challenging to budget due to variability
because if you're running queries that
are processing different amounts of data
you know one day you could be spending
say $10 the next day you could be
spending twenty five depending on how
much data you're pushing through the
system so net-net it's a little bit more
challenging as concurrency grows and
then complexity of your queries grow so
it's as a solid choice though whenever
you're ramping up on workflow and/or you
have very very predictable usage
patterns now on the flip side Flowery
pricing is a fixed capacity model where
you basically pay for fixed capacity and
you pay the same amount regardless of
how many queries you submit now the
basic collaborate pricing is that you
can cancel after 30 days so you
basically commit to a number of slots a
minimum of 500 slot commitment and then
you have 30 days in this case to use
those slots there's a no limit to the
number of commitments you have so you
could have commitments for a thousand
two thousand ten thousand slots

depending on the overall load one of the
nice things about flat rate pricing is
it provides very very stable budgeting
but what happens if you want to extend
the capacity of your flat rate slots so
insert recently we introduced this music
concept that called the flat rate flex
lots so you basically can paying for a
flick stock capacity regardless how many
queries you submit however you can now
buy these slots and a much smaller
commitment so you're basically spending
$200 per hour for price on
a lot commitment but you can cancel
these slots after 60 seconds so you can
basically say how when it commits I need
to do some really really big workload
for say three minutes I commence with
those slots I use them and then I go so
this provides very very stable budgeting
but it also provides you the ability to
deploy additional resources as you see
fit so let's go ahead and run through
some cost estimation and optimization
techniques so you can better understand
how these concepts apply to you so I'm
going to jump back over to my notebook
and I'm in this notebook called
understanding scan costs so in an
on-demand model like I said you're gonna

pay based on the amount of data it's
processed if I go ahead and run this
query you'll see that this query would
have scanned almost a terabyte of data
and would have cost me four dollars now
the way that I did this was I set the
job configuration to dry run is true I
didn't actually execute this query I
only asked the system to tell me how
much data it would scan now I've note
you know if I were to say limit 1 this
is kind of an anti-pattern in bigquery
it wouldn't scan the same amount of data
because in this case it's how much data
that we're scanning not necessarily how
much data were rendering so how would I
make improvements to this overall cost
of the screen well the first thing I can
do is I can add some partitioning so in
this case I'm going to partition by date
so I take that non partition table I'm
going to select from it and I'm going to
apply a partition and create a new table
and let's see what the costs on that
table would look like so now I've gone
down to 81 sets a major major reduction
roughly around say 83 percent or so of
the original cost and you can see that
the tenor might scan the amount of data

with scan because now I'm only looking
for a very specific set of date
partitions another thing that you can do
is you can add clustering so close
we'll provide optimization on pruning
out elements within a where clause so in
this case I'm looking for a particular
page target so again I'll go ahead and
create a table so I have my partitioning
might be my date from before but now I'm
also going to cluster by page target and
if I run this query now on top of that
optimized table actually executed this
query this query cost me 0.001 cents and
it consumed 592 slot noise so now this
is a highly highly optimized query so
I've gone from roughly you know 4.8
dollars down to point on what cents on
the exact same query using partitioning
and clustering techniques know there's
much more about this topic specifically
understanding the cardinality of the
data and so the use case that I came up
with is a fairly classic use case I'm
using columns that have lower
cardinality things like dates month
counts as my partition and things that
have slightly higher cardinality like
page or product counts for my clustering
I encourage you to go read much more

about this but this will set you down
the path of optimizing your queries now
the other thing I want to quickly demo
is this concept of reservations so if
you remember before I talked about you
could go out and commit to a number of
slots in order to execute queries
especially especially if you have some
type of resource constraint so right now
I'm going to go ahead and use the
bigquery API to list all of the
reservations and commitments that have
outstanding so it looks like right now I
have 500 slots currently deployed and
let's say my ops and ETL Department came
over and said hey we have some bursty
workloads that are coming
this afternoon maybe in the next five
minutes maybe in the next week we need a
thousand more slots instead of having to
go through a bunch of deployments etc
you can easily use this API to basically
go creative commitments go create a
reservation and then apply that
reservation to an assignment so let's go
ahead and do this so whenever I run this
this code it's going to go out to make
the commitment apply the reservation and
then make those 1,000 slots available so

extremely extremely powerful and after
60 seconds I can go ahead and tear down
those slow those thousand slots one last
piece around pricing etc there's also a
sandbox option where you can sign up
there's no credit card required so this
is kind of the complete anti of the slot
process that I was just talking about so
if you're just getting started you can
sign up with no credit card required 10
gigs of active storage and you get one
terabyte of processed queries per month
so the next one of the next thing and
kind of this thing the last thing I want
to talk about is machine learning now if
you are familiar with machine learning
this is kind of a common model that you
or flow that you would follow whenever
you're building out machine learning
models you typically identify problem
you have to pre press your data and
maybe do some splitting then actually
build the model itself using various
techniques whether it's a tensor flow or
other SDKs even a train then you
evaluate then you deploy and then you
make predictions it's kind of a classic
flow but one of the things that we've
done in bigquery is we've implemented
this concept of bigquery ml which

effectively lets you identify a problem
skip a lot of the pre-processing
splitting and actual model fabrication
and let you just focus on identification
of problem training it and then making
predictions
right now there are a collection of
different model types that you can
implement inside of aquarium up and now
so we have linear regression for basic
estimation you can do logistic rejection
regression you can do clustering we have
matrix factorization which works well
for recommendation systems and we really
cently announced ARIMA for forecasting
which is an alpha you can also import
tensor flow models back into bigquery
for for prediction if you have trained
them someplace else so in this case what
I'm going to do is I'm going to run a
forecast this is the the retailing site
that I was talking about that I'm
addressing all of my Creek stream
inventory and purchasing data and what I
was asked here is that my boss came in
and said hey I want to get a quick
forecast on our clickstream data now
fortunately I have all that clickstream
data and sales data flowing into my data

warehouse so let's look at how became
yel BQ ml can be used to build a
forecasting model over that data know
back in my notebook I start this journey
so basically I'm going to do a quick
inspection on top of my clickstream data
so you can see this here looks you know
very very basic I print out some columns
because what I really want to do is just
focus on how many people are coming and
browsing into the site I'm basically
trying to project out what type of
traffic I am going to see in the future
on my site so I take that query and I do
some reduction because what I really
want to do is I want to look at traffic
by day not necessarily by individual
product so I go ahead and do a group buy
and now I can see sales and browsing
activity by day so this is good I've
basically effectively created the target
set that I want to build my motto on and
this is the in essence magic if you will
inside of the query I can write this
statement which is create a replace
model in this case I'm going to use a
Rhema for forecasting
passing the columns for my sales which
is the target and then the time series
information which is Dave and I tell it

in essence to go train and model this
takes depending on the size your data a
couple seconds up to a couple minutes or
hours you don't have to worry about the
scalability because underneath the
covers we do scale out the number of
slots needed and orange trim them all
then from there I can go ahead and call
forecast so in this case on a forecast
out thirty days beyond the data that
I've loaded I do a little bit of graph
development and here are the outputs of
that forecast so here's all the
historical information and you can kind
of see starting in January you know
information or levels are a little bit
low then they start the spike we have a
little seasonality we have a little bit
more seasonality here kind of sales drop
off and then we're right into starting
in the start so you can kind of see that
the forecast was was was pretty on spots
we had a confidence interval right now
of 90 percent and you can kind of see
rate at the very end it starts the drift
problem here is actually don't have
enough data historical data about what's
happening we see a big big spike this is
basically as people get in the summer

and they're starting to buy a lot more
shoes so one of the ways to solve this
problem is to either provide some hints
around seasonality because we know
people are buying more shoes or get
access to more data now the beauty here
is I didn't have to export any of this
information it all stayed inside of a
query so the flow is a lot more
streamlined and if I had to do more
complex transformations on that input
data I could easily do that and just go
run some additional queries save them
back as tables and use those as my
source of truth for the features that
have been put into this model now the
last piece as we wrap up kind of bring
this all together is a demoing inside of
looker and I need to log back in to my
account
and I may have typed that incorrectly oh
there we go
let me bring this the correct dashboard
up for you there we go so with my ETL in
place with all my fact tables in place
with my transformations in place I can
end up building out a fairly robust
dashboard in this case I was using
looker because looker provides a very
powerful semantic model to layer on top

of the query schema so it makes it easy
for you to express concepts like
funneling conversion etc in human
readable and repeatable patterns so
bigquery is the source of truth for all
my inputs but looker is the source of
truth for all the semantics and meaning
all right so we're pretty much at I'm
there are a ton of topics we didn't
cover things like access control more
refined cost controls using information
schema to understand what's happening
with queries that are being executed
monitoring logging there are all
different types of ways to do query
optimization etc you know after 10 years
there's a lot to cover especially
whenever you're constrained of 45 to 50
minutes so with that I encourage you to
go get started
so you can sign up for the free tier if
you just want to hack by yourself if you
have bigquery already deployed and
available inside of your company ask for
you to get a carve-out on to a separate
project for your own testing if you're
already doing a lot of prototyping on
BAE query
check out the new flex level flex lock

just kind of all it's a great way to
start expanding workloads and throughput
for the data analysis processes that
you're building and the last thing I'll
leave you with is a couple of quick
links if you really want to get up to
speed I encourage you to go read the
book called Google bigquery the
definitive guide it was written by two
Googlers black box phenom and derp
- gah Lak is the head of solutions
engineering for data analytics and ml
learn is the director of product
management for bigquery so this is an
excellent source of truth I also
encourage you to learn more via the data
engineering with Google cloud of course
series on Coursera which also helps you
get certified with the Google
professional data engineer certification
and last but not least I encourage you
to go out and follow these folks if you
don't follow them all right Felipe Hoffa
is the lead developer advocate for
bigquery he is an amazing source of
information and tribal knowledge for
bigquery I would encourage you to fall
lock in Jordan because they are
effectively tracking a lot of new
features and architecture information in

real time as well as the tea no tea or
tea Nitesh go who is one of the lead
product managers as well on bigquery
follow them and then they will lead you
to more and more valuable information
about bigquery and with that I thank I
thank you for your time and I hope you
stay safe and healthy so let's move over
to some questions on the street
absolutely thank you Erika we do have a
lot of questions from our customer on
the live stream so thank you for these
outstanding presentation and just want
to add a word to our customer do not
exit 8 to reach out to you Google
accounting we are here to help you and
to develop and further any subject you
would like to discuss in the future so
the first question was answer already
but we have the seventh one is from
Anthony yeah so hi there I
so yeah so I didn't talk about
materialized views in this session there
are a new feature coming to be query and
in Antonio probably the best thing to do
is if you just want to send me a mail
and talk about we can dig into kind of
what's happening with the materialized
view itself what's in what's in the view

there are some education where you may
not see performance improvements so I'd
be more than happy to take a look at
that query and schema that you have
offline if you can use cleave the
presentation what piece thank you I'm
sorry can you say that again
so yeah the presentation model so
customer can see you see thank you oh
there we go
sir all right so next question
oh so I'm gonna skip down to the
question from Alex says do you have to
still span for scan terabytes when using
Flex lots the answer to that is no so
once you're on a a slot model you're
paying only for the slot cost not
through not for the scan cost the reason
why I show those the demo though is to
help you understand how complex those
queries would be because the more data
that you're scanning the more slot
utilization you're more than likely
going to consume so as a best practice I
typically always will go ahead and look
at the the scan volume whenever I'm
writing queries just so I make sure I
understand the potential impact and then
also just looking at ways to gain
efficiencies a lot of times I'll prune

off columns because they don't need them
or I'll look at the table and say oh
maybe I have a better partitioning or
clustering scheme that I can come up
with to reduce resource utilization
[Music]
okay so what is the meaning of sloppy
knees for bigquery job I don't know the
number of slots consumed by a query
execution so a really good question
so slot minis you it is just basically
an aggregate overall a slot time so if
you look at the output in the bigquery
console it'll show you the overall slot
Milly's which is the total on the number
of seconds that were per millisecond
that were processed across all of the
slots that were used in the new office
that will also show you the effective
wall clock time so say the effective
wall clock time for a query was one
minute but you used ten thousand or a
hundred thousand milliseconds of slots
that means effectively you were
processing that the processing was
distributed over you know n number of
slots in side of that time window now
the one thing in you can do is you can
go dig into the actual query plan itself

because as the query runs as it moves
from stage the stage in one stage and
maybe say using a smaller number of
slots and then the second stage which is
more resource intensive it expands out
to a watch a much wider slot utilization
so the number that you see is basically
just the total slot Meli's that we use
for that particular query let's see what
else is in there Oh Antonio sorry I kind
of I skipped over your question about
the bigquery storage API yeah I didn't
get to talk about that either here but
in mail we could also dig into that
because I'd be interested to understand
what your use cases are whether you know
you're basically looking to build some
type of SDK or integrate it into some
type of other product and especially yes
I will post the these notebooks up on a
github link
so that you can play with them for sure
wonderful so if we do not have more
question I'm just taking the last time
we can definitely wrap up so thank you
Eric for your presentation it was very
rich in term of contents and I want to
thank the customer for your great
questions so I encourage you to
subscribe the Google cloud forums where

you will see all the YouTube live
session from the past for the one from
today as soon as is described on the
week as well we'll be able to send you
the presentation and I also encourage
you as I mentioned earlier to reach out
to your google cloud account in we are
here to help and serve you so do not
hesitate so thank you everyone for
joining us for the on every Wednesday at
noon call a station of Lunch and Learn
and I wish all of you to stay safe and I
wish you a great week thank you everyone
and talk to you very soon

(English (Auto-Generated) ) (Cloud Forum) Understanding BigQuery - Use Cases and Best Practices (DownSub - Com)

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

(English (Auto-Generated) ) (Cloud Forum) Understanding BigQuery - Use Cases and Best Practices (DownSub - Com)

Uploaded by

Copyright:

Available Formats

morning everyone and thank you for

joining the Google cloud Lunch and Learn

my name is Ingrid comes Ellis and I'm

the Google cloud Sales Director is I

don't mean doc so we on behalf of the

Google cloud family I would like to

thank all of you for joining us I hope

that you knew your family as well as

your colleagues are doing well during

this challenging time so Lunch and Learn

is part of the Whitney Tech Talk series

our goal is to serve you better

previously we cover topics such as

business continuity DevOps communities

our goal is to have this session as

interactive as possible we want to share

with our customers best practices I

encourage you to sign in Nano the link

is available in a live chat box as soon

as you sign in will send you the deck

and if the backbone please provide us

feedback and I think you will like

uncover during the next session we learn

with you and from you

so after today's session you will have

the opportunity to discuss questions

please why do questions in a youtube

light chat box without waiting at the

klezmer and the honor to introduce you

today our speaker eric smith eric is the

analytic and applied data science within

Google Cloud listing focuses on enabling

data engineer data analysts and data

scientists across data pipelines that

are we're seeing streaming analytics and

business intelligence work lots prior to

his advocacy role Eric was the funding

product manager for cloud dataflow

bringing forth a unified model for batch

and stream processing today we'll talk

about the future pillars in the query

with the lens toward how will you use

them for various scenario so without

waiting Eric you have the floor

thank you thank you thank you

flip over to present well good morning

or good afternoon wherever you may be

hello and welcome to an overview of

bigquery as we look at features use

cases and best practices this Ingrid

said my name is Eric Schmidt and I leave

the data analytics and data science

developer advocacy team within Google

cloud my team works with customers and

practitioners at large helping them to

build data analysis pipelines you can

reach me at cloudy at google.com or on

twitter at not that Eric where my

joining me today to start with I'd like

to lay out the overall focus for this

talk this talk is about bigquery

but more specifically this talk is also

about bigquery early bigquery primarily