You are on page 1of 151

Google Cloud Big Data and Machine Learning Fundamentals  

I'm passionate about both technology and education.


00:05Before joining Google, I spent 15 years as a university professor teaching in
the fields of data and machine learning.
00:15My name is Yoanna Long.
00:18I'm a Technical Curriculum Developer at Google Cloud that specializes in
machine learning and artificial intelligence.
00:27As a member of the Google Cloud team, my goal is to make education on
Google's cutting edge technology available to as many learners as possible.
00:39It's an exciting time to explore data and AI and get the most from this
transformational technologies.
00:49This technologies happens once in a generation, and they're going to define
the new century, probably more than anything else will.
00:59And I look forward to partnering with you to ensure that you can take
advantage of this technologies and really be the leaders in your field.
01:10I wish you the best on your learning journey and to be on the lookout for my
name as I help to build the learner community.

Big Data and Machine learning on Google


Cloud
Introdução do curso
Welcome to the Google Cloud Big Data and Machine Learning Fundamentals course! My
name is Marcus, and I’m Katelyn, and we’re from the Google Cloud team. We’ll be leading
you through this course.
00:12It's an exciting time to be exploring big data, artificial intelligence, and machine
learning. Innovation in this field is presenting new opportunities that weren't available
just a few years ago and by joining us on this course we hope you'll
00:25be putting yourself in a position to benefit from these technologies. This
course provides an introduction to the tools and technologies Google Cloud offers to
work with large data sets and then integrate that data into
00:36the artificial intelligence and machine learning lifecycle. Data and AI have a
powerful partnership; data is the foundation of every application integrated with
artificial intelligence. Without data, there is nothing for AI to learn from, no pattern to
recognize, and no insight
00:53to glean. Conversely, without artificial intelligence, large amounts of data can
be unmanageable or underutilized. Google has nine products with over one billion
users: Android, Chrome, Gmail, Google Drive, Google Maps, Google Search, the
Google Play Store, YouTube, and Photos.
01:14That’s a lot of data being processed every day! To meet the needs of a
growing user base, Google has developed the infrastructure to ingest, manage, and
serve high quantities of data from these applications.
01:27And artificial intelligence and machine learning have been integrated into
these products to make the user experience of each even more productive. This
includes features like search in Photos, recommendations in YouTube, or Smart
Compose
01:40in Gmail. Google Cloud offerings can be broadly categorized as compute,
storage, big data, and machine learning services for web, mobile, analytics, and
backend solutions. The main focus of this course is on big data and machine
learning.
01:58In the first section, you’ll be introduced to big data and machine learning on
Google Cloud. This includes Google Cloud’s infrastructure and big data and machine
learning products. In the second section of the course, you’ll explore data engineering
for streaming data.
02:12This includes how to build a streaming data pipeline, from ingestion with
Pub/Sub, to processing with Dataflow, and finally, to visualization using Data Studio
and Looker. After that, you’ll explore big data with BigQuery, Google’s popular data
warehouse
02:27tool, and BigQuery ML, the embedded ML functionality used for developing
machine learning models directly in BigQuery. From there, you’ll compare the four
options provided by Google Cloud to build and deploy
02:39a machine learning model. And in the final section of the course, you’ll learn
how to build a machine learning workflow from start to finish using Vertex AI, a
unified platform that brings all the components of
02:51the machine learning ecosystem and workflow together. The course includes
a mix of videos, quizzes, and hands-on labs. Labs are hosted in a clean, cloud
environment for a fixed period of time.
03:02At the end of each section, you’ll also find a resources section with links to
additional reading material. This course was designed for a wide range of learners.
This includes anyone at an organization involved or interested in the data-to-AI
lifecycle,
03:17such as product managers, data analysts, data engineers, data scientists, ML
developers, and ML engineers. While you’ll be learning about services and concepts
that are specific to big data and machine learning in this course, remember that,
because this
03:32is a fundamentals-level course, some content will be geared toward learners
who are entirely new to cloud technologies. And although this course has no
prerequisites, some knowledge of SQL and
03:42basic machine learning concepts will be helpful. You can learn more about
where this course fits into the learning path for your specific role and all the training
courses offered by Google Cloud at cloud.google.com/training.
03:54OK, are you ready to learn more about the exciting world of data and machine
learning? Great, let's get started!
O que são ?:

1. Big Data and machine Learning on Google Cloud;


2. Data Engineering for Streaming Data
3. Big Data with BigQuery
4. Machine Learning Options on Google Cloud;
5. The Machine Learning Workflow with Vertex AI.
Reading list 0 Course Introduction // Big Data and Machine Learning Fundamentals ● The Google
Cloud Training website ● Learning path: Data Engineering and Smart Analytics ● Learning path:
Machine Learning and A

Cursos e treinamentos do Google Cloud  |  Treinamento do Google Cloud

Cursos de engenharia e análise de dados  |  Treinamento do Google Cloud

Cursos de machine learning e IA  |  Treinamento do Google Cloud

Big Data ans Machine Learning on Google


Cloud
Welcome to the first section of the Big Data and Machine Learning course. Here you’ll
explore the Google infrastructure through compute and storage, and see how innovation has
enabled big data and machine learning capabilities After that, you’ll see the history of big
00:16data and ML products, which will help you understand the relevant product
categories. And to put it all together, you’ll examine an example of a customer who
adopted Google Cloud for their big data and machine learning needs.
00:29Finally, you’ll get hands-on practice using big data tools to analyze a public
dataset. Google has been working with data and artificial intelligence since its early
days as a company in 1998.
00:43Ten years later in 2008 the Google Cloud Platform was launched to provide
secure and flexible cloud computing and storage services. You can think of the
Google Cloud infrastructure in terms of three layers.
00:56At the base layer is networking and security, which lays the foundation to
support all of Google’s infrastructure and applications. On the next layer sit compute
and storage. Google Cloud separates, or decouples, as it’s technically called,
compute and storage so
01:13they can scale independently based on need. And on the top layer sit the big
data and machine learning products, which enable you to perform tasks to ingest,
store, process, and deliver business insights, data pipelines,
01:29and ML models. And thanks to Google Cloud, these tasks can be
accomplished without needing to manage and scale the underlying infrastructure. In
the videos that follow, we’ll focus on on the middle layer, compute and storage,
01:43and the top layer, big data and machine learning products. Networking and
security fall outside of the focus of this course, but if you’re interested in learning
more you can explore cloud.google.com/training
01:55for more options.

Google desre 1998 e em 2008 foi lançado a Google Cloud Platform


São trÊs camadas do GCP de inferior para superior.

Compute:

Let’s focus our attention on the middle layer of the Google Cloud infrastructure, compute and
storage. We’ll begin with compute. Organizations with growing data needs often require lots
of compute power to run big data jobs.
00:14And as organizations design for the future, the need for compute power only
grows. Google offers a range of computing services. The first is Compute Engine.
Compute Engine is an IaaS offering, or infrastructure as
00:30a service, which provides raw compute, storage, and network capabilities
organized virtually into resources that are similar to physical data centers. It provides
maximum flexibility for those who prefer to manage server instances themselves.
00:46The second is Google Kubernetes Engine, or GKE GKE runs containerized
applications in a cloud environment, as opposed to on an individual virtual machine,
like Compute Engine. A container represents code packaged up with all its
dependencies.
01:03The third computing service offered by Google is App Engine, a fully managed
PaaS offering, or platform as a service. PaaS offerings bind code to libraries that
provide access to the infrastructure application needs.
01:19This allows more resources to be focused on application logic. Then there is
Cloud Functions, which executes code in response to events, like when a new file is
uploaded to Cloud Storage.
01:30It’s a completely serverless execution environment, often referred to as
functions as a service. And finally, there is Cloud Run, a fully managed compute
platform that enables you to run request or event-driven stateless workloads without
having to worry about servers.
01:48It abstracts away all infrastructure management so you can focus on writing
code. It automatically scales up and down from zero, so you never have to worry
about scale configuration.
01:59Cloud Run charges you only for the resources you use so you never pay for
over provisioned resources. Let’s look at an example of a technology that requires a
lot of compute power.
02:10Google Photos offers a feature called automatic video stabilization. This takes
an unstable video, like one captured while riding on the back of a motorbike, and
stabilizes it to minimize movement.
02:24Youtube clip For this feature to work as intended, you need the proper data.
This includes the video itself, which is really a large collection of individual images,
along with time series data on the camera’s position
02:43and orientation from the onboard gyroscope, and motion from the camera
lens. A short video can require over a billion data points to feed the ML model to
create a stabilized
02:54version. As of 2020, roughly 28 billion photos and videos were uploaded to
Google Photos every week, with more than four trillion photos in total stored in the
service. To ensure that this feature works as intended, and accurately, the Google
Photos team needed
03:11to develop, train, and serve a high-performing machine learning model on
millions of videos. That’s a large training dataset! Just as the hardware on a standard
personal computer might not be powerful enough to process a big data job for an
organization, the hardware
03:28on a smartphone is not powerful enough to train sophisticated ML models.
That’s why Google trains production machine learning models on a vast network of
data centers, only to then deploy smaller, trained versions
03:41of the models to the smartphone and personal computer hardware But where
does all that processing power come from? According to Stanford University’s 2019
AI index report, before 2012, artificial intelligence results tracked closely with Moore’s
Law, with the
04:00required computing power used in the largest AI training runs doubling every
two years. The report states that, since 2012, the required computing power has been
doubling approximately every three and a half months.
04:13This means that hardware manufacturers have run up against limitations, and
CPUs, which are central processing units, and GPUs, which are graphics processing
units, can no longer scale to adequately reach the
04:25rapid demand for ML. To help overcome this challenge, in 2016 Google
introduced the Tensor Processing Unit, or TPU. TPUs are Google’s custom-developed
application-specific integrated circuits (ASICs) used to accelerate machine learning
workloads.
04:43TPUs act as domain-specific hardware, as opposed to general-purpose
hardware with CPUs and GPUs. This allows for higher efficiency by tailoring
architecture to meet the computation needs in a domain, such as the matrix
multiplication in machine learning.
05:01With TPUs, the computing speed increases more than 200 times. This means
that instead of waiting 26 hours for results with a single state-of-art GPU, you’ll only
need to wait for 7.9 minutes for a full Cloud TPU v.2 pod
05:17to deliver the same results. Cloud TPUs have been integrated across Google
products, and this state-of-art hardware and supercomputing technology is available
with Google Cloud products and services.

Storage

ow that we’ve explored compute and why it’s needed for big data and ML jobs, let’s now
examine storage. For proper scaling capabilities, compute and storage are decoupled. This is
one of the major differences between cloud and desktop computing.
00:18With cloud computing, processing limitations aren’t attached to storage disks.
Most applications require a database and storage solution of some kind. With
Compute Engine, for example, which was mentioned in the previous video, you can
00:33install and run a database on a virtual machine, just as you would do in a data
center. Alternatively, Google Cloud offers fully managed database and storage
services. These include: Cloud Storage
00:47Cloud Bigtable Cloud SQL Cloud Spanner Firestore And BigQuery The goal of
these products is to reduce the time and effort needed to store data. This means
creating an elastic storage bucket directly in a web interface
01:03or through a command line for example on Cloud Storage. Google Cloud offers
relational and non-relational databases, and worldwide object storage. We’ll explore
those options in more detail soon. Choosing the right option to store and process
data often
01:21depends on the data type that needs to be stored and the business need. Let’s
start with unstructured versus structured data. Unstructured data is information
stored in a non-tabular form such
01:33as documents, images, and audio files. Unstructured data is usually best
suited to Cloud Storage. Cloud Storage has four primary storage classes. The first is
Standard Storage. Standard Storage is considered best for frequently accessed, or
“hot,” data.
01:53It’s also great for data that is stored for only brief periods of time. The second
storage class is Nearline Storage. This is best for storing infrequently accessed data,
like reading or modifying data once
02:07per month or less, on average. Examples include data backups, long-tail
multimedia content, or data archiving. The third storage class is Coldline Storage.
This is also a low-cost option for storing infrequently accessed data.
02:24However, as compared to Nearline Storage, Coldline Storage is meant for
reading or modifying data, at most, once every 90 days. The fourth storage class is
Archive Storage. This is the lowest-cost option, used ideally for data archiving, online
backup, and disaster
02:44recovery. It’s the best choice for data that you plan to access less than once a
year, because it has higher costs for data access and operations and a 365-day
minimum storage duration.
02:57Alternatively, there is structured data, which represents information stored in
tables, rows, and columns. Structured data comes in two types: transactional
workloads and analytical workloads. Transactional workloads stem from Online
Transaction Processing systems, which are used when fast
03:17data inserts and updates are required to build row-based records. This is
usually to maintain a system snapshot. They require relatively standardized queries
that impact only a few records. Then there are analytical workloads, which stem from
Online Analytical Processing systems,
03:36which are used when entire datasets need to be read. They often require
complex queries, for example, aggregations. Once you’ve determined if the
workloads are transactional or analytical, you’ll need to identify whether the data will
be accessed using SQL or not.
03:53So, if your data is transactional and you need to access it using SQL, then
Cloud SQL and Cloud Spanner are two options. Cloud SQL works best for local to
regional scalability, while Cloud Spanner, it best
04:07to scale a database globally. If the transactional data will be accessed without
SQL, Firestore might be the best option. Firestore is a transactional NoSQL,
document-oriented database. If you have analytical workloads that require SQL
commands, BigQuery is likely the best
04:27option. BigQuery, Google’s data warehouse solution, lets you analyze petabyte-
scale datasets. Alternatively, Cloud Bigtable provides a scalable NoSQL solution for
analytical workloads. It’s best for real-time, high-throughput applications that require
only millisecond
04:46latency.
Como usar essa arquitetura?

The history of big data and ML products


The final layer of the Google Cloud infrastructure that is left to explore is big data and
machine learning products. In this video, we’ll examine the evolution of data processing
frameworks through
00:11the lens of product development. Understanding the chronology of products
can help address typical big data and ML challenges. Historically speaking, Google
experienced challenges related to big data quite early– mostly with large datasets,
00:27fast-changing data, and varied data. This was the result of needing to index
the World Wide Web And as the internet grew, Google needed to invent new data
processing methods.
00:39So, in 2002, Google released the Google File System, or GFS. GFS was
designed to handle data sharing and petabyte storage at scale. It served as the
foundation for Cloud Storage and also what would become the managed storage
00:54functionality in BigQuery. A challenge that Google was facing around this time
was how to index the exploding volume of content on the web. To solve this, in 2004
Google wrote a report that introduced MapReduce.
01:10MapReduce was a new style of data processing designed to manage large-
scale data processing across big clusters of commodity servers. As Google
continued to grow, new challenges arose, specifically with recording and retrieving
01:25millions of streaming user actions with high throughput. The solution was the
release in 2005 of Cloud Bigtable, a high-performance NoSQL database service for
large analytical and operational workloads. With MapReduce available, some
developers were restricted by the need to write code
01:43to manage their infrastructure, which prevented them from focusing on
application logic. As a result, from 2008 to 2010, Google started to move away from
MapReduce as the solution to process and query large datasets.
01:58So, in 2008, Dremel was introduced. Dremel took a new approach to big-data
processing by breaking the data into smaller chunks called shards, and then
compressing them. Dremel then uses a query optimizer to share tasks between the
many shards of data and
02:15the Google data centers, which processed queries and delivered results. The
big innovation was that Dremel autoscaled to meet query demands. Dremel became
the query engine behind BigQuery. Google continued innovating to solve big data and
machine learning challenges.
02:34Some of the technology solutions released include: Colossus, in 2010, which is
a cluster-level file system and successor to the Google File System. BigQuery, in 2010
as well, which is a fully-managed, serverless data warehouse that enables scalable
02:50analysis over petabytes of data. It is a Platform as a Service (PaaS) that
supports querying using ANSI SQL. It also has built-in machine learning capabilities.
BigQuery was announced in May 2010 and made generally available in November
2011.
03:09Spanner, in 2012, which is a globally available and scalable relational
database. Pub/Sub, in 2015, which is a service used for streaming analytics and data
integration pipelines to ingest and distribute data.
03:25And TensorFlow, also in 2015, which is a free and open source software library
for machine learning and artificial intelligence. 2018 brought the release of the
Tensor Processing Unit, or TPU, which you’ll recall from earlier,
03:40and AutoML, as a suite of machine learning products. The list goes on till
Vertex AI, a unified ML platform released in 2021. And it’s thanks to these
technologies that the big data and machine learning product
03:53line is now robust. This includes: Cloud Storage Dataproc Cloud Bigtable
BigQuery Dataflow Firestore Pub/Sub Looker Cloud Spanner AutoML, and Vertex AI,
the unified platform These products and services are made available
04:15through Google Cloud, and you’ll get hands-on practice with some of them as
part of this course. As we explored in the last video, Google offers a range of big data
and machine learning products.
04:22So, how do you know which is best for your business needs? Let’s look closer
at the list of products, which can be divided into four general categories along the
data-to-AI workflow:
04:25ingestion and process, storage, analytics, and machine learning.
Understanding these product categories can help narrow down your choice. The first
category is ingestion and process, which include products that are used to digest
04:29both real-time and batch data. The list includes Pub/Sub Dataflow Dataproc
Cloud Data Fusion You’ll explore how Dataflow and Pub/Sub can ingest streaming
data later in this course. The second product category is data storage, and you’ll
recall from earlier that there
04:35are five storage products: Cloud Storage Cloud SQL Cloud Spanner Cloud
Bigtable, and Firestore Cloud SQL and Cloud Spanner are relational databases, while
Bigtable and Firestore are NoSQL databases. The third product category is analytics.
04:41The major analytics tool is BigQuery. BigQuery is a fully managed data
warehouse that can be used to analyze data through SQL commands. In addition to
BigQuery, you can analyze data and visualize results using:
04:45Google Data Studio, and Looker You will explore BigQuery, Looker, and Data
Studio in this course. And the final product category is machine learning, or ML. ML
products include both the ML development platform and the AI solutions:
04:50The primary product of the ML development platform is Vertex AI, which
includes the products and technologies: AutoML Vertex AI Workbench, and
TensorFlow AI solutions are built on the ML development
04:54platform and include state-of-the-art products to meet both horizontal and
vertical market needs. These include: Document AI Contact Center AI Retail Product
Discovery, and Healthcare Data Engine These products unlock insights that only large
amounts of data can provide.
05:00We’ll explore the machine learning options and workflow together with these
products in greater detail later.
Big data and ML product categories

As we explored in the last video, Google offers a range of big data and machine learning
products. So, how do you know which is best for your business needs?
00:09Let’s look closer at the list of products, which can be divided into four general
categories along the data-to-AI workflow: ingestion and process, storage, analytics,
and machine learning. Understanding these product categories can help narrow
down your choice.
00:29The first category is ingestion and process, which include products that are
used to digest both real-time and batch data. The list includes Pub/Sub Dataflow
Dataproc Cloud Data Fusion You’ll explore how Dataflow and Pub/Sub can ingest
streaming data later in this course.
00:52The second product category is data storage, and you’ll recall from earlier that
there are five storage products: Cloud Storage Cloud SQL Cloud Spanner Cloud
Bigtable, and Firestore Cloud SQL and Cloud Spanner are relational
01:10databases, while Bigtable and Firestore are NoSQL databases. The third
product category is analytics. The major analytics tool is BigQuery. BigQuery is a fully
managed data warehouse that can be used to analyze data through SQL
01:26commands. In addition to BigQuery, you can analyze data and visualize results
using: Google Data Studio, and Looker You will explore BigQuery, Looker, and Data
Studio in this course. And the final product category is machine learning, or ML.
01:44ML products include both the ML development platform and the AI solutions:
The primary product of the ML development platform is Vertex AI, which includes the
products and technologies: AutoML Vertex AI Workbench, and
02:01TensorFlow AI solutions are built on the ML development platform and include
state-of-the-art products to meet both horizontal and vertical market needs. These
include: Document AI Contact Center AI Retail Product Discovery, and Healthcare
Data Engine
02:21These products unlock insights that only large amounts of data can provide.
We’ll explore the machine learning options and workflow together with these
products in greater detail later.

Exemplo do cliente: Gojek

With many big data and machine learning products options available, it can be helpful to see
an example of how an organization has leveraged Google Cloud to meet their goals.
00:10In this video, you’ll learn about a company called Gojek and how they were
able to find success through Google Cloud’s data engineering and machine learning
offerings. This story starts in Jakarta, Indonesia.
00:24Traffic congestion is a fact of life for most Indonesian residents. To minimize
delays, many rely heavily on motorcycles, including motorcycle taxis, known as Ojeks,
to travel to and from work or personal engagements.
00:39Founded in 2010 and headquartered in Jakarta, a company called Gojek
started as a call center for ojek bookings. The organization has leveraged demand for
the service to become one of the few "unicorns"
00:51in Southeast Asia. A “unicorn” is a privately held startup business valued at
over US$1 billion. Since its inception, Gojek has collected data to understand
customer behavior, and in 2015 launched
01:07a mobile application that bundled ride-hailing, food delivery, and grocery
shopping. They hit hypergrowth very quickly. According to the Q2 2021 Gojek fact
sheet the Gojek app has been downloaded over 190
01:23million times, and they have 2 million driver partners and about 900,000
merchant partners. The business has relied heavily on the skills and expertise of its
technology team and on
01:36selecting the right technologies to grow and to expand into new markets.
Gojek chose to run its applications and data in Google Cloud. Gojek’s goal is to
match the right driver with the right request
01:49as quickly as possible. In the early days of the app, a driver would be pinged
every 10 seconds, which meant 6 million pings per minute, which turned out to be
01:598 billion pings per day across their driver partners. They generated around five
terabytes of data each day. Leveraging information from this data was vital to
meeting their company goals.
02:12But Gojek faced challenges along the way. Let’s explore two of them to see
how Google Cloud was able to solve them. The first challenge was data latency.
When they wanted to scale their big data platform, they found that most reports were
produced
02:26one day later, so they couldn’t identify problems immediately. To help solve
this, Gojek migrated their data pipelines to Google Cloud. The team started using
Dataflow for streaming data processing and
02:39BigQuery for real-time business insights. Another challenge was quickly
determining which location had too many, or too few, drivers to meet demand. Gojek
was able to use Dataflow to build a streaming event data pipeline.
02:55This let driver locations ping Pub/Sub every 30 seconds, and Dataflow would
process the data. The pipeline would aggregate the supply pings from the drivers
against the booking requests. This would connect to Gojek’s notification system to
alert drivers where they should
03:11go. This process required a system that was able to scale up to handle times
of high throughput and then back down again. Dataflow was able to automatically
manage the number of workers
03:23processing the pipeline to meet demand. The Gojek team was also able to
visualize and identify supply and demand issues. They discovered that the areas with
the highest discrepancy between supply and demand came
03:35from train stations. Often there were far more booking requests than there
were available drivers. Since using Google Cloud’s big data and machine learning
products, the Gojek team has been able
03:48actively monitor requests to ensure that drivers are in the areas with the
highest demand. This brings faster bookings for riders and more work for the drivers.
Now it’s time for you to take a break from course videos and get hands-on
03:58practice with one of the big data and machine learning products that was
introduced earlier– BigQuery. In the lab that follows this video, you’ll use BigQuery to
explore a public dataset.
04:01You’ll practice: Querying a public data set Creating a custom table Loading
data into a table, and Querying a table. Please note that this exercise involves leaving
the current learning platform and opening
04:05Qwiklabs. Qwiklabs offers a free, clean, Google Cloud environment for a fixed
period of time. You’ll have multiple attempts at each lab, so if you don’t complete it
the first time,
04:08or if you want to experiment more with it later on, you can return and start a
new instance. This brings us to the end of the first section of the Big Data and
Machine Learning course.
04:11Before we move forward, let’s review what we’ve covered so far. You began by
exploring the Google Cloud infrastructure through three different layers. At the base
layer is networking and security, which makes up the foundation to support all
04:14of Google’s infrastructure and applications. On the next layer sit compute and
storage. Google Cloud decouples compute and storage so they can scale
independently based on need. And on the top layer sit the big data and machine
learning products.
04:18In the next section, you learned about the history of big data and ML
technologies, and then explored the four major product categories that support the
data to AI workflow:
04:20In the next section, you learned about the history of big data and ML
technologies, and then explored the four major product categories that support the
data to AI workflow:
04:22ingestion and process, storage, analytics, and machine learning. After that, you
saw an example of how Gojek, the Indonesian on-demand multi-service platform and
digital payment technology group, leveraged Google Cloud big data and ML products
to expand
04:26their business. And finally, you got hands-on practice with BigQuery by
analyzing a public dataset.
Introdução de laboratório: Explorando um conjunto de dados públicos bigquery

ow it’s time for you to take a break from course videos and get hands-on practice with one of
the big data and machine learning products that was introduced earlier–
00:09BigQuery. In the lab that follows this video, you’ll use BigQuery to explore a
public dataset. You’ll practice: Querying a public data set Creating a custom table
Loading data into a table, and
00:25Querying a table. Please note that this exercise involves leaving the current
learning platform and opening Qwiklabs. Qwiklabs offers a free, clean, Google Cloud
environment for a fixed period of time.
00:36You’ll have multiple attempts at each lab, so if you don’t complete it the first
time, or if you want to experiment more with it later on, you can return and start a new
instance.

Lab

Quick quiz. You need a table to hold the dataset.

SELECT
  name, gender,
  SUM(number) AS total
FROM
  `bigquery-public-data.usa_names.usa_1910_2013`

where gender = 'F'
GROUP BY
  name, gender
ORDER BY
  total DESC
LIMIT
  10

Cloud Storage, Cloud Bigtable, Cloud SQL, Cloud Spanner, and Firestore represent which
type of services?

Machine learning

Networking

Compute

Database and storage


2.
Which Google hardware innovation tailors architecture to meet the computation needs on a
domain, such as the matrix multiplication in machine learning?

CPUs (central processing units)

GPUs (graphic processing units)

DPUs (data processing units)

TPUs (Tensor Processing Units)


3.
Pub/Sub, Dataflow, Dataproc, and Cloud Data Fusion align to which stage of the data-to-AI
workflow?

Ingestion and process

Storage

Machine learning
Analytics
4.
AutoML, Vertex AI Workbench, and TensorFlow align to which stage of the data-to-AI
workflow?

Ingestion and process

Storage

Machine learning

Analytics
5.
Which data storage class is best for storing data that needs to be accessed less than once a
year, such as online backups and disaster recovery?

Standard storage

Nearline storage

Archive storage

Coldline storage
6.
Compute Engine, Google Kubernetes Engine, App Engine, and Cloud Functions represent
which type of services?

Database and storage

Compute

Machine learning

Networking
Submit

Consultar um conjunto de dados público com o console do Google Cloud  |  BigQuery

Como usar a ferramenta de linha de comando bq  |  BigQuery  |  Google Cloud
BigQuery para usuários de armazenamento de dados  |  Centro de arquitetura do Cloud  |  Google
Cloud

Leitura obrigatoria

Lista de leitura | Google Cloud Skills Boost para parceiros

Data Engineering for Streaming Data


Introduction

In the previous section of this course, you learned about the different layers of the Google Cloud
infrastructure, including the categories of big data and machine learning products.

00:09In this second section, you’ll explore data engineering for streaming data with the goal
of building a real-time data solution with Google Cloud products and services.

00:17This includes how to: Ingest streaming data using Pub/Sub Process the data with
Dataflow, and Visualize the results with Google Data Studio and Looker.

00:30In between data processing with Dataflow and visualization with Data Studio or
Looker, the data is normally saved and analyzed in a data warehouse such as BigQuery.

00:39You will learn the details about BigQuery in a later module.

00:44Coming up in this section, you’ll start by examining some of the big data challenges
faced by today’s data engineers when setting up and managing pipelines.

00:52Next, you’ll learn about message-oriented architecture.

00:56This includes ways to capture streaming messages globally, reliably, and at scale so
they can be fed into a pipeline.

01:05From there, you’ll see how to design streaming pipelines with Apache Beam, and then
implement them with Dataflow.

01:11You’ll explore how to visualize data insights on a dashboard with Looker and Data
Studio.

01:18And finally, you’ll get hands-on practice building an end-to-end data pipeline that
handles real-time data ingestion with Pub/Sub, processing with Dataflow, and visualization
with Data Studio.

01:30Before we get too far, let’s take a moment to explain what streaming data is, how it
differs from batch processing, and why it’s important.

01:39Batch processing is when the processing and analysis happens on a set of stored
data.
01:45An example is Payroll and billing systems that have to be processed on either a
weekly or monthly basis.

01:52Streaming data is a flow of data records generated by various data sources.

01:57The processing of streaming data happens as the data flows through a system.

02:01This results in the analysis and reporting of events as they happen.

02:05An example would be fraud detection or intrusion detection.

02:09Streaming data processing means that the data is analyzed in near real-time and that
actions will be taken on the data as quickly as possible.

02:17Modern data processing has progressed from legacy batch processing of data toward
working with real-time data streams.

02:25An example of this is streaming music and movies.

02:28No longer is it necessary to download an entire movie or album to a local device.

02:33Data streams are a key part in the world of big data.

Big data challenges


Building scalable and reliable pipelines is a core responsibility of data engineers. However, in modern
organizations, data engineers and data scientists are facing four major challenges. These are
collectively known as the 4 Vs. They are

00:16variety, volume, velocity, and veracity. First, data could come in from a variety of
different sources and in various formats. Imagine hundreds of thousands of sensors for self-
driving cars on roads around the world.

00:31The data is returned in various formats such as number, image, or even audio. Now
consider point-of-sale data from a thousand different stores. How do we alert our
downstream systems of new transactions in an organized way with

00:47no duplicates? Next, let’s increase the magnitude of the challenge to handle not only
an arbitrary variety of input sources, but a volume of data that varies from gigabytes to
petabytes.

01:00You’ll need to know whether your pipeline code and infrastructure can scale with
those changes or whether it will grind to a halt or even crash. The third challenge concerns
velocity.

01:13Data often needs to be processed in near-real time, as soon as it reaches the system.
You’ll probably also need a way to handle data that arrives late, has bad data in the message,

01:23or needs to be transformed mid-flight before it is streamed into a data warehouse.


And the fourth major challenge is veracity, which refers to the data quality. Because big data
involves a multitude of data dimensions resulting
01:38from different data types and sources, there’s a possibility that gathered data will
come with some inconsistencies and uncertainties. Challenges like these are common
considerations for pipeline developers. By the end of this section, the goal is for you to better
understand the tools available

01:55to help successfully build a streaming data pipeline and avoid these challenges.

Message-oriented architecture
One of the early stages in a data pipeline is data ingestion, which is where large amounts of
streaming data are received. Data, however, may not always come from a single, structured

00:11database. Instead, the data might stream from a thousand, or even a million, different
events that are all happening asynchronously. A common example of this is data from IoT, or
Internet of Things, applications.

00:26These can include sensors on taxis that send out location data every 30 seconds or
temperature sensors around a data center to help optimize heating and cooling. These IoT
devices present new challenges to data ingestion, which can be summarized in

00:42four points: The first is that data can be streamed from many different methods and
devices, many of which might not talk to each other and might be sending bad or delayed
data.

00:54The second is that it can be hard to distribute event messages to the right
subscribers. Event messages are notifications. A method is needed to collect the streaming
messages that come from IoT sensors and broadcast

01:06them to the subscribers as needed. The third is that data can arrive quickly and at high
volumes. Services must be able to support this. And the fourth challenge is ensuring services
are reliable, secure, and perform as expected.

01:22Google Cloud has a tool to handle distributed message-oriented architectures at scale,


and that is Pub/Sub. The name is short for Publisher/Subscriber, or publish messages to
subscribers. Pub/Sub is a distributed messaging service

01:38that can receive messages from a variety of device streams such as gaming events,
IoT devices, and application streams. It ensures at-least-once delivery of received messages
to subscribing applications, with no provisioning required.

01:52Pub/Sub’s APIs are open, the service is global by default, and it offers end-to-end
encryption. Let’s explore the end-to-end architecture using Pub/Sub. Upstream source data
comes in from devices all over the globe and is ingested into Pub/Sub,

02:10which is the first point of contact within the system. Pub/Sub reads, stores,
broadcasts to any subscribers of this data topic that new messages are available. As a
subscriber of Pub/Sub, Dataflow can ingest and

02:24transform those messages in an elastic streaming pipeline and output the results into
an analytics data warehouse like BigQuery. Finally, you can connect a data visualization tool,
like Looker or Data Studio, to
02:38visualize and monitor the results of a pipeline, or an AI or ML tool such as Vertex AI to
explore the data to uncover business insights or help with predictions

02:49A central element of Pub/Sub is the topic. You can think of a topic like a radio
antenna. Whether your radio is playing music or it’s turned off, the antenna itself is always
there.

03:00If music is being broadcast on a frequency that nobody’s listening to, the stream of
music still exists. Similarly, a publisher can send data to a topic that has no subscriber to
receive it.

03:12Or a subscriber can be waiting for data from a topic that isn’t getting data sent to it,
like listening to static from a bad radio frequency. Or you could have a fully operational
pipeline where the publisher is sending data to a topic

03:24that an application is subscribed to. That means there can be zero, one, or more
publishers, and zero, one or more subscribers related to a topic. And they’re completely
decoupled, so they’re free to break without

03:38affecting their counterparts. It’s helpful to describe this using an example. Say you’ve
got a human resources topic. A new employee joins your company, and several applications
across the company

03:51need to be updated. Adding a new employee can be an event that generates a


notification to the other applications that are subscribed to the topic, and they’ll receive the
message about the new employee starting.

04:02Now, let’s assume that there are two different types of employees: a full-time
employee and a contractor. Both sources of employee data could have no knowledge of the
other but still publish their events saying “this employee joined” into the Pub/Sub HR topic.

04:20After Pub/Sub receives the message, downstream applications like the directory
service, facilities system, account provisioning, and badge activation systems can all listen
and process their own next steps independent of one another.

04:33Pub/Sub is a good solution to buffer changes for lightly coupled architectures, like this
one, that have many different publishers and subscribers. Pub/Sub supports many different
inputs and outputs, and you can even publish a Pub/Sub

04:45event from one topic to another. The next task is to get these messages reliably into
our data warehouse, and we’ll need a pipeline that can match Pub/Sub’s scale and

04:55elasticity to do it.
Designing streaming pipelines with Apache Beam
[SLIDES 162-165] After messages have been captured from the streaming input sources you need a
way to pipe that data into a data warehouse for analysis. This is where Dataflow comes in.

00:11Dataflow creates a pipeline to process both streaming data and batch data. “Process”
in this case refers to the steps to extract, transform, and load data, or ETL. When building a
data pipeline, data engineers often encounter challenges related to
00:29coding the pipeline design and implementing and serving the pipeline at scale. During
the pipeline design phase, there are a few questions to consider: Will the pipeline code be
compatible with both batch and streaming data, or will it

00:44need to be refactored? Will the pipeline code software development kit, or SDK, being
used have all the transformations, mid-flight aggregations and windowing and be able to
handle late data?

00:58Are there existing templates or solutions that should be referenced? A popular


solution for pipeline design is Apache Beam. It’s an open source, unified programming model
to define and execute data processing

01:12pipelines, including ETL, batch, and stream processing. Apache Beam is unified, which
means it uses a single programming model for both batch and streaming data. It’s portable,
which means it can work on multiple execution environments, like Dataflow

01:30and Apache Spark, among others. And it’s extensible, which means it allows you to
write and share your own connectors and transformation libraries. Apache Beam provides
pipeline templates, so you don’t need to build a pipeline from

01:44nothing. And it can write pipelines in Java, Python, or Go. The Apache Beam software
development kit, or SDK, is a collection of software development tools in one installable
package.

01:58It provides a variety of libraries for transformations and data connectors to sources
and sinks. Apache Beam creates a model representation from your code that is portable
across many runners.

02:09Runners pass off your model for execution on a variety of different possible engines,
with Dataflow being a popular choice.

Implementing streaming pipelines on Cloud Dataflow


As covered in the previous video, Apache can be used to create data processing pipelines. The next
step is to identify an execution engine to implement those pipelines. When choosing an execution
engine for your pipeline code,

00:15it might be helpful to consider the following questions: How much maintenance
overhead is involved? Is the infrastructure reliable? How is the pipeline scaling handled? How
can the pipeline be monitored?

00:28Is the pipeline locked in to a specific service provider? This brings us to Dataflow.
Dataflow is a fully managed service for executing Apache Beam pipelines within the Google
Cloud

00:40ecosystem. Dataflow handles much of the complexity relating to infrastructure setup


and maintenance and is built on Google’s infrastructure. This allows for reliable auto scaling
to meet data pipeline demands.

00:56Dataflow is serverless and NoOps, which means No Operations. But what does that
mean exactly? A NoOps environment is one that doesn't require management from an
operations team, because maintenance, monitoring, and scaling
01:09are automated. Serverless computing is a cloud computing execution model. This is
when Google Cloud, for example, manages infrastructure tasks on behalf of the users. This
includes tasks like resource provisioning, performance tuning,

01:25and ensuring pipeline reliability. Dataflow means that you can spend more time
analyzing the insights from your datasets and less time provisioning resources to ensure that
your pipeline will successfully complete its next cycles.

01:41It’s designed to be low maintenance. Let’s explore the tasks Dataflow performs when
a job is received. It starts by optimizing a pipeline model's execution graph to remove any
inefficiencies.

01:56Next, it schedules out distributed work to new workers and scales as needed. After
that, it auto-heals any worker faults. From there, it automatically rebalances efforts to most
efficiently use its workers.

02:13And finally, it outputs data to produce a result. BigQuery is one of many options that
data can be outputted to. You’ll get some more practice using BigQuery later in this course.

02:24So by design, you don't need to monitor all of the compute and storage resources that
Dataflow manages, to fit the demand of a streaming data pipeline. Even experienced Java or
Python developers will benefit

02:37from using Dataflow templates, which cover common use cases across Google Cloud
products. The list of templates is continuously growing. They can be broken down into three
categories: streaming templates, batch templates, and

02:52utility templates. Streaming templates are for processing continuous, or real-time,


data. For example: Pub/Sub to BigQuery Pub/Sub to Cloud Storage Datastream to BigQuery
Pub/Sub to MongoDB Batch templates are for processing bulk data, or batch load data.

03:12For example: BigQuery to Cloud Storage Bigtable to Cloud Storage Cloud Storage to
BigQuery Cloud Spanner to Cloud Storage Finally, utility templates address activities related
to bulk compression, deletion, and

03:29conversion. For a complete list of templates, please refer to the reading list.
Visualization with Looker
elling a good story with data through a dashboard can be critical to the success of a data pipeline
because data that is difficult to interpret or draw insights from might be useless.

00:10After data is in BigQuery a lot of skill and effort can still be required to uncover
insights. To help create an environment where stakeholders can easily interact with and
visualize data, Google Cloud offers two solutions

00:24Looker and Google Data Studio. Let’s explore both of them, starting with Looker.
starting with Looker. Looker supports BigQuery, as well as more than 60 different SQL
databases. It allows developers to define a semantic modeling layer on top of

00:41databases using Looker Modeling Language, or LookML. LookML defines logic and
permissions independent from a specific database or a SQL language, which frees a data
engineer from interacting with individual databases to focus more on business logic across
an organization.

00:59The Looker platform is 100% web-based, which makes it easy to integrate into existing
workflows and share with multiple teams at an organization. There is also a Looker API,
which can be used to embed Looker reports in other applications.

01:14Let’s explore some of Looker’s features, starting with dashboards. Dashboards, like
the Business Pulse dashboard, for example, can visualize data in a way that makes insights
easy to understand. For a sales organization, it shows figures that many might want to see at
the start of

01:31the week, like the number of new users acquired, monthly sales trends, and even the
number of year-to-date orders. Information like this can help align teams, identify customer
frustrations, and maybe
01:44even uncover lost revenue. Based on the metrics that are important to your business,
you can create Looker dashboards that provide straightforward presentations to help you
and your colleagues quickly see

01:56a high-level business status. Looker has multiple data visualization options, including
area charts, line charts, Sankey diagrams, funnels, and liquid fill gauges. To share a
dashboard with your team, you schedule delivery through storage services

02:13like Google Drive Slack Dropbox. Let’s explore another Looker dashboard, this time
one that monitors key metrics related to New York City taxis over a period of time. This
dashboard displays: Total revenue

02:29Total numbers of passengers, and Total number of rides Looker displays this
information through a timeseries to help monitor metrics over time. Looker also lets you plot
data on a map to see ride distribution, busy areas, and peak

02:45hours. The purpose of these features is to help you draw insights to make business
decisions. For more training on Looker, please refer to cloud.google.com/training. Another
popular data visualization tool offered by Google is Data Studio.

02:57Data Studio is integrated into BigQuery, which makes data visualization possible with
just a few clicks. This means that leveraging Data Studio doesn’t require support from an
administrator to establish a data connection, which is a requirement with Looker.

03:01Data Studio dashboards are widely used across many Google products and
applications. For example, Data Studio is integrated into Google Analytics to help visualize, in
this case, a summary of a marketing website.

03:04This dashboard visualizes the total number of visitors through a map, compares
month-over-month trends, and even displays visitor distribution by age. Another Data Studio
integration is the Google Cloud billing dashboard.

03:07You might be familiar with this from your account. Maybe you’ve already used it to
monitor spending, for example. You’ll soon have hands-on practice with Data Studio, but in
preparation for the lab,

03:10let’s explore the three steps needed to create a Data Studio dashboard. First, choose a
template. You can start with either a pre-built template or a blank report. Second, link the
dashboard to a data source.

03:14This might come from BigQuery, a local file, or a Google application like Google Sheets
or Google Analytics–or a combination of any of these sources. And third, explore your
dashboard!

03:17Now it’s time for hands-on practice with some of the tools you learned about in this
section of the course. In the lab that follows this video, you build a streaming data pipeline to
monitor sensor

03:20data, and then visualize the dataset through a dashboard. You’ll practice: Creating a
Dataflow job from a pre-existing template and subscribing to a Pub/Sub topic. Streaming and
monitoring a Dataflow pipeline into BigQuery
03:24Analyzing results with SQL, and Visualizing key metrics in Data Studio. Please note
that, though you will use some SQL commands in this lab, the lab doesn’t actually require
strong SQL knowledge.

03:27We’ll explore BigQuery in more detail later in this course. You’ll have multiple attempts
at each lab, so if you don’t complete it the first time, or if you want to experiment more with it
later on, you can

03:30return and start a new instance. Well done completing the lab on building a data
pipeline for streaming data! Before you move on in the course, let's do a quick recap.

03:33In this section, you explored the streaming data workflow, from ingestion to
visualization. You started by learning about the four common big data challenges, or the 4Vs:
Volume (data size)

03:36Variety (data format) Velocity (data speed), and Veracity (data accuracy). These
challenges can be especially common when dealing with streaming data. You then learned
how the streaming data workflow provided by Google can help address these

03:40challenges. You started with Pub/Sub, which can be used to ingest a large volume of
IoT data from diverse resources in various formats. After that, you explored Dataflow, a
serverless, NoOps service, to process the data.

03:44‘Process’ here refers to ETL (extract, transform, and load). And finally, you were
introduced to two Google visualization tools, Looker and Data Studio.
Visualization with Data Studio
Another popular data visualization tool offered by Google is Data Studio.

00:04Data Studio is integrated into BigQuery, which makes data visualization possible with
just a few clicks.

00:11This means that leveraging Data Studio doesn’t require support from an administrator
to establish a data connection, which is a requirement with Looker.

00:21Data Studio dashboards are widely used across many Google products and
applications.

00:26For example, Data Studio is integrated into Google Analytics to help visualize, in this
case, a summary of a marketing website.

00:35This dashboard visualizes the total number of visitors through a map, compares
month-over-month trends, and even displays visitor distribution by age.
00:45Another Data Studio integration is the Google Cloud billing dashboard.

00:49You might be familiar with this from your account.

00:52Maybe you’ve already used it to monitor spending, for example.

00:55You’ll soon have hands-on practice with Data Studio, but in preparation for the lab,
let’s explore the three steps needed to create a Data Studio dashboard.

01:04First, choose a template.

01:06You can start with either a pre-built template or a blank report.

01:10Second, link the dashboard to a data source.

01:14This might come from BigQuery, a local file, or a Google application like Google Sheets
or Google Analytics–or a combination of any of these sources.

01:23And third, explore your dashboard!

Lab introduction: Creating a streaming data pipeline for a Real-Time dashboard with
Dataflow
Now it’s time for hands-on practice with some of the tools you learned about in this section of the
course. In the lab that follows this video, you build a streaming data pipeline to monitor sensor

00:10data, and then visualize the dataset through a dashboard. You’ll practice: Creating a
Dataflow job from a pre-existing template and subscribing to a Pub/Sub topic. Streaming and
monitoring a Dataflow pipeline into BigQuery

00:24Analyzing results with SQL, and Visualizing key metrics in Data Studio. Please note
that, though you will use some SQL commands in this lab, the lab doesn’t actually require
strong SQL knowledge.

00:37We’ll explore BigQuery in more detail later in this course. You’ll have multiple attempts
at each lab, so if you don’t complete it the first time, or if you want to experiment more with it
later on, you can

00:49return and start a new instance.


Lab 02

Como iniciar seu laboratório e fazer login no Console


1. Clique no botão Iniciar laboratório . Se você precisar pagar pelo laboratório,
um pop-up será aberto para você selecionar sua forma de pagamento. À
esquerda há um painel preenchido com as credenciais temporárias que você
deve usar para este laboratório.

2. Copie o nome de usuário e clique em Abrir Google Console . O laboratório


ativa recursos e abre outra guia que mostra a página Escolher uma conta .

Nota: Abra as guias em janelas separadas, lado a lado.

3. Na página Escolher uma conta, clique em Usar outra conta . A página Entrar é
aberta.
4. Cole o nome de usuário que você copiou do painel Detalhes da conexão. Em
seguida, copie e cole a senha.

Observação: você deve usar as credenciais do painel Detalhes da conexão. Não use suas credenciais
do Google Cloud Skills Boost. Se você tiver sua própria conta do Google Cloud, não a use para este
laboratório (evita cobranças).

5. Clique nas páginas seguintes:


 Aceite os termos e condições.
 Não adicione opções de recuperação ou autenticação de dois fatores (porque esta é
uma conta temporária).
 Não se inscreva para testes gratuitos.
Após alguns momentos, o console do Cloud é aberto nesta guia.

Observação: você pode visualizar o menu com uma lista de produtos e serviços do Google Cloud
clicando no menu de navegação no canto superior
esquerdo. 

Ativar o Google Cloud Shell

O Google Cloud Shell é uma máquina virtual carregada com ferramentas de


desenvolvimento. Ele oferece um diretório inicial persistente de 5 GB e é executado
no Google Cloud.
O Google Cloud Shell fornece acesso de linha de comando aos seus recursos do
Google Cloud.

1. No console do Cloud, na barra de ferramentas superior direita, clique no botão


Open Cloud Shell.

2. Clique em Continuar .

Leva alguns momentos para provisionar e conectar-se ao ambiente. Quando você


está conectado, você já está autenticado e o projeto está definido para o
seu PROJECT_ID . Por exemplo:

gcloud é a ferramenta de linha de comando do Google Cloud. Ele vem pré-instalado


no Cloud Shell e é compatível com preenchimento de guias.

 Você pode listar o nome da conta ativa com este comando:


lista de autenticação do gcloud
Copiado!

content_copy

Resultado:

Contas credenciadas :
- @. com (ativo)
Saída de exemplo:

Contas credenciadas:
- google1623327_student @qwiklabs .net
 Você pode listar o ID do projeto com este comando:
projeto de lista de configuração gcloud
Copiado!

content_copy

Resultado:

[essencial]
projeto =
Saída de exemplo:

[core]
projeto = qwiklabs-gcp- 44776 a13dea667a6
Observação: a documentação completa do gcloud está disponível no guia de visão geral da CLI do
gcloud .

Tarefa 1. Crie um tópico público do Pub/Sub


e crie um conjunto de dados do BigQuery
Nesta tarefa, você cria o conjunto de taxiridesdados. Você tem duas opções
diferentes que você pode usar para criar isso, usando o Google Cloud Shell ou o
Google Cloud Console.

Pub/Sub é um serviço de mensagens global assíncrono. Ao desacoplar remetentes e


destinatários, ele permite uma comunicação segura e altamente disponível entre
aplicativos escritos de forma independente. O Pub/Sub oferece mensagens duráveis
e de baixa latência.
No Pub/Sub, os aplicativos do editor e os aplicativos do assinante se conectam por
meio do uso de uma string compartilhada chamada tópico . Um aplicativo publicador
cria e envia mensagens para um tópico. Os aplicativos de assinante criam uma
assinatura para um tópico para receber mensagens dele.

O Google mantém alguns tópicos públicos de dados de streaming do Pub/Sub para


laboratórios como este. Usaremos o conjunto de dados aberto da NYC Taxi &
Limousine Commission .
O BigQuery é um armazenamento de dados sem servidor. As tabelas no BigQuery
são organizadas em conjuntos de dados. Neste laboratório, as mensagens
publicadas no Pub/Sub serão agregadas e armazenadas no BigQuery.
Use uma das seguintes opções para criar um novo conjunto de dados do BigQuery:
Opção 1: a ferramenta de linha de comando

1. No Cloud Shell (  ), execute o comando a seguir para criar o conjunto


de taxiridesdados.

bq --location=us-west1 mk taxirides
Copiado!

content_copy

2. Execute este comando para criar a taxirides.realtimetabela (esquema


vazio para o qual você transmitirá posteriormente).

bq --location=us-west1 mk \
--time_partitioning_field timestamp \
--schema ride_id:string,point_idx: integer ,latitude: float ,longitude:
float ,\
timestamp:timestamp,meter_reading: float ,meter_increment:
float ,ride_status:string,\
passageiro_count: integer -t taxirides.realtime
Copiado!

content_copy

Opção 2: a IU do console do BigQuery

Observação: ignore essas etapas se você criou as tabelas usando a linha de comando.

1. No console do Google Cloud, no menu de navegação (  ), clique


em BigQuery .
2. Se você vir a caixa de diálogo Bem-vindo, clique em Concluído .
3. Clique em Visualizar ações (  ) ao lado do ID do projeto e clique em Criar
conjunto de dados .
4. Em ID do conjunto de dados, digite taxirides .
5. Em Local de dados, clique em us-west1 (Oregon) e, em seguida, clique
em Criar conjunto de dados .
6. No painel Explorer, clique em expandir nó (  ) para revelar o novo conjunto
de dados de taxirides.
7. Clique em Exibir ações (  ) ao lado do conjunto de dados de taxirides e
clique em Abrir na guia atual .
8. Clique em Criar Tabela .
9. Em Tabela, digite em tempo real
10. Para o esquema, clique em Editar como texto e cole o seguinte:

ride_id:string,
point_idx: inteiro ,
latitude: flutuar ,
longitude: flutuar ,
carimbo de data/hora: carimbo de data/hora,
meter_reading: float ,
meter_increment: float ,
ride_status:string,
passageiro_count: inteiro
Copiado!

content_copy

10. Em Configurações de partição e cluster , selecione timestamp .


11. Clique em Criar Tabela .

Tarefa 2. Crie um bucket do Cloud Storage


Nesta tarefa, você cria um bucket do Cloud Storage para fornecer espaço de trabalho
para o pipeline do Dataflow.

O Cloud Storage permite armazenamento e recuperação em todo o mundo de


qualquer quantidade de dados a qualquer momento. Você pode usar o Cloud Storage
para diversos cenários, incluindo veiculação de conteúdo do site, armazenamento de
dados para arquivamento e recuperação de desastres ou distribuição de grandes
objetos de dados para usuários por meio de download direto.

1. No console do Cloud, no menu Navigation (  ), clique em Cloud Storage .


2. Clique em Criar bucket .
3. Para Nome , copie e cole na ID do projeto e clique em Continuar .
4. Para Tipo de local , clique em Multirregião se ainda não estiver selecionado.
5. Clique em Criar .
Tarefa 3. Configurar um pipeline do Dataflow
Nesta tarefa, você configura um pipeline de dados de streaming para ler os dados do
sensor do Pub/Sub, calcular a temperatura máxima em uma janela de tempo e gravá-
los no BigQuery.

O Dataflow é uma maneira sem servidor de realizar análise de dados.

Reinicie a conexão com a API Dataflow.

1. No console do Cloud, na barra de pesquisa superior, digite Dataflow API e


pressione ENTER.
2. Na janela de resultados da pesquisa, clique em API Dataflow .
3. Clique em Gerenciar .
4. Clique em Desativar API .
5. Na caixa de diálogo Desativar, clique em Desativar .
6. Clique em Ativar .

Crie um novo pipeline de streaming:

1. No console do Cloud, no menu Navegação (  ), clique em Dataflow .


2. Na barra de menu superior, clique em Criar trabalho a partir do modelo .
3. Digite streaming-taxi-pipeline como o nome da tarefa do Dataflow.
4. Em Regional endpoint , selecione us-west1 (Oregon) .
5. Em Modelo do Dataflow , selecione o modelo Pub/Sub Topic to BigQuery .
6. Em Input Pub/Sub topic , clique em Enter Topic Manually e digite o seguinte:

projects/pubsub-public-data/topics/taxirides-realtime

7. Clique em Salvar .
8. Na tabela de saída do BigQuery ,
digite<myprojectid>:taxirides.realtime

Observação: você deve substituir myprojectid pelo ID do projeto.Observação : há dois


pontos :entre o nome do projeto e do conjunto de dados e um ponto .entre o nome do conjunto de
dados e da tabela.

9. Em Local temporário, clique em Procurar .


10. Clique em visualizar recursos filho (  ).
11. Clique em Criar nova pasta (  ) e digite o nome tmp .
12. Clique em Criar e, em seguida, clique em Selecionar .
13. Clique em Mostrar parâmetros opcionais .
14. Em Max workers , digite 2
15. Em Número de trabalhadores , digite 2
16. Em Worker region , selecione us-west1 .
17. Clique em Executar trabalho .

Um novo trabalho de streaming começou! Agora você pode ver uma representação


visual do pipeline de dados.

Observação : se o trabalho de fluxo de dados falhar pela primeira vez, recrie um novo modelo de
trabalho com o novo nome do trabalho e execute o trabalho.

Tarefa 4. Analise os dados de táxi usando o


BigQuery
Nesta tarefa, você analisa os dados enquanto estão sendo transmitidos.

1. No console do Cloud, no menu de navegação (  ), clique em BigQuery .


2. Se a caixa de diálogo Bem-vindo for exibida, clique em Concluído .
3. No Editor de Consultas, digite o seguinte e clique em Executar :

SELECT * FROM taxirides.realtime LIMIT 10


Copiado!

content_copy

Observação: se nenhum registro for retornado, aguarde mais um minuto e execute novamente a
consulta acima (o Dataflow leva de 3 a 5 minutos para configurar o fluxo).
Sua saída será semelhante à
seguinte: 

Tarefa 5. Execute agregações no fluxo para


relatórios
Nesta tarefa, você calcula agregações no fluxo para relatórios.

1. No Editor de Consultas , limpe a consulta atual.


2. Copie e cole a consulta a seguir e clique em Executar .

WITH streaming_data AS (
SELECT
timestamp ,
TIMESTAMP_TRUNC( timestamp , HOUR , 'UTC' ) AS hora ,
TIMESTAMP_TRUNC( timestamp , MINUTE , 'UTC' ) AS minuto ,
TIMESTAMP_TRUNC( timestamp , SECOND , 'UTC' ) AS segundo ,
ride_id,
latitude,
longitude,
leitura do medidor,
passeio_status,
passageiro_count
A PARTIR DE
taxirides.realtime
WHERE ride_status = 'dropoff'
ORDER BY timestamp DESC
LIMIT 1000
)
# calcula agregações no fluxo para relatórios:
SELECT
ROW_NUMBER () OVER () AS dashboard_sort,
minute ,
COUNT ( DISTINCT ride_id) AS total_rides,
SUM (meter_reading) AS total_revenue,
SUM (passageiro_count) AS total_passengers
FROM streaming_data
GROUP BY minute , timestamp
Copiado!

WITH streaming_data AS (

SELECT

timestamp,

TIMESTAMP_TRUNC(timestamp, HOUR, 'UTC') AS hour,

TIMESTAMP_TRUNC(timestamp, MINUTE, 'UTC') AS minute,

TIMESTAMP_TRUNC(timestamp, SECOND, 'UTC') AS second,

ride_id,

latitude,

longitude,

meter_reading,

ride_status,

passenger_count

FROM

taxirides.realtime

WHERE ride_status = 'dropoff'

ORDER BY timestamp DESC

LIMIT 1000

# calculate aggregations on stream for reporting:

SELECT

ROW_NUMBER() OVER() AS dashboard_sort,

minute,

COUNT(DISTINCT ride_id) AS total_rides,

SUM(meter_reading) AS total_revenue,

SUM(passenger_count) AS total_passengers

FROM streaming_data

GROUP BY minute, timestampObservação : verifique se o fluxo de dados está registrando dados no


BigQuery antes de prosseguir para a próxima tarefa.
O resultado mostra as principais métricas por minuto para cada desembarque de
táxi.

3. Clique em Salvar > Salvar consulta .


4. Na caixa de diálogo Salvar consulta, no campo Nome , digite Minha consulta
salva .
5. Clique em Salvar .

Tarefa 6. Interrompa o trabalho do Dataflow


Nesta tarefa, você interrompe o job do Dataflow para liberar recursos para seu
projeto.

1. No console do Cloud, no menu Navegação (  ), clique em Dataflow .


2. Clique no streaming-taxi-pipeline ou no novo nome do trabalho.
3. Clique em Parar e selecione Cancelar > Parar trabalho .

Tarefa 7. Crie um painel em tempo real


Nesta tarefa, você cria um painel em tempo real para visualizar os dados.

1. No console do Cloud, no menu de navegação (  ), clique em BigQuery .


2. No Painel do Explorer, expanda sua ID do projeto .
3. Expanda Consultas salvas e clique em Minha consulta salva .

Sua consulta é carregada no editor de consultas.

4. Clique em Executar .
5. No BigQuery, clique em Explorar dados > Explorar com o Data Studio .

O Google Data Studio é aberto.

6. Na janela do Data Studio, clique no gráfico de barras.

O painel Gráfico é exibido.

7. Clique em Gráfico e selecione Gráfico de combinação .

8. No painel Configuração, em Dimensão do intervalo de dados, passe o mouse


sobre o minuto (Data) e clique no X para removê-lo.
9. No painel Dados, clique em dashboard_sort e arraste-o para Configuração >
Dimensão do intervalo de dados > Adicionar dimensão .
10. Em Configuração > Dimensão , clique em total_revenue e
selecione dashboard_sort .
11. Em Configuração > Métrica , clique em dashboard_sort e
selecione total_rides .
12. Em Configuração > Métrica , clique em dashboard_sort e
selecione total_passengers .
13. Em Configuração > Métrica , clique em Adicionar métrica e
selecione total_revenue .
14. Em Configuração > Classificar , clique em total de passageiros e
selecione dashboard_sort .
15. Em Configuração > Classificar , clique em Crescente .
Seu gráfico deve ser semelhante a este:

Observação: no momento, a visualização de dados em uma granularidade de nível de minuto não é


compatível com o Data Studio como um carimbo de data/hora. É por isso que criamos nossa própria
dimensão dashboard_sort.

16. Quando estiver satisfeito com seu painel, clique em Salvar e


compartilhar para salvar esta fonte de dados.
17. Se solicitado a concluir a configuração da sua conta, concorde com os termos
e condições e clique em Continuar .
18. Se as preferências de e-mail forem solicitadas, responda não a todos e clique
em Continuar .
19. Se solicitado com a janela Revisar acesso aos dados antes de salvar , clique
em Reconhecer e salvar .
20. Clique em Adicionar ao relatório .
21. Sempre que alguém visitar seu painel, ele estará atualizado com as últimas
transações. Você pode experimentar clicando em Mais opções (  ) e
em Atualizar dados .
Tarefa 8. Crie um painel de série temporal
Nesta tarefa, você cria um gráfico de série temporal.

1. Clique neste link do Google Data Studio para abrir o Data Studio em uma nova
guia do navegador.
2. Na página Relatórios , na seção Iniciar com um modelo , clique no modelo
de relatório em branco [+] .
3. Um novo relatório vazio é aberto com a janela Adicionar dados ao relatório .
4. Na lista de Conectores do Google , selecione o bloco BigQuery .
5. Clique em Consulta personalizada e selecione seu ProjectID. Isso deve
aparecer no seguinte formato, qwiklabs-gcp-xxxxxxx .
6. Em Enter Custom Query, cole a seguinte consulta:

SELECIONAR
*
DE
taxirides.realtime
WHERE
ride_status = 'dropoff'
Copiado!

content_copy

6. Clique em Adicionar > Adicionar ao relatório .

Um novo relatório sem título é exibido. Pode levar até um minuto para que a
tela termine de atualizar.

Criar um gráfico de série temporal

1. No painel Dados , clique em Adicionar um campo .


2. Clique em Todos os campos no canto esquerdo.
3. Altere o tipo de campo de carimbo de data/hora para Data e Hora > Data Hora
Minuto (AAAAMMDDhhmm) .
4. Na caixa de diálogo alterar carimbo de data/hora, clique em Continuar e
em Concluído .
5. No menu superior, clique em Adicionar um gráfico .
6. Escolha Gráfico de série temporal .
7. Posicione o gráfico no canto inferior esquerdo - no espaço em branco.
8. Em Configuração > Dimensão , clique em carimbo de data/hora (Data) e
selecione carimbo de data/hora .
9. Em Configuração > Dimensão , clique em carimbo de data/hora e

selecione calendário . 
10. Em Tipo , selecione Data e Hora > Data Hora Minuto .
11. Clique fora da caixa de diálogo para fechá-la. Você não precisa adicionar um
nome.
12. Em Configuração > Métrica , clique em Contagem de registros e
selecione leitura do medidor .

Seu gráfico de série temporal deve ser semelhante a este:


Paraben’s!
Neste laboratório, você usou o Pub/Sub para coletar mensagens de dados de
streaming de táxis e alimentá-las por meio do pipeline do Dataflow no BigQuery.

BigQuery Custom SQL - 9/29/22, 12:00 PM


2 charts
Owner's
Edit
Connector type: BigQuery
Billing project ID: qwiklabs-gcp-01-6d15707f30a2
Type: TEMP_TABLE_QUERY
Project ID: qwiklabs-gcp-01-6d15707f30a2
Dataset ID: _05083e10ab31d614ce6610ab6e63e45b931a53d8
Table ID: anon7f73fd99_eea5_43af_b590_adea09bef25f
Owner's credentials
Allows this data source to connect to the underlying data set using your credentials.
Custom SQL:WITH streaming_data AS (
SELECT
timestamp,
TIMESTAMP_TRUNC(timestamp, HOUR, 'UTC') AS hour,
TIMESTAMP_TRUNC(timestamp, MINUTE, 'UTC') AS minute,
TIMESTAMP_TRUNC(timestamp, SECOND, 'UTC') AS second,
ride_id,
latitude,
longitude,
meter_reading,
ride_status,
passenger_count
FROM
taxirides.realtime
WHERE ride_status = 'dropoff'
ORDER BY timestamp DESC
LIMIT 1000
)
# calculate aggregations on stream for reporting:
SELECT
ROW_NUMBER() OVER() AS dashboard_sort,
minute,
COUNT(DISTINCT ride_id) AS total_rides,
SUM(meter_reading) AS total_revenue,
SUM(passenger_count) AS total_passengers
FROM streaming_data
GROUP BY minute, timestamp

arabéns ao concluir o laboratório sobre como criar um pipeline de dados para streaming de dados!

00:05Antes de prosseguir no curso, vamos fazer uma rápida recapitulação.

00:09Nesta seção, você explorou o fluxo de trabalho de dados de streaming, desde a


ingestão até a visualização.
00:15Você começou aprendendo sobre os quatro desafios comuns de big data, ou os 4Vs:
Volume (tamanho dos dados) Variedade (formato dos dados) Velocidade (velocidade dos
dados) e Veracidade (precisão dos dados).

00:27Esses desafios podem ser especialmente comuns ao lidar com dados de streaming.

00:32Em seguida, você aprendeu como o fluxo de trabalho de dados de streaming


fornecido pelo Google pode ajudar a enfrentar esses desafios.

00:39Você começou com o Pub/Sub, que pode ser usado para ingerir um grande volume de
dados de IoT de diversos recursos em vários formatos.

00:48Depois disso, você explorou o Dataflow, um serviço NoOps sem servidor, para
processar os dados.

00:54'Processo' aqui se refere a ETL (extrair, transformar e carregar).

00:59E, finalmente, você foi apresentado a duas ferramentas de visualização do Google,


Looker e Data Studio.

Congratulations! You passed this assessment.


Retake
check
1 .
Which Google Cloud product acts as an execution engine to process and implement data
processing pipelines?

Looker
checkDataflow

Apache Beam

Data Studio

That answer is correct!

check
2 .
Which Google Cloud product is a distributed messaging service that is designed to ingest
messages from multiple device streams such as gaming events, IoT devices, and application
streams?
checkPub/Sub

Apache Beam

Looker

Data Studio
That answer is correct!

check
3 .
Select the correct streaming data workflow.

Visualize the data, process the data, and ingest the streaming data.

Process the data, visualize the data, and ingest the data.
checkIngest the streaming data, process the data, and visualize the results.

Ingest the streaming data, visualize the data, and process the data.

That answer is correct!

check
4 .
Due to several data types and sources, big data often has many data dimensions. This can
introduce data inconsistencies and uncertainties. Which type of challenge might this present to
data engineers?

Volume
checkVeracity

Variety

Velocity

That answer is correct!

check
5 .
When you build scalable and reliable pipelines, data often needs to be processed in near-real
time, as soon as it reaches the system. Which type of challenge might this present to data
engineers?

Volume

Veracity

Variety
checkVelocity

links para estudos


https://cloud.google.com/pubsub/docs/

https://cloud.google.com/dataflow/docs/

https://cloud.google.com/dataflow/docs/concepts/dataflow-templates

https://developers.google.com/datastudio/

https://datastudio.google.com/gallery

https://connect.looker.com/

https://cloud.google.com/looker/docs

BigData com BigQuery


Na seção anterior deste curso, você explorou o Dataflow e o Pub/Sub, as soluções do Google Cloud
para processar dados de streaming. Agora vamos focar sua atenção no BigQuery. Você começará
explorando os dois principais serviços do BigQuery, armazenamento e análise, e

00:17em seguida, faça uma demonstração da interface do usuário do BigQuery. Depois


disso, você verá como o BigQuery ML fornece um ciclo de vida de dados para IA em um só
lugar. Você também aprenderá sobre as fases do projeto do BigQuery ML, bem como os
principais comandos.

00:32Por fim, você praticará o uso do BigQuery ML para criar um modelo de ML


personalizado. Vamos começar. O BigQuery é um armazenamento de dados totalmente
gerenciado. Um data warehouse é um grande armazenamento, contendo terabytes e
petabytes de dados coletados de

00:50uma ampla gama de fontes dentro de uma organização, que é usada para orientar as
decisões de gerenciamento. Neste ponto, é útil considerar qual é a principal diferença entre
um data warehouse

01:00e um lago de dados. Um data lake é apenas um pool de dados brutos, desorganizados
e não classificados, que não tem uma finalidade específica. Um data warehouse, por outro
lado, contém dados estruturados e organizados, que podem ser

01:13usado para consultas avançadas. Ser totalmente gerenciado significa que o BigQuery


cuida da infraestrutura subjacente, para que você possa se concentrar no uso de consultas
SQL para responder a perguntas de negócios, sem se preocupar com
01:25implantação, escalabilidade e segurança. Vejamos alguns dos principais recursos do
BigQuery. O BigQuery oferece dois serviços em um: armazenamento e análise. É um lugar
para armazenar petabytes de dados.

01:41Para referência, 1 petabyte equivale a 11.000 filmes com qualidade 4k. O BigQuery
também é um lugar para analisar dados, com recursos integrados como aprendizado de
máquina, análise geoespacial e inteligência de negócios, que veremos mais adiante.

01:59O BigQuery é uma solução sem servidor totalmente gerenciada, o que significa que
você usa consultas SQL para responder às maiores perguntas da sua organização no front-
end sem se preocupar com a infraestrutura no back-end. Se você nunca escreveu SQL antes,
não se preocupe.

02:13Este curso fornece recursos e laboratórios para ajudar. O BigQuery tem um modelo de
precificação flexível de pagamento conforme o uso, em que você paga pelo número de bytes
de dados que sua consulta processa e por qualquer armazenamento de tabela permanente.

02:26Se preferir ter uma fatura fixa todos os meses, também pode subscrever o preço fixo
onde tem uma quantidade reservada de recursos para utilização. Os dados no BigQuery são
criptografados em repouso por padrão, sem a necessidade de nenhuma ação do cliente.

02:41Por criptografia em repouso, queremos dizer criptografia usada para proteger dados
armazenados em um disco, incluindo unidades de estado sólido ou mídia de backup. O
BigQuery tem recursos integrados de aprendizado de máquina para que você possa escrever
modelos de ML diretamente no BigQuery

02:55usando SQL. Além disso, se você decidir usar outras ferramentas profissionais, como
o Vertex AI do Google Cloud, para treinar seus modelos de ML, poderá exportar conjuntos de
dados do BigQuery diretamente para o Vertex AI para

03:07uma integração perfeita em todo o ciclo de vida de dados para IA. Então, como é uma
arquitetura típica de solução de data warehouse? Os dados de entrada podem ser dados em
tempo real ou em lote. Se você se lembra do último módulo, quando discutimos os quatro
desafios do big data,

03:26nas organizações modernas, os dados podem estar em qualquer formato (variedade),


qualquer tamanho (volume), qualquer velocidade (velocidade) e possivelmente imprecisos
(veracidade). Se forem dados de streaming, que podem ser estruturados ou não
estruturados, de alta velocidade e

03:45grande volume, o Pub/Sub é necessário para digerir os dados. Se forem dados em


lote, eles podem ser enviados diretamente para o Cloud Storage. Depois disso, os dois
pipelines levam ao Dataflow para processar os dados.

03:59Esse é o lugar onde fazemos ETL – extraímos, transformamos e carregamos – os


dados, se necessário. O BigQuery fica no meio para vincular processos de dados usando
Dataflow e acesso a dados por meio de análises,
04:11Ferramentas de IA e ML. O trabalho do mecanismo de análise do BigQuery no final de
um pipeline de dados é ingerir todos os dados processados após o ETL, armazená-los e
analisá-los e, possivelmente, enviá-los para outros

04:25como visualização de dados e aprendizado de máquina. As saídas do BigQuery


geralmente são alimentadas em dois grupos: ferramentas de inteligência de negócios e
ferramentas de IA/ML. Se você é um analista de negócios ou analista de dados, pode se
conectar a ferramentas de visualização como

04:41Looker, Data Studio, Tableau ou outras ferramentas de BI. Se preferir trabalhar em


planilhas, você pode consultar conjuntos de dados pequenos ou grandes do BigQuery
diretamente do Planilhas Google e até mesmo realizar operações comuns, como tabelas
dinâmicas.

04:58Como alternativa, se você for um cientista de dados ou engenheiro de aprendizado de


máquina, poderá chamar os dados diretamente do BigQuery por meio do AutoML ou
Workbench. Essas ferramentas de IA/ML fazem parte da Vertex AI, a plataforma unificada de
ML do Google.

05:14O BigQuery é como uma área de teste comum para cargas de trabalho de análise de
dados. Quando seus dados estão lá, analistas de negócios, desenvolvedores de BI, cientistas
de dados e engenheiros de aprendizado de máquina podem ter acesso aos seus dados para
obter seus próprios insights.

05:28O BigQuery oferece dois serviços em um. É um recurso de armazenamento


totalmente gerenciado para carregar e armazenar conjuntos de dados e também um
mecanismo analítico rápido baseado em SQL. Os dois serviços são conectados pela rede
interna de alta velocidade do Google.

05:32É essa rede super-rápida que permite que o BigQuery dimensione o armazenamento e
a computação de forma independente, com base na demanda. Vejamos como o BigQuery
gerencia o armazenamento e os metadados dos conjuntos de dados.

05:35O BigQuery pode ingerir conjuntos de dados de várias fontes diferentes, incluindo
dados internos, que são dados salvos diretamente no BigQuery, dados externos, dados de
várias nuvens e conjuntos de dados públicos. Depois que os dados são armazenados no
BigQuery, eles são totalmente gerenciados pelo BigQuery e são automaticamente

05:39replicado, com backup e configurado para dimensionamento automático. O BigQuery


também oferece a opção de consultar fontes de dados externas, como dados armazenados
em outros serviços de armazenamento do Google Cloud, como o Cloud Storage, ou em
outros serviços de banco de dados do Google Cloud.

05:42como Spanner ou Cloud SQL, e ignore o armazenamento gerenciado do


BigQuery. Isso significa que um arquivo CSV bruto no Cloud Storage ou uma planilha do
Google pode ser usado para escrever uma consulta sem ser ingerido pelo BigQuery primeiro.

05:45Uma coisa a ser observada aqui: a inconsistência pode resultar do salvamento e do


processamento de dados separadamente. Para evitar esse risco, considere usar o Dataflow
para criar um pipeline de dados de streaming no BigQuery. Além de fontes de dados
internas/nativas e externas, o BigQuery também pode ingerir dados

05:48de: Dados de várias nuvens, que são dados armazenados em vários serviços de
nuvem, como AWS ou Azure. Um conjunto de dados público. Se você não tiver dados
próprios, poderá analisar qualquer um dos conjuntos de dados disponíveis no

05:52mercado público de conjuntos de dados. Existem três padrões básicos para carregar


dados no BigQuery. O primeiro é um carregamento em lote, em que os dados de origem são
carregados em uma tabela do BigQuery em um único

05:55operação em lote. Esta pode ser uma operação única ou pode ser automatizada para
ocorrer em uma programação. Uma operação de carregamento em lote pode criar uma nova
tabela ou anexar dados em uma tabela existente.

05:58O segundo é o streaming, em que lotes menores de dados são transmitidos


continuamente para que os dados estejam disponíveis para consulta quase em tempo
real. E o terceiro são os dados gerados, onde as instruções SQL são usadas para inserir
linhas em um

06:01tabela existente ou para gravar os resultados de uma consulta em uma


tabela. Obviamente, o objetivo do BigQuery não é apenas salvar dados; é para analisar dados
e ajudar a tomar decisões de negócios.

06:04O BigQuery é otimizado para executar consultas analíticas em grandes conjuntos de


dados. Ele pode realizar consultas em terabytes de dados em segundos e em petabytes em
minutos. Esse desempenho permite analisar grandes conjuntos de dados com eficiência e
obter insights quase reais

06:08Tempo. Vejamos os recursos de análise disponíveis no BigQuery. O BigQuery é


compatível com: análise ad hoc usando SQL padrão, o dialeto SQL do BigQuery. Análise
geoespacial usando tipos de dados geográficos e funções geográficas SQL padrão.

06:13Como criar modelos de aprendizado de máquina usando o BigQuery ML. Criação de


painéis de business intelligence ricos e interativos usando o BigQuery BI Engine. Por padrão,
o BigQuery executa consultas interativas, o que significa que as consultas são executadas

06:16como necessário. O BigQuery também oferece consultas em lote, em que cada


consulta é enfileirada em seu nome e a consulta é iniciada quando os recursos ociosos
estão disponíveis, geralmente em alguns minutos. A seguir, você verá uma demonstração no
BigQuery.

06:20Observe que você pode notar uma interface de usuário ligeiramente diferente.
Armazenamento e análise
Na seção anterior deste curso, você explorou o Dataflow e o Pub/Sub, as soluções do Google Cloud
para processar dados de streaming. Agora vamos focar sua atenção no BigQuery. Você começará
explorando os dois principais serviços do BigQuery, armazenamento e análise, e

00:12em seguida, faça uma demonstração da interface do usuário do BigQuery. Depois


disso, você verá como o BigQuery ML fornece um ciclo de vida de dados para IA em um só
lugar. Você também aprenderá sobre as fases do projeto do BigQuery ML, bem como os
principais comandos.

00:20Por fim, você praticará o uso do BigQuery ML para criar um modelo de ML


personalizado. Vamos começar. O BigQuery é um armazenamento de dados totalmente
gerenciado. Um data warehouse é um grande armazenamento, contendo terabytes e
petabytes de dados coletados de

00:37uma ampla gama de fontes dentro de uma organização, que é usada para orientar as
decisões de gerenciamento. Ser totalmente gerenciado significa que o BigQuery cuida da
infraestrutura subjacente, para que você possa se concentrar no uso de consultas SQL para
responder a perguntas de negócios, sem se preocupar com

00:53implantação, escalabilidade e segurança. Vejamos alguns dos principais recursos do


BigQuery. O BigQuery oferece dois serviços em um: armazenamento e análise. É um lugar
para armazenar petabytes de dados.

01:08Para referência, 1 petabyte equivale a 11.000 filmes com qualidade 4k. O BigQuery
também é um lugar para analisar dados, com recursos integrados como aprendizado de
máquina, análise geoespacial e inteligência de negócios, que veremos mais adiante.

01:22O BigQuery é uma solução sem servidor totalmente gerenciada, o que significa que
você usa consultas SQL para responder às maiores perguntas da sua organização no front-
end sem se preocupar com a infraestrutura no back-end. Se você nunca escreveu SQL antes,
não se preocupe.

01:36Este curso fornece recursos e laboratórios para ajudar. O BigQuery tem um modelo de
precificação flexível de pagamento conforme o uso, em que você paga pelo número de bytes
de dados que sua consulta processa e por qualquer armazenamento de tabela permanente.
01:47Se preferir ter uma fatura fixa todos os meses, também pode subscrever o preço fixo
onde tem uma quantidade reservada de recursos para utilização. Os dados no BigQuery são
criptografados em repouso por padrão, sem a necessidade de nenhuma ação do cliente.

02:03Por criptografia em repouso, queremos dizer criptografia usada para proteger dados
armazenados em um disco, incluindo unidades de estado sólido ou mídia de backup. O
BigQuery tem recursos integrados de aprendizado de máquina para que você possa escrever
modelos de ML diretamente no BigQuery

02:16usando SQL. Além disso, se você decidir usar outras ferramentas profissionais, como
o Vertex AI do Google Cloud, para treinar seus modelos de ML, poderá exportar conjuntos de
dados do BigQuery diretamente para o Vertex AI para

02:25uma integração perfeita em todo o ciclo de vida de dados para IA. Então, como é uma
arquitetura típica de solução de data warehouse? Os dados de entrada podem ser dados em
tempo real ou em lote. Se você se lembra do último módulo, quando discutimos os quatro
desafios do big data,

02:37nas organizações modernas, os dados podem estar em qualquer formato (variedade),


qualquer tamanho (volume), qualquer velocidade (velocidade) e possivelmente imprecisos
(veracidade). Se forem dados de streaming, que podem ser estruturados ou não
estruturados, de alta velocidade e

02:49grande volume, o Pub/Sub é necessário para digerir os dados. Se forem dados em


lote, eles podem ser enviados diretamente para o Cloud Storage. Depois disso, os dois
pipelines levam ao Dataflow para processar os dados.

02:58Esse é o lugar onde fazemos ETL – extraímos, transformamos e carregamos – os


dados, se necessário. O BigQuery fica no meio para vincular processos de dados usando
Dataflow e acesso a dados por meio de análises,

03:05Ferramentas de IA e ML. O trabalho do mecanismo de análise do BigQuery no final de


um pipeline de dados é ingerir todos os dados processados após o ETL, armazená-los e
analisá-los e, possivelmente, enviá-los para outros

03:13como visualização de dados e aprendizado de máquina. As saídas do BigQuery


geralmente são alimentadas em dois grupos: ferramentas de inteligência de negócios e
ferramentas de IA/ML. Se você é um analista de negócios ou analista de dados, pode se
conectar a ferramentas de visualização como

03:21Looker, Data Studio, Tableau ou outras ferramentas de BI. Se preferir trabalhar em


planilhas, você pode consultar conjuntos de dados pequenos ou grandes do BigQuery
diretamente do Planilhas Google e até mesmo realizar operações comuns, como tabelas
dinâmicas.

03:34Como alternativa, se você for um cientista de dados ou engenheiro de aprendizado de


máquina, poderá chamar os dados diretamente do BigQuery por meio do AutoML ou
Workbench. Essas ferramentas de IA/ML fazem parte da Vertex AI, a plataforma unificada de
ML do Google.
03:43O BigQuery é como uma área de teste comum para cargas de trabalho de análise de
dados. Quando seus dados estão lá, analistas de negócios, desenvolvedores de BI, cientistas
de dados e engenheiros de aprendizado de máquina podem ter acesso aos seus dados para
obter seus próprios insights.

Demonstração do BigQuery - compartilhamento de bicicletas em São Francisco

Como qualquer analista de dados lhe dirá, explorar seu conjunto de dados com SQL geralmente é um
dos primeiros passos que você toma para descobrir esses insights ocultos. Então, o que essa consulta
está realmente fazendo?

00:10Posso tomar sim. Aqui estamos contando o número total de viagens feitas em um


conjunto de dados público para viagens compartilhadas de bicicleta em São
Francisco. Agora vamos explorar esse conjunto de dados públicos com mais detalhes com
uma rápida

00:26demonstração. Agora é hora de tirar o pó do seu SQL ou estrutura, criar habilidades


linguísticas enquanto exploramos alguns conjuntos de dados públicos dentro do
google. Grande consulta, vamos mergulhar de cabeça. Então, nesta demonstração, como ele
conseguiu esse conjunto de dados público de grande pedreira, que é o que

00:42estávamos olhando antes das viagens de compartilhamento de bicicleta em São


Francisco. É deste conjunto de dados publicamente disponível, dados públicos de consulta
grande. Agora, se você não tiver isso na conta do seu clube ou

00:52suas próprias contas pessoais. Como você pode obter esses conjuntos de dados é
clicando em adicionar dados, explorar conjuntos de dados públicos, escolhendo um deles. E
traz você de volta ao console.

01:02Então, mostre algumas coisas bacanas que eu peguei dentro de uma grande pedreira
ao longo dos anos. Então agora temos uma tonelada de grandes conjuntos de dados
públicos com os quais você pode experimentar.

01:10Depois de encontrar uma, você pode fazer muito apenas sabendo o nome da
tabela. Portanto, não vamos escrever nenhum SQL ainda. Então, acabei de receber o aviso
do nome da tabela se tiver hífens.

01:20Por isso, nomeamos o projeto de consulta grande com os dados públicos da padaria
com hífens. É quando você precisa desses back ticks e tecnicamente você só precisa deles
em torno do nome do projeto.

01:28Mas é aí que você vê aqueles tiques nas costas, que é aquele personagem, entrar em
cena. Então você não tem hífens em nenhum lugar aqui, você realmente não precisa deles,
essa é a primeira dica para você.

01:37Segunda dica, uma das minhas favoritas absolutas, se você segurar pelo menos neste
Mac aqui, será a tecla de comando. Ou a tecla windows, ela destacará todos os conjuntos de
dados em sua consulta.
01:47Por que isso é útil? Porque assim que estiver destacado você pode clicar
nele. Portanto, se você tiver vários conjuntos de dados diferentes, basta explorar.

01:52Você herdou uma consulta de outra pessoa. Você pode obter rapidamente os


detalhes do esquema na visualização dessa tabela específica. Então, vamos tentar chegar
até aqui sem escrever nenhum pressionamento de tecla aqui.

02:06Então dê uma olhada no esquema. Esta é uma viagem compartilhada de bicicleta em


São Francisco, você tem coisas como o ID da viagem. Tudo o que poderíamos querer a
duração e os segundos, quando começou,

02:15quais são as estações. E é isso que seu tipo de mente divaga para alguns dos
insights interessantes que você pode dar uma olhada nos dados geográficos. Temos
latitude longitude inicial e latitude e longitude.

02:26Talvez algumas dessas funções legais do GIS. E vamos dar uma olhada em quantas
linhas temos. Não precisamos fazer uma contagem de estrelas nem nada disso, temos
quase dois milhões de linhas e cerca de 375 MB lá.

02:39E a melhor prática, você não quer fazer uma contagem de estrelas selecionada de um
limite de tabela 10 quando apenas olhar para uma visualização seria suficiente. Então você
viu o esquema no tipo de dados.

02:47Agora você pode realmente entrar aqui, olhar para todas as linhas totais e ver os
valores de dados de exemplo. Então você pode ver a estação onde eles pararam e
começaram tudo que vai te familiarizar com isso.

03:01Se você quiser consultar a tabela, pode clicar na tabela de consulta, ela perguntará, ei,
vou substituir tudo no editor de consultas. Claro que está bem e

03:08vai dizer selecione algumas colunas de nossas viagens de compartilhamento de


bicicleta. Ele foi uma dica legal. Acabei de descobrir recentemente e adereços para a grande
equipe de consulta. Se eu só quisesse ver dizer o número total

03:23de viagens que começaram a partir do nome da estação. Estou apenas clicando nos
nomes dos campos, especialmente se você tiver nomes de campos muito longos. Ou se
você não quer digitar as coisas como eu faço em uma demonstração ao vivo, então

03:37você pode simplesmente clicar nesses nomes de campo. E vamos adicioná-los e até
tivemos os comentários para vocês, o que é bem legal. E então o que você pode fazer dentro
de grandes consultas, você pode formatar a consulta novamente

03:47mesmo se você estiver experimentando o SQL, ainda é divertido tentar fazer o


máximo que puder sem digitar nada no editor. Vamos executar essa consulta.

03:55Não faz muito por nós porque só vai dar o nome da estação. Ainda está no nível de
granularidade da viagem individual. Então, qual é uma das coisas que você pode fazer
dentro de uma grande consulta?
04:05Se você estiver olhando para esse ID de viagem, poderá dizer que quero ter uma
função agregada como conta de todas as viagens. Não, apenas contando as estradas. Você
poderia contar estrelas.

04:16Algumas pessoas colocam uma contagem ou algo assim para facilitar a leitura. Vou
manter a conta do ID da viagem apenas para que outra pessoa que herde meu código possa
ver rapidamente qual é o nível de granularidade dessa tabela.

04:27E vamos chamar isso de número de viagens. Agora, naturalmente, imediatamente


você receberá um erro. Se você trabalhar com SQL por tempo suficiente assim que fizer uma
agregação em um campo

04:38é melhor que todo o resto de seus outros campos seja agravado também. Mas se foi
apenas tarde da noite, você quer apenas abrir o validador. Você pode ver ei, isso faz
referência ao nome da estação inicial,

04:49sua outra pessoa aqui é agregada. Mas este não é então você naturalmente o que
você quer fazer, você quer ter certeza de que você faz o grupo. E é assim que podemos ver,
tudo bem, estamos acumulando todas as viagens e

05:04fazer um único valor. Vamos agrupá-los por cada uma das diferentes


estações. Agora, se você se lembra do seu SQL, qual é uma das coisas que você pode fazer
para obter o máximo primeiro.

05:12Você pode fazer um pedido pelo qual vai ser uma ordenação e você pode pedir pelo
alias como vemos aqui. Agora, que você realmente não pode filtrar em uma cláusula where
por um campo de alias.

05:24Porque na verdade não existe quando quando o mecanismo de consulta sai e executa
a cláusula where. Portanto, lembre-se disso, é onde você pode usar coisas como tabelas
temporárias.

05:33Portanto, ordena primeiro pelo maior número de viagens. O que é, deixe-me dizer as
10 principais estações ou algo assim. Vamos nos certificar de que formatamos isso. Se
você quer ser realmente a favor de Evan, você pode adicionar um comentário no topo.

05:43É como as 10 principais estações em volume e vá em frente e execute isso. E vamos


ver nas estações mais populares. E nós temos o [risos] San Francisco Caltrain.

05:56Eu definitivamente posso garantir isso assim que você sair do trem. Você precisa
chegar a algum lugar em São Francisco e tem 72.000 viagens. Agora, se você quiser
experimentar um pouco mais,

06:05o que você pode fazer é adicionar um filtro. E apenas diga ei, estou procurando
apenas essas viagens para 2018. Há outro campo que você pode voltar se honestamente.

06:15Eu já esqueci o nome do campo, eu pressiono esse botão e então eu vou colocar a
data de início, vamos dar uma olhada nessa data de início. A data de início é um carimbo de
data/hora.
06:30Então, vamos ver, podemos dizer exatamente onde a data de início é depois, o que
dissemos? 2018 para que possamos fazer final de 2017. 12:31, com certeza depois de tudo
isso você pode convertê-lo se quiser.

06:57Espero que isso ocorra automaticamente, pois há muitas funções e extrações de data
que você pode fazer. Mas acho que foi antes de 20.000 ou algo assim.

07:0870.000 acho que para o Caltrain, vamos ver se o Caltrain ainda é o número um. Olha
no ano passado, Caltrain na verdade foi destronado é o prédio da balsa, ponto turístico super
popular se você ainda não esteve em São Francisco.

07:19E isso leva o maior número de chips apenas para 2018. Se você estiver fazendo muita
agregação dentro de uma grande consulta e depois filtrando. É divertido para esses tipos de
insights, mas é uma maneira mais fácil de fazer isso

07:30vezes é apenas exportando os dados. Ou vinculá-lo diretamente de uma ferramenta


de visualização de front-end, como o data studio. Então você pode realmente dizer, eu não
quero realmente limitar os dados aqui.

07:43Eu quero jogar tudo isso em uma ferramenta de visualização e então você pode
simplesmente ter um filtro para os usuários que realmente tem os resultados. Se eu quiser
armazenar essa consulta comum, posso fazer algumas coisas diferentes.

07:54Eu poderia salvar a consulta como minha consulta dentro do meu projeto
pessoalmente. Apenas para mim ou para todas as pessoas que podem vê-lo ou posso salvá-
lo como uma visualização.

08:04Mas o que eu gosto de fazer, já que geralmente compro todo o controle de versão do
meu código, é que na verdade faço um pouco de linguagem de definição de dados e SQL
DDO. E isso diz que eu quero a declaração de criação dentro do próprio código real.

08:22Então as pessoas podem, se elas se perguntam. Ei, onde diabos você conseguiu esse
top 2018? Na verdade, não é apenas 2018, é qualquer coisa depois de 2017 também para
pessoas gritando em suas telas como ei, não é apenas 2018.

08:34Eu fiz você criar uma tabela substituída e podemos apenas dar um nome de
tabela. Vamos apenas ligar, precisamos de um conjunto de dados. Nós ainda temos um
conjunto de dados?

08:45Eu não acho. Precisamos ter um conjunto de dados primeiro antes de apenas


despejar coisas aqui. Chamamos isso de insights de bicicleta ou algo assim. O conjunto de
dados é apenas uma coleção.

08:54Então agora temos um conjunto de dados vazio, não há tabelas lá, então podemos
começar a preencher um. Temos insights sobre bicicletas, então crie uma tabela substituída
no conjunto de dados e

09:08chamaremos isso de melhores viagens de 2018 e além. Tente manter os nomes das


suas tabelas um pouco mais sucintos do que os meus. E então, uma vez que você tenha
criado isso, você deverá ver automaticamente isso aparecer.
09:24Boom lá, você tem isso e agora você pode realmente ter isso, a consulta não vai
reiterar todas as vezes. Agora você pode estar se perguntando o que acontece se a fonte de
dados pública de consulta grande for atualizada

09:34depois disto? Esse é um excelente ponto. Então, como você, sua tabela, acabou de


despejar todos esses dados aqui. Vai ficar estático agora. Então, uma das coisas que você
pode fazer é se você deseja vincular isso a

09:46seu painel. Em vez de criar uma tabela, você pode simplesmente criar uma visão, que
é uma visão lógica. O que isso significa é cada vez. E vamos apenas chamar essa visão, pois
ela diz que o objeto já existe.

10:03Cada vez que você pode notar que o ícone muda. E você pode clicar aqui e pode olhar
dentro da tabela. Você pode realmente visualizar os dados dentro da exibição.

10:13Essa prévia se foi. Por que é que? Porque a visão é apenas um objeto vazio, é uma
visão é essencialmente uma visão lógica nesta instância do SQL é apenas uma consulta
armazenada.

10:25Então, se você está tentando consultar a visualização, podemos limpar a visualização


e, na verdade, isso é executado na consulta que armazenamos um pouco antes.

10:36Então é uma recapitulação de alto nível, você viu conjuntos de dados, você viu
conjuntos de dados públicos. Você pode explorá-los em seu lazer, algumas das dicas e
truques legais que você viu se quisesse realmente entrar.

10:46E edite essa visualização mais tarde, você pode rolar para baixo, trazer esse resultado
de volta para aqui. Olhe para aquele esquema e continue explorando como achar
melhor. Esta é apenas a ponta do iceberg quando se trata de análise de dados.

11:00Acabamos de fazer um relato simples de viagens pelos nomes das estações mais
populares, também há conjuntos de dados para o clima. Para que você possa fazer uma
análise divertida para ver, temos um conjunto de dados de compartilhamento de bicicletas
para

11:09Nova York ou São Francisco. Veja onde o clima afeta mais os pilotos. Ou como o
programa de compartilhamento de bicicletas está se saindo como um todo quando se trata
de sazonalidade.

11:18Há muito mais ou menos passageiros e alguns dos meses de inverno? Você pode
verificar tudo por si mesmo, mas pode fazer tudo com o SQL.

11:26E não precisa se preocupar em construir nenhuma infraestrutura nos bastidores, tudo
isso é gerenciado para você dentro de uma grande consulta. Tudo bem, isso é um envoltório.

Introdução ao BigQuery ML
Although BigQuery started out solely as a data warehouse, over time it has evolved to provide
features that support the data-to-AI lifecycle. In this section of the course, we’ll explore BigQuery’s
capabilities for building machine

00:14learning models and the ML project phases, and walk you through the key ML
commands in SQL. If you’ve worked with ML models before, you know that building and
training them can

00:25be very time-intensive. You must first export data from your datastore into an IDE
(integrated development environment) such as Jupyter Notebook or Google Colab and then
transform the data and perform all your

00:38feature engineering steps before you can feed it into a training model. Then finally, you
need to build the model in TensorFlow, or a similar library, and train it locally on your
computer or on a virtual machine.

00:52To improve the model performance, you also need to go back and forth to get more
data and create new features. This process will need to be repeated, but it’s so time-intensive
that you’ll probably

01:02stop after a few iterations. Also, we just mentioned TensorFlow and feature
engineering; in the past, if you weren’t familiar with these technologies, ML was left to the
data scientists on your

01:14team and was not available to you. Now you can create and execute machine learning
models on your structured datasets in BigQuery in just a few minutes using SQL queries.

01:26There are two steps needed to start: Step 1: Create a model with a SQL statement.
Here we use the bikeshare dataset as an example. Step 2: Write a SQL prediction query and
invoke ml.Predict.

01:41And that’s it! You now have a model and can view the results. Additional steps might
include activities like evaluating the model, but if you know basic SQL, you can now
implement ML; pretty cool!

01:56BigQuery ML was designed to be simple, like building a model in two steps. That
simplicity extends to defining the machine learning hyperparameters, which let you tune the
model to achieve the best training result.

02:10Hyperparameters are the settings applied to a model before the training starts, like the
learning rate. With BigQuery ML, you can either manually control the hyperparameters or
hand it to

02:20BigQuery starting with a default hyperparameter setting and then automatic tuning.
When using a structured dataset in BigQuery ML, you need to choose the appropriate model
type. Choosing which type of ML model depends on your business goal and the datasets.

02:37BigQuery supports supervised models and unsupervised models. Supervised models


are task-driven and identify a goal. Alternatively, unsupervised models are data-driven and
identify a pattern. Within a supervised model, if your goal is to classify data, like whether an
email is
02:55spam, use logistic regression. If your goal is to predict a number, like shoe sales for
the next three months, use linear regression. Within an unsupervised model, if your goal is to
identify patterns or clusters and then

03:11determine the best way to group them, like grouping random pictures of flowers into
categories, you should use cluster analysis. Once you have your problem outlined, it’s time to
decide on the best model.

03:23Categories include classification and regression models. There are also other model
options to choose from, along with ML ops. Logistic regression is an example of a
classification model, and linear regression is an example

03:37of a regression model. We recommend that you start with these options, and use the
results to benchmark to compare against more complex models such as DNN (deep neural
networks), which may take more time

03:49and computing resources to train and deploy. In addition to providing different types
of machine learning models, BigQuery ML supports features to deploy, monitor, and manage
the ML production, called ML Ops, which is short

04:03for machine learning operations. Options include: Importing TensorFlow models for
batch prediction Exporting models from BigQuery ML for online prediction And
hyperparameter tuning using Cloud AI Vizier We discuss ML Ops in more detail later in this
course.
Using BigQuery ML to predict customer lifetime value
Now that you’re familiar with the types of ML models available to choose from, high-quality data
must be used to teach the models what they need to learn. The best way to learn the key concepts of
machine learning on structured datasets is

00:13through an example. In this scenario, we’ll predict customer lifetime value with a
model. Lifetime value, or LTV, is a common metric in marketing… …used to estimate how
much revenue or profit you can expect from a customer given their

00:29history and customers with similar patterns. We’ll use a Google Analytics ecommerce
dataset from Google’s own merchandise store that sells branded items like t-shirts and
jackets. The goal is to identify high-value customers and bring them to our store with special
promotions

00:46and incentives. Having explored the available fields, you may find some useful in
determining whether a customer is high value based on their behavior on our website. These
fields include: customer lifetime pageviews,

01:01total visits, average time spent on the site, total revenue brought in, and ecommerce
transactions on the site. Remember that in machine learning, we feed in columns of data and
let the model figure
01:15out the relationship to best predict the label. It may even turn out that some of the
columns weren’t useful at all to the model in predicting the outcome -- we'll see later how to
determine this.

01:26Now that we have some data, we can prepare to feed it into the model. Incidentally, to
keep this example simple, we’re only using seven records, but we’d need tens of thousands
of records to train a model effectively.

01:40Before we feed the data into the model, we first need to define our data and columns
in the language that data scientists and other ML professionals use. Using the Google
Merchandise Store example, a record or row in the dataset is called an

01:55example, an observation, or an instance. A label is a correct answer, and you know it’s
correct because it comes from historical data. This is what you need to train the model on in
order to predict future data.

02:09Depending on what you want to predict, a label can be either a numeric variable, which
requires a linear regression model, or a categorical variable, which requires a logistic
regression model.

02:22For example, if we know that a customer who has made transactions in the past and
spends a lot of time on our website often turns out to have high lifetime revenue, we could
use

02:32revenue as the label and predict the same for newer customers with that same
spending trajectory. This means forecasting a number, so we can use a linear regression as
a starting point to model.

02:45Labels could also be categorical variables like binary values, such as High Value
Customer or not. To predict a categorical variable, if you recall from the previous section, you
need

02:57to use a logistic regression model. Knowing what you’re trying to predict, such as a
class or a number, will greatly influence the type of model you’ll use. But what do we call all
the other data columns in the data table?

03:12Those columns are called features, or at least potential features. Each column of data
is like a cooking ingredient you can use from the kitchen pantry. But keep in mind that using
too many ingredients can ruin a dish!

03:24The process of sifting through data can be time consuming. Understanding the quality
of the data in each column and working with teams to get more features or more history is
often the hardest part of any ML project.

03:38You can even combine or transform feature columns in a process called feature
engineering. If you’ve ever created calculated fields in SQL, you’ve already executed the
basics of feature engineering.

03:49Also, BigQuery ML does much of the hard work for you, like automatically one-hot
encoding categorical values. One-hot encoding is a method of converting categorical data to
numeric data to prepare
04:03it for model training. From there, BigQuery ML automatically splits the dataset into
training data and evaluation data. And finally, there is predicting on future data. Let’s say new
data comes in that you don’t have a label for, so you

04:20don’t know whether it is for a high-value customer. You do, however, have a rich
history of labeled examples for you to train a model on. So if we train a model on the known
historical data, and are happy with the performance,

04:32then we can use it to predict on future datasets!

BigQuery ML project phases


Let’s walk through the key phases of a machine learning project. In phase 1, you extract, transform,
and load data into BigQuery, if it isn’t there already. If you’re already using other Google products,
like YouTube for example, look out for easy

00:15connectors to get that data into BigQuery before you build your own pipeline. You can
enrich your existing data warehouse with other data sources by using SQL joins. In phase 2,
you select and preprocess features.

00:31You can use SQL to create the training dataset for the model to learn from. You’ll
recall that BigQuery ML does some of the preprocessing for you, like one-hot encoding of
your categorical variables.

00:44One-hot encoding converts your categorical data into numeric data that is required by
a training model. In phase 3, you create the model inside BigQuery. This is done by using the
“CREATE MODEL” command.

00:58Give it a name, specify the model type, and pass in a SQL query with your training
dataset. From there, you can run the query. In phase 4, after your model is trained, you can
execute an ML.EVALUATE query to evaluate

01:15the performance of the trained model on your evaluation dataset. It’s here that you
can analyze loss metrics like a Root Mean Squared Error for forecasting models and area-
under-the-curve, accuracy, precision, and recall, for classification

01:31models. We’ll explore these metrics later in this course. In phase 5, the final phase,
when you’re happy with your model performance, you can then use it to make predictions.

01:43To do so, invoke the ml.PREDICT command on your newly trained model to return with
predictions and the model’s confidence in those predictions. With the results, your label field
will have “predicted” added to the field name.

01:58This is your model’s prediction for that label. Now that you’re familiar with the key
phases of an ML project, let’s explore some of the key commands of BigQuery ML.

02:01You’ll remember from an earlier video that you can create a model with just the
CREATE MODEL command. Models have OPTIONS, which you can specify. The most
important, and the only one required, is the model type.
02:05If you want to overwrite an existing model, use the CREATE OR REPLACE MODEL
command. You can inspect what a model learned with the ML.WEIGHTS command and
filtering on an

02:07input column. The output of ML.WEIGHTS is a numerical value, and each feature has a
weight from -1 to 1. That value indicates how important the feature is for predicting the
result, or label.

02:10If the number is closer to 0, the feature isn't important for the prediction. However, if
the number is closer to -1 or 1, then the feature is more important for

02:12predicting the result. To evaluate the model's performance, you can run an
ML.EVALUATE command against a trained model. You get different performance metrics
depending on the model type you chose.

02:16And if you want to make batch predictions, you can use the ML.PREDICT command on
a trained model, and pass through the dataset you want to make the prediction on.

02:18Now let’s explore a consolidated list of BigQuery ML commands for supervised


models. First in BigQuery ML, you need a field in your training dataset titled LABEL, or you
need to specify which field or fields are your labels using the input_label_cols in

02:21your model OPTIONS. Second, your model features are the data columns that are part
of your SELECT statement after your CREATE MODEL statement. After a model is trained,
you can use the ML.FEATURE_INFO command to get statistics

02:25and metrics about that column for additional analysis. Next is the model object itself.
This is an object created in BigQuery that resides in your BigQuery dataset. You train many
different models, which will all be objects stored under your BigQuery

02:29dataset, much like your tables and views. Model objects can display information for
when it was last updated or how many training runs it completed. Creating a new model is as
easy as writing CREATE MODEL, choosing a type, and passing

02:33in a training dataset. Again, if you’re predicting on a numeric field, such as next year's
sales, consider linear regression for forecasting. If it's a discrete class like high, medium, low,
or spam/not-spam, consider using logistic

02:37regression for classification. While the model is running, and even after it’s complete,
you can view training progress with ML.TRAINING_INFO. As mentioned earlier, you can
inspect weights to see what the model learned about the importance

02:41of each feature as it relates to the label you’re predicting. The importance is indicated
by the weight of each feature. You can see how well the model performed against its
evaluation dataset by using ML.EVALUATE.

02:44And lastly, getting predictions is as simple as writing ML.PREDICT and referencing


your model name and prediction dataset. Now it’s time to get some hands-on practice
building a machine learning model in BigQuery.
02:47In the lab that follows this video, you’ll use ecommerce data from the Google
Merchandise Store website: https://shop.googlemerchandisestore.com/ The site’s visitor
and order data has been loaded into BigQuery, and you’ll build a machine learning model to

02:50predict whether a visitor will return for more purchases later. You’ll get practice:
Loading data into BigQuery from a public dataset.. Querying and exploring the ecommerce
dataset. Creating a training and evaluation dataset to be used for batch prediction.

02:55Creating a classification (logistic regression) model in BQML. Evaluating the


performance of your machine learning model. And Predicting and ranking the probability that
a visitor will make a purchase
BigQuery ML key commands

Now that you’re familiar with the key phases of an ML project, let’s explore some of the key
commands of BigQuery ML. You’ll remember from an earlier video that you can create a model with
just the CREATE

00:18MODEL command. Models have OPTIONS, which you can specify. The most
important, and the only one required, is the model type. If you want to overwrite an existing
model, use the CREATE OR REPLACE MODEL command.

00:29You can inspect what a model learned with the ML.WEIGHTS command and filtering
on an input column. The output of ML.WEIGHTS is a numerical value, and each feature has a
weight from -1 to 1.

00:42That value indicates how important the feature is for predicting the result, or label. If
the number is closer to 0, the feature isn't important for the prediction. However, if the
number is closer to -1 or 1, then the feature is more important for

00:57predicting the result. To evaluate the model's performance, you can run an
ML.EVALUATE command against a trained model. You get different performance metrics
depending on the model type you chose.

01:12And if you want to make batch predictions, you can use the ML.PREDICT command on
a trained model, and pass through the dataset you want to make the prediction on.

01:23Now let’s explore a consolidated list of BigQuery ML commands for supervised


models. First in BigQuery ML, you need a field in your training dataset titled LABEL, or you
need to specify which field or fields are your labels using the input_label_cols in

01:39your model OPTIONS. Second, your model features are the data columns that are part
of your SELECT statement after your CREATE MODEL statement. After a model is trained,
you can use the ML.FEATURE_INFO command to get statistics

01:55and metrics about that column for additional analysis. Next is the model object itself.
This is an object created in BigQuery that resides in your BigQuery dataset. You train many
different models, which will all be objects stored under your BigQuery

02:12dataset, much like your tables and views. Model objects can display information for
when it was last updated or how many training runs it completed. Creating a new model is as
easy as writing CREATE MODEL, choosing a type, and passing

02:28in a training dataset. Again, if you’re predicting on a numeric field, such as next year's
sales, consider linear regression for forecasting. If it's a discrete class like high, medium, low,
or spam/not-spam, consider using logistic

02:45regression for classification. While the model is running, and even after it’s complete,
you can view training progress with ML.TRAINING_INFO. As mentioned earlier, you can
inspect weights to see what the model learned about the importance
03:01of each feature as it relates to the label you’re predicting. The importance is indicated
by the weight of each feature. You can see how well the model performed against its
evaluation dataset by using ML.EVALUATE.

03:14And lastly, getting predictions is as simple as writing ML.PREDICT and referencing


your model name and prediction dataset.
Lab

Como prever
compras de
visitantes com um
modelo de
classificação com o
BigQuery ML
1 hora 20 minutosLivre

Visão geral
O BigQuery ML (aprendizado de máquina do BigQuery) é um recurso do BigQuery em
que os analistas de dados podem criar, treinar, avaliar e prever com modelos de
aprendizado de máquina com codificação mínima.
O conjunto de dados de comércio eletrônico de amostra do Google Analytics que
tem milhões de registros do Google Analytics para a Google Merchandise
Store carregados no BigQuery. Neste laboratório, você usará esses dados para
executar algumas consultas típicas que as empresas gostariam de saber sobre os
hábitos de compra de seus clientes.

Objetivos

Neste laboratório, você aprenderá a realizar as seguintes tarefas:

 Use o BigQuery para encontrar conjuntos de dados públicos


 Consulte e explore o conjunto de dados de comércio eletrônico
 Crie um conjunto de dados de treinamento e avaliação a ser usado para
previsão em lote
 Criar um modelo de classificação (regressão logística) no BigQuery ML
 Avalie o desempenho do seu modelo de aprendizado de máquina
 Preveja e classifique a probabilidade de um visitante fazer uma compra
Configure seus ambientes

Configuração do laboratório

Para cada laboratório, você recebe um novo projeto do GCP e um conjunto de


recursos por um período fixo sem custo.

1. Verifique se você fez login no Qwiklabs usando uma janela anônima .

2. Observe o tempo de acesso do laboratório (por exemplo,  e


certifique-se de que pode terminar nesse bloco de tempo.

Não há recurso de pausa. Você pode reiniciar, se necessário, mas precisa começar


do início.

3. Quando estiver pronto, clique em  .


4. Anote suas credenciais de laboratório. Você os usará para fazer login no

Console do Cloud Platform. 


5. Clique em Abrir Google Console .
6. Clique em Usar outra conta e copie/cole as credenciais deste laboratório nos
prompts.
Se você usar outras credenciais, receberá erros ou incorrerá em cobranças .
7. Aceite os termos e ignore a página de recursos de recuperação.
Não clique em Encerrar laboratório a menos que tenha concluído o laboratório ou
queira reiniciá-lo. Isso limpa seu trabalho e remove o projeto.

Abra o console do BigQuery

1. No Console do Google Cloud, selecione Menu de navegação > BigQuery .


A caixa de mensagem Bem-vindo ao BigQuery no Console do Cloud é aberta. Esta
caixa de mensagem fornece um link para o guia de início rápido e lista as
atualizações da interface do usuário.

2. Clique em Concluído .

Acesse o conjunto de dados do curso

Quando o BigQuery estiver aberto, abra o projeto data-to-insights em uma nova guia


do navegador para trazer esse projeto para o painel de projetos do BigQuery.
As definições de campo para o conjunto de dados de comércio eletrônico de
dados para insights estão na página de esquema de exportação do BigQuery
[UA] . Mantenha o link aberto em uma nova guia para referência.

Tarefa 1. Explore os dados de comércio


eletrônico
Cenário: sua equipe de analistas de dados exportou os registros do Google Analytics
de um site de comércio eletrônico para o BigQuery e criou uma nova tabela com
todos os dados brutos da sessão do visitante de comércio eletrônico para você
explorar. Usando esses dados, você tentará responder a algumas perguntas.
Pergunta: Do total de visitantes que visitaram nosso site, qual % fez uma compra?

1. Clique na consulta EDITOR .
2. Adicione o seguinte ao campo Nova consulta:

#standardSQL
WITH visitors AS(
SELECT
COUNT(DISTINCT fullVisitorId) AS total_visitors
FROM `data-to-insights.ecommerce.web_analytics`
),
purchasers AS(
SELECT
COUNT(DISTINCT fullVisitorId) AS total_purchasers
FROM `data-to-insights.ecommerce.web_analytics`
WHERE totals.transactions IS NOT NULL
)
SELECT
total_visitors,
total_purchasers,
total_purchasers / total_visitors AS conversion_rate

FROM visitors, purchasersCopiado!

content_copy

3. Clique em Executar .
Resultado: 2,69%

Pergunta: Quais são os 5 produtos mais vendidos?

4. Adicione a seguinte consulta na consulta EDITOR e clique em Executar :

SELECT
p.v2ProductName,
p.v2ProductCategory,
SUM(p.productQuantity) AS units_sold,
ROUND(SUM(p.localProductRevenue/1000000),2) AS revenue
FROM `data-to-insights.ecommerce.web_analytics`,
UNNEST(hits) AS h,
UNNEST(h.product) AS p
GROUP BY 1, 2
ORDER BY revenue DESC

LIMIT 5;Copiado!

content_copy

O resultado:

unidades
Fileira v2NomeDoProduto v2ProductCategory receita
vendidas

Nest® Learning Thermostat 3rd Gen-USA -


1 Nest-EUA 17651 870976,95
Aço inoxidável

Câmera de segurança externa Nest® Cam -


2 Nest-EUA 16930 684034.55
EUA

Câmera de segurança interna Nest® Cam -


3 Nest-EUA 14155 548104.47
EUA

Alarme Nest® Protect Smoke + CO branco


4 Nest-EUA 6394 178937.6
com fio - EUA

Alarme de bateria branca Nest® Protect


5 Nest-EUA 6340 178572.4
Smoke + CO - EUA

Pergunta: Quantos visitantes compraram em visitas subsequentes ao site?

5. Execute a seguinte consulta para descobrir:

# visitors who bought on a return visit (could have bought on first as well

WITH all_visitor_stats AS (

SELECT

fullvisitorid, # 741,721 unique visitors


IF(COUNTIF(totals.transactions > 0 AND totals.newVisits IS NULL) > 0, 1, 0) AS
will_buy_on_return_visit

FROM `data-to-insights.ecommerce.web_analytics`

GROUP BY fullvisitorid

SELECT

COUNT(DISTINCT fullvisitorid) AS total_visitors,

will_buy_on_return_visit

FROM all_visitor_stats

GROUP BY will_buy_on_return_visitcontent_copy

Os resultados:

Fileira total_visitors will_buy_on_return_visit

1 729848 0

2 11873 1

Analisando os resultados, você pode ver que (11873 / 729848) = 1,6% do total de
visitantes retornará e comprará no site. Isso inclui o subconjunto de visitantes que
compraram na primeira sessão e depois voltaram e compraram novamente.

Quais são algumas das razões pelas quais um cliente de comércio eletrônico típico
navegará, mas não comprará até uma visita posterior? Escolha tudo o que pode ser
aplicado.

O cliente deseja comparar a loja em outros sites antes de tomar uma decisão de compra.

O cliente está esperando que os produtos sejam colocados à venda ou outra promoção

O cliente está fazendo pesquisas adicionais

Enviar
Esse comportamento é muito comum para produtos de luxo, onde é necessária uma
pesquisa e comparação antecipadas significativas pelo cliente antes de decidir
(pense em comprar um carro), mas também é verdade em menor grau para a
mercadoria neste site (t-shirts, acessórios, etc.) .

No mundo do marketing online, a identificação e o marketing para esses futuros


clientes com base nas características de sua primeira visita aumentarão as taxas de
conversão e reduzirão o fluxo de saída para sites concorrentes.

Tarefa 2. Selecione os recursos e crie seu


conjunto de dados de treinamento
Agora você criará um modelo de aprendizado de máquina no BigQuery para prever
se um novo usuário provavelmente comprará ou não no futuro. Identificar esses
usuários de alto valor pode ajudar sua equipe de marketing a segmentá-los com
promoções especiais e campanhas publicitárias para garantir uma conversão
enquanto eles comparam as visitas ao seu site de comércio eletrônico.

O Google Analytics captura uma ampla variedade de dimensões e medidas sobre a


visita de um usuário neste site de comércio eletrônico. Navegue pela lista completa
de campos no Guia de esquema de exportação do BigQuery [UA] e visualize o
conjunto de dados de demonstração para encontrar recursos úteis que ajudarão um
modelo de aprendizado de máquina a entender a relação entre os dados sobre a
primeira vez de um visitante em seu site e se eles retornarão e fazer uma compra.
Sua equipe decide testar se esses dois campos são boas entradas para seu modelo
de classificação:

 totals.bounces(se o visitante saiu do site imediatamente)


 totals.timeOnSite(quanto tempo o visitante esteve em nosso site)
Quais são os riscos de usar apenas os dois campos acima?

A rejeição de um usuário está altamente correlacionada com seu tempo no site (por
exemplo, 0 segundos)
Apenas o uso do tempo gasto no site ignora outras possíveis colunas úteis (recursos)

Ambos mencionados acima

Enviar

O aprendizado de máquina é tão bom quanto os dados de treinamento que são


alimentados nele. Se não houver informações suficientes para o modelo determinar
e aprender a relação entre seus recursos de entrada e seu rótulo (neste caso, se o
visitante comprou no futuro), você não terá um modelo preciso. Embora treinar um
modelo apenas nesses dois campos seja um começo, você verá se eles são bons o
suficiente para produzir um modelo preciso.

 Na consulta EDITOR , adicione a seguinte consulta e clique em Executar :

SELECT
* EXCEPT(fullVisitorId)
FROM
# features
(SELECT
fullVisitorId,
IFNULL(totals.bounces, 0) AS bounces,
IFNULL(totals.timeOnSite, 0) AS time_on_site
FROM
`data-to-insights.ecommerce.web_analytics`
WHERE
totals.newVisits = 1)
JOIN
(SELECT
fullvisitorid,
IF(COUNTIF(totals.transactions > 0 AND totals.newVisits IS NULL) >
0, 1, 0) AS will_buy_on_return_visit
FROM
`data-to-insights.ecommerce.web_analytics`
GROUP BY fullvisitorid)
USING (fullVisitorId)
ORDER BY time_on_site DESC

LIMIT 10;Copiado!
content_copy

Resultados:

Fileira salta time_on_site will_buy_on_return_visit

1 0 15047 0

2 0 12136 0

3 0 11201 0

4 0 10046 0

5 0 9974 0

6 0 9564 0

7 0 9520 0

8 0 9275 1

9 0 9138 0

10 0 8872 0

Quais campos são os recursos do modelo? Qual é o rótulo (resposta correta)?

Os recursos são bounces e time_on_site. O rótulo é will_buy_on_return_visit


Os recursos são bounces e will_buy_on_return_visit. O rótulo é time_on_site

O recurso é will_buy_on_return_visit. Os rótulos são devoluções e time_on_site

Enviar

Quais campos são conhecidos após a primeira sessão de um visitante? (Marque todos os


que se aplicam)

will_buy_on_return_visit

visitId

salta

time_on_site

Enviar

Qual campo não é conhecido até mais tarde no futuro após sua primeira sessão?

will_buy_on_return_visit

time_on_site

visitId

salta

Enviar

Discussão: will_buy_on_return_visit não é conhecido após a primeira


visita. Novamente, você está prevendo um subconjunto de usuários que retornaram
ao seu site e compraram. Como você não conhece o futuro no momento da previsão,
não pode dizer com certeza se um novo visitante volta e compra. O valor de construir
um modelo de ML é obter a probabilidade de compra futura com base nos dados
coletados sobre sua primeira sessão.
Pergunta: Observando os resultados de dados iniciais, você acha
que time_on_site e devoluções serão um bom indicador se o usuário retornará e
comprará ou não?

Resposta: Muitas vezes é muito cedo para saber antes de treinar e avaliar o modelo,
mas à primeira vista do top 10 time_on_site, apenas 1 cliente voltou a comprar, o
que não é muito promissor. Vamos ver como o modelo se sai.

Tarefa 3. Crie um conjunto de dados do


BigQuery para armazenar modelos
Em seguida, crie um novo conjunto de dados do BigQuery que também armazenará
seus modelos de ML.

1. No painel esquerdo, clique no nome do seu projeto e, em seguida, clique no View


actionícone (três pontos) e selecione Criar conjunto de dados .

2. Na caixa de diálogo Criar conjunto de dados:

 Para ID do conjunto de dados , digite comércio eletrônico .


 Deixe os outros valores em seus padrões.

3. Clique em Criar conjunto de dados .


Tarefa 4. Selecione um tipo de modelo do
BigQuery ML e especifique as opções
Agora que você selecionou seus recursos iniciais, está pronto para criar seu primeiro
modelo de ML no BigQuery.

Existem dois tipos de modelos para escolher:

Tipo de
Modelo Tipo de dados do rótulo Exemplo
modelo

Valor numérico (normalmente um Preveja os números de vendas para o


Previsão linear_reg número inteiro ou ponto próximo ano com base nos dados
flutuante) históricos de vendas.

Classifique um e-mail como spam ou


Classificação logistic_reg 0 ou 1 para classificação binária
não spam de acordo com o contexto.

Observação: há muitos tipos de modelos adicionais usados em Machine Learning (como


redes neurais e árvores de decisão) e disponíveis usando bibliotecas como TensorFlow . No
momento da redação deste artigo, o BigQuery ML é compatível com os dois listados acima.
Que tipo de modelo você deve escolher para comprar ou não comprar?

Modelo de recomendação (como matrix_factorization etc.)

Modelo de classificação (como logistic_reg etc.)

Modelo de previsão (como linear_reg etc.)

Enviar

1. Insira a seguinte consulta para criar um modelo e especificar as opções do


modelo:

CREATE OR REPLACE MODEL `ecommerce.classification_model`


OPTIONS
(
model_type='logistic_reg',
labels = ['will_buy_on_return_visit']
)
AS
#standardSQL
SELECT
* EXCEPT(fullVisitorId)
FROM
# features
(SELECT
fullVisitorId,
IFNULL(totals.bounces, 0) AS bounces,
IFNULL(totals.timeOnSite, 0) AS time_on_site
FROM
`data-to-insights.ecommerce.web_analytics`
WHERE
totals.newVisits = 1
AND date BETWEEN '20160801' AND '20170430') # train on first 9
months
JOIN
(SELECT
fullvisitorid,
IF(COUNTIF(totals.transactions > 0 AND totals.newVisits IS NULL) >
0, 1, 0) AS will_buy_on_return_visit
FROM
`data-to-insights.ecommerce.web_analytics`
GROUP BY fullvisitorid)
USING (fullVisitorId)
;;
Copiado!

content_copy

2. Em seguida, clique em Executar para treinar seu modelo.


Espere o modelo treinar (5 - 10 minutos).

Observação: não é possível alimentar todos os dados disponíveis no modelo durante o


treinamento, pois você precisa salvar alguns pontos de dados não vistos para avaliação e
teste do modelo. Para fazer isso, adicione uma condição de cláusula WHERE que está sendo
usada para filtrar e treinar apenas os primeiros 9 meses de dados de sessão em seu
conjunto de dados de 12 meses.

Depois que seu modelo for treinado, você verá a mensagem "Esta instrução criou um
novo modelo chamado qwiklabs-gcp-xxxxxxxxx:ecommerce.classification_model".

3. Clique em Ir para o modelo .


Examine o conjunto de dados de comércio eletrônico e confirme que
o modelo_classificação agora aparece.

Em seguida, você avaliará o desempenho do modelo em relação a novos dados de


avaliação não vistos.
Tarefa 5. Avaliar o desempenho do modelo
de classificação

Selecione seus critérios de desempenho

Para problemas de classificação no ML, você deseja minimizar a taxa de falsos


positivos (prever que o usuário retornará e comprará e eles não) e maximizar a taxa
de verdadeiros positivos (prever que o usuário retornará e comprará e eles o fazem).

Essa relação é visualizada com uma curva ROC (Receiver Operating Characteristic)
como a mostrada aqui, onde você tenta maximizar a área sob a curva ou AUC:
No BigQuery ML, roc_auc é simplesmente um campo consultável ao avaliar seu
modelo de ML treinado.

 Agora que o treinamento foi concluído, você pode avaliar o desempenho do


modelo executando esta consulta usando ML.EVALUATE:

SELECT

roc_auc,

CASE

WHEN roc_auc > .9 THEN 'good'

WHEN roc_auc > .8 THEN 'fair'

WHEN roc_auc > .7 THEN 'not great'


ELSE 'poor' END AS model_quality

FROM

ML.EVALUATE(MODEL ecommerce.classification_model, (

SELECT

* EXCEPT(fullVisitorId)

FROM

# features

(SELECT

fullVisitorId,

IFNULL(totals.bounces, 0) AS bounces,

IFNULL(totals.timeOnSite, 0) AS time_on_site

FROM

`data-to-insights.ecommerce.web_analytics`

WHERE

totals.newVisits = 1

AND date BETWEEN '20170501' AND '20170630') # eval on 2 months

JOIN

(SELECT

fullvisitorid,

IF(COUNTIF(totals.transactions > 0 AND totals.newVisits IS NULL) > 0, 1, 0) AS


will_buy_on_return_visit

FROM

`data-to-insights.ecommerce.web_analytics`

GROUP BY fullvisitorid)

USING (fullVisitorId)

));

Copiado!

content_copy

Você deverá ver o seguinte resultado:


Fileira roc_auc qualidade_modelo

1 0,724588 nada bom

Após avaliar seu modelo você obtém um roc_auc de 0,72, o que mostra que o
modelo não tem grande poder preditivo. Como o objetivo é obter a área sob a curva o
mais próximo possível de 1,0, há espaço para melhorias.

Tarefa 6. Melhore o desempenho do modelo


com engenharia de recursos
Conforme sugerido anteriormente, há muitos outros recursos no conjunto de dados
que podem ajudar o modelo a entender melhor a relação entre a primeira sessão de
um visitante e a probabilidade de compra em uma visita subsequente.

Adicione alguns novos recursos e crie um segundo modelo de aprendizado de


máquina chamado classification_model_2:

 Até que ponto o visitante chegou no processo de checkout em sua primeira


visita
 De onde veio o visitante (fonte de tráfego: pesquisa orgânica, site de
referência etc.)
 Categoria do dispositivo (celular, tablet, desktop)
 Informações geográficas (país)

1. Crie este segundo modelo executando a consulta abaixo:

CREATE OR REPLACE MODEL `ecommerce.classification_model_2`

OPTIONS

(model_type='logistic_reg', labels = ['will_buy_on_return_visit']) AS

WITH all_visitor_stats AS (
SELECT

fullvisitorid,

IF(COUNTIF(totals.transactions > 0 AND totals.newVisits IS NULL) > 0, 1, 0) AS


will_buy_on_return_visit

FROM `data-to-insights.ecommerce.web_analytics`

GROUP BY fullvisitorid

# add in new features

SELECT * EXCEPT(unique_session_id) FROM (

SELECT

CONCAT(fullvisitorid, CAST(visitId AS STRING)) AS unique_session_id,

# labels

will_buy_on_return_visit,

MAX(CAST(h.eCommerceAction.action_type AS INT64)) AS latest_ecommerce_progress,

# behavior on the site

IFNULL(totals.bounces, 0) AS bounces,

IFNULL(totals.timeOnSite, 0) AS time_on_site,

totals.pageviews,

# where the visitor came from

trafficSource.source,

trafficSource.medium,

channelGrouping,

# mobile or desktop

device.deviceCategory,

# geographic

IFNULL(geoNetwork.country, "") AS country

FROM `data-to-insights.ecommerce.web_analytics`,

UNNEST(hits) AS h

JOIN all_visitor_stats USING(fullvisitorid)

WHERE 1=1
# only predict for new visits

AND totals.newVisits = 1

AND date BETWEEN '20160801' AND '20170430' # train 9 months

GROUP BY

unique_session_id,

will_buy_on_return_visit,

bounces,

time_on_site,

totals.pageviews,

trafficSource.source,

trafficSource.medium,

channelGrouping,

device.deviceCategory,

country

);Copiado!

content_copy

Observação: você ainda está treinando nos mesmos primeiros 9 meses de dados, mesmo
com este novo modelo. É importante ter o mesmo conjunto de dados de treinamento para
que você possa ter certeza de que uma saída de modelo melhor é atribuível a melhores
recursos de entrada e não a dados de treinamento novos ou diferentes.

Um novo recurso importante que foi adicionado à consulta do conjunto de dados de


treinamento é o progresso máximo de checkout que cada visitante alcançou em sua
sessão, que é registrado no campo hits.eCommerceAction.action_type. Se
você pesquisar esse campo nas definições de campo , verá o mapeamento de
campo de 6 = Compra concluída.
Além disso, o conjunto de dados de análise da web tem campos aninhados e
repetidos, como ARRAYS, que precisam ser divididos em linhas separadas em seu
conjunto de dados. Isso é feito usando a função UNNEST(), que você pode ver na
consulta acima.
Aguarde até que o novo modelo termine o treinamento (5-10 minutos).

2. Avalie esse novo modelo para ver se há melhor poder preditivo executando a
consulta abaixo:
#standardSQL

SELECT

roc_auc,

CASE

WHEN roc_auc > .9 THEN 'good'

WHEN roc_auc > .8 THEN 'fair'

WHEN roc_auc > .7 THEN 'not great'

ELSE 'poor' END AS model_quality

FROM

ML.EVALUATE(MODEL ecommerce.classification_model_2, (

WITH all_visitor_stats AS (

SELECT

fullvisitorid,

IF(COUNTIF(totals.transactions > 0 AND totals.newVisits IS NULL) > 0, 1, 0)


AS will_buy_on_return_visit

FROM `data-to-insights.ecommerce.web_analytics`

GROUP BY fullvisitorid

# add in new features

SELECT * EXCEPT(unique_session_id) FROM (

SELECT

CONCAT(fullvisitorid, CAST(visitId AS STRING)) AS unique_session_id,


# labels

will_buy_on_return_visit,

MAX(CAST(h.eCommerceAction.action_type AS INT64)) AS
latest_ecommerce_progress,

# behavior on the site

IFNULL(totals.bounces, 0) AS bounces,

IFNULL(totals.timeOnSite, 0) AS time_on_site,

totals.pageviews,

# where the visitor came from

trafficSource.source,

trafficSource.medium,

channelGrouping,

# mobile or desktop

device.deviceCategory,

# geographic

IFNULL(geoNetwork.country, "") AS country

FROM `data-to-insights.ecommerce.web_analytics`,

UNNEST(hits) AS h

JOIN all_visitor_stats USING(fullvisitorid)

WHERE 1=1

# only predict for new visits

AND totals.newVisits = 1
AND date BETWEEN '20170501' AND '20170630' # eval 2 months

GROUP BY

unique_session_id,

will_buy_on_return_visit,

bounces,

time_on_site,

totals.pageviews,

trafficSource.source,

trafficSource.medium,

channelGrouping,

device.deviceCategory,

country

));

));
Copiado!

content_copy

(Resultado)

Fileira roc_auc qualidade_modelo

1 0,910382 Boa

Com este novo modelo, agora você obtém um roc_auc de 0,91, que é
significativamente melhor que o primeiro modelo.
Agora que você tem um modelo treinado, é hora de fazer algumas previsões.

Tarefa 7. Preveja quais novos visitantes


voltarão e comprarão
Em seguida, você escreverá uma consulta para prever quais novos visitantes
voltarão e farão uma compra.

 Execute a consulta de previsão abaixo, que usa o modelo de classificação


aprimorado para prever a probabilidade de um visitante pela primeira vez na
Google Merchandise Store fazer uma compra em uma visita posterior:

SELECT

FROM

ml.PREDICT(MODEL `ecommerce.classification_model_2`,

WITH all_visitor_stats AS (

SELECT

fullvisitorid,

IF(COUNTIF(totals.transactions > 0 AND totals.newVisits IS NULL) > 0, 1, 0) AS


will_buy_on_return_visit

FROM `data-to-insights.ecommerce.web_analytics`

GROUP BY fullvisitorid

SELECT

CONCAT(fullvisitorid, '-',CAST(visitId AS STRING)) AS unique_session_id,

# labels
will_buy_on_return_visit,

MAX(CAST(h.eCommerceAction.action_type AS INT64)) AS latest_ecommerce_progress,

# behavior on the site

IFNULL(totals.bounces, 0) AS bounces,

IFNULL(totals.timeOnSite, 0) AS time_on_site,

totals.pageviews,

# where the visitor came from

trafficSource.source,

trafficSource.medium,

channelGrouping,

# mobile or desktop

device.deviceCategory,

# geographic

IFNULL(geoNetwork.country, "") AS country

FROM `data-to-insights.ecommerce.web_analytics`,

UNNEST(hits) AS h

JOIN all_visitor_stats USING(fullvisitorid)

WHERE

# only predict for new visits

totals.newVisits = 1

AND date BETWEEN '20170701' AND '20170801' # test 1 month

GROUP BY

unique_session_id,

will_buy_on_return_visit,

bounces,

time_on_site,

totals.pageviews,

trafficSource.source,

trafficSource.medium,

channelGrouping,
device.deviceCategory,

country

ORDER BY

predicted_will_buy_on_return_visit DESC;

content_copy

As previsões são feitas no último 1 mês (de 12 meses) do conjunto de dados.

Seu modelo agora produzirá as previsões que ele tem para as sessões de comércio
eletrônico de julho de 2017. Você pode ver três campos recém-adicionados:

 previsto_will_buy_on_return_visit: se o modelo acha que o visitante comprará


mais tarde (1 = sim)
 previsto_will_buy_on_return_visit_probs.label: o classificador binário para
sim/não
 previsto_will_buy_on_return_visit_probs.prob: a confiança que o modelo tem
em sua previsão (1 = 100%)

Resultados
 Dos 6% dos visitantes de primeira viagem (classificados em ordem
decrescente de probabilidade prevista), mais de 6% fazem uma compra em
uma visita posterior.
 Esses usuários representam quase 50% de todos os visitantes de primeira
viagem que fazem uma compra em uma visita posterior.
 No geral, apenas 0,7% dos visitantes de primeira viagem fazem uma compra
em uma visita posterior.
 A segmentação dos 6% principais da primeira vez aumenta o ROI de
marketing em 9x em comparação com a segmentação de todos!
Informação adicional

roc_auc é apenas uma das métricas de desempenho disponíveis durante a avaliação


do modelo. Também estão disponíveis exatidão, precisão e recuperação . Saber em
qual métrica de desempenho confiar depende muito de qual é seu objetivo ou meta
geral.

Parabéns!

Você criou um modelo de aprendizado de máquina usando apenas SQL.

Desafio

Resumo

Nas duas tarefas anteriores, você viu o poder da engenharia de recursos trabalhando
para melhorar o desempenho de nossos modelos. No entanto, ainda podemos
melhorar nosso desempenho explorando outros tipos de modelos. Para problemas
de classificação, o BigQuery ML também é compatível com os seguintes tipos de
modelo:

 Redes Neurais Profundas .


 Árvores de decisão impulsionadas (XGBoost) .
 Modelos de tabelas do AutoML .
 Importando modelos personalizados do TensorFlow .
Tarefa

Embora nosso modelo de classificação linear (regressão logística) tenha um bom


desempenho após a engenharia de recursos, pode ser um modelo muito simples
para capturar totalmente a relação entre os recursos e o rótulo. Usando o mesmo
conjunto de dados e rótulos que você fez na Tarefa 6 para criar o modelo
ecommerce.classification_model_2, seu desafio é criar um classificador XGBoost.

Nota: Dica: Use as seguintes opções para Boosted_Tree_Classifier:

1. L2_reg = 0,1

2. num_parallel_tree = 8

3. max_tree_depth = 10

Pode ser necessário consultar a documentação vinculada acima para ver a sintaxe
exata. O modelo levará cerca de 7 minutos para treinar. A solução pode ser
encontrada na seção de solução abaixo se você precisar de ajuda para escrever a
consulta. pa

Solução:

Esta é a solução que você precisa para criar um classificador XGBoost:

CREATE OR REPLACE MODEL `ecommerce.classification_model_3`

OPTIONS

(model_type='BOOSTED_TREE_CLASSIFIER' , l2_reg = 0.1, num_parallel_tree = 8,


max_tree_depth = 10,

labels = ['will_buy_on_return_visit']) AS

WITH all_visitor_stats AS (

SELECT

fullvisitorid,

IF(COUNTIF(totals.transactions > 0 AND totals.newVisits IS NULL) > 0, 1, 0) AS


will_buy_on_return_visit

FROM `data-to-insights.ecommerce.web_analytics`

GROUP BY fullvisitorid

)
# add in new features

SELECT * EXCEPT(unique_session_id) FROM (

SELECT

CONCAT(fullvisitorid, CAST(visitId AS STRING)) AS unique_session_id,

# labels

will_buy_on_return_visit,

MAX(CAST(h.eCommerceAction.action_type AS INT64)) AS
latest_ecommerce_progress,

# behavior on the site

IFNULL(totals.bounces, 0) AS bounces,

IFNULL(totals.timeOnSite, 0) AS time_on_site,

totals.pageviews,

# where the visitor came from

trafficSource.source,

trafficSource.medium,

channelGrouping,

# mobile or desktop

device.deviceCategory,

# geographic

IFNULL(geoNetwork.country, "") AS country

FROM `data-to-insights.ecommerce.web_analytics`,

UNNEST(hits) AS h

JOIN all_visitor_stats USING(fullvisitorid)

WHERE 1=1

# only predict for new visits

AND totals.newVisits = 1

AND date BETWEEN '20160801' AND '20170430' # train 9 months

GROUP BY

unique_session_id,

will_buy_on_return_visit,
bounces,

time_on_site,

totals.pageviews,

trafficSource.source,

trafficSource.medium,

channelGrouping,

device.deviceCategory,

country

);

content_copy

Vamos agora avaliar nosso modelo e ver como fizemos:

#standardSQL

SELECT

roc_auc,

CASE

WHEN roc_auc > .9 THEN 'good'

WHEN roc_auc > .8 THEN 'fair'

WHEN roc_auc > .7 THEN 'not great'

ELSE 'poor' END AS model_quality

FROM

ML.EVALUATE(MODEL ecommerce.classification_model_3, (

WITH all_visitor_stats AS (

SELECT

fullvisitorid,
IF(COUNTIF(totals.transactions > 0 AND totals.newVisits IS NULL) > 0, 1, 0) AS
will_buy_on_return_visit

FROM `data-to-insights.ecommerce.web_analytics`

GROUP BY fullvisitorid

# add in new features

SELECT * EXCEPT(unique_session_id) FROM (

SELECT

CONCAT(fullvisitorid, CAST(visitId AS STRING)) AS unique_session_id,

# labels

will_buy_on_return_visit,

MAX(CAST(h.eCommerceAction.action_type AS INT64)) AS
latest_ecommerce_progress,

# behavior on the site

IFNULL(totals.bounces, 0) AS bounces,

IFNULL(totals.timeOnSite, 0) AS time_on_site,

totals.pageviews,

# where the visitor came from

trafficSource.source,

trafficSource.medium,

channelGrouping,

# mobile or desktop
device.deviceCategory,

# geographic

IFNULL(geoNetwork.country, "") AS country

FROM `data-to-insights.ecommerce.web_analytics`,

UNNEST(hits) AS h

JOIN all_visitor_stats USING(fullvisitorid)

WHERE 1=1

# only predict for new visits

AND totals.newVisits = 1

AND date BETWEEN '20170501' AND '20170630' # eval 2 months

GROUP BY

unique_session_id,

will_buy_on_return_visit,

bounces,

time_on_site,

totals.pageviews,

trafficSource.source,

trafficSource.medium,

channelGrouping,

device.deviceCategory,

country
)

));Nosso roc_auc aumentou cerca de 0,02 para cerca de 0,94!

Nota: Seus valores exatos serão diferentes devido à aleatoriedade envolvida no processo de


treinamento.

É uma pequena mudança no roc_auc, mas observe que, como 1 é um roc_auc


perfeito, fica mais difícil melhorar a métrica quanto mais próximo de 1 ele fica.

Este é um ótimo exemplo de como é fácil no BigQuery ML testar diferentes tipos de


modelo com diferentes opções para ver o desempenho deles. Conseguimos usar um
tipo de modelo muito mais complexo alterando apenas uma linha de SQL.

Pode-se perguntar razoavelmente “De onde vieram as escolhas para essas opções?”,
e a resposta é experimentação! Quando você está tentando encontrar o melhor tipo
de modelo para seus problemas, é preciso experimentar diferentes conjuntos de
opções em um processo conhecido como ajuste de hiperparâmetro.

Vamos terminar gerando previsões com nosso modelo aprimorado e ver como elas
se comparam às que geramos antes. Ao usar um modelo de classificador de árvore
impulsionado , você pode observar uma ligeira melhoria de 0,2 em nosso ROC AUC
em comparação com o modelo anterior. A consulta abaixo irá prever quais novos
visitantes voltarão e farão uma compra:

SELECT

FROM

ml.PREDICT(MODEL `ecommerce.classification_model_3`,

WITH all_visitor_stats AS (

SELECT

fullvisitorid,

IF(COUNTIF(totals.transactions > 0 AND totals.newVisits IS NULL) > 0, 1, 0) AS


will_buy_on_return_visit

FROM `data-to-insights.ecommerce.web_analytics`

GROUP BY fullvisitorid

)
SELECT

CONCAT(fullvisitorid, '-',CAST(visitId AS STRING)) AS unique_session_id,

# labels

will_buy_on_return_visit,

MAX(CAST(h.eCommerceAction.action_type AS INT64)) AS latest_ecommerce_progress,

# behavior on the site

IFNULL(totals.bounces, 0) AS bounces,

IFNULL(totals.timeOnSite, 0) AS time_on_site,

totals.pageviews,

# where the visitor came from

trafficSource.source,

trafficSource.medium,

channelGrouping,

# mobile or desktop

device.deviceCategory,

# geographic

IFNULL(geoNetwork.country, "") AS country

FROM `data-to-insights.ecommerce.web_analytics`,

UNNEST(hits) AS h

JOIN all_visitor_stats USING(fullvisitorid)

WHERE

# only predict for new visits

totals.newVisits = 1

AND date BETWEEN '20170701' AND '20170801' # test 1 month

GROUP BY

unique_session_id,

will_buy_on_return_visit,

bounces,

time_on_site,

totals.pageviews,
trafficSource.source,

trafficSource.medium,

channelGrouping,

device.deviceCategory,

country

ORDER BY

predicted_will_buy_on_return_visit DESC;

content_copy

A saída agora mostra um modelo de classificação que pode prever melhor a


probabilidade de um visitante pela primeira vez na Google Merchandise Store fazer
uma compra em uma visita posterior. Ao comparar o resultado acima com o modelo
anterior mostrado na Tarefa 7, você pode ver que a confiança que o modelo tem em
suas previsões é mais precisa quando comparada ao tipo de modelo
logistic_regression.

Questionário
Sua pontuação: 75 % Pontuação de aprovação: 75 %
Parabéns! Você passou nesta avaliação.
Retomar
Verifica
1 .
Você deseja usar o aprendizado de máquina para identificar se um email é spam. Qual você
deve usar?
VerificaAprendizagem supervisionada, regressão logística

2 .
Os dados foram carregados no BigQuery e os recursos foram selecionados e pré-
processados. O que deve acontecer a seguir quando você usar o BigQuery ML para
desenvolver um modelo de machine learning?

VerificaCrie o modelo de ML no BigQuery.

3 .
Em um modelo de aprendizado de máquina supervisionado, o que fornece dados históricos
que podem ser usados para prever dados futuros?
Rótulos

Exemplos

Características
4 .
Qual padrão descreve os dados de origem que são movidos para uma tabela do BigQuery em
uma única operação?

Carga em lote

5 .
O BigQuery é um armazenamento de dados totalmente gerenciado. A que se refere
“totalmente gerenciado”?

O BigQuery gerencia a qualidade dos dados para você.

O BigQuery gerencia a estrutura subjacente para você.

6 .
Quais são os dois serviços que o BigQuery oferece?

VerificaArmazenamento e análise

7 .
Você deseja usar o aprendizado de máquina para agrupar fotos aleatórias em grupos
semelhantes. Qual você deve usar?

VerificaAprendizado não supervisionado, análise de cluster

8 .
Qual recurso do BigQuery aproveita os tipos de dados geográficos e as funções geográficas
padrão do SQL para analisar um conjunto de dados?

VerificaAnálise geoespacial
https://cloud.google.com/bigquery

https://cloud.google.com/bigquery#section-5

https://cloud.google.com/bigquery/docs/loading-
data#choosing_a_data_ingestion_method

https://cloud.google.com/blog/products/g-suite/connecting-bigquery-and-google-
sheets-to-help-with-hefty-data-analysis

https://cloud.google.com/bigquery-ml/docs/reference/standard-sql/bigqueryml-
syntax-e2e-journey

https://cloud.google.com/bigquery-ml/docs/tutorials

Machine Learning optins on google Cloud


Introduction
In previous sections of this course, you learned about many data engineering tools available from
Google Cloud. Now let’s switch our focus to machine learning. In this section, we’ll explore the
different options Google Cloud offers for building machine

00:13learning models. Additionally, we will explain how a product called Vertex AI can help
solve machine learning challenges. So you might be wondering, “Why should I trust Google
for artificial intelligence and

00:25machine learning?” Google is an AI-first company, and is recognized as a leader


across industries because of its contributions in the fields of artificial intelligence and
machine learning. In 2022 Google was recognized as a leader in the Gartner Magic Quadrant
for Cloud AI

00:42Developer services, and in recent years has also received recognition in numerous
annual industry awards and reports. And at Google we’ve been implementing artificial
intelligence for over ten years into many

00:56of our critical products, systems, and services. For example, have you ever noticed
how Gmail automatically suggests three responses to a received message? This feature is
called Smart Reply, which uses artificial intelligence to predict how

01:12you might respond. Behind this intelligence is AI technology known as natural


language processing, which is just one example of an impressive list of technologies that
Google scientists and engineers are working on.

01:24We’ll explore these in more depth later in the course. The goal of these technologies is
not for exclusive use to only benefit Google customers. The goal is to enable every company
to be an AI company by reducing the challenges of
01:40AI model creation to only the steps that require human judgment or creativity. So for
workers in the travel and hospitality field, this might mean using AI and ML to

01:52improve aircraft scheduling or provide customers with dynamic pricing options. For
retail-sector employees, it might mean using AI and ML to leverage predictive inventory
planning. The potential solutions are endless.

02:07What are the problems in your business that artificial intelligence and machine
learning might help you solve? Take a moment to think about this question before continuing
to the next video.

Options to build ML models

Google Cloud offers four options for building machine learning models. The first option is BigQuery
ML. You’ll remember from an earlier section of this course that BigQuery ML is a tool

00:11for using SQL queries to create and execute machine learning models in BigQuery. If
you already have your data in BigQuery and your problems fit the pre-defined ML models, this
could be your choice.

00:23The second option is to use pre-built APIs, which are application programming
interfaces. This option lets you leverage machine learning models that have already been
built and trained by Google, so you don’t have to build your own machine

00:37learning models if you don’t have enough training data or sufficient machine learning
expertise in-house. The third option is AutoML, which is a no-code solution, so you can build
your own machine learning models on Vertex AI through a point-and-click interface.

00:51And finally, there is custom training, through which you can code your very own
machine learning environment, the training, and the deployment, which gives you flexibility
and provides the control over the ML pipeline.

01:06Let’s compare the four options to help you decide which one to use for building your
ML model. Please note that the technologies change constantly and this is only a brief
guideline.

01:18Data type: BigQuery ML only supports tabular data while the other three support
tabular, image, text, and video. Training data size: Pre-built APIs do not require any training
data, while BigQuery ML and custom training require
01:37a large amount of data. Machine learning and coding expertise: Pre-Built APIs and
AutoML are user friendly with low requirements, while Custom training has the highest
requirement and BigQuery ML requires you to understand SQL.

01:56Flexibility to tune the hyperparameters: At the moment, you can’t tune the
hyperparameters with Pre-built APIs or AutoML, however, however, you can experiment with
hyperparameters using BigQueryML and custom training.

02:10Time to train the model: Pre-built APIs require no time to train a model because they
directly use pre-built models from Google. The time to train a model for the other three
options depends on the specific project.

02:23Normally, custom training takes the longest time because it builds the ML model from
scratch, unlike AutoML and BigQuery ML. Selecting the best option will depend on your
business needs and ML expertise.

02:37If your data engineers, data scientists, and data analysts are familiar with SQL and
already have your data in BigQuery, BigQuery ML lets you develop SQL-based models. If your
business users or developers have little ML experience,

02:52using pre-built APIs is likely the best choice. Pre-built APIs address common
perceptual tasks such as vision, video, and natural language. They are ready to use without
any ML expertise or model development effort.

03:06If your developers and data scientists want to build custom models with your own
training data while spending minimal time coding, then AutoML is your choice. AutoML
provides a code-less solution to enable you to focus on business problems instead

03:21of the underlying model architecture and ML provisioning. If your ML engineers and
data scientists want full control of ML workflow, Vertex AI custom training lets you train and
serve custom models with code on Vertex Workbench.

03:36We’ve already explored BigQuery ML, so in the videos that follow, we’ll explore the
other three options in more detail.
Pre-built APIs

Good Machine Learning models require lots of high-quality training data. You should aim for
hundreds of thousands of records to train a custom model. If you don't have that kind of data, pre-
built APIs are a great place to start.

00:15Pre-built APIs are offered as services. In many cases they can act as building blocks
to create the application you want without expense or complexity of creating your own
models.

00:26They save the time and effort of building, curating, and training a new dataset so you
can just jump right ahead to predictions. So, what are some of the pre-built APIs?
00:36Let’s explore a short list. The Speech-to-Text API converts audio to text for data
processing. The Cloud Natural Language API recognizes parts of speech called entities and
sentiment. The Cloud Translation API converts text from one language to another.

00:55The Text-to-Speech API converts text into high quality voice audio. The Vision API
works with and recognizes content in static images. And the Video Intelligence API
recognizes motion and action in video.

01:09And Google has already done a lot of work to train these models using Google
datasets. For example, the Vision API is based on Google’s image datasets, the Speech-to-
Text API is trained on YouTube

01:22captions, and the Translation API is built on Google’s neural machine translation
technology. You’ll recall that how well a model is trained depends on how much data is
available to train it.

01:35As you might expect, Google has a lot of images, text, and ML researchers to train it’s
pre-built models. This means less work for you. Let’s take a minute and try out the Vision API
in a browser.

01:48(Provide an image.) Start by navigating to cloud.google.com/vision in Chrome, and


then Scroll down to try the API by uploading an image. You can actually experiment with each
of the ML APIs in a browser.

02:04When you’re ready to build a production model, you’ll need to pass a JSON object
request to the API and parse what it returns.
Custom training

We’ve explored the options Google Cloud provides to build machine learning models using BigQuery
ML, pre-built APIs, and AutoML. Now let's take a look at the last option, custom training.

00:12If you want to code your machine learning model, you can use this option by building a
custom training solution with Vertex AI Workbench. Workbench is a single development
environment for the entire data science workflow, from

00:25exploring, to training, and then deploying a machine learning model with code. Before
any coding begins, you need to determine what environment you want your ML training code
to use.

00:36There are two options: a pre-built container or a custom container. Imagine that a
container is a kitchen. A pre-built container would represent a fully furnished room with
cabinets and appliances (which represent the dependencies) that includes all

00:51all the cookware (which represents the libraries) you need to make a meal. So, if your
ML training needs a platform like TensorFlow, Pytorch, Scikit-learn, or XGboost, and Python
code to work with the platform, a pre-built container is probably your best

01:08solution. A custom container, alternatively, is like an empty room with no cabinets,


appliances, or cookware. You define the exact tools that you need to complete the job.

Vertex AI
For years now, Google has invested time and resources into developing big data and AI. Google had
developed key technologies and products, from its roots in the development of Scikit Learn back in
2007 to Vertex AI today.
00:15As an AI-first company, Google has applied AI technologies to many of its products
and services, like Gmail, Google Maps, Google Photos, and Google Translate, just to name a
few.

00:27But developing these technologies doesn’t come without challenges, especially when
it involves developing machine learning models and putting them into production. Some
traditional challenges include determining how to handle large quantities of data,
determining

00:42the right machine learning model to train the data, and harnessing the required
amount of computing power. Then there are challenges around getting ML models into
production. Production challenges requires scalability, monitoring, and continuous
integration and

00:57continuous delivery or deployment. In fact, according to Gartner, only half of enterprise


ML projects get past the pilot phase. And finally, there are ease-of-use challenges. Many
tools on the market require advanced coding skills, which can take a data scientist’s

01:18focus away from model configuration. And without a unified workflow, data scientists
often have difficulties finding tools. Google’s solution to many of the production and ease-of-
use challenges is Vertex AI, a

01:31unified platform that brings all the components of the machine learning ecosystem
and workflow together. So, what exactly does a unified platform mean? In the case of Vertex
AI, it means having one digital experience to create, deploy,

01:46and manage models over time, and at scale. For example, During the data readiness
stage, users can upload data from wherever it’s stored– Cloud Storage, BigQuery, or a local
machine.

02:00Then, during the feature readiness stage, users can create features, which are the
processed data that will be put into the model, and then share them with others using the
feature

02:10store. After that, it’s time for Training and Hyperparameter tuning. This means that
when the data is ready, users can experiment with different models and adjust
hyperparameters. And finally, during deployment and model monitoring, users can set up the
pipeline to transform

02:27the model into production by automatically monitoring and performing continuous


improvements. And to refer back to the different options we explored earlier, Vertex AI allows
users to build machine learning models with either AutoML, a code-less solution or Custom
Training,

02:44a code-based solution. AutoML is easy to use and lets data scientists spend more
time turning business problems into ML solutions, while custom training enables data
scientists to have full control over
02:57the development environment and process. Being able to perform such a wide range
of tasks in one unified platform has many benefits. This can be summarized with four Ss: It’s
seamless.

03:12Vertex AI provides a smooth user experience from uploading and preparing data all
the way to model training and production. It’s scalable. The Machine Learning Operations
(MLOps) provided by Vertex AI helps to monitor and manage the

03:27ML production and therefore scale the storage and computing power automatically.
It’s sustainable. All of the artifacts and features created using Vertex AI can be reused and
shared. And it’s speedy.

03:43Vertex AI produces models that have 80% fewer lines of code than competitors.

AI Solutions
Now that you’ve explored the four different options available to create machine learning models
with Google Cloud, let’s take a few minutes to explore Google Cloud’s artificial intelligence solution
portfolio.

00:11It can be visualized with three layers. The bottom layer is the AI foundation, and
includes the Google Cloud infrastructure and data. The middle layer represents the AI
development platform, which includes the four ML options

00:24you just learned about: AutoML and custom training, which are offered through Vertex
AI, and pre-built APIs and BigQuery ML. The top layer represents AI solutions, for which there
are two groups, horizontal solutions

00:39and industry solutions. Horizontal solutions usually apply to any industry that would
like to solve the same problem. Examples include Document AI and CCAI. Document AI uses
computer vision and optical character recognition, along with natural

00:56language processing, to create pretrained models to extract information from


documents. The goal is to increase the speed and accuracy of document processing to help
organizations make better decisions faster, while reducing costs.
01:11Another example of a horizontal solution is Contact Center AI, or CCAI. The goal of
CCAI is to improve customer service in contact centers through the use of artificial
intelligence.

01:24It can help automate simple interactions, assist human agents, unlock caller insights,
and provide information to answer customer questions. And the second group is vertical, or
industry solutions. These represent solutions that are relevant to specific industries.

01:42Examples include: Retail Product Discovery, which gives retailers the ability to provide
Google-quality search and recommendations on their own digital properties, helping to
increase conversions and reduce search abandonment, Google Cloud Healthcare Data
Engine, which generates healthcare insights and analytics

02:00with one end-to-end solution, and Lending DocAI, which aims to transform the home
loan experience for borrowers and lenders by automating mortgage document processing.
You can learn more about Google Cloud’s growing list of AI solutions at
cloud.google.com/solutions/ai.

Summary
We’ve covered a lot of information in this section of the course.

00:02Let’s do a quick recap.

00:04To start, we explored Google’s history as an AI-first company.

00:09From there, we looked at the four options Google Cloud offers to build machine
learning models.

00:15This includes BigQuery ML, pre-built APIs, AutoML, and Custom Training.

00:21Next, we introduced Vertex AI, a tool that combines the functionality of AutoML, which
is codeless, and custom training, which is code-based, to solve production and ease-of-use
problems.
00:32You’ll recall that selecting the best ML option will depend on your business needs and
ML expertise.

00:39If your are data engineers, data scientists, and data analysts are familiar with SQL and
already have your data in BigQuery, BigQuery ML lets you develop SQL-based models.

00:51If your business users or developers have little ML experience, using pre-built APIs is
likely the best choice.

00:59Pre-built APIs address common perceptual tasks such as vision, video, and natural
language.

01:05They are ready to use without any ML expertise or model development effort.

01:09If your developers and data scientists want to build custom models with your own
training data while spending minimal time coding, then AutoML is your choice.

01:19AutoML provides a code-less solution to enable you to focus on business problems


instead of the underlying model architecture and ML provisioning.

01:27If your ML engineers and data scientists want full control of the ML workflow, Vertex
AI custom training lets you train and serve custom models with code on Vertex Workbench.

01:38Using pre-built containers, you can leverage popular ML libraries, such as Tensorflow
and PyTorch.

01:44Alternatively, you can build a custom container from scratch.

01:49And finally, we introduced Google Cloud AI solutions.

01:52The solutions are built on top of the four ML development options to meet both
horizontal and vertical market needs.

01:54The solutions are built on top of the four ML development options to meet both
horizontal and vertical market needs.

Questionário
Sua pontuação: 100 % Pontuação de aprovação: 75 %
Parabéns! Você passou nesta avaliação.
Retomar
Verifica
1 .
Você trabalha para uma rede global de hotéis que carregou recentemente alguns dados de
hóspedes no BigQuery. Você tem experiência em escrever SQL e deseja aproveitar o
aprendizado de máquina para ajudar a prever tendências de convidados para os próximos
meses. Qual opção é melhor?
AutoML

Pre-built APIs
VerificaBigQuery ML

Custom training

Essa resposta está correta!

Verifica
2 .
Qual produto do Google Cloud permite que os usuários criem, implantem e gerenciem
modelos de machine learning em uma plataforma unificada?

AI Platform

TensorFlow

Document AI
VerificaVertex AI

Essa resposta está correta!

Verifica
3 .
Qual solução baseada em código oferecida com o Vertex AI oferece aos cientistas de dados
controle total sobre o ambiente e o processo de desenvolvimento?

AI Platform
VerificaTreinamento personalizado

AutoML

AI Solutions

Essa resposta está correta!

Verifica
4 .
Você trabalha para uma empresa de produção de vídeo e deseja usar o aprendizado de
máquina para categorizar as filmagens de eventos, mas não deseja treinar seu próprio modelo
de ML. Qual opção pode ajudá-lo a começar?

AutoML
VerificaAPIs pré-criadas
BigQuery ML

Custom training

Essa resposta está correta!

Verifica
5 .
Sua empresa tem muitos dados e você deseja treinar seu próprio modelo de máquina para ver
quais insights o ML pode fornecer. Devido a restrições de recursos, você precisa de uma
solução sem código. Qual opção é melhor?
VerificaAutoML

Pre-built APIs

BigQuery ML

Custom training

https://cloud.google.com/training/machinelearning-ai

https://cloud.google.com/vertex-ai#section-1

https://cloud.google.com/architecture/ml-on-gcp-best-practices

https://www.fast.ai/posts/2018-07-23-auto-ml-3.html

https://cloud.google.com/vertex-ai/docs/training/containers-overview?
_ga=2.143882930.-601714452.1627921693

https://cloud.google.com/deep-learning-containers/docs/overview#pre-
installed_software

The Machine Learning workflow with


Vertex AI
Introduction
In the previous section of this course, you explored the machine learning options available on Google
Cloud. Now let’s switch our focus to the machine learning Workflow with Vertex AI – from

00:10data preparation, to model training, and finally, model deployment. Vertex AI, Google’s
AI platform, provides developers and data scientists one unified environment to build custom
ML models. This process is actually not too different from serving food in a restaurant–
starting

00:30with preparing raw ingredients through to serving dishes on the table. Later in this
section, you’ll get hands-on practice building a machine-learning model end-to-end using
AutoML on Vertex AI. But before we get into the details, let’s look at the basic differences
between machine

00:48learning and traditional programming. In traditional programming, simply put, one plus
one equals two (1+1=2). Data, plus rules–otherwise known as algorithms–lead to answers.
And with traditional programming, a computer can only follow the algorithms that a human

01:06has set up. But what if we’re just too lazy to figure out the algorithms? Or what if the
algorithms are too complex to figure out? This is where machine learning comes in.

01:19With machine learning, we feed a machine a large amount of data, along with answers
that we would expect a model to conclude from that data. Then, we show the machine a
learning method by selecting a machine learning model.

01:31From there, we expect the machine to learn from the provided data and examples to
solve the puzzle on its own. So, instead of telling a machine how to do addition, we give it
pairs of numbers and

01:42the answers. For example, 1,1, and 2, and 2, 3, and 5. We then ask it to figure out how
to do addition on its own. But how is it possible that a machine can actually learn to solve
puzzles?

01:58For machine learning to be successful, you'll need lots of storage, like what’s available
with Cloud Storage, and the ability to make fast calculations, like with cloud computing.
There are many practical examples this capability.

02:12For example, by feeding Google Photos lots of pictures with tags, we can teach the
software to associate and then automatically attach tags to new pictures. Tags can then be
used for search functionality, or even to automatically create photo albums.

02:28Can you come up with any other examples to apply machine learning capabilities?
Take a moment to think about it. There are three key stages to this learning process. The first
stage is data preparation.

02:44which includes two steps: data uploading and feature engineering. You will be
introduced to feature engineering in the next lesson. A model needs a large amount of data
to learn from.

02:56Data used in machine learning can be either real-time streaming data or batch data,
and it can be either structured, which is numbers and text normally saved in tables, or
unstructured,
03:06which is data that can’t be put into tables, like images and videos. The second stage is
model training. A model needs a tremendous amount of iterative training. This is when
training and evaluation form a cycle to train a model, then evaluate the

03:22model, and then train the model some more. The third and final stage is model
serving. A model needs to actually be used in order to predict results. This is when the
machine learning model is deployed, monitored, and managed.

03:37If you don’t move an ML model into production, it has no use and remains only a
theoretical model. We mentioned at the start that the machine learning workflow on Vertex AI
is not too

03:50different from serving food in a restaurant. So, if you compare these steps to running
a restaurant, data preparation is when you prepare the raw ingredients, model training is
when you experiment with different recipes,

04:02and model serving is when you finalize the menu to serve the meal to lots of hungry
customers. Now it’s important to note that an ML workflow isn’t linear, it’s iterative.

04:14For example, during model training, you may need to return to dig into the raw data
and generate more useful features to feed the model. When monitoring the model during
model serving, you might find data drifting, or the accuracy

04:27of your prediction might suddenly drop. You might need to check the data sources and
adjust the model parameters. Fortunately, these steps can be automated with machine
learning operations, or MLOps.

04:39We’ll go into more detail on this soon. So how does Vertex AI support this workflow?
You’ll recall that Vertex AI provides two options to build machine learning models: AutoML,
which is a codeless solution, and Custom Training, which is a code-based solution.

04:59Vertex AI provides many features to support the ML workflow, all of which are
accessible through either AutoML or Vertex AI workbench. Examples include: Feature Store,
which provides a centralized

05:11repository for organizing, storing, and serving features to feed to training models,
Vizier, which helps you tune hyperparameters in complex machine learning models,
Explainable AI, which helps with things like interpreting training performance, and

05:29Pipelines, which help you monitor the ML production line.


Data preparation
Now let’s look closer at an AutoML workflow.

00:04The first stage of the AutoML workflow is data preparation.

00:08During this stage, you must upload data and then prepare the data for model training
with feature engineering.

00:15When you upload a dataset in the Vertex AI user interface, you’ll need to provide a
meaningful name for the data and then select the data type and objective.

00:25AutoML allows four types of data: image, tabular, text, and video.

00:33To select the correct data type and objective, you should: Start by checking data
requirements.

00:38We’ve included a link to these requirements in the resources section of this course.

00:42Next, you’ll need to add labels to the data if you haven’t already.
00:47A label is a training target.

00:49So, if you want a model to distinguish a cat from a dog, you must first provide sample
images that are tagged—or labeled—either cat or dog.

00:59A label can be manually added, or it can be added by using Google’s paid label service
via the Vertex console.

01:06These human labellers will manually generate accurate labels for you.

01:10The final step is to upload the data.

01:14Data can be uploaded from a local source, BigQuery, or Cloud Storage.

01:18You will practice these steps in the lab.

01:23After your data is uploaded to AutoML, the next step is preparing the data for model
training with feature engineering.

01:31Imagine you’re in the kitchen preparing a meal.

01:33Your data is like your ingredients, such as carrots, onions, and tomatoes.

01:39Before you start cooking, you'll need to peel the carrots, chop the onions, and rinse the
tomatoes.

01:44This is what feature engineering is like: the data must be processed before the model
starts training.

01:49A feature, as we discussed in the BigQuery module, refers to a factor that contributes
to the prediction.

01:57It’s an independent variable in statistics or a column in a table.

02:03Preparing features can be both challenging and tedious.

02:06To help, Vertex AI has a function called Feature Store.

02:10Feature Store is a centralized repository to organize, store, and serve machine learning
features.

02:17It aggregates all the different features from different sources and updates them to
make them available from a central repository.

02:24Then, when engineers need to model something, they can use the features available in
the Feature Store dictionary to build a dataset.

02:32Vertex AI automates the feature aggregation to scale the process.

02:38So what are the benefits of Vertex AI Feature Store?

02:41First, features are shareable for training or serving tasks.

02:46Features are managed and served from a central repository, which helps maintain
consistency across your organization.
02:53Second, features are reusable.

02:56This helps save time and reduces duplicative efforts, especially for high-value
features.

03:02Third, features are scalable.

03:05Features automatically scale to provide low-latency serving, so you can focus on


developing the logic to create the features without worrying about deployment.

03:15And fourth, features are easy to use.

03:18Feature Store is built on an easy-to-navigate user interface.

Model training

Now that our data is ready, which, if we return to the cooking analogy is our ingredients, it’s time to
train the model. This is like experimenting with some recipes.

00:11This stage involves two steps: model training, which would be like cooking the recipe,
and model evaluation, which is when we taste how good the meal is. This process might be
iterative.

00:24Before we get into more details about this stage, let’s pause to clarify two terms:
artificial intelligence and machine learning. Artificial intelligence, or AI, is an umbrella term
that includes anything related to computers

00:38mimicking human intelligence. For example, in an online word processor, robots


performing human actions all the way down to spell check. Machine learning is a subset of
AI that mainly refers to

00:51supervised and unsupervised learning. You might also hear the term deep learning, or
deep neural networks. It’s a subset of machine learning that adds layers in between input
data and output results

01:04to make a machine learn at more depth. So, what’s the difference between supervised
and unsupervised learning? Supervised learning is task-driven and identifies a goal.
Unsupervised learning, however, is data-driven and identifies a pattern.

01:21An easy way to distinguish between the two is that supervised learning provides each
data point with a label, or an answer, while unsupervised does not. For example, if we were
given sales data from an online retailer, we could use supervised

01:35learning to predict the sales trend for the next couple of months and use unsupervised
learning to group customers together based on common characteristics. There are two
major types of supervised learning: The first is classification, which predicts
01:52a categorical variable, like using an image to tell the difference between a cat and a
dog. The second type is a regression model, which predicts a continuous number, like using
past

02:03sales of an item to predict a future trend. There are three major types of unsupervised
learning: The first is clustering, which groups together data points with similar characteristics
and

02:16assigns them to "clusters," like using customer demographics to determine customer


segmentation. The second is association, which identifies underlying relationships, like a
correlation between two products to place them closer together in a grocery store for a
promotion.

02:34And the third is dimensionality reduction, which reduces the number of dimensions, or
features, in a dataset to improve the efficiency of a model. For example, combining customer
characteristics like age, driving violation history, or car

02:50type, to create an insurance quote. If too many dimensions are included, it can
consume too many compute resources, which might make the model inefficient. Although
Google Cloud provides four machine learning options,

03:04with AutoML and pre-built APIs you don’t need to specify a machine learning model.
Instead, you’ll define your objective, such as text translation or image detection. Then on the
backend, Google will select the best model to meet your business goal.

03:19With the other two options, BigQuery ML and custom training, you’ll need to specify
which model you want to train your data on and assign something called hyperparameters.
You can think of hyperparameters as user-defined knobs in a machine that helps guide the
machine

03:35learning process. For example, one parameter is a learning rate, which is how fast you
want the machine to learn With AutoML, you don’t need to worry about adjusting these
hyperparameter knobs because the tuning happens automatically in

03:48the back end. This is largely done by a neural architect search, which finds the best fit
model by comparing the performance against thousands of other models.
Model evaluation

While we are experimenting with a recipe, we need to keep tasting it constantly to make sure it
meets our expectations.

00:06This is the model evaluation portion of the model training stage.

00:11Vertex AI provides extensive evaluation metrics to help determine a model’s


performance.

00:16Among the metrics are two sets of measurements.

00:20The first is based on the confusion matrix, for example recall and precision.

00:25The second is based on feature importance, which we’ll explore later in this section of
the course.

00:30A confusion matrix is a specific performance measurement for machine learning


classification problems.

00:36It’s a table with combinations of predicted and actual values.

00:41To keep things simple we assume the output includes only two classes.

00:45Let’s explore an example of a confusion matrix.

00:49The first is a true positive combination, which can be interpreted as, “The model
predicted positive, and that’s true.”

00:56The model predicted that this is an image of a cat, and it actually is.

01:01The opposite of that is a true negative combination, which can be interpreted as, “The
model predicted negative, and that’s true.”
01:09The model predicted that a dog is not a cat, and it actually isn’t.

01:15Then there is a false positive combination, otherwise known as a Type 1 Error, which
can be interpreted as, “The model predicted positive, and that’s false.”

01:24The model predicted that a dog is a cat but it actually isn’t.

01:28Finally, there is the false negative combination, otherwise known as a Type 2 Error,
which can be interpreted as, “The model predicted negative, and that’s false.”

01:39The model predicted that a cat is not a cat, but it actually is.

01:43A confusion matrix is the foundation for many other metrics used to evaluate the
performance of a machine learning model.

01:50Let’s take a look at the two popular metrics, recall and precision, that you will
encounter in the lab.

01:58Recall refers to all the positive cases, and looks at how many were predicted correctly.

02:04This means that recall is equal to the true positives, divided by the sum of the true
positive and false negatives.

02:12Precision refers to all the cases predicted as positive, and how many are actually
positive.

02:18This means that precision is equal to the true positives, divided by the sum of the true
positive and false positives.

02:26Imagine you’re fishing with a net.

02:28Using a wide net, you caught both fish and rocks: 80 fish out of 100 total fish in the
lake, plus 80 rocks.

02:36The recall in this case is 80%, which is calculated by the number of fish caught, 80,
divided by the total number of fish in the lake, 100.

02:46The precision is 50%, which is calculated by taking the number of fish caught, 80, and
dividing it by the number of fish and rocks collected, 160.

02:58Let’s say you wanted to improve the precision, so you switched to a smaller net.

03:05This time you caught 20 fish and 0 rocks.

03:08The recall becomes 20% (20 out of 100 fish collected) and the precision becomes
100% (20 out of 20 total fish and rocks collected).

03:21Precision and recall are often a trade-off.

03:23Depending on your use case, you may need to optimize for one or the other.

03:29Consider a classification model where Gmail separates emails into two categories:
spam and not-spam.
03:36If the goal is to catch as many potential spam emails as possible, Gmail may want to
prioritize recall.

03:43In contrast, if the goal is to only catch the messages that are definitely spam without
blocking other emails, Gmail may want to prioritize precision.

03:53In Vertex AI, the platform visualizes the precision and the recall curve so they can be
adjusted based on the problem that needs solving.

04:01You’ll get the opportunity to practice adjusting precision and recall in the AutoML lab.

04:07In addition to the confusion matrix and the metrics generated to measure model
effectiveness such as recall and precision, the other useful measurement is feature
importance.

04:17In Vertex AI, feature importance is displayed through a bar chart to illustrate how each
feature contributes to a prediction.

04:25The longer the bar, or the larger the numerical value associated with a feature, the
more important it is.

04:32This information helps decide which features are included in a machine learning
model to predict the goal.

04:37You will observe the feature importance chart in the lab as well.

04:42Feature importance is just one example of Vertex AI’s comprehensive machine


learning functionality called Explainable AI.

04:49Explainable AI is a set of tools and frameworks to help understand and interpret


predictions made by machine learning models.
The recipes are ready and now it’s time to serve the meal! This represents the final stage of the
machine learning workflow, model serving. Model serving consists of two steps: First, model
deployment, which we can compare

00:14to serving the meal to a hungry customer, and second, model monitoring, which we
can compare to checking with the waitstaff to ensure that the restaurant is operating
efficiently. It’s important to note that model management exists throughout this whole
workflow to manage

00:29the underlying machine learning infrastructure. This lets data scientists focus on what
to do, rather than how to do it. Machine learning operations, or MLOps, play a big role. MLOps
combines machine learning development with operations and

00:46and applies similar principles from DevOps to machine learning models, which is short
for development and operations. MLOps aims to solve production challenges related to
machine learning. In this case, this refers to building an integrated machine learning system

01:02and operating it in production. These are considered to be some of the biggest pain
points by the ML practitioners’ community, because both data and code are constantly
evolving in machine learning.

01:15Practicing MLOps means advocating for automation and monitoring at each step of
the ML system construction. This means adopting a process to enable continuous
integration, continuous training, and continuous delivery.

01:30What does MLOps have to do with model serving? Well, let’s start with model
deployment, which is the exciting time when a model is implemented. In our restaurant
analogy, it’s when the food is put on the table for the customer

01:42to eat! MLOps provides a set of best practices on the backend to automate this
process. There are three options to deploy a machine learning model. The first option is to
deploy to an endpoint.

01:57This option is best when immediate results with low latency are needed, such as
making instant recommendations based on a user’s browsing habits whenever they’re online.
A model must be deployed to an endpoint before that model can be used to serve real-time

02:11predictions. The second option is to deploy using batch prediction. This option is best
when no immediate response is required, and accumulated data should be processed with a
single request.

02:24For example, sending out new ads every other week based on the user’s recent
purchasing behavior and what’s currently popular on the market. And the final option is to
deploy using offline prediction.

02:37This option is best when the model should be deployed in a specific environment off
the cloud. In the lab, you will practice predicting with an endpoint. Now let’s shift our focus to
model monitoring.
02:50The backbone of MLOps on Vertex AI is a tool called Vertex AI Pipelines. It automates,
monitors, and governs machine learning systems by orchestrating the workflow in a
serverless manner.

03:04Imagine you’re in a production control room, and Vertex AI Pipelines is displaying the
production data onscreen. If something goes wrong, it automatically triggers warnings based
on a predefined threshold.

03:17With Vertex AI Workbench, which is a notebook tool, you can define your own pipeline.
You can do this with prebuilt pipeline components, which means that you primarily need to
specify

03:26how the pipeline is put together using components as building blocks. And it’s with
these final two steps, model deployment and model monitoring, that we complete our
exploration of the machine learning workflow.

03:40The restaurant is open and operating smoothly – Bon appetit!


Lab

Vertex AI: Predicting


Loan Risk with
AutoML
2 hours 30 minutesFree
Overview
In this lab, you use Vertex AI to train and serve a machine learning model to predict
loan risk with a tabular dataset.

Objectives

You learn how to:

 Upload a dataset to Vertex AI.


 Train a machine learning model with AutoML.
 Evaluate the model performance.
 Deploy the model to an endpoint.
 Get predictions.

Setup
Before you click the Start Lab button
Note: Read these instructions.

Labs are timed and you cannot pause them. The timer, which starts when you click Start Lab,
shows how long Google Cloud resources will be made available to you.

This Qwiklabs hands-on lab lets you do the lab activities yourself in a real cloud
environment, not in a simulation or demo environment. It does so by giving you new,
temporary credentials that you use to sign in and access Google Cloud for the
duration of the lab.

What you need


To complete this lab, you need:

 Access to a standard internet browser (Chrome browser recommended).


 Time to complete the lab.
Note: If you already have your own personal Google Cloud account or project, do not use it
for this lab.Note: If you are using a Pixelbook, open an Incognito window to run this lab.

How to start your lab and sign in to the Console


1. Click the Start Lab button. If you need to pay for the lab, a pop-up opens for
you to select your payment method. On the left is a panel populated with the
temporary credentials that you must use for this lab.

2. Copy the username, and then click Open Google Console. The lab spins up
resources, and then opens another tab that shows the Choose an
account page.

Note: Open the tabs in separate windows, side-by-side.

3. On the Choose an account page, click Use Another Account. The Sign in page
opens.
4. Paste the username that you copied from the Connection Details panel. Then
copy and paste the password.

Note: You must use the credentials from the Connection Details panel. Do not use your
Google Cloud Skills Boost credentials. If you have your own Google Cloud account, do not
use it for this lab (avoids incurring charges).

5. Click through the subsequent pages:


 Accept the terms and conditions.
 Do not add recovery options or two-factor authentication (because this is a temporary
account).
 Do not sign up for free trials.
After a few moments, the Cloud console opens in this tab.

Note: You can view the menu with a list of Google Cloud Products and Services by clicking
the Navigation menu at the top-
left. 

Introduction to Vertex AI
This lab uses Vertex AI, the unified AI platform on Google Cloud to train and deploy a
ML model. Vertex AI offers two options on one platform to build a ML model: a
codeless solution with AutoML and a code-based solution with Custom
Training using Vertex Workbench. You use AutoML in this lab.
In this lab you build a ML model to determine whether a particular customer will
repay a loan.
Task 1. Prepare the training data
The initial Vertex AI dashboard illustrates the major stages to train and deploy a ML
model: prepare the training data, train the model, and get predictions. Later, the
dashboard displays your recent activities, such as the recent datasets, models,
predictions, endpoints, and notebook instances.

Create a dataset

1. In the Google Cloud console, on the Navigation menu, click Vertex AI >


Datasets.
2. Click Create dataset.
3. On the Datasets page, give the dataset a name.
4. For the data type and objective, click Tabular, and then
select Regression/classification.
5. Click Create.

Upload data

There are three options to import data in Vertex AI:

 Upload a local file from your computer.


 Select files from Cloud Storage.
 Select data from BigQuery.
For convenience, the dataset is already uploaded to Cloud Storage.

1. For the data source, select Select CSV files from Cloud Storage.
2. For Import file path, enter:

spls/cbl455/loan_risk.csv
Copied!

content_copy

3. Click Continue.
Note: You can also configure this page by clicking Datasets on the left menu and then
selecting the dataset name on the Datasets page.
(Optional) Generate statistics

1. To see the descriptive statistics for each column of your dataset,


click Generate statistics . Generating the statistics might take a few minutes,
especially the first time.
2. When the statistics are ready, click each column name to display analytical
charts.

Task 2. Train your model


With a dataset uploaded, you're ready to train a model to predict whether a customer
will repay the loan.

 Click Train new model.

Training method

The dataset is called LoanRisk.

1. For Objective, select Classification.

You select classification instead of regression because you are predicting a


distinct number (whether a customer will repay a loan: 0 for repay, 1 for
default/not repay) instead of a continuous number.

2. Click Continue.
Model details

Specify the name of the model and the target column.

1. Give the model a name, such as LoanRisk.


2. For Target column, select Default .
3. (Optional) Explore Advanced options to determine how to assign the training
vs. testing data and specify the encryption.
4. Click Continue.

Training options

Specify which columns you want to include in the training model. For example,
ClientID might be irrelevant to predict loan risk.

1. Click the minus sign on the ClientID row to exclude it from the training model.
2. (Optional) Explore Advanced options to select different optimization
objectives. For more information about optimization objectives for tabular
AutoML models, refer to the Optimization objectives for tabular AutoML
models guide.
3. Click Continue.

Compute and pricing

1. For Budget, which represents the number of node hours for training, enter 1.
Training your AutoML model for 1 compute hour is typically a good start for
understanding whether there is a relationship between the features and label
you've selected. From there, you can modify your features and train for more
time to improve model performance.
2. Leave early stopping enabled.
3. Click Start training.

Depending on the data size and the training method, the training can take from a few
minutes to a couple of hours. Normally you would receive an email from Google
Cloud when the training job is complete. However, in the Qwiklabs environment, you
will not receive an email.
To save the waiting for the model training, you download a pre-trained model in Task
5 to get predictions in Task 6. This pre-trained model is the training result following
the same steps from Task 1 to Task 2.

Task 3. Evaluate the model performance


(demonstration only)
Vertex AI provides many metrics to evaluate the model performance. You focus on
three:

 Precision/Recall curve
 Confusion Matrix
 Feature Importance
Note: If you had a model trained, you could navigate to the Models tab in Vertex AI.

1. Navigate to the Models.

2. Click on the model you just trained.

3. Browse the Evaluate tab.

However in this lab, you can skip this step since you use a pre-trained model.
The precision/recall curve

The confidence threshold determines how a ML model counts the positive cases. A
higher threshold increases the precision, but decreases recall. A lower threshold
decreases the precision, but increases recall.

You can manually adjust the threshold to observe its impact on precision and recall
and find the best tradeoff point between the two to meet your business needs.

The confusion matrix

A confusion matrix tells you the percentage of examples from each class in your test
set that your model predicted correctly.
The confusion matrix shows that your initial model is able to predict 100% of the
repay examples and 87% of the default examples in your test set correctly, which is
not too bad.

You can improve the percentage by adding more examples (more data), engineering
new features, and changing the training method, etc.

The feature importance

In Vertex AI, feature importance is displayed through a bar chart to illustrate how
each feature contributes to a prediction. The longer the bar, or the larger the
numerical value associated with a feature, the more important it is.
These feature importance values could be used to help you improve your model and
have more confidence in its predictions. You might decide to remove the least
important features next time you train a model or to combine two of the more
significant features into a feature cross to see if this improves model performance.
Feature importance is just one example of Vertex AI’s comprehensive machine
learning functionality called Explainable AI. Explainable AI is a set of tools and
frameworks to help understand and interpret predictions made by machine learning
models.
Task 4. Deploy the model (demonstration
only)
Note: You will not deploy the model to an endpoint because the model training can take an
hour. Here you can review the steps you would perform in a production environment.

Now that you have a trained model, the next step is to create an endpoint in Vertex. A
model resource in Vertex can have multiple endpoints associated with it, and you can
split traffic between endpoints.

Create and define an endpoint

1. On your model page, on the Deploy and test tab, click Deploy to endpoint.


2. For Endpoint name, enter a name for your endpoint, such as LoanRisk
3. Click Continue.

Model settings and monitoring

1. Leave the traffic splitting settings as-is.


2. As the machine type for your model deployment, under Machine type,
select n1-standard-8, 8 vCPUs, 30 GiB memory.
3. Click Continue.
4. In Model monitoring, click Continue.
5. In Model objectives > Training data source, select Vertex AI dataset.
6. Select your dataset from the drop down menu.
7. In Target column, enter Default
8. Leave the remaining settings as-is and click Deploy.

Your endpoint will take a few minutes to deploy. When it is completed, a green check
mark will appear next to the name.

Now you're ready to get predictions on your deployed model.


Task 5. SML Bearer Token

Retrieve your Bearer Token

To allow the pipeline to authenticate, and be authorized to call the endpoint to get the
predictions, you will need to provide your Bearer Token.

Note: Follow the instructions below to get your token. If you have issues getting the Bearer
Token, this can be due to cookies in the incognito window. If this is happening to you, try this
step in a non-incognito window.

1. Log in to gsp-auth-kjyo252taq-uc.a.run.app.
2. When logging in, use your student email address and password.
3. Click the Copy button. This will copy a very long token to your clipboard.

Note: This token will only be available for about 60 seconds, so copy and and move on to the
next steps.Note: If you have issues getting the Bearer Token, this can be due to cookies in
the incognito window - try in a non-incognito window.
Task 6. Get predictions
In this section, use the Shared Machine Learning (SML) service to work with an
existing trained model.

ENVIRONMENT
VALUE
VARIABLE

AUTH_TOKEN Use the value from the previous section

https://sml-api-vertex-kjyo252taq-uc.a.run.app/vertex/predict/
ENDPOINT
tabular_classification

INPUT_DATA_FILE INPUT-JSON

To use the trained model, you will need to create some environment variables.

1. Open a Cloud Shell window.


2. Replace INSERT_SML_BEARER_TOKEN with the bearer token value from the
previous section:

AUTH_TOKEN="INSERT_SML_BEARER_TOKEN"
Copied!

content_copy

3. Download the lab assets:


gsutil cp gs://spls/cbl455/cbl455.tar.gz .
Copied!

content_copy

4. Extract the lab assets:


tar -xvf cbl455.tar.gz
Copied!

content_copy

5. Create an ENDPOINT environment variable:


ENDPOINT="https://sml-api-vertex-kjyo252taq-uc.a.run.app/vertex/
predict/tabular_classification"
Copied!
content_copy

6. Create a INPUT_DATA_FILE environment variable:


INPUT_DATA_FILE="INPUT-JSON"
Copied!

content_copy

Note: After the lab assets are extracted, take a moment to review the contents.

The INPUT-JSON file is used to provide Vertex AI with the model data required. Alter this file
to generate custom predictions.

The smlproxy application is used to communicate with the backend.

The file INPUT-JSON is composed of the following values:

age ClientID income loan

40.77 997 44964.01 3944.22

7. Test the SML Service by passing the parameters specified in the environment
variables.
8. Perform a request to the SML service:

./smlproxy tabular \
-a $AUTH_TOKEN \
-e $ENDPOINT \
-d $INPUT_DATA_FILE
Copied!

content_copy

This query should result in a response similar to this:

SML Tabular HTTP Response:


2022/01/10 15:04:45 {"model_class":"0","model_score":0.9999981}
9. Alter the INPUT-JSON file to test a new scenario:

age ClientID income loan

30.00 998 50000.00 20000.00

10. Test the SML Service by passing the parameters specified in the environment
variables.
11. Edit the file INPUT-JSON and replace the original values.
12. Perform a request to the SML service:

./smlproxy tabular \
-a $AUTH_TOKEN \
-e $ENDPOINT \
-d $INPUT_DATA_FILE
Copied!

content_copy

In this case, assuming that the person's income is 50,000, age 30, and loan 20,000,
the model predicts that this person will repay the loan.

This query should result in a response similar to this::

SML Tabular HTTP Response:


2022/01/10 15:04:45 {"model_class":"0","model_score":1.0322887E-5}
If you use the Google Cloud console, the following image illustrates how the same
action could be performed:

You can now use Vertex AI to:

 Upload a dataset.
 Train a model with AutoML.
 Evaluate the model performance.
 Deploy the trained AutoML model to an endpoint.
 Get predictions.
🎉 Congratulations! 🎉

To learn more about different parts of Vertex AI, refer to the Vertex AI


documentation.

Você pode construir dois tipos de modelo com dados tabulares. O tipo de modelo é escolhido automaticamente com base no
tipo de dados de sua coluna de destino.
 Os modelos de regressão preveem um valor numérico. Por exemplo, prever preços de imóveis ou gastos do
consumidor.
 Os modelos de classificação preveem uma categoria a partir de um número fixo de categorias. Os exemplos
incluem prever se um e-mail é spam ou não, ou aulas que um aluno pode estar interessado em participar.

Well done on completing the AutoML lab! You’ve now had a chance to use Vertex AI to build a
machine learning model without writing a single line of code. Let’s take a few moments to review the
results of the lab, starting with the confusion matrix.

00:15But before that begins, please pause to consider the matrix results for yourself. The
true positives were 100%. This represents the percentage of people the model predicted
would repay their loan who

00:28actually did pay it back. The true negatives were 87%. This represents the percentage
of people the model predicted would not repay their loan who indeed did not pay it back.

00:44The false negatives were 0%. This represents the percentage of people the model
predicted would not repay their loan, but who actually did pay it back. And finally, the false
positives were 13%.

00:58This represents the percentage of people the model predicted would repay their loan,
but who actually did not pay it back. As a general principle, it’s good to have high true
positives and true negatives, and

01:11low false positives and false negatives. However, how high or low they need to be
really depends on the business goals you’re looking to achieve. There are different ways to
improve the performance of a model, which might include using a more

01:24accurate data source, using a larger dataset, choosing a different type of ML model, or
tuning the hyperparameters. Let’s also review the precision-recall curve from the AutoML lab.
The confidence threshold determines how a machine learning model counts the positive

01:42cases. A higher threshold increases the precision, but decreases recall. A lower
threshold decreases the precision, but increases recall. Moving the threshold to zero
produces the highest recall of 100%, and the lowest precision

01:58of 50%. So, what does that mean? That means the model predicts that 100% of loan
applicants will be able to repay a loan they take out. However, in actuality, only 50% of people
were able to repay the loan.

02:16Using this threshold to identify the default cases in this example can be risky, because
it means that you’re only likely to get half of the loan investment back. Now let’s move to the
other extreme by moving the threshold to 1.

02:30This will produce the highest precision of 100% with the lowest recall of 1%. What
does this mean? It means that of all the people who were predicted to repay the loan, 100%
of them actually did.
02:43However, you rejected 99% of loan applicants by only offering loans to 1% of them.
That’s a pretty big loss of business for your company. These are both extreme examples, but
it’s important that you always try to set an appropriate

02:57threshold for your model.

Which stage of the machine learning workflow includes feature engineering?

Model training

Model serving
checkData preparation

That answer is correct!

check
2 .
A hospital uses Google’s machine learning technology to help pre-diagnose cancer by feeding
historical patient medical data to the model. The goal is to identify as many potential cases as
possible. Which metric should the model focus on?
checkRecall

Confusion matrix

Feature importance

Precision

That answer is correct!


check
3 .
Which Vertex AI tool automates, monitors, and governs machine learning systems by
orchestrating the workflow in a serverless manner?

Vertex AI console

Vertex AI Feature Store


checkVertex AI Pipelines

Vertex AI Workbench

That answer is correct!

check
4 .
Select the correct machine learning workflow.

Data preparation, model serving, model training


checkData preparation, model training, model servin

Model serving, data preparation, model training

Model training, data preparation, model serving

That answer is correct!

check
5 .
Which stage of the machine learning workflow includes model evaluation?
checkModel training

Model serving

Data preparation

That answer is correct!

close
6 .
A farm uses Google’s machine learning technology to detect defective apples in their crop,
such as those that are irregular in size or have scratches. The goal is to identify only the apples
that are actually bad so that no good apples are wasted. Which metric should the model focus
on?

Recall
closeConfusion matrix

Feature importance

Precision

You might also like