You are on page 1of 29

1

Module 2 In this module we will:


• Walkthrough Data Analyst Tasks, Challenges,
and Introduce Google Cloud Platform
Big Data
Data Tools
Tools Overview • Demo: Analyze 10 Billion Records with
Google BigQuery
• Explore 9 Fundamental BigQuery Features
• Compare GCP Tools for Analysts, Data
Scientists, and Data Engineers

© 2017 Google Inc. All rights reserved. Google and the Google logo are trademarks of Google Inc. All other
company and product names may be trademarks of the respective companies with which they are associated.
2

A data analyst is responsible for analyzing and gleaning


insights from data

Ingest Transform Store Analyze Visualize


Get data in. Prepare, clean, and Create, save, and Derive insights Explore and
transform data. store datasets. from data. present data
insights.

© 2017 Google Inc. All rights reserved. Google and the Google logo are trademarks of Google Inc. All other
company and product names may be trademarks of the respective companies with which they are associated.
3

Challenges in each task prevent data analysts from getting to


scalable insights

Ingest Transform Store Analyze Visualize


Get data in. Prepare, clean, and Create, save, and Derive insights Explore and
transform data. store datasets. from data. present data
insights.

Challenges Challenges Challenges


Data Volume Slow Exploration Storage Cost Slow Queries
Data Variety Slow Processing Hard to Scale Data Volume Dataset Size
Data Velocity Unclear Logic Latency Issues Siloed Data Tool Latency

© 2017 Google Inc. All rights reserved. Google and the Google logo are trademarks of Google Inc. All other
company and product names may be trademarks of the respective companies with which they are associated.
4

Choosing the Right Tools


5

Google Cloud Platform offers scalable big data tools to


overcome data challenges

Ingest Transform Store Analyze Visualize


Get petabytes of Prepare, clean, and Create, save, and Derive insights Explore and
data in from a transform data store datasets from data at scale present interactive
variety of formats. quickly and easily. inexpensively. and without and impactful data
managing servers. insights.

BigQuery BigQuery Cloud Cloud BigQuery BigQuery Third-party Tools


Storage Analysis Dataprep Storage Storage Analysis (Tableau, Looker, Qlik)
(import) (SQL) (preparation) (buckets) (tables) (SQL)

© 2017 Google Inc. All rights reserved. Google and the Google logo are trademarks of Google Inc. All other
company and product names may be trademarks of the respective companies with which they are associated.
6

Module 2 In this module we will:


• Walkthrough Data Analyst Tasks, Challenges,
and Introduce Google Cloud Platform
Big Data
Data Tools
Tools Overview • Demo: Analyze 10 Billion Records with
Google BigQuery
• Explore 9 Fundamental BigQuery Features
• Compare GCP Tools for Analysts, Data
Scientists, and Data Engineers

© 2017 Google Inc. All rights reserved. Google and the Google logo are trademarks of Google Inc. All other
company and product names may be trademarks of the respective companies with which they are associated.
BigQuery Demo using 10 Billion+ rows
#standardSQL

# Demo processing 10 Billion Wikipedia records


SELECT
language,
title,
SUM(views) AS views
FROM 500
`bigquery-samples.wikipedia_benchmark.Wiki10B` GB
WHERE
title LIKE '%Google%'
GROUP BY
language,
title
ORDER BY
views DESC;

?
© 2017 Google Inc. All rights reserved. Google and the Google logo are trademarks of Google Inc. All other
company and product names may be trademarks of the respective companies with which they are associated.
8

Explore and visualize large datasets with Data Studio

Insight
Spike in Revenue Mid-July associated with
our annual summer sales event.

Take Action
Did sales meet or beat expectations? Do we
have inventory reordering issues?

Insight
High Revenue from Desktop could suggest
poor Mobile experience.

Take Action
Should we do a mobile UI/UX audit?

© 2017 Google Inc. All rights reserved. Google and the Google logo are trademarks of Google Inc. All other
company and product names may be trademarks of the respective companies with which they are associated.
9

Module 2 In this module we will:


• Walkthrough Data Analyst Tasks, Challenges,
and Introduce Google Cloud Platform
Big Data
Data Tools
Tools Overview • Demo: Analyze 10 Billion Records with
Google BigQuery
• Explore 9 Fundamental BigQuery Features
• Compare GCP Tools for Analysts, Data
Scientists, and Data Engineers

© 2017 Google Inc. All rights reserved. Google and the Google logo are trademarks of Google Inc. All other
company and product names may be trademarks of the respective companies with which they are associated.
10

Don’t Manage Infrastructure


11

Focus on Finding Insights


12

Google BigQuery is a petabyte-scale data analytics warehouse

Fully-Managed Data Warehouse:


1 No-Ops, Petabyte-Scale

Reliability: Backed by Google


2 Datacenters

Economical: Pay only for the


3 processing and storage you use

© 2017 Google Inc. All rights reserved. Google and the Google logo are trademarks of Google Inc. All other
company and product names may be trademarks of the respective companies with which they are associated.
13

Google BigQuery is a petabyte-scale data analytics warehouse

Security: Role ACLs, Data Encrypted


4 in Transport and at Rest

Auditable: Every Transaction


5 Logged and Queryable

Scalable: Highly Parallel Processing


6 Model means Fast Queries

© 2017 Google Inc. All rights reserved. Google and the Google logo are trademarks of Google Inc. All other
company and product names may be trademarks of the respective companies with which they are associated.
14

Google BigQuery is a petabyte-scale data analytics warehouse

Flexible: Mashup Data across


7 Multiple Datasets

Easy-to-use: Familiar SQL, No


8 Indexes, Open Standards

Public Datasets: Explore and


9 Practice with Real Datasets (NOAA,
IRS, GitHub, NYC Taxi etc.)

© 2017 Google Inc. All rights reserved. Google and the Google logo are trademarks of Google Inc. All other
company and product names may be trademarks of the respective companies with which they are associated.
15

Three Ways to Interface with BigQuery

Web UI Command-Line REST API


Interface (CLI) Programmatically run queries
using languages like Java and
Build, validate, and run queries
Use Cloud Shell or the Google Python over HTTP
quickly through the Web UI.
Cloud SDK (gcloud) to interact
This will be our primary focus GET
through a terminal
https://www.googleapis.co
for this course.
m/bigquery/v2/projects/pr
bq mk [DATASET_ID]
ojectId/queries/jobId

© 2017 Google Inc. All rights reserved. Google and the Google logo are trademarks of Google Inc. All other
company and product names may be trademarks of the respective companies with which they are associated.
16

Creating and Querying Datasets: BigQuery Terminology

User Query BigQuery Analytics Engine BigQuery Managed Storage

New Query Job Analytics Engine BigQuery Dataset Managed Storage


Multiple Tables
BigQuery BigQuery BigQuery

Table Name
Columnar Layout

User queries
a BigQUery Table Name
Columnar Layout
dataset

Data Ingestion to Cloud


Data can be Offline Data
ingested from Batch Load

raw formats Files to be


Uploaded
Raw Data
Cloud Storage
into GCS →
BigQuery
© 2017 Google Inc. All rights reserved. Google and the Google logo are trademarks of Google Inc. All other
company and product names may be trademarks of the respective companies with which they are associated.
17

Google BigQuery is actually two services in one

BigQuery BigQuery
Managed Storage Analysis

Fully-managed and Fast massively parallel


scalable data storage SQL Engine based on
that is based on the Google’s own internal
same technology that Dremel query engine
stores Google’s product technology
data (ads, gmail etc.)

© 2017 Google Inc. All rights reserved. Google and the Google logo are trademarks of Google Inc. All other
company and product names may be trademarks of the respective companies with which they are associated.
18

Google innovates data technologies

Flume
MapReduce Dremel Millwheel TensorFlow

GFS Megastore
BigTable Colossus PubSub
Spanner F1

2002 2004 2006 2008 2010 2012 2014 2016

Google Research Publications referenced are available here: http://research.google.com/pubs/papers.html


The Datacenter as a Computer: An Introduction to the Design of Warehouse-Scale Machines, 2009
http://research.google.com/pubs/pub35290.html

© 2017 Google Inc. All rights reserved. Google and the Google logo are trademarks of Google Inc. All other
company and product names may be trademarks of the respective companies with which they are associated.
19

GCP opens
Google up that
Cloud Platform innovation
opens for youtoto
up that innovation use
you

DataFlow
DataProc BigQuery DataFlow ML

Cloud Storage DataStore


BigTable Cloud Storage PubSub

2002 2004 2006 2008 2010 2012 2014 2016


Google Research Publications referenced are available here: http://research.google.com/pubs/papers.html
The Datacenter as a Computer: An Introduction to the Design of Warehouse-Scale Machines, 2009
http://research.google.com/pubs/pub35290.html

© 2017 Google Inc. All rights reserved. Google and the Google logo are trademarks of Google Inc. All other
company and product names may be trademarks of the respective companies with which they are associated.
20

Module 2 In this module we will:


• Walkthrough Data Analyst Tasks, Challenges,
and Introduce Google Cloud Platform
Big Data
Data Tools
Tools Overview • Demo: Analyze 10 Billion Records with
Google BigQuery
• Explore 9 Fundamental BigQuery Features
• Compare GCP Tools for Analysts, Data
Scientists, and Data Engineers

© 2017 Google Inc. All rights reserved. Google and the Google logo are trademarks of Google Inc. All other
company and product names may be trademarks of the respective companies with which they are associated.
21

Each data-related role uses a different suite of tools

Roles: Data Analyst Data Scientist Data Engineer

What they do: Derive data insights Analyze data and model Designs, builds, and
from queries and systems using statistics maintains data
visualization. and machine learning. processing systems.

Data analysis Statistical analysis using Computer Engineering


Background:
using SQL SQL, R, Python

GCP Tools Used:

© 2017 Google Inc. All rights reserved. Google and the Google logo are trademarks of Google Inc. All other
company and product names may be trademarks of the respective companies with which they are associated.
22

End-to-end gaming analytics example highlighting GCP tools


Architecture: Gaming > Gaming Analytics

Streaming

Real-Time Events Authentication


Multiple Platforms App Engine

Async Messaging Data Exploration


Cloud Pub/Sub Cloud Datalab

Report & Share


Business Analysis

Data Processing Analytics Engine


Cloud Dataflow BigQuery

Batch Analyze the


Report on
Gaming Logs
Events
Batch Load

Log Data
Insights
Cloud Storage
Ingest Gaming
Log Data
© 2017 Google Inc. All rights reserved. Google and the Google logo are trademarks of Google Inc. All other
company and product names may be trademarks of the respective companies with which they are associated.
23

Summary: GCPdata
Summary: Review offers youtasks
analyst the and
ability
toolsto:

Reviewed Data Data Analysts Explored the 9 Compared Data


Analyst tasks: will use Cloud Features that Analysts, Data
Ingest, Storage, make BigQuery is Scientists, and
Transform, Store, BigQuery, Cloud a Petabyte-Scale Data Engineers
Analyze, and Data Prep, and Data Analytics
Visualize Data Google Data Warehouse
Studio
© 2017 Google Inc. All rights reserved. Google and the Google logo are trademarks of Google Inc. All other
company and product names may be trademarks of the respective companies with which they are associated.
24

Lab 1
Exploring your Public Dataset
with Google BigQuery

© 2017 Google Inc. All rights reserved. Google and the Google logo are trademarks of Google Inc. All other
company and product names may be trademarks of the respective companies with which they are associated.
BigQuery hosts 50+ public datasets for SQL practice

Public datasets include


flights, taxi cab logs,
weather recordings, and
many more

Example SQL code is


provided for practice

© 2017 Google Inc. All rights reserved. Google and the Google logo are trademarks of Google Inc. All other
company and product names may be trademarks of the respective companies with which they are associated.
Your course dataset is millions of U.S. charity tax filings

Form 990

U.S. Internal Revenue


Service (IRS) collects taxes
from all individuals and
businesses

To subsidize charitable efforts, the US Internal


Nonprofit charities like hospitals, schools, animal Revenue Service (tax) allows these organizations
shelters and more run programs to avoid paying tax by filing a special form annually

© 2017 Google Inc. All rights reserved. Google and the Google logo are trademarks of Google Inc. All other
company and product names may be trademarks of the respective companies with which they are associated.
Form 990 Preview
Form 990 is the special form
that nonprofit organizations
must file annually to receive
their special tax exemption

Employees
It contains key financial
information and is then
published publicly by Revenue
the IRS for the public
to also inspect
Expenses

Assets

© 2017 Google Inc. All rights reserved. Google and the Google logo are trademarks of Google Inc. All other
company and product names may be trademarks of the respective companies with which they are associated.
IRS BigQuery public dataset has two primary table types

Annual tax exempt filings by


organization by year

Organization details lookup table by


Employer Identification Number (EIN).
The EIN uniquely identifies each charity
much like a phone number or passport
number for individuals

© 2017 Google Inc. All rights reserved. Google and the Google logo are trademarks of Google Inc. All other
company and product names may be trademarks of the respective companies with which they are associated.
Exploring your Dataset with Google BigQuery

● Locate and Query the IRS_990 BigQuery Public Dataset

● Explore dataset and table metadata using the Google


BigQuery UI

● Enable the Standard SQL dialect for your queries U.S. Internal Revenue
Service (IRS) collects taxes
from all individuals and
● Perform basic stats and counts on data tables using businesses
Standard SQL in the Google BigQuery UI

● Find duplicate records in a data table using SQL

© 2017 Google Inc. All rights reserved. Google and the Google logo are trademarks of Google Inc. All other
company and product names may be trademarks of the respective companies with which they are associated.

You might also like