Professional Documents
Culture Documents
Using Google Analytics With R
Using Google Analytics With R
of Contents
Introduction 1.1
Why? 1.1.1
About Google Analytics 1.1.2
About R 1.1.3
Author 1.1.4
Prepare environment 1.2
Data sources 1.2.1
Creating Google Analytics account 1.2.2
Getting credentials for Google Analytics API 1.2.3
Installing Google Analytics on website 1.2.4
Installing R Studio 1.2.5
Summary 1.2.6
First steps 1.3
Introduction to R 1.3.1
Connection with Google Analytics 1.3.2
googleAnalyticsR package 1.3.3
Import and export data to CSV 1.3.4
Code repository 1.3.5
Summary 1.3.6
Exploratory data analysis 1.4
Exploratory data analysis 1.4.1
Data visualization 1.5
Data visualization in R 1.5.1
Traffic heatmap 1.5.2
Device comparsion 1.5.3
Machine Learning 1.6
Clustering (k-means) 1.6.1
Generating reports 1.7
Introduction to R Markdown 1.7.1
Create report 1.7.2
2
Additional analysis 1.8
Anomaly detection 1.8.1
Forecasting 1.8.2
Resources 1.9
Blogs 1.9.1
Documentation 1.9.2
Online trainings 1.9.3
Books 1.9.4
3
Introduction
What is Google Analytics and how to collect web traffic data by this tool.
What is R and how to analyze data from Google Analytics in R Studio.
How to discover hidden knowledge into data about traffic on your website.
Feel free to share this book, read it online and offline. Thanks to Gitbook.io you can
download it if different formats - printable .pdf and formats for e-book readers like .epub
and .mobi .
This is still development version. If you want to develop this book - feel free to contact with
author via:
about.me
michalbrys.com
4
Why?
Why?
I've decided to write this book to show how big value is hidden in data. If you have website
you probably collecting data about web traffic. But if you use this data to make business
decisions?
Nowadays we are swimming in data lake. Only if you know how to use this data you will stay
on the surface :). First step is to regularly check standard reports in your web analytics tool
(i.e. Google Analytics).
But to stay competitive you need something more. Everybody talks about data collection.
But only a few tell you what to do with data after collect them. I try to describe this process
and give you some ideas how to deal with data from Google Analytics using R.
In this book I will share my experience on this field. I hope that it will be usefull, interesting,
sometimes funny and will save you time :)
Target audience
I wrote this book for marketers who worked with Google Analytics and know basic metrics
included in this tool and know web interface. I hope that this material will be helpful in
learning how to extend features of Google Analytics in daily work and learning how to use R.
If you are analyst who knows perfectly R I hope that you also find some inspirations in this
book. Especialyy in learning how to connect Google Analytyics as additional data source in
R and what kind of analysis you can perform on this data.
5
About Google Analytics
It's also the most popular free web analytics tool in the Internet according to Builwith report
(Feb 2016). It's complete analytics platform offering solution for collect, analyze and report
data. Google Analytics offers also free APIs to export data to externals systems.
Terms of service
Common question is: if this great tool is really free? To be precise, according to Google
Analytics Terms of Service:
Service is provided without charge to You for up to 10 million Hits per month per
account.
If you exceed this quota, you should think about Google Analytics 360, former Gooogle
Analytics Premium service. This premium and paid version offers you multiple times bigger
data collection quota.
What is hit?
As you read above, your Google Analytics account has 10 000 000 hits per month limit. So
what is hit?
Hit - An interaction that results in data being sent to Analytics Common hit types
include page tracking hits, event tracking hits, and ecommerce hits.
Each time the tracking code is triggered by a user’s behavior (for example, user loads a
page on a website or a screen in a mobile app), Analytics records that activity. Each
interaction is packaged into a hit and sent to Google’s servers. Examples of hit types
include:
6
About Google Analytics
7
About R
About R
What is R?
R is a programming language and software environment for statistical computing and
graphics supported by the R Foundation for Statistical Computing. The R language is widely
used among statisticians and data miners for developing statistical software and data
analysis. Polls, surveys of data miners, and studies of scholarly literature databases show
that R's popularity has increased substantially in recent years. Wikipedia
Free.
Offers a lot of libraries for different statistical computations. Actual list of packages
A lot of educational materials (tutorials, MOOCs, blogs) available free in the Internet.
Has big community support.
Ready to run in different platforms (Windows, Mac, Unix). Version for server installation
is also available.
Fast because of in-memory computations.
Disadvantages?
R is not out-of-the-box solution with GUI for all analytical problems. You need to write
a chunk of code to get the result. It sometimes can be barer for non-technical people to
start with. But I hope if you read this book is not problem for you :)
The advantage of in-memory computations is sometimes a trap. In standard
installation you can only process data set which fits to RAM memory in your machine. If
you have really big data to process - think about other solution like Hadoop
(MapReduce) or Apache Spark. If you feel comfortable with R you can run your script
on other platforms (reading from HDFS or using SparkR). It is more advanced topic for
other book ;)
8
Author
Author
Michał Bryś
Data scientist
Michał is working in internet industry from 2009. He is expert in web analytics in e-commerce
context, especially using Google Analytics & Google Tag Manager. He loves mining big data
sets and transform information into actionable knowledge. He loves creating story from
numbers. He graduated AGH University of Science and Technology and University of
Economics in Cracow. Michal is member of Google Developers Group Cracow.
about.me
michalbrys.com
9
Prepare environment
Preparing environment
To analysis data you will need to set up:
10
Data sources
Data sources
You can find the most popular scenarios website.
It's community connecting web analysts and NGOs looking for help with digital analytics.
You can also contact with your University, family and friends offering help with digital
analytics.
11
Data sources
support.google.com\/analytics\/answer\/6367342
Traffic source data: information about where website visitors originate. This includes
data about organic traffic, paid search traffic, display traffic, etc.
Content data: information about the behavior of users on the site. This includes the
URLs of pages that visitors look at, how they interact with content, etc.
Transactional data: information about the transactions that occur on the Google
Merchandise Store website.
12
Creating Google Analytics account
If you don't have any accounts connected with your Google Account you will see this screen:
Account details
To create Google Analytics account fill form with:
Account Name. (Note: One Account may have a few tracking IDs so it can be one
Account per one organization/company with many websites.)
Insert Website Name, Website URL and Reporting Time Zone. (Note: Correct time
zone is critically important - your data will be divided into dates in reports using this
value).
13
Creating Google Analytics account
14
Creating Google Analytics account
To complete registration process, click Get Tracking ID and accept Google Analytics Terms
of Service.
After this you will see instructions how to install Google Analytics Tracking Code on your
website:
15
Getting credentials for Google Analytics API
Search: Analytics
Select Enable
16
Getting credentials for Google Analytics API
Create credentials:
17
Getting credentials for Google Analytics API
18
Getting credentials for Google Analytics API
Get credentials:
19
Getting credentials for Google Analytics API
Save Client ID and Client Secret. You need this to configure library getting data from
Google Analytics to R.
20
Installing Google Analytics on website
Example code:
<script>
(function(i,s,o,g,r,a,m){i['GoogleAnalyticsObject']=r;i[r]=i[r]||function(){
(i[r].q=i[r].q||[]).push(arguments)},i[r].l=1*new Date();a=s.createElement(o),
m=s.getElementsByTagName(o)[0];a.async=1;a.src=g;m.parentNode.insertBefore(a,m)
})(window,document,'script','//www.google-analytics.com/analytics.js','ga');
Install tracking code on your website, between <head></head> tags, on every page you want
to track.
To do this you should have access to your website source code or contact with your
webmaster.
Alternatively you can install Google Analytics via Google Tag Manager. I personally
recommend that way because it will save you a lot of time in future :)
21
Installing R Studio
Installing R Studio
Go to R Studio and download R Studio Desktop - graphic interface tool for R language.
You can use R Studio for free under AGPL 3. Paid version with more functionality is also
available.
install.packages("package name")
22
Installing R Studio
install.packages("ggplot2")
After installing and before using package you should load it to current session:
library("ggplot2")
23
Summary
Summary
In this chapter you may learn:
How to create account, configure and install Google Analytics on your website.
How to download and set up R Studio.
How to get credentials to download data from Google Analytics into R.
24
First steps
First steps
In this chapter you will set up your environment installing R Studio, creating Google
Analytics account and make connection via API between both tools.
25
Introduction to R
Introduction to R
Try type in console (left down corner window in R Studio) some basic instructions.
Arithmetic operations
> 1+1
[1] 2
> 2*4
[1] 8
Using variables
You can assign value to variable using <- (more popular) or = operator. You can find
some basic examples below.
Numeric variables
26
Introduction to R
Text variables
Vectors
Data frames
More popular than one dimensional vector is multidimensional data structure called
data.frame .
Data returned from Google Analytics API query we'll also save as data.frame
df <- data.frame(
date = c("20160101","20160101","20160101",
"20160101","20160101","20160101","20160101"),
city = c("London","Warsaw","Krakow",
"New York","Paris","Zurich","Sydney"),
sessions = c(101,80,70,50,30,60,20)
)
> df
27
Introduction to R
> head(df)
> colnames(df)
> df$city
And select only unique values of column (we have sessions for only one date: 2016-01-01):
> unique(df$date)
28
Introduction to R
[1] 20160101
Levels: 20160101
Select column 2:
> df[,2]
Select row 1:
> df[1,]
> df[1,1]
[1] 20160101
Levels: 20160101
29
Connection with Google Analytics
install.packages("googleAuthR")
install.packages("googleAnalyticsR")
library("googleAuthR")
library("googleAnalyticsR")
You will be asked about authorize R to download data from Google Analytics and your
browser will open authorization page. Click Agree:
30
Connection with Google Analytics
All done. You can now start to send queries via Google Analytics API.
31
Connection with Google Analytics
...
ga_id <- 33333333
...
Display results
After you successfully run your first query you can check results fetched from Google
Analytics. Display first 6 rows of result:
head(gadata)
32
Connection with Google Analytics
date sessions
1 20140101 39
2 20140102 46
3 20140103 47
4 20140104 53
5 20140105 49
6 20140106 15
Congrats! You've downloaded first data set from your Google Analytics account!
Source code
Complete code for this example in GitHub repository:
https:\/\/github.com\/michalbrys\/R-Google-Analytics\/blob\/master\/1_hello_world.R
33
googleAnalyticsR package
googleAnalyticsR package
To pull data from Google Analytics into R we'll use package googleAnalyticsR by Mark
Edmonson. Using this add on you can use all features of Google Analytics including the
latest like Google Analytics 360 or integration with Big Query.
http:\/\/code.markedmondson.me\/googleAnalyticsR\/
34
googleAnalyticsR package
35
Import and export data to CSV
Example use if you want to import file named file_to_import.csv from your working
directory and save data in data frame df :
df <- read.csv('file_to_import.csv')
If you don't want to import first line of your file, use header = FALSE option.
Also if you have column separator other than comma , use sep=';' option - you can
declare your separator in this place.
getwd()
[1] "/Users/michal"
Export data
After conducted analysis you may want to save results in file to use it in other tools. To do
this you need write.csv function.
36
Import and export data to CSV
If you have data in data frame called ga.data you can use this code:
As a result R will export data to .csv file. You can open it in every text editor or
spreadsheet (i.e. Microsoft Excel). Other use case is upload data as custom dimension or
campaign cost data to Google Analytics.
37
Code repository
Code repository
Source code in R for all examples described in this book you can find in my GitHub
repository:
github.com\/michalbrys\/R-Google-Analytics
Feel free to commit if you find some issue in code or if you want to share your examples.
38
Summary
Summary
In this chapter you can learn:
39
Exploratory data analysis
40
Exploratory data analysis
Min
Check what is minimum number of sessions in 2014?
min(gadata$sessions)
[1] 0
subset(gadata, ga.data$sessions == 0)
date sessions
7 20140107 0
8 20140108 0
129 20140509 0
130 20140510 0
131 20140511 0
132 20140512 0
133 20140513 0
134 20140514 0
135 20140515 0
41
Exploratory data analysis
How many days with 0 sessions? Use function nrow() to count rows with this condition.
[1] 9
summary(gadata)
Max
When was the biggest traffic on your website? Use max() function.
> max(gadata$sessions)
[1] 204
date sessions
59 20140228 204
You can reach this data in one function, replacing value with max() . It is shorter but harder
to read:
date sessions
59 20140228 204
Mean
What is mean number of sessions per day? To calculate this, use mean() function.
42
Exploratory data analysis
mean(gadata$sessions)
[1] 27.6
Standard deviation
You can check diversity of number sessions per day. Use sd() function.
sd(gadata$sessions)
[1] 22.12984
So average number of sessions is equal 27.6 +\/- 22.12984. This dataset has big diversity
and in your case is better not to trust only average value.
Median
If dataset has high standard deviation its better to calculate median (the most popular value
in dataset).
median(gadata$sessions)
[1] 21
Summary
If you want, you can get all of this statistics in one function: summary .
summary(gadata)
43
Exploratory data analysis
date sessions
Length:365 Min. : 0.0
Class :character 1st Qu.: 12.0
Mode :character Median : 21.0
Mean : 27.6
3rd Qu.: 40.0
Max. :204.0
As a result you will get basic statistics for numeric variables and description for character
variables.
Source code
Complete code for this example in GitHub repository:
github.com\/michalbrys\/R-Google-Analytics\/blob\/master\/2_eda.R
44
Data visualization
Data visualization
45
Data visualization in R
Data visualization in R
We'll make some exploratory data analysis by visualizing data from Google Analytics in R.
Package ggplot2
According to ggplot2 project site:
ggplot2 is a plotting system for R, based on the grammar of graphics, which tries to
take the good parts of base and lattice graphics and none of the bad parts. It takes care
of many of the fiddly details that make plotting a hassle (like drawing legends) as well
as providing a powerful model of graphics that makes it easy to produce complex multi-
layered graphics.
Using ggplot2
Download data to visualize in chart
In first step install (if necessary) and load package in current session.
install.packages("ggplot2")
library("ggplot2")
Next build query do fetch data about date and number of session:
46
Data visualization in R
head(gadata)
date sessions
1 2016-01-01 199
2 2016-01-02 212
3 2016-01-03 155
4 2016-01-04 210
5 2016-01-05 192
6 2016-01-06 180
Scatter plot
Plot data in time (scatter plot)
As a result you will get basic scatter plot with sessions in time:
47
Data visualization in R
As you see this plot isn't very nice because of a-axis labels. You can fix this using 90-degree
pivot.
Add line:
You can also change point size depending on number of sessions by adding:
size = sessions
48
Data visualization in R
color = sessions
Complete code:
49
Data visualization in R
Line chart
Plot data in time (line chart) with some styles:
ggplot(gadata,aes(x=date,y=sessions,group=1)) +
geom_line() +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
# some styles to pivot x-axis labels
50
Data visualization in R
And now we can plot data points with added trend line:
51
Data visualization in R
Box plot
To make some exploratory data analysis, you can visualize your traffic in different day od
week. Is your website traffic is seasonal? When are more crowded days? Let's check
creating box plot which will illustrate distribution of number of sessions in every day of
week:
52
Data visualization in R
0 - Sunday
1 - Monday
2 - Tuesday
3 - Wednesday
4 - Thursday
5 - Friday
6 - Saturday
So in this case, the highest traffic was on Thursday. Fridays are also not bad :)
Source code
Complete code for this example in GitHub repository:
github.com\/michalbrys\/R-Google-Analytics\/blob\/master\/3_data_visualization.R
53
Traffic heatmap
Traffic heatmap
We will build some more advanced data visualization. It willl be useres engagement
heatmap. The darker color the highest user engagement (avgSessionDuration) was on this
time of day. Inspired by Todd Moy.
54
Traffic heatmap
# traffic heatmap
# based on https://github.com/toddmoy/Google-Analytics-Heatmap/blob/master/traffic_hea
tmap.R
# install libraries
# install.packages("googleAuthR")
# install.packages("googleAnalyticsR")
# install.packages("ggplot2")
# install.packages("RColorBrewer")
# load libraries
library("googleAuthR")
library("googleAnalyticsR")
library("ggplot2")
library("RColorBrewer")
# order data
gadata$dayOfWeekName <- factor(gadata$dayOfWeekName, levels = c("Sunday",
"Monday",
"Tuesday",
"Wednesday",
"Thursday",
"Friday",
"Saturday"))
gadata[order(gadata$dayOfWeekName),]
55
Traffic heatmap
# plot heatmap
heatmap(heatmap_data,
col=colorRampPalette(brewer.pal(9,"Blues"))(100),
revC=TRUE,
scale="none",
Rowv=NA, Colv=NA,
main="avgSessionDuration by Day and Hour",
xlab="Hour")
In this case - wednesday morning is the most engaging for users time of the day :)
Source code
Complete code for this example in GitHub repository:
github.com\/michalbrys\/R-Google-Analytics\/blob\/master\/6_heatmap.R
56
Traffic heatmap
57
Device comparsion
Device comparsion
Let's check how engaged users are on different types of device. To do this, we'll plot 2 charts
- describing how many sessions was made from different device types and what is
avgSessionDuration (in seconds) on particular device type.
# device comparsion
# install libraries
# install.packages("googleAuthR")
# install.packages("googleAnalyticsR")
# install.packages("ggplot2")
# load libraries
library("googleAuthR")
library("googleAnalyticsR")
library("ggplot2")
58
Device comparsion
In this case the longest sessions was made from mobile devices.
Source code
Complete code for this example in GitHub repository:
github.com\/michalbrys\/R-Google-Analytics\/blob\/master\/7_device_comparsion.R
59
Machine Learning
Machine Learning
60
Clustering (k-means)
Clustering (k-means)
Power of R is wide range of packages with advanced algorithms ready-to-use. In this
example we'll use k-means for custom users segmentation.
Because this example needs custom instalation of Google Analytics tracking (content
grouping, fingerprint), I've prepared special dataset for thus purpose. You can find complete
code below.
61
Clustering (k-means)
#install.packages("plotly")
library(plotly)
plot_ly(clustered_users,
x = clustered_users$beginner_pv,
y = clustered_users$intermediate_pv,
z = clustered_users$advanced_pv,
type = "scatter3d",
mode = "markers",
color=factor(clustered_users$fit.cluster)
)
Results
Result visualized in plotly package:
62
Clustering (k-means)
> clustered_users
Source code
Complete code for this example in GitHub repository:
github.com\/michalbrys\/R-Google-Analytics\/blob\/master\/5_users_segmentation.R
63
Generating reports
Generating reports
For every analyst periodic reports can be time-consuming work. We can automate this
process in R and prepare reporting templates. After that you can run this reports changing
time range and save it do i.e. .pdf file. Sounds interesting?
64
Introduction to R Markdown
Introduction to R Markdown
You can use markdowns as follow:
R Markdown options
---
title: "Monthly report"
output: pdf_document
---
Chunks of code
```{r}
# R Code
```r
If you don't want to display code in chunk in output file, use echo = FALSE option.
```{r, echo=FALSE}
# R Code
```
Basic formatting
Headers
# Header 1
## Header 2
### Header 3
will produce
Header 1
Header 2
65
Introduction to R Markdown
Header 3
Lists
* element 1
* element 2
* element 3
will produce
element 1
element 2
element 3
1. element 1
2. element 2
3. element 3
will produce
1. element 1
2. element 2
3. element 3
Formatting
*italic*
**bold**
***bold+italic**
will produce
More resources
Full documentation:
www.rstudio.com\/wp-content\/uploads\/2015\/03\/rmarkdown-reference.pdf
Cheat sheet:
66
Introduction to R Markdown
www.rstudio.com\/wp-content\/uploads\/2016\/03\/rmarkdown-cheatsheet-2.0.pdf
67
Create report
Create report
To generate basic report template use this code. This report will contain title, sessions in
time scatter plot from chapter 2 (Data visualization in R).
You will see window with some basic configuration options. Change this values or you can
do this later directly in code.
You can select output of your report. Select HTML , PDF or Word .
68
Create report
---
title: "Google Analytics Traffic Report"
author: "Michal Brys"
output: html_document
---
#install.packages("googleAnalyticsR")
#install.packages("ggplot2")
library("googleAnalyticsR")
library("ggplot2")
69
Create report
Result
As a result you'll get complete HTML file with report. You can also generate PDF file.
70
Create report
71
Create report
Source code
Complete code for this example in GitHub repository:
github.com\/michalbrys\/R-Google-Analytics\/blob\/master\/8_rmarkdown_report.Rmd
72
Additional analysis
Additional analysis
73
Anomaly detection
Anomaly detection
Use: https:\/\/github.com\/twitter\/AnomalyDetection
74
Forecasting
Forecasting
Forecast of future web traffic using Holt-Winters method. Inspired by Richard Fergie.
# install libraries
# install.packages("googleAuthR")
# install.packages("googleAnalyticsR")
# install.packages("ggplot2")
# install.packages("forecast")
# install.packages("reshape2")
# load libraries
library("googleAuthR")
library("googleAnalyticsR")
library("ggplot2")
library("forecast")
library("reshape2")
75
Forecasting
plot(forecastmodel)
ggplot(forecastdata, aes(x=day)) +
geom_line(aes(y=actual),color="black") +
geom_line(aes(y=forecast),color="blue") +
geom_ribbon(aes(ymin=forecastlower,ymax=forecastupper), alpha=0.4, fill="green") +
xlim(c(0,90)) +
xlab("Day") +
ylab("Sessions")
Result
As a result you'll get chart with predictions about your web traffic.
76
Forecasting
Source code
Complete code for this example in GitHub repository:
https:\/\/github.com\/michalbrys\/R-Google-Analytics\/blob\/master\/4_forecasting.R
77
Resources
Resources
78
Blogs
Blogs
R Bloggers
Mark Edmondson
Richard Fergie
...
79
Documentation
Documentation
R project - official website
googleAnalyticsR - R package
80
Online trainings
Online trainings
To learn more details about R I recommend to check Coursera MOOC:
R Programming by Johns Hopkins University
81
Books
Books
List of book where you can get some inspiration for further analysis, with links for free online
versions:
Cookbook for R
www.cookbook-r.com
r4ds.had.co.nz
Think Stats
greenteapress.com\/thinkstats
82