You are on page 1of 95

RETHINKING

MODERN
WEB ANALYTICS
RETHINKING MODERN WEB ANALYTICS

Table of Contents
1. The state of the web analytics landscape in 2021,
and what has changed ....................................................................2

2. Privacy updates, ad blockers, and the need


for 1st-party tracking ....................................................................16

3. Building a web analytics stack – packaged vs modular ..............26

4. The best in class tools for web analytics ......................................38

5. Redefining web analytics metrics ................................................50

6. Data Modeling for Web Analytics ..................................................64

7. How Snowplow can power your web analytics ..........................79

8. How Welcome to the Jungle took ownership


of their web data with Snowplow..................................................86

1
CHAPTER 1

THE STATE OF
THE WEB ANALYTICS
LANDSCAPE IN 2021,
AND WHAT HAS
CHANGED
RETHINKING MODERN WEB ANALYTICS

The web has evolved dramatically over the


last 25 years, from humble beginnings to rich,
interactive experiences. Web analytics has
struggled to keep pace.
Web analytics has evolved a lot over the last 25 or so years. From humble
beginnings of Webtrends and analyzing raw server log files to understand
what pages were being requested from your server the most, web analytics
now helps businesses understand what is going on on their web apps and
platforms in greater detail than ever before. But how did we get where we are
today? Technology has changed massively in that time, as has the way
people use the web, and it’s worth taking a look back to see how far we’ve
come, and to understand why things have evolved the way they have.

3
RETHINKING MODERN WEB ANALYTICS

In the early 2000s, the web was a much simpler place. We had mostly static
websites with little interactivity. Javascript and CSS, the languages that give
the web it’s gloss and makes it a more engaging experience were far less
sophisticated than they are now. Not only was the technology underpinning
the web itself much more basic than today, the way people used the web was
much simpler. Almost nobody had what we would describe today as a
‘smartphone’, and even if they did, the wireless internet was very restrictive,
as were the websites you could visit on it (the iPhone wasn’t released until
2007). Most people accessed the web through a single computer – usually
one per household.

4
RETHINKING MODERN WEB ANALYTICS

Web in its infancy


When people did browse the web, there were also physically fewer things
that they could do online. Ecommerce was still in its infancy, and the range of
items and products you could buy on them was limited (Amazon.com had
only just branched out from just selling books). There were some social
media platforms (MySpace, Facebook est. 2004), but without the proliferation
of smartphones, the impact on users’ lives was small. The idea that people
went to the web to waste time wasn’t really commonplace. It was much more
likely users would go to their computer “to surf the internet” (like it was an
activity or event), rather than always having instant access to the internet.

apple.com in 2000 apple.com in 2021

Content consumption on the internet was almost exclusively text and image
based, primarily text. Connection speeds were nowhere near fast enough to
allow for reliable on demand music or video playback.

5
RETHINKING MODERN WEB ANALYTICS

Not only was it different from a user’s perspective, it was also different for
brands and businesses. A number of brands didn’t have any online presence
at all (certainly not on social media). Even if you did, your options for online
marketing and advertising were limited to static banner ads (paid for through
reserving a spot on a website for a finite period of time), email newsletters
and paid search advertising through Google (who had at least by this time
established themselves as the number one search engine).

Where the web (analytics) was won


In a more limited web era, web analytics could quite easily cater to this kind
of user behaviour. This page view -> visit -> visitor (later known more
commonly as page view -> session -> user), through single devices, from
single locations, for a singular purpose, to simple and static websites meant
that understanding what users were doing on your site was relatively
straightforward.

In 2005 Google acquired Urchin, and rebranded it as Google Analytics. It used


a small piece of Javascript placed on every page of a website that could track
page views, device types and traffic sources right out-of-the-box. Google also
made Google Analytics free to use. This proved to be a very savvy move by
Google, and allowed them to make waves in the industry - both for
themselves and the industry at large.

Over time GA had further functionality developed, including event tracking


(for tracking interactions such as forms and clicks around the site, for
example), ecommerce tracking and integrations with other Google marketing
products such as Google AdWords and Google Webmaster Tools.

6
RETHINKING MODERN WEB ANALYTICS

GA was well-positioned to help analysts answer the questions that made


sense at this time, given that user behaviour and web technology was much
simpler than today. Coupled with its free price point, Google was able to
bring web analytics to the masses.

A web revolution
Over time however, the way people use the web has evolved significantly.
The iPhone was released in 2007, changing the way users browsed both on-
the-go and at home, as did, to a lesser extent, the iPad in 2010. Net
connections became exponentially faster and more reliable (both broadband
and mobile networks) which enabled video and music streaming on demand,
as well as live streaming.

7
RETHINKING MODERN WEB ANALYTICS

Web technologies and frameworks (such as React, Angular and Vue) have
been developed which enable web applications that were little more than
pipedreams in the early 2000s. You can buy more items online today than
you ever could before (cars, stocks/shares/options, groceries, ISAs etc) which
brings users to the web more frequently. Not only that, we can now manage
our finance through online and mobile banking, start relationships with
online dating apps and services, track our health with fitness apps etc. This is
the same if you are a business (handling finances through Xero or
Quickbooks, customer support through Zendesk, virtual events using
BrightTalk etc). We have more reasons to use the internet and the web than
ever before.

And since the variety of items people can use or buy online now is so vast,
and most people use the web on multiple devices, the customer journey is
now more complex than it has ever been. Research Online Purchase Offline
(ROPO) is very common nowadays. Starting your user journey on mobile after
seeing a dynamically targeted video ad on a social media app on your
commute before researching and purchasing on a desktop web browser at
home after receiving a triggered marketing email, and other user journeys
similar and more complex are commonplace today.

All of these societal and technological changes mean that the websites that
analysts are analyzing today look and behave very differently than they did
when websites were just static pages and a few buttons and forms.

And yet, the majority of the most popular web analytics tools out there today
still use a data model and frameworks as if we’re still analyzing simple
websites, with users that only have a single device.

8
RETHINKING MODERN WEB ANALYTICS

SHOP
ONLINE

VIEW VIEW
BANNER PRINT
AD AD

WATCH WATCH VIDEO


TUTORIAL ON MOBILE

PURCHASE
DOWNLOAD
THROUGH
APP
CALL CENTRE

WATCH
BLOG YOUTUBE
AD

COMPARE
PURCHASE
SHOP
IN-STORE
ONLINE

POST READ
REVIEWS REVIEWS

PURCHASE LIKE ON
VIA MOBILE FACEBOOK

VIEW YOUTUBE AD

A typical customer journey

Breaking out of the GA paradigm


Taking Google Analytics as a primary example: GA relies on page views as its
key hit type, in order to chop sets of page views by the same cookie ID (not
user) into sessions, and then tie these sessions together to the same cookie.
Most metrics and dimensions in GA are tied to the concept of a session, such as
device type, landing page, channel, conversion rate, bounce rate etc. Therefore
GA requires that you send it page view hits so that GA can construct sessions.

9
RETHINKING MODERN WEB ANALYTICS

There are two key problems with this approach.


Firsty, a session in GA is a complex and untidy concept. It combines a timeout
window based on when GA receives hits, campaign timeout windows based
on how long you wish a marketing or advertising campaign to be applied,
cross-domain tracking issues, referral exclusions, changing acquisition
sources, midnight etc. Changes to any of these configurable options will
impact how a session is defined, and therefore all the metrics and
dimensions that are tied to this concept of a session. Since all these metrics
can all be so drastically changed by small changes in config, it’s difficult to
have confidence in them.

Secondly, consider a web application that doesn’t fit nicely into the page
view -> session -> user framework, like twitter.com for instance. A user can
visit twitter.com/home, scroll through their timeline, hover over users’
avatars to see a profile card, like and retweet individual tweets, follow or
unfollow users all from this single page, that also auto-refreshes your feed for
you. This can all be performed across just one, single page view, in a
traditional sense, since the URL has not reloaded. If Twitter were using GA for
their web analytics, without extreme customization they would likely have a
high proportion of their “sessions” consisting of only a few page views and
high bounce rates. The standard data model enforced by most web analytics
tools don’t fit the web of today.

10
RETHINKING MODERN WEB ANALYTICS

The standardized page view -> session -> user paradigm doesn’t fit a lot of
web experiences in 2021. The BMW car customiser, an online learning
provider like Udemy or a streaming service like Twitch are all web
applications for which the standard web analytics data model makes no
sense anymore.

It’s worth noting that there are a number of websites out there that do still fit
this model – most publisher and ecommerce websites generally do fit, for
instance, since users move from product pages, to search results pages and
checkout and confirmation pages, or users read articles. For those businesses
out there, the model still fits well. However, a growing number of businesses
do not fit this model, and even publishers and ecommerce businesses are
starting to change their web experiences away from what we might call a
“traditional” website model.

11
RETHINKING MODERN WEB ANALYTICS

From websites to digital products


There is a valid argument that the web applications described above are
more suited to product analytics rather than web analytics and therefore
require a different set of tools that cater to a different set of requirements
that traditional web analytics tools don’t do. This is true in a number of
cases, although more and more of these “product” type applications are
appearing in what could be considered “traditional” websites. The other
drawback is that specialist product analytics tools often lack the high-level
perspective that web analytics tools are excellent at providing. We are seeing
product analytics and web analytics tools come closer together, a trend that
will probably carry on over the next few years.

Overall, many modern web analytics tools are generally poorly equipped to
provide the deep level of detail required to understand user behaviour across
complex web user journeys. This is not a revelation or an unpopular opinion.
This challenge has been picked up by the largest player in the market, in an
attempt to help analysts better answer those questions that traditional web
analytics tools struggle with.

12
RETHINKING MODERN WEB ANALYTICS

Google Analytics 4 is Google’s newest and latest version, and completely


changes how Google Analytics works, from the interface to the underlying
data model. Instead of the traditional page view -> session -> user framework,
GA4 shifts to using the events -> users data model. This is a big change and it is
also a change in mindset for web analysts using GA4 who have used Universal
Analytics (“old” GA) for a number of years. GA4 also provides the ability to
export data to Google BigQuery for no additional fee - for the first time
providing mass access to event level data in a SQL data warehouse.

This major change from Google shows they acknowledge the need for a new
look at web analytics, and the fact that GA4 is based on Firebase which was a
tool for tracking interactions on mobile apps shows how Google sees the two
worlds coming closer together.

13
RETHINKING MODERN WEB ANALYTICS

Beyond GA4 – the future of web analytics


GA4 however does not fix everything. Things such as reliable cross-device
attribution, off-site measurement, integrating other channels such as CRM all
while respecting your users’ privacy is still a challenge for all online
businesses, and GA4 will not fix these issues - nor will it fix issues with poorly
constructed metrics.

“Once we figured out how easy it is to


add trackers, we quickly started adding all kinds of
events to understand how users browse and search
on our site. We even moved all our infrastructure
tracking and server-side events to Snowplow.”
Steven O, Head of Analytics at Tripaneer

14
RETHINKING MODERN WEB ANALYTICS

To summarise, web analytics tools have struggled to keep pace with the
changing user behaviors and technological advancements that have
happened over the last 10-15 years. Web analytics solutions need to provide
businesses and analysts with the ability to customise their tracking and data
models to truly fit their web applications that their customers use, to fully
understand the behaviour those users are exhibiting. Without this
understanding that comes from rich and detailed behavioural data businesses
cannot expect to provide the best user experience across all touchpoints.

In the upcoming chapters in this eBook on web analytics, we will cover some
of the big topics and challenges that need to be addressed in order to go to
the next level and gain the most value from your behavioural web data:

• Privacy, security, ad-blockers etc…


• Challenges related to relying on a packaged solution
• Building out more meaningful web analytics metrics
from your behavioural data
• How best to model your behavioural web data

15
CHAPTER 2

PRIVACY UPDATES,
AD BLOCKERS, AND
THE NEED FOR
1ST-PARTY TRACKING
RETHINKING MODERN WEB ANALYTICS

How organizations can overcome the


challenges of the modern web to deliver
complete behavioral data
The proliferation of privacy tools and ad-blocking software has made the job
of web analysts increasingly difficult. This technology obscures many of the
actions and behaviors that constitute the rich behavioral data visitors
generate on your website. Features like private browsing modes and ad
blockers were implemented to help protect users from the most egregious
intrusions into their browsing. Now, many of these privacy measures are
baked into browsers or enabled by default.

If your recommendation engine or marketing attribution model relies on


detailed behavioral data, and you rely on a packaged analytics solution for
collection, the unfortunate reality is you’re missing data — (as much as 20%!).
But “tracking” has become a dirty word: what looks like well-intentioned
data collection to improve site navigation to some may seem like unwelcome
surveillance to others. As web analysts, we have to somehow bridge that gap
between the users (whose privacy deserves respect) and our organizations
(which require accurate data).

17
RETHINKING MODERN WEB ANALYTICS

The Browser Wars


After what seems like a never-ending stream of high profile data breaches
coupled with increasingly strict privacy laws around the world, web browsers
are locking down their user data, placing restrictions on who can access that
data and how. All of this is not without reason. Regulators are seizing any
opportunity to question organizations over their tracking practices and users
are becoming more educated on how their data is being used. The result is
that visitors to your website expect the same high quality experience while
retaining full control over what data they do or do not wish to share.

Enhanced browser privacy and built-in ad blocking make collecting meaningful


behavioral web data difficult. This is the result of masking information about
the user and their browsing history, controlling activity logs, and restricting
cookies. Pre-packaged analytics platforms are often blocked by default, a
result of the context in which the tracking event occurs (more on this below).
While out-of-the-box tools can be helpful for organizations early in their
analytics journeys, the increasing focus on user privacy that restricts third-
party tracking makes a compelling case for first-party tracking.

18
RETHINKING MODERN WEB ANALYTICS

Private Browsing
Web browsers are implementing clever features designed to both protect
users while preventing websites from tracking those visitors. Behind the
scenes, browsers are removing tracking parameters from URLs, stripping or
spoofing referral IDs, and setting strict limits on how websites can interact
with a user’s browser storage via cookies.

Cookies
As a refresher, cookies are essentially bits of code that use browser storage to
maintain specific states as a visitor navigates from one page to the next. Cookies
make sure your visitors stay logged in and keep their items in the shopping cart
as they browse your website. Cookies are often referred to as first or third party,
but it’s more accurate to describe their context, the circumstances under which
the cookie was written to a visitor’s browser. From Cookie Status:

“First-party context means that the operation happens between domains


within the same site, i.e. domains that share the eTLD+1. Third-party context
means that the operation happens cross-site, i.e. between domains that do
not share the eTLD+1.”

19
RETHINKING MODERN WEB ANALYTICS

Here, “eTLD+1” refers to the effective top-level domain plus one part. For
example, blog.snowplowanalytics.com is an eTLD+1 for the domain
snowplowanalytics.com. Cookies with a first-party context (“first-party”
cookies) occur between pages that share an eTLD+1, e.g. navigating from
blog.snowplowanalytics.com/post_1 to blog.snowplowanalytics.com/post_2.
Third-party context occurs between pages that don’t share a domain, like your
email service provider’s subscription form popping up in an iframe or a
restaurant’s menu PDF being served directly from S3 via s3.amazonaws.com.

Despite such an innocuous name, cookies wield a remarkable amount of


power to inform and alter a user’s web browsing experience, so it’s no
surprise they’re at the center of some of the most robust privacy initiatives in
modern web development.

20
RETHINKING MODERN WEB ANALYTICS

The way the (third-party) cookie crumbled


The two most significant browser privacy initiatives (in intent if not in scope)
currently impacting web analysts are Apple’s Intelligent Tracking Prevention
and Mozilla’s Enhanced Tracking Protection.

Intelligent Tracking Prevention (ITP)


Apple’s ITP was introduced to prevent intrusive, disruptive practices by ad
tech companies in the earlier days of the internet. Ad tech companies
responded by moving to first-party cookies set client-side, which happens to
be the same mechanism that many packaged analytics tools use to identify
site visitors. ITP’s restrictions eventually impacted cookies set by analytics
providers as well: version 2.1 of ITP included a seven-day expiry period for all
cookies in the Safari browser, and version 2.2 capped cookies to just one day
of storage if the domain URL matched a known tracker.

“These [analytics] solutions use first-party, client-side cookies to collect data


on behalf of a business rather than the business (and its domain) collecting
the data itself, so these cookies fall under the seven-day expiration mandated
by ITP 2.1.”
From “How ITP 2.3 expands on ITP 2.1 and 2.2 and what it means for your web analytics”

This means if someone visits your website to browse your products and
comes back ten days later and makes a purchase, that second visit looks like
a new person to your analytics.

21
RETHINKING MODERN WEB ANALYTICS

ETP
Mozilla introduced Enhanced Tracking Protection into its Firefox browser in
2018 and enabled the privacy-focused suite of features by default in 2019.
Similar to ITP, ETP blocks third-party cookies. As of version 2.0, Firefox
deletes tracking cookies every 24 hours, as opposed to Apple’s generous
seven days. ETP extends a grace period for websites you visit frequently, like
search engines or social media, storing those first-party cookies for 45 days
(or indefinitely, depending on how often you visit the site).

Privacy in Google Chrome/Microsoft Edge


Apple and Mozilla are not alone in developing advanced privacy tools for
their browsers. While Google Chrome doesn’t currently offer as many options
as other browsers, the company announced a new privacy initiative in
January 2020 set to give users greater control over their data, calling it “a
path to making third-party cookies obsolete.” Microsoft’s Edge browser offers
tracking protection by default, with the recommended settings functioning
similar to ITP and ETP but without an expiration period.

Ad blockers
Even if you don’t advertise, ad blockers may be having a significant impact
on your web analytics. Ad blockers function like other tracking prevention, by
checking scripts as a page loads against a list of domains to block. Depending
on the implementation and the ad blocker, tracking scripts from Google
Analytics or other on-page analytics platforms can be caught by the filters.

22
RETHINKING MODERN WEB ANALYTICS

The impact on third-party tracking


All of the signs are clear: Google is not alone on the path to making the
third-party cookie obsolete. Of the leading web browser versions by global
market share as of January 2021, the browsers discussed above account for
over 80% of web users.

Browser-based privacy features and the prevalence of ad blockers (over 40%


of internet users employ some form of ad blocking) means organizations
relying on many analytics platforms will have to rethink their data analytics,
digital marketing, marketing attribution, and personalization strategies. Put
another way: companies have found that Safari users spend more on average
than Chrome users, so if your analytics solution can’t track Safari, you’re
losing 25% of your most lucrative visitors.

23
RETHINKING MODERN WEB ANALYTICS

Collect complete behavioral data


with first-party tracking

Your analytics will have distortions and gaps if you rely on cookies set in a
third-party context or by many known tracking and analytics services. First-
party data collection platforms like Snowplow use server side set cookies
(first-party context), leaving them unaffected by ITP, ETP, or most other
tracking prevention. In an experiment run by Moz, calculating traffic
obscured by ad blockers or browser tracking prevention revealed anywhere
from a 5-30% discrepancy in volume.

Uncovering your own missing behavioral data leads to more accurate


marketing attribution and can be a serious driver of growth.

Without being limited by expiration dates, tracking using server side set
cookies provides a source of rich, detailed behavioral data for businesses to
use to make more informed decisions. Just as important, setting cookies this
way preserves user privacy. Server side set cookies are currently the most
reliable way to track anonymous visitors to your website.

“It was especially important for us to calculate


conversion rates accurately for our clients,
so they could compare us to other job boards.
This wasn’t possible before Snowplow.”
Aurelien Rayer, Head of Data at Welcome to the Jungle

24
RETHINKING MODERN WEB ANALYTICS

First party, server side tracking is a win for


users and businesses
First-party tracking meets the privacy standards Apple, Mozilla, Google, and
others set to protect their users while preserving behavioral data integrity.
Because server-side tracking occurs in a first-party context, cookies set this
way are unaffected by modern browser privacy measures. Controlling your
data pipeline end to end also significantly reduces the likelihood of a third-
party breach exposing any of your visitor data.

Organizations that use intentionally designed first-party tracking solutions,


like Welcome to the Jungle uses Snowplow, collect high quality behavioral
data to deliver the best experiences for their customers. When you do data
collection and analysis right, your user’s positive experience should be as
rewarding to them as their data is to you, the collector. Because first-party
tracking can collect behavioral data without unnecessary personally
identifiable information or information otherwise locked behind ad blockers
or ITP/ETP, you can still benefit from rich, behavioral data, while your visitors
maintain their privacy.

25
CHAPTER 3

BUILDING A WEB
ANALYTICS STACK
– PACKAGED
VS MODULAR
RETHINKING MODERN WEB ANALYTICS

Building a web analytics stack


– packaged vs modular
All-in-one analytics solutions are wildly popular. And rightly so. To take Google
Analytics as an example, Google’s decision to purchase Urchin in 2005 enabled
them to enter the market early and bring web analytics to the masses.

According to a traffic usage survey, in 2008 Google Analytics was used by 55.1%
of all the websites, amounting to a tool market share of 84.3%. Despite many
new players entering the industry, today GA has managed to hold onto their
dominance with an eye-watering 75% of the market. It’s fair to say that GA
continues to be the go-to tool for web analytics, and for many organizations it
is a hugely powerful solution that helps them get started quickly.

27
RETHINKING MODERN WEB ANALYTICS

But despite the popularity of tools like Google Analytics (and other packaged
tools), there are a number of challenges organizations run into when only
relying on packaged tools for their web analytics. From browser privacy
challenges, data silos, and lack of control, it’s worth exploring what these
challenges mean on a practical level to your business , and why a move to a
more modular stack could be a better approach in the long term.

Watch: Building a strategic data capability

That being said, packaged tools are popular for a reason. It wouldn’t be fair –
or accurate – to say that all companies should ignore packaged analytics
solutions, and for many teams starting out on their data journey, packaged
tools offer distinct advantages.

Packaged analytics tools are ideal for getting started

Early in the data maturity journey, it’s often not wise or necessary to build out
a complex technology stack. This is where packaged analytics can shine.

They are quick to set up and get going. A huge advantage of packaged tools is
that they deliver value quickly. They can give you a quick understanding of
how users are interacting with your websites and platforms, while you can
always build out a wider set of use cases later.

28
RETHINKING MODERN WEB ANALYTICS

They offer an all-in-one solution. Packaged analytics tools are exactly that
– a package, which means data collection, modeling and visualization are all
included. This eliminates the need to hunt down and purchase multiple
solutions, which is particularly advantageous at an early stage when
resources are limited.

They’re easy to use. While this may not be a major benefit to data teams or
engineers, marketing teams and other internal data consumers can easily
self-serve data from packaged analytics tools like Google Analytics, without
being SQL proficient.

However, for all their advantages and simplicity, packaged solutions have
their drawbacks.

29
RETHINKING MODERN WEB ANALYTICS

Don’t stick to the packaged tools that you’re used to

It can be tempting to stay with the analytics tools you’ve grown used to. The
risk here is that you’re not fulfilling the potential of one of your greatest
business assets: your behavioral data. Packaged analytics solutions are
limited in the following ways;

• They are one-size-fits all. Packaged analytics tools are designed to be


off-the-shelf solutions. They are not customized to your particular
needs or business logic. This can have serious implications for use
cases such as marketing attribution, when a tool decides what counts
as a ’conversion’ or ‘acquisition’ on your behalf. Often the assumptions
a packaged tool makes on behalf of its customers are unhelpful or inaccurate.

This can be especially problematic for organizations that do not fit the mould
of the typical e-commerce transaction, such as jobs boards or marketplaces
with multiple users.

“With Snowplow, we discovered that we own the


data, which isn’t formatted in a way that forces you
to a specific use case — it’s free and open so you can
do what you want with it. We collect the data, use it
to build a BI dashboard and connect it to the
product to help our contributors”
– Timothy Carbone, Data Engineer, Unsplash

30
RETHINKING MODERN WEB ANALYTICS

• They are black boxes. Let’s imagine you’ve set up your packaged
analytics tool, and you’re beginning to explore data about your web
visitors. For some of your web pages, your bounce rate looks pretty
high, why is that?

At this point, you have no control over how ‘bounce rate’, ‘time spent’
or other important web metrics are recorded. You don’t even know
how your data is captured and processed, where is it hosted? What
logic goes into defining certain events?

Since packaged tools are closed off, you cannot look under the hood and
discover (let alone change) the way your web data is being manipulated. It’s
also often difficult (or impossible without paying large fees) to obtain and
work with the raw data, before it becomes opinionated and modeled. For
organizations beginning to recognize data as one of their most important
assets, this is a red flag. It means you’re handing over control and ownership
of your valuable behavioral data to a third party.

“Other solutions were like black boxes, and that is


not the direction we wanted to take. We wanted a
solution to become a core part of our business.”
Kevin James Parks, Data Engineer, Tourlane

31
RETHINKING MODERN WEB ANALYTICS

• They are siloed. Many companies are realizing the strategic benefits to
building a single customer view. That is to say, unifying data sets from all
your platforms and channels to construct a cohesive understanding of
your users.

Disparate data Strategic data


capabilities capability
Different data sets Same data sets
Different reports Business-wide reports
No source of truth Single source of truth

BI tool ML projects Real-time


applications
Data Science
Marketing
Product

Move to a Data Warehouse


single source
of truth ETL/data integration Event data collection

Homegrown

But with packaged analytics tools, it’s extremely difficult to unify data
in this way, because your data is siloed off and structured completely
differently to data captured from, say, social media, CRM, and other
channels, Without the ability to structure the data the way you’d like,
or access to the raw data, your data is stuck in your packaged analytics
tools where its value is limited to only a few use cases, perhaps just
reporting and analysis. Which brings us to our next drawback.

• They are limiting. The way companies work with data is constantly
evolving. We’ve seen companies like Spotify use behavioral data to give
their listeners unique experiences such as their weekly recommended
playlists. There are now a number of game-changing use cases that can
be achieved with behavioral data, from personalized content to
product analytics and customer journey mapping and the list is growing.

32
RETHINKING MODERN WEB ANALYTICS

We want to be able to control and own all of our


data. Snowplow is open source, which means that
we can have confidence in it; we can look at the code
and figure out what’s going on or change things.
– Rahul Jain, Principal Engineering Manager,
Business Intelligence Platform, Omio

Organizations relying solely on packaged analytics tools to capture and


process their data run the risk of missing out on these opportunities,
because their raw data is ‘stuck’ in the solution and often it’s either
impossible (or expensive) to get it out. And as the competition to attract
and retain customers in a post-digital world escalates, missing the
potential to leverage game-changing data use cases could be fatal.

• They make it difficult to build assurance in data quality.


Organizations cannot build assurance that their data is accurate and
complete without taking ownership of their data infrastructure.

This is often overlooked, but by relinquishing control of how their data


is captured, processed and modeled, companies also lose control of
the integrity of that data. Can our data be actioned effectively by key
data consumers? Is it structured in a way that analysts can work with?
Is the data complete, or are we missing data to ad-blockers and third-
party cookies? These are questions that should be asked at the early
stages of data capture, in order to ensure that the whole organization
can get maximum value from their data – their most important asset.

Watch: identity resolution in a privacy conscious world

33
RETHINKING MODERN WEB ANALYTICS

I want to break free – breaking out into a


modular, best-in-class data stack
It’s not easy, but building a data stack to power your web analytics (and
beyond) is worth the effort in the long run.

To get there, you will need to consider how to shape your end-to-end data
infrastructure, from data capture, to modeling and transformation, to
warehousing/storage, visualization and more. It will require investigation
into a number of different options, and evaluating the choices between
building, buying or running open source versions of the best-in-class solutions.

Your data team will likely lead the charge towards building a future proof
data stack. But that doesn’t mean they should build all their own solutions.
There is a growing market of cutting edge technologies for web analytics (and
wider use cases) for you to explore.

34
RETHINKING MODERN WEB ANALYTICS

Track Collect Store Model Report Act

The data journey

We’ll cover more on the best tools for your web analytics in an upcoming
chapter of this eBook but for now, here are some key categories to consider
when putting together your stack:

• Data capture and management


Capturing and managing behavioral data from web channels should be one of
your first concerns when it comes to building the stack. Explore platforms that
offer you complete control over your data and flexibility to decide its structure.

• Data Visualization
To provide your internal data consumers with the best insights, you’ll need a
solution for visualizing and exploring the data. Look out for tools that make it
possible for teams to self-serve data, without creating bottlenecks.

• Data Monitoring
Measuring and improving your data quality is a huge factor in getting the most
from your web data. These tools will help you build assurance in your web data,
so your internal teams can be confident their data is reliable and trustworthy.

35
RETHINKING MODERN WEB ANALYTICS

• Tag Management
Tag management systems or ‘TMS’s are at the heart of your web analytics
and marketing. They are especially important when it comes to setting
cookies, capturing key information about your users and visitors (while
respecting their privacy). Consider a TMS that allows for server-side tagging
and one which is compatible with your other technologies.

• Testing/Debugging
Testing your web analytics stack for tracking failures is not the most exciting
aspect of your stack, but it’s one of the most important. We recommend
integrating tracking as part of your automated testing suites, so you can
ensure your new builds don’t ship without properly functioning trackers
ready to go.

• Data Transformation
Transforming, reformatting or modeling your data are all essential to ensuring
your internal teams can action the data set that is most relevant to them.
A good data transformation tool will enable you to turn raw data into actionable
data sets that are understood and trusted by cross functional teams.

36
RETHINKING MODERN WEB ANALYTICS

Not all about the stack


Technology is important, but ultimately it’s your people and processes that
make the difference. Having the best tools available will not help you achieve
your goals with web analytics. In fact, it’s sometimes best to start simply,
build out a data team that will fit your key use cases and evolve your tech
stack over time.

“A data stack will not move you along the data


maturity curve if the team and processes
in place aren’t already appropriate.”
– Archit Goyal, Solutions Architecture Lead at Snowplow

If you’re unsure where to start, our internal experts can help you identify your
immediate needs and scope out how you can realize your ambitions with
behavioral data. It’s worth remembering that your organizations’ experience
with data is a journey – there is nothing wrong with starting small, and
building as you grow.

37
CHAPTER 4

THE BEST IN
CLASS TOOLS FOR
WEB ANALYTICS
RETHINKING MODERN WEB ANALYTICS

The best in class tools for web analytics


When it comes to a web analytics stack, one size doesn’t fit all. As we
mentioned in our last chapter, breaking out of a packaged analytics solution
to build a modular stack from best-in-class tools will put you back in control
over your data and data infrastructure.

Building a data stack like this opens up opportunities to do more with your data.
But it isn’t easy. It means finding, researching and evaluating a number of
vendors to find the tools that work best for your business.

To make it easier, we’ve compiled a list of key tools to consider when building
out your web analytics stack. It’s not exhaustive, but a combination of these
solutions will put you in a good place for leveraging behavioral data from
web and other sources.

39
RETHINKING MODERN WEB ANALYTICS

Data warehouse
One of the best ways to start making the most of your web and behavioral
data is to load it into a data warehouse. This allows analysts to not only slice-
and-dice the data in any way they wish, but it will also scale up with the data
volume increases over time. The best data warehouses also have great
marquee features such as integrations into other analytical products and
services, and extra capabilities such as ML or querying semi-structured data
(such as JSON data).

(Check out this post from Poplin to see how the major data warehouse
solutions compare.)

40
RETHINKING MODERN WEB ANALYTICS

Redshift
Redshift is what started the popularity of cloud hosted data warehouses,
launching in 2013. It's ease of use and low cost (compared to popular on-
prem solutions available at the time) drove huge adoption of Amazon's data
warehouse. It has struggled somewhat in recent years to keep up with the
innovation of its competitors, but with new RA3 cluster types (which separate
storage and compute, which had previously been tightly coupled together)
and recent feature announcements such as Redshift ML and the SUPER data
type (with fuller JSON support than ever) are making Redshift a more
appealing choice again. Tight integration with AWS services (such as S3,
Sagemaker and Glue) and reserve pricing for predictable cost forecasting are
also big selling points.

BigQuery
Google's cloud data warehouse (which was developed for internal use for a
long time to analyze Google's search index) now is available as a pay-as-you-
go web service (DWaaS). With great integrations into the rest of GCP (Google
Dataflow, Google Cloud Storage, Google Cloud ML etc) as well as the Google
marketing stack (Google Ads/Search Ads 360, Doubleclick, Ads Data Hub etc),
BigQuery is great service to act as the center of all your marketing and
customer data efforts. It also has good support for nested or repeated JSON
records, supports real-time ingestion (through Streaming Inserts) and even
has support for running ML algorithms with BQML.

41
RETHINKING MODERN WEB ANALYTICS

Snowflake
Snowflake is a cloud data warehouse with some very powerful and unique
features, available on all 3 of the big cloud platforms. It separates storage
and compute (similarly to BigQuery) but allows further control by having
separate Virtual Warehouses, which can all be different sizes and suited for
different purposes. Since the data is stored separately from these Virtual
Warehouses, this means Snowflake is probably the most scalable of all
commercially available data warehouses on the market, and we see our
customers with highest volumes generally moving to Snowflake. Snowflake
also has excellent support for semi-structured JSON or XML data through its
VARIANT data type – meaning Snowflake can also act as a data lake,
popularizing the Data LakeHouse framework.

42
RETHINKING MODERN WEB ANALYTICS

Data Visualization
For most users, staring at a large and
unwieldy table of numbers can be
daunting and hard to understand. In
order to relay insights and findings
to other stakeholders in the
business, your web analytics stack
needs good visualization capabilities

Google Data Studio


Google's free data visualization tool. More of a dashboarding tool than a BI
tool, Data Studio connects well with services in the Google marketing stack
(Google Ads, Google Search Console, Doubleclick/Google Marketing Platform
etc) as well as tight integration to Google BigQuery. If you're heavily in the
Google stack, this is a great starting point for dashboarding your data.

Looker
Google's enterprise BI tool is aimed at companies who want to enable self-
serve analytics across their organisation. Its proprietary data modelling
syntax LookML allows for analysts to define a metric once and let it be used
by all end users throughout the business. It's specifically designed for cloud
data warehouses and takes advantage of their performance. Currently
considered best in class, though it does leave something to be desired when
it comes to the flexibility in terms of different visualizations it can do.

43
RETHINKING MODERN WEB ANALYTICS

Tableau
One of the major players of the BI space for a number of years, Tableau is
enterprise-ready and leads the industry in its capabilities for drag-and-drop
visualization building. Tableau is the most capable in the space for creating
custom visualizations, and since it is a low-code to no-code approach it's
generally very easy for traditional BI analysts to use. Tableau leans heavily on
a legacy approach of loading Tableau data exports onto its own servers to
power its dashboards, but is rolling out new features to enable more cloud
native approaches to data visualization.

Power BI
Built on the Microsoft BI stack that has been popular for decades, this
Windows only BI tool is popular with Excel analysts. Despite this, it also has
powerful data modelling capabilities (through Power Query and its data
modelling language M), and is flexible enough to work with the popular cloud
and on-prem data warehouses. A very affordable price tag also makes this a
good choice if you want to start small and scale up.

Holistics
While they haven’t been around in the BI market for long, Holistics offers a
powerful combination of data governance, ELT/transformation and
visualization capabilities in a single attractive product. Entirely web based,
this service is built from the ground up for the cloud, and utilizes the
performance of cloud data warehouses to ensure speedy dashboards. This is
a great tool if you're looking at a modern, all-in-one, cloud native
dashboarding and BI solution.

Further reading: Snowplow and Holistics

44
RETHINKING WEB ANALYTICS FOR THE MODERN AGE

Data Monitoring
With the increasingly large volume
and diversity of data flowing through
your website and into your points of analysis,
it's more important than ever to monitor your
data quality at every stage. These tools check and alert on
your data quality across various points of your data lifecycle.

Observe Point
A great tool for running automated scans on your website(s) to audit and
monitor your tagging set up. By default will crawl every page and log every
tag that fires on that page, but custom user journeys can be added (such as
checkout flows, product interactions etc) and it will alert if at any point tags
stop firing or start firing incorrect or unexpected values. An enterprise level
piece of software, with a price tag to match

Iteratively
Iteratively helps teams catch analytics bugs before they hit production so you
don’t have to worry about bad data downstream. The product consists of two
parts: an intuitive web app where analysts, PMs and marketers can create
and evolve their tracking plan (ditching their spreadsheets), and developer
tooling for engineers to quickly and easily instrument tracking with type
safety and auto-complete. They work hand-in-hand to ensure event tracking
is implemented accurately and that the tracking plan is always enforced.

Great Expectations
Great Expectations is an open-source framework that allows for automated
tests run against your data in your data pipeline. From simple tests such as
checking a column for unique values to more complex assertions, such as
seeing if a value is within 2 standard deviations of the median value for the
entire column. GE can run all sorts of tests on your data as it is ingested and
transformed. We use it at Snowplow in our latest V1 data models for
BigQuery, Redshift and Snowflake.

45
RETHINKING WEB ANALYTICS FOR THE MODERN AGE

Tag Management
Deploying tracking to your website
is central to your data collection,
data quality and data privacy
strategies. Tag management
systems make it more
straightforward to do this at scale,
and with the flexibility required to
track all customer interactions.

Google Tag Manager


Google's popular free tag management solution (also available as a paid
solution, GTM 360) is primarily aimed at marketers and analysts. This hugely
popular solution has templates for common tag types, and is extensible
through custom templates. For many this is the default choice in the industry.

Tealium
Tealium's enterprise tag management system is aimed at organizations that
want more high-end features, such as granular access controls and
deployment workflows and a more developer friendly experience. It also
integrates into Tealium's CDP product.

Adobe Launch
Formerly known as DTM, this is the go-to choice if your infrastructure sits in
the Adobe ecosystem – Adobe Analytics, Adobe Target, Adobe Experience
Manager, and so on.

46
RETHINKING WEB ANALYTICS FOR THE MODERN AGE

Testing/Debugging
When debugging any web
implementation, it's
important to be able to
see what the browser is
doing and what data it is sending where and when. These Chrome extensions
cover the dataLayer, common web analytics solutions, and help spot common
installation issues, as well as allowing you to see if the data being sent is
correct. This should be included both during implementation (before
publishing to production) and when investigating any issues.

• AdSwerve Chrome extension


• dataSlayer Chrome extension
• Google Tag Manager Assistant Chrome extension
• Snowplow Inspector by Poplin extension

AnalysisTools
Beyond visualizing your
behavioural data (in dashboards
and reports), there are higher-level
analyses you may want to run over
your data. BI tools and dashboarding solutions struggle to perform statistical
analysis such as predictive models, forecasts and dynamic segmentation
models. These are a couple of programming languages and packages aimed
specifically at data scientists and statisticians to get you started.

47
RETHINKING MODERN WEB ANALYTICS

Data Transformation
In order to perform any analysis or generate
any reports, your data will need preparing.
Transforming your data in a modern cloud
data warehouse is a great way to do this, as it is
performant, cost effective and can easily scale up with your data volumes.
There are some great tools available to orchestrate this in-warehouse pipeline.

Dataform - BigQuery
Dataform has recently been acquired by Google Cloud, and is now focusing
on BigQuery specifically. Built on Typescript and Node.js, Dataform works
most entirely in the browser (though there is an OS CLI tool) which provides
instant compilation, automatic dependency inference, custom Javascript
functions for repeating common tasks and scheduling to run your ELT
pipelines inside BigQuery. It is also likely to get a lot focus and development
from Google Cloud in the coming years.

dbt - Redshift, Snowflake, BigQuery, PostgreSQL


dbt have built a huge community of open-source users, bringing analytics
engineering to the masses. dbt is open source and based on Python, and
supports all the major cloud data warehouses. Given its popularity and usage
across the industry there are lots of packages for common tasks (including a
popular Snowplow package). dbt can also be self-hosted with no licence fee,
but there is also dbt Cloud which can be used in the browser.

R & RStudio Python


• googleAnalyticsR package • numpy
• ggplot2 • pandas
• tidyverse • plotly/dash
• tidymodels • scikitlearn

48
RETHINKING MODERN WEB ANALYTICS

Data management
Snowplow
Snowplow is the leading platform for
behavioral data management, including
web data. For data teams looking to get
more from their behavioral data, Snowplow
offers unrivalled control and flexibility over
your data set, as well as complete ownership
of your raw, unopinionated data.

While this list isn’t exhaustive, we hope it helps to get you started on your
journey to a more complete stack for web analytics. Once in place, your data
stack should evolve with your business, setting you up for success for near-
term goals, as well as for future aspirations. For this reason, although it takes
time, effort and investment to piece together a stack that’s effective for
modern web analytics, the hard work will be worth it in the long run.

49
CHAPTER 5

REDEFINING
WEB ANALYTICS
METRICS
RETHINKING MODERN WEB ANALYTICS

Redefining web analytics metrics


At the heart of our web analytics are our metrics. These are the ways we
measure the performance of our websites or web-based applications, how
they are received by our web visitors and whether they are effective,
commercially or otherwise.

In this installment of our series on web analytics, we are going to look at


some of the commonly used web analytics metrics and why they don’t
always serve their purpose when it comes to understanding user behaviour,
both technically and conceptually. We’ll also look at how businesses can go
about creating more meaningful metrics to understand their users better.

Most web analysts have used the same common metrics to measure website
performance for a number of years. Some of these metrics include:
• Conversion rate
• Bounce rate
• Time on page/Session duration

51
RETHINKING MODERN WEB ANALYTICS

Most of these metrics are provided to users by a packaged tool like Google
Analytics. Most of these out-of-the-box metrics are designed to make it easy
to understand if our site is effective at converting users, or they are finding
our pages and content engaging – and they are generally all understood to
work as follows:
• Conversion rate - the higher the better
• Bounce rate - the lower the better
• Time on page/Session duration - the higher the better

This way of understanding these metrics makes a lot of assumptions on how


we believe users interact with our websites. But how valid are these
assumptions? And how are these metrics calculated?

Are they appropriate to use to measure the effectiveness of our web


marketing and performance of our websites? This chapter aims to shed light
on how tools like Google Analytics serve these metrics and allow analysts to
make a better decision on whether to use these out-of-the-box metrics.

52
RETHINKING MODERN WEB ANALYTICS

Conversion Rate
Conversion rate is designed to indicate how well a website is performing in
terms of pushing a user through a desired journey towards a desired
conversion – like purchasing a product, signing up for a demo or requesting a
call from a sales team.

“92% of users visiting a website are not yet ready to buy”.

There is even the specialist field of Conversion Rate Optimization (CRO),


which aims at tweaking the website design and user experience to improve
its efficacy. A common approach for conversion rate is to try and turn a funnel
into a cylinder, removing blockers and issues that make users likely drop off
and leave the purchasing journey, and optimizing the experience to make it
smoother and easier for users.

This is a noble aim, and making the user journey a more enjoyable
experience for the end user is always a worthwhile effort.

However, the metric itself has problems. The first thing to be aware of is
Goodhart’s Law which essentially states that focusing on a single metric like
this can have unintended side effects.

53
RETHINKING MODERN WEB ANALYTICS

According to a recent study, up to 92% of users who visit a website at any


given time are not yet ready to buy (depending on the product and the
vertical). The vast majority of these users are just browsing to see what’s
available at what price, or researching to varying levels to decide what they
wish to purchase and where from. Yet, Conversion Rate focuses the mind only
on the 8% who are actually ready to buy.

This seems like a misguided approach, as there are numerous ways a brand
can add value to a user who is not ready to buy just yet, which may turn them
into a customer later on - creating helpful informational content, providing
honest comparisons, buying guides, nurturing the user journey until they are
ready to purchase - at which point they will be much more likely to visit the site
that helped them make their mind up to complete the purchase or conversion,
and as a result, more likely to stay a customer and become an advocate.

There are also more concrete technical problems with conversion rate. The
biggest issue is that most of the time the conversion rate metric is based on
visits or sessions – total number of identified conversions divided by the total
number of web visits.

Awareness

Consideration

Conversion

54
RETHINKING MODERN WEB ANALYTICS

As explained above, the user may be on a long and complex multi-visit user
journey, but not quite ready to convert right now. A session based conversion
rate would count this visit as a non-converting session, and therefore
negatively count towards your conversion rate. But yet, this very user could
convert in the future, but the majority of their visits will be discounted, as if
they are “bad” sessions, pushing down the conversion rate, and suggesting
the website is not performing well.

Conversion rates can be configured to be user based (conversions / unique


users), but there are problems with this too. Chief among these is the
difficulty in accurately identifying users on the web. Most packaged analytics
tools use cookies as their primary identifiers for users visiting the website.
Not only does this not reflect users who use multiple browsers or devices,
but privacy initiatives such as ITP are limiting cookie’s lifetime to 7 days
(and potentially just 1 day), this really plays havoc with the user numbers.

There are ways to identify authenticated users (who login to your site and
self-identify) and do so across devices. But this is generally a minority of
users, so isn’t a viable option for most businesses.

That isn’t to say that businesses should not concern themselves with their
conversion rates, but instead to ensure that they look at the conversion rate
metric within the right context, and not be blinded by it. Conversion rates can
be useful for visits with a commercial user intent (visits where the landing
page is a product page, or PPC traffic from branded keywords for instance)
but are less helpful when intent is informational (landing on a content page
from organic search) or when the likelihood to convert is low (potentially
visits from a mobile device when the product is of a very high value).

55
RETHINKING MODERN WEB ANALYTICS

Bounce Rate
Bounce rate was once described by Avinash Kaushik as the “I came, I puked, I
left” metric back in 2007. It is supposed to signify the amount of your users
who landed on your site and quickly decided that your site was not what they
were looking for and left instantly.

Under certain definitions, it is possible for a “bounced” session to actually be


a very valuable session.

While this is technically true, many analysts and users focus on this metric to
measure how a landing page is performing, even though bounce rate has
been largely criticized by the wider analytics industry.

To understand the issues with bounce rate as a performance measure, we


first need to be 100% clear on how bounce rate is calculated. In Google
Analytics a “bounced” session is a single interaction session, generally the
first page view of the session, and therefore bounce rate is the total number
of bounced sessions divided by total sessions.

56
RETHINKING MODERN WEB ANALYTICS

The problems start to occur when we take this as our definition of a bounced
session. Under this definition, if there is no other tracking set up to track
interactions on the page, it is possible for a “bounced” session to actually be
a very valuable session.

A good example is a user landing on a piece of informational content from


organic search, perhaps from a “how-to” query such as “how to reverse a
commit in git” or “how to make enchiladas”.

They land on the content, spend time on the page, scroll down the page to the
end and read the content in full. They may even bookmark the page, or copy
the link and send it to their friends or colleagues depending on the exact type
of content (not all content is inherently sharable). Having read the content,
they’re happy they’ve got what they need, and close the browser tab. This is
likely to have been counted as a bounced session, and thus contributes
towards increasing the site’s overall bounce rate. And since a higher bounce
rate is generally considered to be a bad thing, this is therefore a “bad” visit –
whereas in reality this visit was a good visit, as the content answered the
user's question and gave them the answer they were looking for.

57
RETHINKING MODERN WEB ANALYTICS

Another example could be to visit a retail site showing the location and
opening times of a physical store. The user gathers all the information they
need quickly, and then closes the window. Again, the page has served its
purpose perfectly, but still generates a bounced session: therefore, another
“bad” visit.

Analysts have seen this happen, and know that a “high” bounce rate is “bad”.
As a result, sometimes there is a metric known as “adjusted bounce rate”.
This is where the tracking implementation is tweaked in order to bring
bounce rate down. For instance, if the user stays on the page for more than
30 seconds for instance, then don’t treat the session as a bounce, even if they
then leave. This practice of tweaking or “fixing” the metric is generally not a
good idea, as you are not addressing any underlying cause of the metric,
rather you are focusing on the metric itself and ignoring the real issue. This
means creating content to better fit the user’s intent or optimizing the page
for a better experience for the user.

Bounce rate is a useful metric when used appropriately. A good use of bounce
rate would be to look across all similar pages (all the content pages within a
/blog/ section of a site for example) and compare the bounce rate across all
of these pages. If the majority of these pages all have roughly similar bounce
rates, but one or 2 have a significantly lower bounce rate, then it’s worth
looking into these pages to understand why. This insight could prove
valuable when creating future content.

Conversely, if a few pages have a significantly higher bounce rate, then this
should be looked to be understood as well. Always make sure you’re doing a
fair comparison, you understand the user intent behind those pages and how
users got there, and make sure to segment, segment, segment.

58
RETHINKING MODERN WEB ANALYTICS

Time on Page/Session Duration


For sites that don’t sell directly online, such as publishers or brochure sites,
or sites with a large content section, it can be hard to understand if the
content is performing well or accurately understand the value of the content.
One way analysts try to measure the value of content and measure its
performance is to try to comprehend the engagement that content is
creating. Engagement is notoriously difficult to quantify and measure - so a
common proxy for engagement is the time a user stays on a specific page or
the site as a whole (another common approach is to look at the average
number of page views per session).

“If there’s one thing a publisher should really


understand it’s how much attention their content is
receiving. But most media companies really don’t have
a handle on it. The most common analytics platforms
do such a bad job of measuring it they’re actually
worse than useless, so most media companies focus
instead on pageviews and reach metrics.”
– Simon Rumble, Digital Analytics Specialist
at Australian Broadcasting Corporation (ABC)

However, as you may have guessed, there are both conceptual and technical
problems with measuring time on page/site. The first point to consider is this:
Does a higher time on site really mean that the users are more engaged with
the content or the site? It is true that if a user enjoys the content on a site,
that they may spend more time reading the articles and potentially browsing
to other articles or pages and reading them.

59
RETHINKING MODERN WEB ANALYTICS

The problem is that a user who is not enjoying the content or is struggling to
use the site could also spend more time on the site. What if the user interface
is confusing and the user can’t navigate the site easily? This is likely to mean
they will spend longer on the site as well. Or a user who is struggling to read
the content because it is too complicated to follow or poorly written? There’s
no real way to differentiate between these two very different types of user
experience just by examining the time spent on the site.

There are also issues from a technical standpoint. Most tools that measure
the time spent on a page or on the site use the difference in timestamps
between page views and other page views or other events. The problem with
this is that these don’t account for whether the user was actually at their
screen. If you view a page, read for 20 seconds and leave your screen to make
a coffee for 10 minutes and return before the session times out, it is likely that
the tool will assume you’ve spent those 10 minutes looking at the page, even
when you weren’t.

Snowplow handles this by using Page Ping events, where the Javascript
tracker “pings” the page to see if the user is still active. If not, then this time is
not taken into account when calculating how long was spent on the page.

“With page pings from Snowplow we have


a very precise way of measuring engagement
on our articles. This is something we simply
couldn’t do before with Google Analytics.
I think this is one of the most interesting metrics
we’ll see in terms of media analytics.”
– Aurelien Rayer, Head of Data at Welcome to the Jungle

60
RETHINKING MODERN WEB ANALYTICS

Another issue specific to Google Analytics is how GA handles Exit Pages. An


exit page is the last page in a user’s session. Since GA generally calculates the
time on a page as the duration between 2 page views, this cannot be
calculated for an exit page as there is no subsequent page view - and it then
sets the time spent on that page to 0 seconds which is clearly not reflective of
what has actually happened. However, despite the fact GA does not know how
long a user spends on their session’s exit page but sets it to 0 seconds anyway,
the exit page view is still included in the average time on page calculation
(total time spent on the site / total page views). This method of handling time
on page for an exit page messes with both the numerator and denominator in
the calculation, causing the metric to be fundamentally flawed.

61
RETHINKING MODERN WEB ANALYTICS

What metric should web analysts focus on?


If the standard metrics that are provided by packaged analytics tools have
issues, what should we use instead?

The first thing to say is that these metrics aren’t always the wrong thing to
use. It’s just important to understand how they are measured and calculated,
so that given your unique case you are able to make a call as to whether
these metrics are appropriate or not.

Given this, are there alternatives to these common metrics that can be used
in their place? Or different applications of these metrics that will make them
more meaningful?

Session and user based conversion rates all suffer from the somewhat
flimsy definition of “sessions” (a collection of hits with a timeout window,
as well as other factors) and “users” (a unique cookie ID, unique to a
browser/device combination). A way to look at making conversion rate more
meaningful is to use it in the context of other important events (sometimes
called micro conversions).

This is common when looking at “funnels” on site - what percentage of users


who perform action X then go on to perform action Y. This is more concrete
than relying on artificial concepts like sessions or users, and as long as all
important micro conversions are tracked on your site or in your product,
these kinds of “this-then-that” analysis can be performed straightforwardly.

62
RETHINKING MODERN WEB ANALYTICS

Time on page/site is problematic as it is hard to differentiate between


different user experiences from a single value. Utilising a user engagement
metric that actually detects the time actually spent at the screen (using a
heartbeat mechanism or similar) helps only measure time users have spent
interacting with your features or content.

While this helps make the metric somewhat more reliable, the most
important thing to change is your mindset when analyzing the data. Make
sure to take into account things such as how long the content is, what the
user intent was when they landed on the page (whether from a search
engine, a social media post, a referring site or an ad), any multimedia content
(video or audio etc) which might change the user’s behaviour etc. Once these
factors are considered, you are in a better position to interpret what a
particular metric might be indicating.

The ultimate aim is to create metrics that are completely custom to your site
or product, and limit your use of “standard” metrics to all but the most top
level of analyses. This takes a deep level of understanding of your site, your
users and their user journeys. But once you have these higher value,
customised metrics that are much more meaningful, drawing insights from
your data becomes much easier.

63
CHAPTER 6

DATA MODELING
FOR WEB ANALYTICS
RETHINKING MODERN WEB ANALYTICS

What is data modeling?


At Snowplow we describe data modeling as a sequence of SQL steps run on a
schedule in your cloud data warehouse.

However definitions do vary and some organizations do run their scheduled


sequence of steps in a different language – often to the detriment of code
readability but associated with some kind benefits e.g. reduced cloud costs
or easier implementation of specific advanced data transformations. For the
vast majority of organizations the simplicity of declarative languages like SQL
is hard to overstate, and comes with a whole host of other benefits such as
accessibility and reduced code run times.

What is the purpose of data modeling?


The purpose of this sequence of SQL steps is to automate away repeatable
tasks. These repeatable tasks are data transformations that are designed to
reduce complexity so that time to value for new and existing users of the data
model is minimised.

Introduction to data modeling: download

Viewing data modeling in this light leads to the realization that data models
are data products that should be considered to be analogous to any other
tech asset or software product that adds significant value to an organization.

65
RETHINKING MODERN WEB ANALYTICS

Why is data modeling important for


behavioural data?
Behavioural data is messy. Any public-facing application can expect to be
interacted by a whole host of entities, each with their own agenda, a shortlist
created off the top of the head might include the following entities:

- The intended users of the application


- Internal users
- Integrated testing tools
- Spammers
- Bots (set up by competitors or individuals)
- Pentesters

All of these different types of users and more will be performing both
expected and unexpected actions on your website or application. Much of
the noise is filtered out as it is never tracked, but even with the best tracking
design and protocols part of this noise can end up in the final dataset.

It is the job of the data model owner to understand what noise is inherent in the
final dataset and to make decisions around what should be done as a result.

66
RETHINKING MODERN WEB ANALYTICS

How To Approach Data modeling


Stating our goal to be “the minimization of time to value for the end user”
naturally leads to the question of “How?”.

This is achieved by standardizing common operations that are required in


every query that queries the raw event level data. These common operations
are as follows:

1. Aggregation
Event level data can be difficult for the casual user to understand as the
concept of an event is relatively abstract, aggregating up to more familiar
higher order concepts can help to promote understanding of the data. For
example if a user wanted to understand in-session behaviour over time they
might be more inclined to query a ‘sessions’ table over an ‘events’ table.

Equally important is to consider that many data warehouses charge based on


the volume of data scanned in a query or the compute resources used. By
querying aggregate data instead of event level data the cost of data access
can be significantly reduced.

67
RETHINKING MODERN WEB ANALYTICS

2. Filtration
In the simplest case, if upfront validation is not performed on event-level
data before it lands in the data warehouse then poor quality data will need to
be filtered out of the final dataset. An example could be an anomalous
transaction event not being filtered out of a marketing attribution model
resulting in inefficient allocation of marketing spend.

The events table contains a large number of fields – only the fields that are
relevant to a particular data model table should be selected, this helps to
reduce the signal to noise ratio for any downstream analysis.

By default internal users and bots should be removed from any final
analytics dataset. Your SQL data model is the ideal place to define what
constitutes an internal user or bot (often this decision is informed by the
Snowplow IAB enrichment).

The Snowplow tracker comes with built-in semantics to ensure that every
event is sent at least once, this will inevitably result in a small number of
duplicate events in the final dataset, these duplicate events should be
filtered out prior to data consumption.

68
RETHINKING MODERN WEB ANALYTICS

3. Application of business opinion


The Snowplow trackers deliver unopinionated data. It is up to each individual
organization to apply opinion to this data to make it specific to them and
their business model.

There is the classic example of a user visiting a webpage as a result of


marketing activity, they will land on the webpage with UTM parameters in
their querystring – it is trivial to parse out these parameters, but the
collecting organization is the only one with the nuanced understanding of
how to classify these parameters into different marketing channels.

At Snowplow we commonly get the question “what happens if my business


opinion changes” - this is expected behavior the contents of data models is
expected to be transient and change over time. The raw event level data is
immutable and opinionated whereas the data models can be deleted and
recomputed to reflect an updated understanding of the specific parameters
each organization operates under.

69
RETHINKING MODERN WEB ANALYTICS

4. Join together multiple datasets


Your behavioral data is incredibly valuable; this value can be multiplied by
enriching it with other data. If a transaction is recorded on the client side it is
generally not possible, or advisable, to track the associated margin client
side too. However, if a transaction ID is recorded then this can be used as a
join key to combine your client side and server side data - this particular
example is a simple paradigm shift that can drastically increase the value
derived from marketing attribution models by attributing credit for margin,
not just revenue to marketing campaigns.

5. Perform operations incrementally


The data volumes associated with behavioural data are typically large enough
to justify investing the extra effort into ensuring that the previous 4 operations
are performed incrementally. This will ensure that the benefits above are
realised without breaking the bank to create and maintain the data models.

70
RETHINKING MODERN WEB ANALYTICS

Data modeling in action: An example


Internally at Snowplow we are very opinionated about what the “bounce
rate” metric means; as an organization we simply do not believe that it is
correct to attribute the same value to the following 2 sessions:

1. A user visits the homepage and then abruptly leaves


2. A user visits a specific article, scrolls to the bottom of the page
over a 5 minute period and then leaves

Without data modeling an analyst might have to construct the following


query to understand at a basic level what sessions on a particular day are
truly bounced sessions and which sessions are quality sessions.

71
RETHINKING MODERN WEB ANALYTICS

WITH sessions AS(


SELECT DISTINCT
domain_sessionid AS session_id
, FIRST_VALUE(page_title) OVER(PARTITION BY domain_sessionid
ORDER BY derived_tstamp ROWS BETWEEN UNBOUNDED PRECEDING AND
UNBOUNDED FOLLOWING) AS first_page_visited
, COUNT(1) OVER(PARTITION BY domain_sessionid) AS
page_views_in_session

FROM atomic.events AS ev

WHERE event_name = 'page_view'


AND DATE(collector_tstamp) >= ‘2021-03-18’

, time_in_session AS(
SELECT domain_sessionid AS session_id
, COUNT(DISTINCT event_id) * 10 AS time_in_session
FROM atomic.events

WHERE event_name = 'page_ping'


AND DATE(collector_tstamp) >= ‘2021-03-18’

GROUP BY 1
)

, pre_agg AS(
SELECT pv.session_id
, CASE WHEN s.page_views_in_session = 1
AND s.first_page_visited = 'homepage' THEN 'bounce'
WHEN s.page_views_in_session = 1
AND s.first_page_visited = 'how-to-guide-article’'
AND tis.time_in_session < 60 THEN 'bounce'
ELSE 'quality_session'
END AS bounce

FROM sessions AS s

JOIN time_in_session AS tis


ON s.session_id = tis.session_id
)

SELECT bounce
, count(1)
FROM pre_agg
GROUP BY 1

72
RETHINKING MODERN WEB ANALYTICS

This is a relatively complex query that has multiple CTEs, each one querying a
different event type and applying different operations to the event level data.
These CTEs then have to be joined together into a final table that contains a
case statement that is required to classify sessions as either quality sessions
or bounced sessions.

There are a multitude of problems with this query, including but not limited to:
- The level of SQL for this basic analysis is too advanced,
non SQL fluent users would not be able to build such a query;
- Introducing multiple steps into a query means there are more
places where mistakes can be made;
- The query is not optimized and contains expensive and slow
to run window functions;
- There is no version control on the case statement, every user
who wants to analyze bounce rate has to have knowledge of
where to find the latest version of the case statement in order
to perform similar analysis self sufficiently;
- The query directly queries the events table meaning it
unnecessarily scans a large amount of data every time it runs.

The end result of all of this is that any reporting or visualisation that is based on
querying only event level data is likely to result in a very difficult to maintain
and likely very expensive reporting setup. A better approach is needed.

This better approach would be to codify all of this logic in a central, versioned
data model that might allow for the following query:

SELECT bounce
, count(1) AS sessions
FROM derived.sessions
WHERE (session_start_date) = '2021-03-18'
GROUP BY 1

Which allows the user to use simple SQL or even a drag and drop tool to
calculate bounce rate for a specific date or date range with minimal effort.

73
RETHINKING MODERN WEB ANALYTICS

Data model Architecture


The above optimization that allows a 5 line query to do the work of a 50 line
query requires data model architecture to be considered. In the case of web
tracking, using the Snowplow Javascript tracker there is a built-in hierarchy
of entities that are best placed to be used as the basis for the overall data
model architecture, see below:

users 1 row per user

sessions 1 row per session

page_views 1 row per page view

events 1 row per event

Each box in this diagram represents a table in the data warehouse. The events
table contains our immutable unopinionated event log. Each table is dependent
on the table below it, and data is aggregated and filtered incrementally in the
operations that take place between each step of the model.

This is a preparatory data model that contains core business opinion such as
what marketing parameters constitute what channel, what constitutes a
bounced session, and what constitutes a conversion.

This data model is a good starting point for an organization that tracks web
data only. But for any organization that has customer touchpoints outside of
the web it is extremely valuable to integrate these into the data model and
build a single customer view. An example of this is provided below where a
mobile data model analogous to the web data model has been created and
the results have been unified to build such a single customer view.

74
RETHINKING MODERN WEB ANALYTICS

single_customer_view

web_sessions mobile_sessions

page_views screen_views

events

Single customer views like this that capture customer touchpoints across a
variety of media are hugely valuable due to the unparalleled insight they can
offer into customer behavior.

For example, any attribution model that is built to combine both web and
mobile touchpoints and capture the whole customer journey will be orders of
magnitude better than an attribution model that is only able to attribute
marketing credit to single device customer journeys.

What value does a data model


add to an organization?
With the examples discussed above it is possible to see how data models can
add value to an organization in a variety of ways. Primarily data model
products are solutions that enable self service for a variety of teams
throughout any organization.

If an organization truly wants to become data-driven then effort must be


invested into developing data models that democratize access to clean,
feature rich, opinionated, pre-aggregated data to create a platform that is
readily accessible to:
1. Less technical users for simple self service
2. Analysts as a jumping off point for more advanced analytics (e.g. attribution)
3. Data scientists as an input to machine learning models
4. Data engineers as a component of a real time application

75
RETHINKING MODERN WEB ANALYTICS

How to get started


Use of a dedicated data modeling tool is highly recommended when getting
started with data modeling, there are a variety of tools available that fit the
bill such as: Airflow, Dataform, Databricks, dbt or even Snowplow’s open-
sourced, in-house tool SQL-runner. Your choice of tool will primarily depend
on the pre-existing cloud environment and data warehouse that are native to
your organization.

Someone with SQL fluency such as an analytics engineer, a senior analyst, a


data scientist or data engineer can then use this tool to start building out
different data models.

76
RETHINKING MODERN WEB ANALYTICS

In general it is not advisable to start from scratch as many organizations


(including Snowplow) have faced similar challenges in the past and have
developed open source data models that can be used as a robust starting
point. See the recently released V1 Snowplow data models for Redshift,
Bigquery & Snowflake for example starting points.

By building a data model you are making an upfront investment in a


foundational product to democratize and standardize data and insights
across your organization – one that will prevent the analytics team from
becoming a bottleneck to the organization and that will yield dividends for
all other teams in the organization for years to come.

Investing in your strategic data asset is one of the best things you can do to
build a competitive advantage in today’s competitive landscape. In our next
chapter, we’ll explore how Snowplow can help organizations take advantage
of their behavioral data from web.

78
CHAPTER 7

HOW SNOWPLOW
CAN HELP YOUR
WEB ANALYTICS
RETHINKING MODERN WEB ANALYTICS

Web analytics has become a more vibrant, fractured and challenging industry
in recent years. From humble beginnings, websites have evolved out of static
web pages into compelling web experiences. They can now host game
changing features such as personalization, dynamic pricing and content
recommendations to make browsing a richer, more rewarding experience. And
the teams behind them: developers, product teams, data teams and engineers
are laser-focused on understanding the user experience at a granular level, in
order to make incremental improvements on a constant basis.

This is far from easy. Building a website to drive competitive advantage


involves a deep understanding of your users and customers. It means diving
into the intricacies of how they explore and interact with your website,
examining how their needs are met (or not) throughout their journey and
identifying where their overall experience can be improved. Underpinning all
of this investigative work is the need for complete, reliable behavioral data.
Ideally high-quality data; data that is complete, accurate and well-structured
so it can be easily worked with and understood.

And getting this data is another huge challenge. In part, this challenge is a
logistical one. It requires a data team to establish a successful data
management practice that will make the most of the data. It requires a suite
of tools that will take the data on a journey from the point of capture, to
enrichment, modeling, storage, to visualization and reporting. It also requires
a significant investment, not just in terms of cost and effort, but also a unified
internal effort to align data objectives with the wider business and forge a
culture of data excellence across the organization.

79
RETHINKING MODERN WEB ANALYTICS

Keeping pace with the web industry


The gist is that the challenges involved in modern web analytics have now
outgrown the packaged analytics solutions that got us this far. So much
thought and innovation – driven by consumer demand – has gone into
creating rich digital experiences, and rightly so. But as a consequence, the
data practices in many organizations have been left behind, struggling to
keep up. As the web industry has evolved, so must our processes, our
approach and our tools for web analytics and data management.

One of the things I am observing is that big or small,


hypergrowth or not, almost all of us ignored data
management. And now it’s become a beast. We
solved for raw ingestion, storage, querying, viz and
even ML. But overlooked lineage, quality, ops,
modelling, security and privacy.
– Rahul Jain, Principal Engineering Manager,
Business Intelligence Platform, Omio

80
RETHINKING MODERN WEB ANALYTICS

In part, this is because our tooling has not evolved at the same pace.
Packaged tools helped us get started with web analytics, and at their best,
they can help us get off the ground at the start of our data journey. But as
businesses grow and our reliance on data increases, the limitations of these
tools prove costly and frustrating.

This is because:
• Packaged analytics don’t provide the flexibility and control over your
data in how it’s captured or structured.
• Privacy updates such as ITP mean that tracking with third-party
cookies is increasingly unreliable.
• Relying on packaged tools forces you to outsource your data
collection approach to a third party. For example, you don’t get
to decide what counts as a ‘conversion’ or ‘bounce rate’, the tool
decides it for you.
• Packaged tools are ‘black-boxes’ – it isn’t possible to see what
happens to your data under the hood.
• Third-party tools that model your data do not take your unique
business model or logic into account. Data is aggregated according
to a standard approach based around the ‘page view’, ‘session’ and ‘user’.
• Packaged tools don’t provide access to your raw data, limiting your
ability to leverage data beyond basic reporting.

We know that companies winning today are the ones who use behavioral
data to cultivate a strong understanding of their users and their needs. To get
there, modern organizations should look to move from ad-hoc data
functions, siloed off in their marketing, product and BI teams, to a centralized
strategic capability that can empower the whole business.

81
RETHINKING MODERN WEB ANALYTICS

Building a strategic data capability


As we mentioned in chapter 3, organizations looking to drive more value from
their behavioral data should consider the advantages of breaking free from
packaged analytics solutions.

Breaking out towards a more modular stack, made up of best-in-class tools


makes it possible to build a strategic data capability that can sit centrally at
the heart of the organization, empowering multiple teams and use cases. With
this approach, your data is no longer in the hands of a third party. Your data,
your data infrastructure and your overall data strategy belong to you and your
organization. It’s this level of control and oversight that opens the door to new
possibilities – bringing data closer to the user experience and the potential to
use behavioral data, not just to generate insights, but to enhance products.

Moving towards building a strategic data capability is as much a cultural shift,


a change in mindset of an organization. It involves a transition from
perceiving the data team as a cost center or IT department, to a strategic
resource who can empower every aspect of the company.

While there is too much to be said on this subject to cover it sufficiently here,
the goal of the strategic data capability is to create a centralized, high-quality
data asset that can provide insights, power use cases and inform decisions
for all internal teams.

The first step for companies embarking on this path is to take full control of
their data. Built from the ground up with ownership and flexibility in mind,
Snowplow is a solution that can help data teams make this crucial step on
their data journey.

82
RETHINKING MODERN WEB ANALYTICS

Why Snowplow belongs in the


modern web analytics stack
Snowplow is the preeminent behavioral data management platform, built to
put data teams back in the driving seat of their web data. With Snowplow,
data teams can capture and manage rich, high-quality web data in a way that
makes it easy for analysts and other data consumers to use and understand.
Snowplow treats behavioral data differently to packaged analytics solutions
because it was designed to handle data as a company’s most important asset.

There are multiple reasons why Snowplow is the solution of choice for
modern web analytics. The following examples are just the beginning.

Total control and flexibility


Snowplow puts you in control of your data. It’s up to you how to collect your
data, with multiple trackers at your disposal for web, mobile, server, IoT and
more. Then you have complete flexibility over how you structure, model and
store your data.

It’s your choice how the data is used – for whatever use case or company goal you
are striving for. Snowplow data is flexible and does not prescribe a particular
approach or assumption on how your data should be utilized. You decide how
the data should be modeled, and ultimately used, to grow your business.

“Thanks to the unlimited, real-time data points


Snowplow lets us gather, we can calculate individual user
footprints, and will soon offer users a more personalized
content space when they come to La Presse sites.”
– Hervé Mensah, Director
7
- Data Science & Integration, La Presse
RETHINKING MODERN WEB ANALYTICS

The best behavioral data set


Snowplow data is made up of events that register user interactions.
Snowplow events automatically capture 130 properties, making the data
uniquely rich. When it comes to web data, Snowplow lets you capture events
with first-party, server-side tracking. This means your data collection isn’t
affected by the restrictions of browser privacy measures or ad blockers, since
you don’t have to rely on third-party cookies.

“With Snowplow we are focused on extracting and


centralizing data from everywhere, ensuring data
quality to be able to stitch everything we need
together to get a complete picture.”
– Kevin James Parks, Data Engineer, Tourlane

Snowplow data arrives clean, well structured and ready to use in your data
warehouse. All data collected by Snowplow is validated by JSON schemas,
set up according to the requirements of your unique tracking plan. The result
is that behavioral data delivered by Snowplow requires little cleaning or
reformatting before your data consumers can put it to work.

84
RETHINKING MODERN WEB ANALYTICS

Complete ownership of your data


and data infrastructure
Snowplow data never leaves your own cloud environment, giving you total
control over your data and data infrastructure. Your raw data is completely at
your disposal – it’s never concealed or difficult to obtain.

And because Snowplow infrastructure is yours, you can configure your data
pipeline in a way that makes sense for your business, with no vendor lock-in
or preference for certain tools.

With total ownership of your data and freedom over your end-to-end
infrastructure, you can choose how you’d prefer to work with your web data asset.

“The gist is that once you have all the relevant data
for each event, which is possible with Snowplow,
you can do whatever you want with it. Snowplow’s
importance will only continue to grow as we
customize our pipeline.”
Rahul Jain, Principal Engineering Manager at Omio

Every organization will take a different approach to web data management.


But we believe it boils down to treating your web data as a strategic asset
that can (and should) be owned by you, opening the door to limitless
possibilities and use cases, far beyond basic reporting.

85
CHAPTER 8

HOW WELCOME
TO THE JUNGLE
TOOK OWNERSHIP
OF THEIR WEB DATA
WITH SNOWPLOW
RETHINKING MODERN WEB ANALYTICS

How Welcome to the Jungle took ownership


of their web data with Snowplow
The web as we know it has changed. As we’ve discussed during the course of
this series, privacy updates, ad blockers and the changing web landscape
have made web analytics a more complex and challenging industry.

At the heart of this challenge is the decline of third-party cookies, rendering


packaged web analytics tools less reliable for getting a complete, accurate
view of your web data. Data mature companies should consider the
advantages of building a more modular stack, made up of best-in-class
solutions. Doing so can enable them to deploy behavioral data to empower
their internal teams, drive their use cases and equip their business to keep up
with the demands of customers today.

A good example of an organization that took these steps is Welcome to the


Jungle, a content hub and hiring platform taking a novel approach to the job
application process. Welcome to the Jungle’s product and business model
revolves around a central web application – combining editorial content with
an employer platform to create a seamless candidate experience.

87
RETHINKING MODERN WEB ANALYTICS

Welcome to the Jungle at a glance


Welcome to the Jungle is a media company and jobs
board launched in 2015 in Paris that matches working
professionals with their dream employers.

The platform has grown rapidly with almost 2 million


monthly users and more than 3000 company profiles.

Welcome to the Jungle is not a typical hiring platform or jobs marketplace.


After landing on their website, the user encounters a hub of rich media
content around recruitment, job search and interview techniques. Look
closer, and you’ll see company listings or ‘employer profiles’, where
companies can give a glimpse of their size, culture and mission to entice
professionals to apply.

88
RETHINKING MODERN WEB ANALYTICS

The Welcome to the Jungle website is a rich digital experience – a good


example of how far the web has come from 10 or 15 years ago. But this has
not happened by accident. The website’s structure, content and flow are the
result of carefully tracking user interactions, gradually building a picture of
how people behave across their website. Using these insights, Welcome to
the Jungle has been able to improve the web experience for job seekers and
employers alike.

This systematic, data-informed approach was born out of necessity.


Capturing complete web data was not only vital to internal business
intelligence and product teams at the company, but a key demand of their
clients (employer brands) who required accurate conversion rates in order to
see the impact of their job ads. Without accurate data to demonstrate the
value of their ads, clients would not see the true value of their ad spend.
Reliable web analytics was central to the company’s business model.

Read the full story: Download the Welcome to the Jungle case study

89
RETHINKING MODERN WEB ANALYTICS

Breaking out of Google Analytics


Soon after their launch in 2015, Welcome to the Jungle rapidly grew to a
point where millions of users were interacting with their website each month.

Aurélien Rayer, the company’s Head of Data, quickly found that the
company’s Google Analytics stack was not able to keep up with the growing
demands of a fast-growth company.

One major problem was lack of ownership. Without owning their raw data, it
was impossible to combine data sets or implement user stitching to build the
full picture of the customer journey across all platforms and channels. But
worse than that – the data didn’t add up. Some users were going missing
altogether, conversion rates didn’t look accurate, and it was clear that the
behavioral data provided by Google Analytics wasn’t telling the whole story.

While this was a challenge internally, it was also a cause of concern for
clients. Employers wanted to know the conversion rates of their ads, and
what they could do to improve them.

90
RETHINKING MODERN WEB ANALYTICS

Missing pieces in the puzzle


We’ve mentioned previously how relying on packaged tools can be
problematic when it comes to delivering complete web data. With Google
Analytics, the team at Welcome to the Jungle were experiencing this first
hand. This was more than a frustrating challenge – their business model and
relationships with key clients relied upon delivering accurate metrics such as
conversion rates. This would enable clients to monitor the performance of
their employer profiles, informing their decisions around investing in
Welcome to the Jungle’s ad solution.

“Since working with Snowplow, we uncovered


over 2 million user ids were flagged as bots”
– Aurélien Rayer, Head of Data, Welcome to the Jungle

With complete web data now seen as business-critical, Aurélien decided to move
to Snowplow for more reliable behavioral data capture. Welcome to the Jungle
could now deploy first-party, server-side tracking, finding that this enabled them
to overcome the restrictions of Safari’s ITP and other browser privacy measures.

With Snowplow, Welcome to the Jungle could track users without worrying
that third-party cookie data would be deleted, giving them a much more
complete view of user activity. Snowplow’s first-party tracking also meant
that Aurélien could track web visitors who used ad blockers (as long as they
granted their consent). These users could previously not be tracked with
Google Analytics, since ad blockers often automatically prevent tracking as a
side effect of blocking ads.

91
RETHINKING MODERN WEB ANALYTICS

Once the Snowplow tracking was in place, Aurélien was able to compare his
new data set with the data sets he was capturing from Google Analytics. The
results were astonishing. Welcome to the Jungle gained visibility into 3%
more unique users each month – which amounted to around 600,000 users
who were previously invisible. In addition, Aurélien was able to examine user
behavior more closely to discover that over 2 million of the company’s user
ids were bots. This meant the platform could waste less time displaying ads
to bot accounts and focus their targeting on real users.

Welcome to the Jungle’s web analytics revolution:


• Over 2 million user ids flagged as bots
• Visibility of 3% more unique users per month
• Captured 5% more page views per month

The transition for Welcome to the Jungle from third-party tracking with Google
Analytics, to first-party tracking with Snowplow had a huge impact on their
business. Now the product team could deliver accurate insights with clients,
and take steps to help them improve their conversion rates where required.

But their web analytics transformation didn’t stop there. Welcome to the
Jungle also benefited from the flexibility Snowplow offers when it comes to
tracking key metrics. Using ‘pages pings’, a feature of Snowplow’s core web
tracker, Welcome to the Jungle gained a more accurate understanding of user
engagement with their media articles than an out-of-the-box tool could offer.
With Snowplow, Welcome to the Jungle were able to take ownership of their
behavioral web data, and leverage first-party tracking to get a complete
picture of their users and how they interact with their web content.

92
RETHINKING MODERN WEB ANALYTICS

“With page pings from Snowplow we have


a very precise way of measuring engagement
on our articles. This is something we simply
couldn’t do before with Google Analytics.
I think this is one of the most interesting metrics
we’ll see in terms of media analytics.”
– Aurélien Rayer, Head of Data, Welcome to the Jungle

Aurélien and his team were now equipped, not only to capture web data in a
way that made sense specifically for their organization, but to own their
behavioral data set. This can open up a host of possibilities for the future,
allowing them to explore such as building a view of the customer journey or
powering content recommendation systems.

93
snowplowanalytics.com

You might also like