CH2 Descriptive Analytics QA PDF

# Descriptive Analytics - Case Study - Quality Alloys# QA used to do direct marketing only.
They are now looking

at web presence. The goals of web presence is to (a) Drive sales, (b) make product and contact information
available and (c) add legitimacy to brand. They invested in promotional activities to generate interest in their
website. Promotion cost them $25K. Before they wish to do more promotion, QA would like to understand #
How many people visit the website? How do they come to the website? # Is the website generating interest and
does this interest yield actual sales ? # Do traditional promotions drive web traffic, which inturn drive sales #
How can sales / visits be modelled ? # Where and how QA should advertise ?# Learning Objectives You are an
analyst hired by QA to answer the questions mentioned above. As an analyst how will go about answering the
questions. You may want to restate the questions as shown below and answer them. You will apply Descriptive
Analytics to answer business questions in the case study# Pre-reading for the class HBR Article: Web Analytics
at Quality Alloys, Inc. https://hbsp.harvard.edu/product/CU44-PDF-ENG # Case Study Questions (a) Describe
weekly visits and financials using visualization (b) Describe financials and behavior (c) Describe the financials
and behavior by Period - initial, pre-promo, promo and post-promo (d) Describe the relationship b/w revenues
and qty sold (e) Describe the relationship b/w revenues and visits (f) Analyze the qty sold and get its estimates
(g) Analyze the visits and get its estimates (h) Analyze the visits by traffic source, search engine, geographic
region, browsers and OS used.
In [1]:
# import pandas library

import pandas as pd
import numpy as np
from scipy.stats import pearsonr
# import visualization library

import matplotlib.pyplot as plt
import seaborn as sn
In [2]:
# Loading Data
# Data is available in quality_alloys
# Sheet "QA" contains weekly data related to visits and financials across period
s initial, pre-promo, promo and post-promo
# Sheet "demographics" contains visit related data
alloys_df = pd.read_excel('quality_alloys.xlsx', sheet_name="QA")

demographics_df = pd.read_excel('quality_alloys.xlsx', sheet_name="demographics"
)
In [3]:
# Understanding the data

alloys_df.head(5)
Out[3]:
Week Type Weeks Visits Unique_Visits Pageviews Pages/Visit Time_on_Site Bounce
May
25 -
0 1 1_Initial 1632 1509 3328 2.04 71
May
31
Jun 1 -
1 2 1_Initial 1580 1450 3097 1.96 56
Jun 7
Jun 8 -
2 3 1_Initial 1441 1306 3202 2.22 79
Jun 14
Jun 15
3 4 1_Initial - Jun 1452 1301 3170 2.18 81
21
Jun 22
4 5 1_Initial - Jun 1339 1255 2366 1.77 50
28
In [4]:
# Understanding the data

demographics_df.head(10)
Out[4]:
Top_10_Sites S_Visits Top_10_Engines E_Visits Top_10_Regions R_Visits
0 googleads.g.doubleclick.net 15626 google 17681 South America 22616
Northern
1 pagead2.googlesyndication.com 8044 yahoo 1250 17509
America
2 sedoparking.com 3138 search 592 Central America 6776
3 globalspec.com 693 msn 424 Western Europe 5214
4 searchportal.information.com 582 aol 309 Eastern Asia 3228
5 freepatentsonline.com 389 ask 268 Northern Europe 2721
6 thomasnet.com 379 live 145 Southern Asia 2589
South-Eastern
7 mu.com 344 bing 122 1968
Asia
8 mail.google.com 337 voila 63 Southern Europe 1538
9 psicofxp.com 310 netscape 26 Eastern Europe 1427

In [5]:
# Size of the data
print("Alloys_DF Dimensions")
print(alloys_df.shape)
print("")
print("______________________________________")
print("")
print("Demographics_DF Dimensions")
print(demographics_df.shape)
print("")
print("______________________________________")
print("")
print("Alloys_DF Information")
print("")
print(alloys_df.info())
print("")
print("______________________________________")
print("")
print("Demographics_DF Information")
print("")
print(demographics_df.info())
Alloys_DF Dimensions
(66, 14)
______________________________________
Demographics_DF Dimensions
(10, 12)
______________________________________
Alloys_DF Information
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 66 entries, 0 to 65
Data columns (total 14 columns):
Week 66 non-null int64
Type 66 non-null object
Weeks 66 non-null object
Visits 66 non-null int64
Unique_Visits 66 non-null int64
Pageviews 66 non-null int64
Pages/Visit 66 non-null float64
Time_on_Site 66 non-null int64
Bounce_Rate_% 66 non-null float64
New_Visits_% 66 non-null float64
Revenue 66 non-null float64
Profit 66 non-null float64
Lbs_Sold 66 non-null float64
Inquiries 66 non-null int64
dtypes: float64(6), int64(6), object(2)
memory usage: 7.3+ KB
None
______________________________________
Demographics_DF Information
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10 entries, 0 to 9
Data columns (total 12 columns):
Top_10_Sites 10 non-null object
S_Visits 10 non-null int64
Top_10_Engines 10 non-null object
E_Visits 10 non-null int64
Top_10_Regions 10 non-null object
R_Visits 10 non-null int64
Top_10_Browsers 10 non-null object
B_Visits 10 non-null int64
Top_10_OS 10 non-null object
OS_Visits 10 non-null int64
Traffic_Source 4 non-null object
T_Visits 4 non-null float64
dtypes: float64(1), int64(5), object(6)
memory usage: 1.1+ KB
None
In [6]:
# Describe weekly visits and financials using visualization
# Setting figure size

fig, ax = plt.subplots(figsize = (20,8))
# We may use bar charts or line charts for this purpose

fig = sn.barplot(x = 'Weeks', y = 'Visits', hue = 'Type', data = alloys_df)
fig.legend(loc=1)
plt.xlabel('Weeks', fontsize=18)
plt.ylabel('Visits', fontsize=18)
plt.title('Visits over Time',fontsize=20)
ax.set_xticklabels(labels=alloys_df.Weeks,rotation=45,ha='right',fontsize=8);
In [7]:
# Instead of Matplotlib let's use plotly
# Before running plotly use command prompt to install plotly. use pip install pl
otly==4.8.2 or from Anaconda
import plotly.express as px
# Visits over time
fig_1 = px.bar(alloys_df, x='Weeks', y='Visits',

hover_data=['Pageviews', 'Unique_Visits'], color='Type', height=6
00,title="Visits Over Time")
fig_1.show()
Visits Over Time
3500
3000
2500
Visits
2000
1500
In [8]:
# Describe weekly Unique visits and financials using visualization


fig = sn.barplot(x = 'Weeks', y = 'Unique_Visits', hue = 'Type', data = alloys_
df)
fig.legend(loc=1)
plt.ylabel('Unique Visits', fontsize=18)
plt.title('Unique Visits over Time',fontsize=20)
In [9]:
# Unique Visits over time

fig_2 = px.bar(alloys_df, x='Weeks', y='Unique_Visits',
hover_data=['Pageviews', 'Visits'], color='Type',height=600,title=
"Unique Visits Over Time")
fig_2.show()
Unique Visits Over Time
3500
3000
2500
Unique_Visits
2000
1500
In [10]:
# Describe weekly Revenue and financials using visualization


fig = sn.barplot(x = 'Weeks', y = 'Revenue', hue = 'Type', data = alloys_df)
fig.legend(loc=1)
plt.ylabel('Revenue', fontsize=18)
plt.title('Revenue over Time',fontsize=20)
In [11]:
# Revenues over time

fig_3 = px.bar(alloys_df, x='Weeks', y='Revenue',
hover_data=['Lbs_Sold', 'Profit'], color='Type',height=600,title="R
evenues Over Time")
fig_3.show()
Revenues Over Time
1M
0.8M
0.6M
Revenue
0.4M
Inference: 1. Visits and Unique visits show similar pattern 2. Visits and Unique visits have inceased during
promotion period and then decreased 3. Visits and Unique visits in post-promo period have settled at levels
higher than pre-promo period 4. Sales has decreased during the promo period. You may plot other parameters
like page views, timespent, page per visit, qty sold, profit etc.
In [12]:
# (b) Describe financials and behavior

# Descriptive statistics of financials and other behavioral variables
alloys_df.describe(include=np.number)
Out[12]:
Week Visits Unique_Visits Pageviews Pages/Visit Time_on_Site Bounce_
count 66.000000 66.000000 66.000000 66.000000 66.000000 66.000000 6
mean 33.500000 1051.984848 989.196970 2173.227273 2.258485 74.939394 6
std 19.196354 638.118665 621.022983 831.306827 0.412550 22.531263
min 1.000000 383.000000 366.000000 793.000000 1.420000 28.000000 53
25% 17.250000 596.000000 540.000000 1602.250000 2.025000 59.750000 6
50% 33.500000 842.000000 790.000000 1910.000000 2.235000 75.500000 65
75% 49.750000 1243.750000 1175.000000 2410.500000 2.575000 92.500000 73
max 66.000000 3726.000000 3617.000000 5291.000000 3.180000 120.000000 85

In [13]:
# (c) Describe the financials and behavior by Period - initial, pre-promo, promo
and post-promo
# Let us compare behavioural and financial values across four periods
values=['Unique_Visits','Visits','Time_on_Site','Bounce_Rate_%','New_Visits_%',
'Revenue','Profit','Lbs_Sold','Inquiries']
index =['Type']
aggfunc={'Unique_Visits': np.mean,
'Visits': np.mean,
'Time_on_Site': np.mean,
'Bounce_Rate_%':np.mean,
'New_Visits_%':np.mean,
'Revenue':np.mean,
'Profit':np.mean,
'Lbs_Sold':np.mean,
'Inquiries':np.mean
}
result = pd.pivot_table(alloys_df,values=values,index =index,aggfunc=aggfunc,fil
l_value=0)
result = result.round(2)
result
# Looking back at business questions

# How many people visit the website? How do they come to the website (this is no
t answered later)?
# Answer is in visits column on summary table
# Is the website generating interest and does this interest yield actual sales ?
# Yes, it is generating interest as seen in increase number of visits during p
romotion
# But it may not be increasing sales. See Sales in promo period. It has actual
ly decreased.
Out[13]:
Bounce_Rate_% Inquiries Lbs_Sold New_Visits_% Profit Revenue Time_on_
Type
1_Initial 67.28 7.29 18736.73 86.80 200233.41 608250.12 8
2_Pre-
59.41 6.48 18440.77 83.88 159932.03 534313.52 9
promo
3_Promo 77.28 6.35 17112.92 91.05 131929.90 456398.85 4
4_Post-
66.33 5.43 14577.79 86.34 111045.82 371728.02 7
Promo
In [14]:
# (d) Describe the relationship b/w revenues and qty sold

# Use Scatter plot
plt.scatter(alloys_df['Lbs_Sold'], y = alloys_df['Revenue']);
plt.xlabel('Quantity Sold');
plt.ylabel('Revenues');
# Inference - Revenue increases with Qty Sold
In [15]:
corr, p_val = pearsonr(alloys_df['Lbs_Sold'], alloys_df['Revenue'])

print(corr,p_val)
0.8689297128616138 3.2149183975865717e-21
In [16]:
# (e) Describe the relationship b/w revenues and visits

# Use Scatter plot
plt.scatter(alloys_df['Visits'], y = alloys_df['Revenue']);
plt.xlabel('Number of Visits');
plt.ylabel('Revenues');
# Inference - Revenue does not seem to be related with the number of visits
# Revisiting Business Question

# Do traditional promotions drive web traffic, which inturn drive sales
# They seem to increase web traffic but does not inturn drive sales
# Infact correlation coefficient is negative
In [17]:
corr, p_val = pearsonr(alloys_df['Visits'], alloys_df['Revenue'])

print(corr,p_val)
-0.05939183049878598 0.6357131002032045
In [18]:
# (f) Analyze the qty sold and get its estimates

# We shall use histograms and density plots for this purpose
# Histogram
plt.hist( alloys_df['Lbs_Sold']);
In [19]:
# Density Plot
sn.distplot( alloys_df['Lbs_Sold'], color='green');
In [20]:
# Business question - How can sales / visits be modelled?
# Estimates for Lbs_Sold

print("Mean: ",round(alloys_df.Lbs_Sold.mean(),2))
print("Median: ",round(alloys_df.Lbs_Sold.median(),2))
print("Std_Dev: ",round(alloys_df.Lbs_Sold.std(),2))
print("Skewness: ",round(alloys_df.Lbs_Sold.skew(),2))
print("Kurtosis: ",round(alloys_df.Lbs_Sold.kurtosis(),2))
# Inference : Close to normal distribution
# Confidence Interval
print("")
print("Confidence Interval")
from scipy import stats
print(stats.norm.interval(0.50,loc = alloys_df.Lbs_Sold.mean(),scale = alloys_df
.Lbs_Sold.std()))
# What does this say

# There is 50% chance that the qty sold will be in the range of 5645 to 29039.
# But there is an underlying trend in data
# It's better to look at last few weeks than whole date in this case
print("")
print("Original Values")
print(alloys_df.Lbs_Sold.to_numpy().round(2))
print("")
# Moving Average
print("Estimates using moving average")
print(alloys_df.Lbs_Sold.rolling(window=3).mean().to_numpy().round(2))
Mean: 17342.11
Median: 17215.73
Std_Dev: 6068.91
Skewness: 0.33
Kurtosis: -0.19
Confidence Interval
(13248.690144531636, 21435.52997668048)
Original Values
[16585.18 18906.38 28052.92 19382.31 24274.25 15308.72 8633.06 1721
6.34
17308.57 24571.17 14389.77 17230.83 13801.99 26652.7 12402.83 1169
5.44
26362.02 15771.65 31968.98 15531.27 19734.21 17192.88 22591.28 899
2.42
19104.31 21454.99 18783.76 14298.02 17215.12 27256.95 11292.47 2014
7.79
16453.81 12702.59 26303.49 22198.86 16535.15 7814.05 28041.31 3149
6.26
10181.38 9727.18 18323.31 17299.12 13862.04 19006.92 23283.78 1737
4.42
18194.93 9176.38 12880.99 15523.62 15406.28 14535.71 10397.18 1105
4.22
17854.91 7197.15 3825.75 22820.65 12758.08 22324.76 18565.9 1229
4.4
11292.5 23761.61]
Estimates using moving average

[ nan nan 21181.5 22113.87 23903.16 19655.09 16072.01 1371
9.37
14385.99 19698.69 18756.5 18730.59 15140.86 19228.51 17619.17 1691
6.99
16820.09 17943.04 24700.88 21090.63 22411.49 17486.12 19839.46 1625
8.86
16896.01 16517.24 19781.02 18178.92 16765.63 19590.03 18588.18 1956
5.74
15964.69 16434.73 18486.63 20401.64 21679.16 15516.02 17463.5 2245
0.54
23239.65 17134.94 12743.96 15116.54 16494.82 16722.69 18717.58 1988
8.37
19617.71 14915.24 13417.43 12526.99 14603.63 15155.21 13446.39 1199
5.7
13102.1 12035.43 9625.94 11281.18 13134.82 19301.16 17882.91 1772
8.35
14050.94 15782.84]
In [21]:
(16585.18+18906.38+28052.92)/3
Out[21]:
21181.493333333332
In [22]:
# (g) Analyze the Visits and get its estimates

# We shall use histograms and density plots for this purpose
# Histogram
plt.hist( alloys_df['Visits']);
In [23]:
# Density Plot
sn.distplot( alloys_df['Visits'], color='green');
In [24]:
# Estimates for Lbs_Sold

print("Mean: ",round(alloys_df.Visits.mean(),2))
print("Median: ",round(alloys_df.Visits.median(),2))
print("Std_Dev: ",round(alloys_df.Visits.std(),2))
print("Skewness: ",round(alloys_df.Visits.skew(),2))
print("Kurtosis: ",round(alloys_df.Visits.kurtosis(),2))
# Inference : Not normal
print("")
print("Original Values")
print(alloys_df.Visits.to_numpy().round(2))
print("")
# Moving Average
print("Estimates using moving average")
print(alloys_df.Visits.rolling(window=3).mean().to_numpy().round(2))
Mean: 1051.98
Median: 842.0
Std_Dev: 638.12
Skewness: 2.04
Kurtosis: 4.93
Original Values
[1632 1580 1441 1452 1339 892 797 744 1044 906 849 737 734 6
26
577 562 563 652 611 561 558 570 551 537 543 558 536 5
49
545 591 383 402 547 631 795 1000 1207 2317 2013 2324 3726 25
63
3006 1663 1779 1086 1231 1248 1674 1514 1302 1191 957 963 882 9
42
835 802 806 900 860 924 792 781 776 772]
Estimates using moving average

[ nan nan 1551. 1491. 1410.67 1227.67 1009.33 811. 86
1.67
898. 933. 830.67 773.33 699. 645.67 588.33 567.33 59
2.33
608.67 608. 576.67 563. 559.67 552.67 543.67 546. 54
5.67
547.67 543.33 561.67 506.33 458.67 444. 526.67 657.67 80
8.67
1000.67 1508. 1845.67 2218. 2687.67 2871. 3098.33 2410.67 214
9.33
1509.33 1365.33 1188.33 1384.33 1478.67 1496.67 1335.67 1150. 103
7.
934. 929. 886.33 859.67 814.33 836. 855.33 894.67 85
8.67
832.33 783. 776.33]
In [25]:
# Business Question - # Where and how QA should advertise ?

# Where do visitors come from ?
# (h) Analyze the visits by traffic source, search engine, geographic region, br
owsers and OS used.
# We can use piecharts for this purpose
# Before running plotly use command prompt to install plotly. use pip install pl
otly==4.8.2
import plotly.express as px
fig1 = px.pie(demographics_df, values='S_Visits', names='Top_10_Sites', title='V

isits by Referring Sites')
fig1.show()
Visits by Referring Sites
27%
52.4%
In [26]:
fig2= px.pie(demographics_df, values='E_Visits', names='Top_10_Engines', title=

'Visits by Search Engine')
fig2.show()
Visits by Search Engine

2.03%
1.48%
1.28%
0.694%
5.99%
2.8
0.584%
4%
0.302%
0.125%
In [27]:
fig3= px.pie(demographics_df, values='R_Visits', names='Top_10_Regions', title=

'Visits by Region')
fig3.show()
Visits by Region
26.7%
34.5%
In [28]:
fig4= px.pie(demographics_df, values='B_Visits', names='Top_10_Browsers', title=

'Visits by Browser')
fig4.show()
Visits by Browser
1.35%
1.22%
1.14%
18.9%
0.689%
0.0677%
0.0447%
0.0346%
0.013%
In [29]:
fig5= px.pie(demographics_df, values='OS_Visits', names='Top_10_OS', title='Visi

ts by OS')
fig5.show()
1.71%
1.51%
0.0691%
0.0418%
Visits by OS
0.0288%
0.0259%
0.0115%
0.00576%
0.00432%
In [30]:
fig6= px.pie(demographics_df, values='T_Visits', names='Traffic_Source', title=

'Visits by Traffic Source')
fig6.show()
Visits by Traffic Source
30.2%
55.8
In [ ]:

CH2 Descriptive Analytics QA PDF

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

CH2 Descriptive Analytics QA PDF

Uploaded by

Copyright:

Available Formats

# Descriptive Analytics - Case Study - Quality Alloys# QA used to do direct marketing only.

They are now looking

# import pandas library

# import visualization library

alloys_df = pd.read_excel('quality_alloys.xlsx', sheet_name="QA")

# Understanding the data

Week Type Weeks Visits Unique_Visits Pageviews Pages/Visit Time_on_Site Bounce

# Understanding the data

Top_10_Sites S_Visits Top_10_Engines E_Visits Top_10_Regions R_Visits

0 googleads.g.doubleclick.net 15626 google 17681 South America 22616

2 sedoparking.com 3138 search 592 Central America 6776

3 globalspec.com 693 msn 424 Western Europe 5214

4 searchportal.information.com 582 aol 309 Eastern Asia 3228

5 freepatentsonline.com 389 ask 268 Northern Europe 2721

6 thomasnet.com 379 live 145 Southern Asia 2589

8 mail.google.com 337 voila 63 Southern Europe 1538

9 psicofxp.com 310 netscape 26 Eastern Europe 1427

# Size of the data

# Describe weekly visits and financials using visualization

# Setting figure size

# We may use bar charts or line charts for this purpose

# Instead of Matplotlib let's use plotly

# Visits over time

fig_1 = px.bar(alloys_df, x='Weeks', y='Visits',

Visits Over Time

# Describe weekly Unique visits and financials using visualization

# Setting figure size

# We may use bar charts or line charts for this purpose

# Instead of Matplotlib let's use plotly

# Unique Visits over time

Unique Visits Over Time

# Describe weekly Revenue and financials using visualization

# Setting figure size

# We may use bar charts or line charts for this purpose

# Instead of Matplotlib let's use plotly

# Revenues over time

Revenues Over Time

# (b) Describe financials and behavior

Week Visits Unique_Visits Pageviews Pages/Visit Time_on_Site Bounce_

count 66.000000 66.000000 66.000000 66.000000 66.000000 66.000000 6

mean 33.500000 1051.984848 989.196970 2173.227273 2.258485 74.939394 6

std 19.196354 638.118665 621.022983 831.306827 0.412550 22.531263

min 1.000000 383.000000 366.000000 793.000000 1.420000 28.000000 53

25% 17.250000 596.000000 540.000000 1602.250000 2.025000 59.750000 6

50% 33.500000 842.000000 790.000000 1910.000000 2.235000 75.500000 65

75% 49.750000 1243.750000 1175.000000 2410.500000 2.575000 92.500000 73

max 66.000000 3726.000000 3617.000000 5291.000000 3.180000 120.000000 85

# Let us compare behavioural and financial values across four periods

# Looking back at business questions

Bounce_Rate_% Inquiries Lbs_Sold New_Visits_% Proﬁt Revenue Time_on_

1_Initial 67.28 7.29 18736.73 86.80 200233.41 608250.12 8

3_Promo 77.28 6.35 17112.92 91.05 131929.90 456398.85 4

# (d) Describe the relationship b/w revenues and qty sold

# Inference - Revenue increases with Qty Sold

corr, p_val = pearsonr(alloys_df['Lbs_Sold'], alloys_df['Revenue'])

# (e) Describe the relationship b/w revenues and visits

# Revisiting Business Question

corr, p_val = pearsonr(alloys_df['Visits'], alloys_df['Revenue'])

# (f) Analyze the qty sold and get its estimates

# Business question - How can sales / visits be modelled?

# Estimates for Lbs_Sold

# Inference : Close to normal distribution

# What does this say