You are on page 1of 12

21/02/2023, 03:05 final_project

Evaluating StockX Shoes


Demand of some shoes data to evaluate the Market
First we need to install some libraries.
1. Panda - to import data
pip install pandas
pip install pyreadstat
pip install matplotlib
pip install scipy
pip install numpy
pip install openpyxl
pip install seaborn
pip install plotly
pip install us
pip install re

Then we import those libraries


In [2]: import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
import plotly.express as px
from plotly.offline import plot
from datetime import datetime
import us
import re
import numpy
from mpl_toolkits.mplot3d import Axes3D
import warnings
import statsmodels.api as sm
warnings.filterwarnings('ignore', category=DeprecationWarning)

Importing the data


In [3]: # Load the excel file
df = pd.read_excel('dataset/StockX-Data-Contest-2019-main.xlsx')

# Let's have a look at the first 5 rows of the dataset:


df.head()

file:///Users/PRASI/Downloads/final_project (1).html 1/12


21/02/2023, 03:05 final_project

Out[3]: Order Brand Sneaker Name Sale Retail Release Shoe Buyer
Date Price Price Date Size Region
2017- Yeezy Adidas-Yeezy-
0 09-01 Boost-350-Low-V2- 1097.0 220 2016-09-
24 11.0 California
Beluga
2017- Yeezy Adidas-Yeezy-
1 09-01 Boost-350-V2- 685.0 220 2016-11-
23 11.0 California
Core-Black-Copper
2017- Yeezy Adidas-Yeezy-
2 09-01 Boost-350-V2- 690.0 220 2016-11-
23 11.0 California
Core-Black-Green
2017- Yeezy Adidas-Yeezy-
3 09-01 Boost-350-V2- 1075.0 220 2016-11-
23 11.5 Kentucky
Core-Black-Red
Adidas-Yeezy-
2017- Yeezy
4 09-01 Boost-350-V2- 828.0 220 2017-02-11 11.0 Rhode
Core-Black-Red- Island
2017

About the Dataset:


The dataset provided is a sample of all Off-White x Nike and Yeezy 350 sales from
between 9/1/2017 and the present in the United States. There are a total of 99,956
sales in the dataset, with 27,794 Off-White sales and 72,162 Yeezy sales. The dataset
includes 8 variables: Order Date, Brand, Sneaker Name, Sale Price ( ), RetailP rice(

), Release Date, Shoe Size, and Buyer State. Each row represents an individual sale
on the StockX platform, and the dataset only includes sales within the United States.
The Order Date column represents the date the order was placed, while the Brand
column specifies whether the sale was for an Off-White x Nike or Yeezy 350 shoe.
The Sneaker Name column provides information on the specific model of the shoe
sold. The Sale Price column indicates the amount of money that the buyer paid for
the shoe, while the Retail Price column specifies the manufacturer's suggested retail
price for the shoe. The Release Date column indicates when the shoe was first
released. The Shoe Size column provides information on the size of the shoe sold.
The Buyer State column specifies the state in which the buyer resides.
Overall, this dataset provides valuable insights into the demand and pricing of two
popular shoe brands, and could be used to identify trends and patterns in consumer
behavior.
In [4]: # Convert Order_Date to datetime format
df['Order Date'] = pd.to_datetime(df['Order Date'])
df['Release Date'] = pd.to_datetime(df['Release Date'])

# Extract week from Order_Date


df['Week'] = df['Order Date'].dt.isocalendar().week

# Aggregate the data by Sneaker_name and Week to get the total sales of e
agg_df = df.groupby(['Sneaker Name', 'Week']).agg({'Sale Price': 'sum'}).

file:///Users/PRASI/Downloads/final_project (1).html 2/12


21/02/2023, 03:05 final_project

# Print the aggregated data


agg_df.head()

Out[4]: Sneaker Name Week Sale Price


0 Adidas-Yeezy-Boost-350-Low-Moonrock 1 3860.0
1 Adidas-Yeezy-Boost-350-Low-Moonrock 2 3050.0
2 Adidas-Yeezy-Boost-350-Low-Moonrock 3 3250.0
3 Adidas-Yeezy-Boost-350-Low-Moonrock 4 3140.0
4 Adidas-Yeezy-Boost-350-Low-Moonrock 5 1600.0
Here we have aggregated the data by Sneaker_name and Week to get the total sales
of each sneaker model for each week.
In [5]: sales_by_sneaker_week = agg_df.groupby(['Sneaker Name', 'Week']).sum()['S

# Reshape the data to create a pivot table


pivot_table = sales_by_sneaker_week.unstack(level=0)

# Let's create a subset of the above table to plot the sales of 10 random
column_names = pivot_table.columns[1:].tolist()
selected_columns = pd.Series(column_names).sample(n=5, random_state=42)
pivot_table = pd.concat([pivot_table.iloc[:,0], pivot_table[selected_colu

pivot_table.head()

Out[5]: Adidas- Adidas- Nike- Air-Jordan-


Yeezy-
Yeezy- Boost-350- Zoom- adidas- Nike-Zoom- 1-Retro-
Boost- Yeezy-
Fly-Off- Boost-350- Fly-Off- High-Off-
350-Low- V2-Semi-
Frozen- White- V2-Static
White-
Black-Silver
White-
University-
Moonrock Yellow Pink Blue
Week
1 3860.0 86891.0000 13064.0 214690.0000 21195.7100 33891.0000
2 3050.0 71768.0000 15474.0 155469.0000 16246.0000 38320.0000
3 3250.0 70202.0000 12388.0 154020.0000 14588.0000 30994.0000
4 3140.0 66063.2696 10431.0 99448.6392 11761.0000 29757.0000
5 1600.0 62744.0000 10557.0 76537.0651 12569.0376 31333.0483

In [6]: # Plot a line chart


pivot_table.plot(kind='line', figsize=(10, 6))

# Set the chart title and axis labels


plt.title('Sales by Sneaker Name and Week')
plt.xlabel('Week')
plt.ylabel('Sales')
plt.show()

file:///Users/PRASI/Downloads/final_project (1).html 3/12


21/02/2023, 03:05 final_project

In [7]: # Let's try to create a heatmap to visualize the sales of the 10 randomly
# Pivot data to create a matrix with "Sneaker Name" as rows, "Week" as co
pivot_df = agg_df.pivot_table(index="Sneaker Name", columns="Week", value

# Create heatmap using seaborn


sns.heatmap(pivot_df, cmap="coolwarm")

# Show the plot


plt.show()

Observing
States. the sales of sneaker models by US
In [8]: # Group the data by Buyer_region and calculate the total sales for each r
region_sales = df.groupby('Buyer Region')['Sale Price'].sum().reset_index

# Create a dictionary mapping state names to their two-letter codes


state_dict = {state.name: state.abbr for state in us.states.STATES}

file:///Users/PRASI/Downloads/final_project (1).html 4/12


21/02/2023, 03:05 final_project

# Convert the 'Buyer_region' column to state codes


region_sales['state_code'] = region_sales['Buyer Region'].apply(lambda x:

# Create a choropleth map using the plotly express library


fig = px.choropleth(locationmode='USA-states', locations=region_sales['st
# plot(fig) # To plot the map in a new window
fig.show(renderer="notebook")

color
9M

8M

7M

6M

5M

4M

3M

2M

1M

Pre-Processing the original dataset for Regression


In [9]: df = pd.read_excel('dataset/StockX-Data-Contest-2019-main.xlsx')
df['Order Date'] = pd.to_datetime(df['Order Date'])
df['Release Date'] = pd.to_datetime(df['Release Date'])
start_date = pd.to_datetime('2017-09-01')
df['Month'] = df['Order Date'].apply(lambda x: (x.year - start_date.year)

In [10]: # calculate the number of days since the release date


processed_df = df.copy()
substring_show_names = ['Adidas-Yeezy-Boost-350-Low', 'Adidas-Yeezy-Boost
# Aggregate test_df by substring_show_names and average the Sale Price, s
processed_df['Sneaker Name'] = processed_df['Sneaker Name'].str.extract('

processed_df['Days Since Release'] = (processed_df['Order Date'] - proces


processed_df.drop('Release Date', axis=1, inplace=True)
processed_df['State Code'] = df['Buyer Region'].apply(lambda x: state_dic

file:///Users/PRASI/Downloads/final_project (1).html 5/12


21/02/2023, 03:05 final_project

processed_df.drop('Buyer Region', axis=1, inplace=True)


processed_df.to_csv('dataset/stockx_data.csv', index=False)
processed_df.drop('Shoe Size', axis=1, inplace=True)
processed_df.drop('Retail Price', axis=1, inplace=True)
processed_df.drop('Brand', axis=1, inplace=True)
processed_df.drop('Order Date', axis=1, inplace=True)

# processed_df.head()
final_df = processed_df.copy()
final_df['Number of Sales'] = 1
final_df = final_df.groupby(['Month', 'Sneaker Name']).agg({'Sale Price':
final_df.head()

Out[10]:
Month Sneaker Name Sale Price Days Since Number of
Release Sales
0 0 Adidas-Yeezy-Boost-350-Low 1095.068182 457.818182 44
1 0 Adidas-Yeezy-Boost-350-V2 633.195329 198.233546 471
2 0 Air-Jordan-1-Retro-High-Off-
White-Chicago 1964.707317 9.219512 41
3 0 Nike-Air-Max-90 872.323529 8.470588 34
4 0 Nike-Air-Presto 1220.595238 8.523810 42

In [11]: test_df = df.copy()


test_df['Number of Sales'] = 1

substring_show_names = ['Adidas-Yeezy-Boost-350-Low', 'Adidas-Yeezy-Boost


# Aggregate test_df by substring_show_names and average the Sale Price, s
test_df['Sneaker Name'] = test_df['Sneaker Name'].str.extract('({})'.form

test_df = test_df.groupby(['Sneaker Name']).agg({'Sale Price': 'mean', 'N

# test_df.head()

# Sort by number of sales


test_df = test_df.sort_values(by='Sneaker Name', ascending=True)
# Create a subset of the data with "Sneaker Name" containing the word "Ni
# test_df = test_df[test_df['Sneaker Name'].str.contains('Nike')]
# Print the top 15 rows
test_df.head(60)

file:///Users/PRASI/Downloads/final_project (1).html 6/12


21/02/2023, 03:05 final_project

Out[11]: Sneaker Name Sale Price Number of Sales


0 Adidas-Yeezy-Boost-350-Low 915.546695 953
1 Adidas-Yeezy-Boost-350-V2 352.598103 71209
2 Air-Jordan-1-Retro-High 1026.032711 5703
3 Nike-Air-Force-1-Low 507.668735 2486
4 Nike-Air-Max-90 595.522176 1998
5 Nike-Air-Max-97 723.300866 1392
6 Nike-Air-Presto 758.068462 4363
7 Nike-Air-Vapormax 646.502682 3429
8 Nike-Blazer-Mid 602.877967 3622
9 Nike-React-Hyperdunk-2017-Flyknit 494.946281 484
10 Nike-Zoom-Fly 325.560268 4317

In [12]: sneakers = ['Air-Jordan-1-Retro-High-Off-White-Chicago']


df_analysis = final_df.copy()
df_analysis = df_analysis[df_analysis['Sneaker Name'].isin(sneakers)]

# Convert every NaN value to 0 in the dataset


df_analysis = df_analysis.fillna(0)

df_analysis.head(30)

file:///Users/PRASI/Downloads/final_project (1).html 7/12


21/02/2023, 03:05 final_project

Out[12]:
Month Sneaker Name Sale Price Days Since Number of
Release Sales
2 0 Air-Jordan-1-Retro-High- 1964.707317 9.219512 41
Off-White-Chicago
9 1 Air-Jordan-1-Retro-High- 1873.900000 34.000000 20
Off-White-Chicago
17 2 Air-Jordan-1-Retro-High- 1296.276042 71.317708 192
Off-White-Chicago
28 3 Air-Jordan-1-Retro-High- 1293.887850 94.383178 107
Off-White-Chicago
39 4 Air-Jordan-1-Retro-High- 1545.723077 128.400000 65
Off-White-Chicago
50 5 Air-Jordan-1-Retro-High- 1676.348837 158.767442 43
Off-White-Chicago
62 6 Air-Jordan-1-Retro-High- 1877.300000 186.725000 40
Off-White-Chicago
74 7 Air-Jordan-1-Retro-High- 2118.551724 218.448276 29
Off-White-Chicago
86 8 Air-Jordan-1-Retro-High- 2344.057143 250.600000 35
Off-White-Chicago
99 9 Air-Jordan-1-Retro-High- 2172.407407 279.962963 27
Off-White-Chicago
112 10 Air-Jordan-1-Retro-High- 2166.571429 308.500000 14
Off-White-Chicago
125 11 Air-Jordan-1-Retro-High- 2471.884615 342.807692 26
Off-White-Chicago
138 12 Air-Jordan-1-Retro-High- 2324.928571 370.500000 14
Off-White-Chicago
151 13 Air-Jordan-1-Retro-High- 2379.625000 404.125000 16
Off-White-Chicago
164 14 Air-Jordan-1-Retro-High- 2397.391304 432.826087 23
Off-White-Chicago
177 15 Air-Jordan-1-Retro-High- 2488.047619 463.666667 21
Off-White-Chicago
190 16 Air-Jordan-1-Retro-High- 2504.708333 491.166667 24
Off-White-Chicago
203 17 Air-Jordan-1-Retro-High- 2684.944444 515.944444 18
Off-White-Chicago

Correlation of
Since Release Number of Sales, Sale Price and Days
Let us pick one of the shows Air-Jordan-1-Retro-High-Off-White-Chicago
and see how the number of sales, sale price and days since release are correlated.
In [13]: fig = plt.figure(figsize=(10, 10))
ax = fig.add_subplot(111, projection='3d')
# Set the angle and elevation of the plot

file:///Users/PRASI/Downloads/final_project (1).html 8/12


21/02/2023, 03:05 final_project

ax.view_init(elev=30, azim=120)

x = df_analysis['Days Since Release']


y = df_analysis['Sale Price']
z = df_analysis['Number of Sales']

x = numpy.ravel(x, order = 'C')


y = numpy.ravel(y, order = 'C')
z = numpy.ravel(z, order = 'C')
sn_name = numpy.ravel(df_analysis['Sneaker Name'], order = 'C')

ax.scatter(x, y, z, s=50, alpha=0.6)

for i, txt in enumerate(range(len(x))):


ax.text(x[i], y[i], z[i], '%s' % (int(y[i])), size=10, zorder=1, colo

ax.set_xlabel('Days Since Release')


ax.set_ylabel('Sale Price')
ax.set_zlabel('Number of Sales')

ax.dist = 13

plt.show()

file:///Users/PRASI/Downloads/final_project (1).html 9/12


21/02/2023, 03:05 final_project

Linear Regression
In [14]: # create the independent variable matrix X and dependent variable vector
X = df_analysis[['Sale Price', 'Days Since Release']]
X = sm.add_constant(X) # add a constant term to the independent variable
y = df_analysis['Number of Sales']

# fit the multiple linear regression model


model = sm.OLS(y, X)
result = model.fit()

# print the model summary


print(result.summary())

file:///Users/PRASI/Downloads/final_project (1).html 10/12


21/02/2023, 03:05 final_project

OLS Regression Results


========================================================================
======
Dep. Variable: Number of Sales R-squared:
0.650
Model: OLS Adj. R-squared:
0.603
Method: Least Squares F-statistic:
13.92
Date: Tue, 21 Feb 2023 Prob (F-statistic): 0.
000382
Time: 03:00:22 Log-Likelihood: -
83.554
No. Observations: 18 AIC:
173.1
Df Residuals: 15 BIC:
175.8
Df Model: 2
Covariance Type: nonrobust
========================================================================
==============
coef std err t P>|t| [0.0
25 0.975]
------------------------------------------------------------------------
--------------
const 258.8132 46.913 5.517 0.000 158.8
21 358.805
Sale Price -0.1186 0.030 -3.959 0.001 -0.1
82 -0.055
Days Since Release 0.1164 0.078 1.500 0.154 -0.0
49 0.282
========================================================================
======
Omnibus: 14.347 Durbin-Watson:
2.207
Prob(Omnibus): 0.001 Jarque-Bera (JB):
12.715
Skew: 1.501 Prob(JB):
0.00173
Kurtosis: 5.819 Cond. No. 1.
55e+04
========================================================================
======

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is c
orrectly specified.
[2] The condition number is large, 1.55e+04. This might indicate that th
ere are
strong multicollinearity or other numerical problems.
/Users/PRASI/Documents/My_Documents/MBA-SataProject/venv/lib/python3.9/s
ite-packages/scipy/stats/_stats_py.py:1736: UserWarning:

kurtosistest only valid for n>=20 ... continuing anyway, n=18

In [15]: # create a meshgrid of the independent variables


x = np.linspace(min(df_analysis['Sale Price']), max(df_analysis['Sale Pri
y = np.linspace(min(df_analysis['Days Since Release']), max(df_analysis['
X, Y = np.meshgrid(x, y)

file:///Users/PRASI/Downloads/final_project (1).html 11/12


21/02/2023, 03:05 final_project

# predict the dependent variable using the OLS model


Z = result.params[0] + result.params[1]*X + result.params[2]*Y

# create a 3D plot
fig = plt.figure(figsize=(10, 8))
ax = fig.add_subplot(111, projection='3d')
ax.plot_surface(X, Y, Z, cmap='coolwarm')
ax.set_xlabel('Sale Price')
ax.set_ylabel('Days Since Release')
ax.set_zlabel('Number of Sales')
ax.dist = 11
plt.show()

So, it seems that the sale price is the highest when the product is released, and it
gradually reduces.
In [ ]:

file:///Users/PRASI/Downloads/final_project (1).html 12/12

You might also like