Analysis

21/02/2023, 03:05 final_project
Evaluating StockX Shoes

Demand of some shoes data to evaluate the Market
First we need to install some libraries.
1. Panda - to import data
pip install pandas
pip install pyreadstat
pip install matplotlib
pip install scipy
pip install numpy
pip install openpyxl
pip install seaborn
pip install plotly
pip install us
pip install re
Then we import those libraries

In [2]: import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
import plotly.express as px
from plotly.offline import plot
from datetime import datetime
import us
import re
import numpy
from mpl_toolkits.mplot3d import Axes3D
import warnings
import statsmodels.api as sm
warnings.filterwarnings('ignore', category=DeprecationWarning)
Importing the data

In [3]: # Load the excel file
df = pd.read_excel('dataset/StockX-Data-Contest-2019-main.xlsx')
# Let's have a look at the first 5 rows of the dataset:

df.head()
file:///Users/PRASI/Downloads/final_project (1).html 1/12

21/02/2023, 03:05 final_project
Out[3]: Order Brand Sneaker Name Sale Retail Release Shoe Buyer
Date Price Price Date Size Region
2017- Yeezy Adidas-Yeezy-
0 09-01 Boost-350-Low-V2- 1097.0 220 2016-09-
24 11.0 California
Beluga
1 09-01 Boost-350-V2- 685.0 220 2016-11-
23 11.0 California
Core-Black-Copper
2 09-01 Boost-350-V2- 690.0 220 2016-11-
23 11.0 California
Core-Black-Green
3 09-01 Boost-350-V2- 1075.0 220 2016-11-
23 11.5 Kentucky
Core-Black-Red
Adidas-Yeezy-
2017- Yeezy
4 09-01 Boost-350-V2- 828.0 220 2017-02-11 11.0 Rhode
Core-Black-Red- Island
2017
About the Dataset:

The dataset provided is a sample of all Off-White x Nike and Yeezy 350 sales from
between 9/1/2017 and the present in the United States. There are a total of 99,956
sales in the dataset, with 27,794 Off-White sales and 72,162 Yeezy sales. The dataset
includes 8 variables: Order Date, Brand, Sneaker Name, Sale Price ( ), RetailP rice(
), Release Date, Shoe Size, and Buyer State. Each row represents an individual sale
on the StockX platform, and the dataset only includes sales within the United States.
The Order Date column represents the date the order was placed, while the Brand
column specifies whether the sale was for an Off-White x Nike or Yeezy 350 shoe.
The Sneaker Name column provides information on the specific model of the shoe
sold. The Sale Price column indicates the amount of money that the buyer paid for
the shoe, while the Retail Price column specifies the manufacturer's suggested retail
price for the shoe. The Release Date column indicates when the shoe was first
released. The Shoe Size column provides information on the size of the shoe sold.
The Buyer State column specifies the state in which the buyer resides.
Overall, this dataset provides valuable insights into the demand and pricing of two
popular shoe brands, and could be used to identify trends and patterns in consumer
behavior.
In [4]: # Convert Order_Date to datetime format
df['Order Date'] = pd.to_datetime(df['Order Date'])
df['Release Date'] = pd.to_datetime(df['Release Date'])
# Extract week from Order_Date

df['Week'] = df['Order Date'].dt.isocalendar().week
# Aggregate the data by Sneaker_name and Week to get the total sales of e
agg_df = df.groupby(['Sneaker Name', 'Week']).agg({'Sale Price': 'sum'}).

21/02/2023, 03:05 final_project
# Print the aggregated data

agg_df.head()
Out[4]: Sneaker Name Week Sale Price

0 Adidas-Yeezy-Boost-350-Low-Moonrock 1 3860.0
Here we have aggregated the data by Sneaker_name and Week to get the total sales
of each sneaker model for each week.
In [5]: sales_by_sneaker_week = agg_df.groupby(['Sneaker Name', 'Week']).sum()['S
# Reshape the data to create a pivot table

pivot_table = sales_by_sneaker_week.unstack(level=0)
# Let's create a subset of the above table to plot the sales of 10 random
column_names = pivot_table.columns[1:].tolist()
selected_columns = pd.Series(column_names).sample(n=5, random_state=42)
pivot_table = pd.concat([pivot_table.iloc[:,0], pivot_table[selected_colu
pivot_table.head()
Out[5]: Adidas- Adidas- Nike- Air-Jordan-

Yeezy-
Yeezy- Boost-350- Zoom- adidas- Nike-Zoom- 1-Retro-
Boost- Yeezy-
Fly-Off- Boost-350- Fly-Off- High-Off-
350-Low- V2-Semi-
Frozen- White- V2-Static
White-
Black-Silver
White-
University-
Moonrock Yellow Pink Blue
Week
1 3860.0 86891.0000 13064.0 214690.0000 21195.7100 33891.0000
2 3050.0 71768.0000 15474.0 155469.0000 16246.0000 38320.0000
3 3250.0 70202.0000 12388.0 154020.0000 14588.0000 30994.0000
4 3140.0 66063.2696 10431.0 99448.6392 11761.0000 29757.0000
5 1600.0 62744.0000 10557.0 76537.0651 12569.0376 31333.0483
In [6]: # Plot a line chart

pivot_table.plot(kind='line', figsize=(10, 6))
# Set the chart title and axis labels

plt.title('Sales by Sneaker Name and Week')
plt.xlabel('Week')
plt.ylabel('Sales')
plt.show()

21/02/2023, 03:05 final_project
In [7]: # Let's try to create a heatmap to visualize the sales of the 10 randomly
# Pivot data to create a matrix with "Sneaker Name" as rows, "Week" as co
pivot_df = agg_df.pivot_table(index="Sneaker Name", columns="Week", value
# Create heatmap using seaborn

sns.heatmap(pivot_df, cmap="coolwarm")
# Show the plot

plt.show()
Observing
States. the sales of sneaker models by US
In [8]: # Group the data by Buyer_region and calculate the total sales for each r
region_sales = df.groupby('Buyer Region')['Sale Price'].sum().reset_index
# Create a dictionary mapping state names to their two-letter codes

state_dict = {state.name: state.abbr for state in us.states.STATES}

21/02/2023, 03:05 final_project
# Convert the 'Buyer_region' column to state codes

region_sales['state_code'] = region_sales['Buyer Region'].apply(lambda x:
# Create a choropleth map using the plotly express library

fig = px.choropleth(locationmode='USA-states', locations=region_sales['st
# plot(fig) # To plot the map in a new window
fig.show(renderer="notebook")
color
9M
8M
7M
6M
5M
4M
3M
2M
1M
Pre-Processing the original dataset for Regression

In [9]: df = pd.read_excel('dataset/StockX-Data-Contest-2019-main.xlsx')
df['Order Date'] = pd.to_datetime(df['Order Date'])
df['Release Date'] = pd.to_datetime(df['Release Date'])
start_date = pd.to_datetime('2017-09-01')
df['Month'] = df['Order Date'].apply(lambda x: (x.year - start_date.year)
In [10]: # calculate the number of days since the release date

processed_df = df.copy()
substring_show_names = ['Adidas-Yeezy-Boost-350-Low', 'Adidas-Yeezy-Boost
# Aggregate test_df by substring_show_names and average the Sale Price, s
processed_df['Sneaker Name'] = processed_df['Sneaker Name'].str.extract('
processed_df['Days Since Release'] = (processed_df['Order Date'] - proces

processed_df.drop('Release Date', axis=1, inplace=True)
processed_df['State Code'] = df['Buyer Region'].apply(lambda x: state_dic

21/02/2023, 03:05 final_project
processed_df.drop('Buyer Region', axis=1, inplace=True)

processed_df.to_csv('dataset/stockx_data.csv', index=False)
processed_df.drop('Shoe Size', axis=1, inplace=True)
processed_df.drop('Retail Price', axis=1, inplace=True)
processed_df.drop('Brand', axis=1, inplace=True)
processed_df.drop('Order Date', axis=1, inplace=True)
# processed_df.head()
final_df = processed_df.copy()
final_df['Number of Sales'] = 1
final_df = final_df.groupby(['Month', 'Sneaker Name']).agg({'Sale Price':
final_df.head()
Out[10]:
Month Sneaker Name Sale Price Days Since Number of
Release Sales
0 0 Adidas-Yeezy-Boost-350-Low 1095.068182 457.818182 44
1 0 Adidas-Yeezy-Boost-350-V2 633.195329 198.233546 471
2 0 Air-Jordan-1-Retro-High-Off-
White-Chicago 1964.707317 9.219512 41
3 0 Nike-Air-Max-90 872.323529 8.470588 34
4 0 Nike-Air-Presto 1220.595238 8.523810 42
In [11]: test_df = df.copy()

test_df['Number of Sales'] = 1
substring_show_names = ['Adidas-Yeezy-Boost-350-Low', 'Adidas-Yeezy-Boost

# Aggregate test_df by substring_show_names and average the Sale Price, s
test_df['Sneaker Name'] = test_df['Sneaker Name'].str.extract('({})'.form
test_df = test_df.groupby(['Sneaker Name']).agg({'Sale Price': 'mean', 'N
# test_df.head()
# Sort by number of sales

test_df = test_df.sort_values(by='Sneaker Name', ascending=True)
# Create a subset of the data with "Sneaker Name" containing the word "Ni
# test_df = test_df[test_df['Sneaker Name'].str.contains('Nike')]
# Print the top 15 rows
test_df.head(60)

21/02/2023, 03:05 final_project
Out[11]: Sneaker Name Sale Price Number of Sales

0 Adidas-Yeezy-Boost-350-Low 915.546695 953
1 Adidas-Yeezy-Boost-350-V2 352.598103 71209
2 Air-Jordan-1-Retro-High 1026.032711 5703
3 Nike-Air-Force-1-Low 507.668735 2486
4 Nike-Air-Max-90 595.522176 1998
5 Nike-Air-Max-97 723.300866 1392
6 Nike-Air-Presto 758.068462 4363
7 Nike-Air-Vapormax 646.502682 3429
8 Nike-Blazer-Mid 602.877967 3622
9 Nike-React-Hyperdunk-2017-Flyknit 494.946281 484
10 Nike-Zoom-Fly 325.560268 4317
In [12]: sneakers = ['Air-Jordan-1-Retro-High-Off-White-Chicago']

df_analysis = final_df.copy()
df_analysis = df_analysis[df_analysis['Sneaker Name'].isin(sneakers)]
# Convert every NaN value to 0 in the dataset

df_analysis = df_analysis.fillna(0)
df_analysis.head(30)

21/02/2023, 03:05 final_project
Out[12]:
Month Sneaker Name Sale Price Days Since Number of
Release Sales
2 0 Air-Jordan-1-Retro-High- 1964.707317 9.219512 41
Off-White-Chicago
Off-White-Chicago
Off-White-Chicago
Off-White-Chicago
Off-White-Chicago
Off-White-Chicago
Off-White-Chicago
Off-White-Chicago
Off-White-Chicago
Off-White-Chicago
Off-White-Chicago
Off-White-Chicago
Off-White-Chicago
Off-White-Chicago
Off-White-Chicago
Off-White-Chicago
Off-White-Chicago
Off-White-Chicago
Correlation of
Since Release Number of Sales, Sale Price and Days
Let us pick one of the shows Air-Jordan-1-Retro-High-Off-White-Chicago
and see how the number of sales, sale price and days since release are correlated.
In [13]: fig = plt.figure(figsize=(10, 10))
ax = fig.add_subplot(111, projection='3d')
# Set the angle and elevation of the plot

21/02/2023, 03:05 final_project
ax.view_init(elev=30, azim=120)
x = df_analysis['Days Since Release']

y = df_analysis['Sale Price']
z = df_analysis['Number of Sales']
x = numpy.ravel(x, order = 'C')

y = numpy.ravel(y, order = 'C')
z = numpy.ravel(z, order = 'C')
sn_name = numpy.ravel(df_analysis['Sneaker Name'], order = 'C')
ax.scatter(x, y, z, s=50, alpha=0.6)
for i, txt in enumerate(range(len(x))):

ax.text(x[i], y[i], z[i], '%s' % (int(y[i])), size=10, zorder=1, colo
ax.set_xlabel('Days Since Release')

ax.set_ylabel('Sale Price')
ax.set_zlabel('Number of Sales')
ax.dist = 13
plt.show()

21/02/2023, 03:05 final_project
Linear Regression
In [14]: # create the independent variable matrix X and dependent variable vector
X = df_analysis[['Sale Price', 'Days Since Release']]
X = sm.add_constant(X) # add a constant term to the independent variable
y = df_analysis['Number of Sales']
# fit the multiple linear regression model

model = sm.OLS(y, X)
result = model.fit()
# print the model summary

print(result.summary())

21/02/2023, 03:05 final_project
OLS Regression Results

========================================================================
======
Dep. Variable: Number of Sales R-squared:
0.650
Model: OLS Adj. R-squared:
0.603
Method: Least Squares F-statistic:
13.92
Date: Tue, 21 Feb 2023 Prob (F-statistic): 0.
000382
Time: 03:00:22 Log-Likelihood: -
83.554
No. Observations: 18 AIC:
173.1
Df Residuals: 15 BIC:
175.8
Df Model: 2
Covariance Type: nonrobust
========================================================================
==============
coef std err t P>|t| [0.0
25 0.975]
------------------------------------------------------------------------
--------------
const 258.8132 46.913 5.517 0.000 158.8
21 358.805
Sale Price -0.1186 0.030 -3.959 0.001 -0.1
82 -0.055
Days Since Release 0.1164 0.078 1.500 0.154 -0.0
49 0.282
========================================================================
======
Omnibus: 14.347 Durbin-Watson:
2.207
Prob(Omnibus): 0.001 Jarque-Bera (JB):
12.715
Skew: 1.501 Prob(JB):
0.00173
Kurtosis: 5.819 Cond. No. 1.
55e+04
========================================================================
======
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is c
orrectly specified.
[2] The condition number is large, 1.55e+04. This might indicate that th
ere are
strong multicollinearity or other numerical problems.
/Users/PRASI/Documents/My_Documents/MBA-SataProject/venv/lib/python3.9/s
ite-packages/scipy/stats/_stats_py.py:1736: UserWarning:
kurtosistest only valid for n>=20 ... continuing anyway, n=18
In [15]: # create a meshgrid of the independent variables

x = np.linspace(min(df_analysis['Sale Price']), max(df_analysis['Sale Pri
y = np.linspace(min(df_analysis['Days Since Release']), max(df_analysis['
X, Y = np.meshgrid(x, y)

21/02/2023, 03:05 final_project
# predict the dependent variable using the OLS model

Z = result.params[0] + result.params[1]*X + result.params[2]*Y
# create a 3D plot
fig = plt.figure(figsize=(10, 8))
ax = fig.add_subplot(111, projection='3d')
ax.plot_surface(X, Y, Z, cmap='coolwarm')
ax.set_xlabel('Sale Price')
ax.set_ylabel('Days Since Release')
ax.set_zlabel('Number of Sales')
ax.dist = 11
plt.show()
So, it seems that the sale price is the highest when the product is released, and it
gradually reduces.
In [ ]:

Analysis

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Analysis

Uploaded by

Copyright:

Available Formats

21/02/2023, 03:05 final_project

Evaluating StockX Shoes

Then we import those libraries

Importing the data

# Let's have a look at the first 5 rows of the dataset:

file:///Users/PRASI/Downloads/final_project (1).html 1/12

About the Dataset:

# Extract week from Order_Date

file:///Users/PRASI/Downloads/final_project (1).html 2/12

# Print the aggregated data

Out[4]: Sneaker Name Week Sale Price

# Reshape the data to create a pivot table

Out[5]: Adidas- Adidas- Nike- Air-Jordan-

In [6]: # Plot a line chart

# Set the chart title and axis labels

file:///Users/PRASI/Downloads/final_project (1).html 3/12

# Create heatmap using seaborn

# Show the plot

# Create a dictionary mapping state names to their two-letter codes

file:///Users/PRASI/Downloads/final_project (1).html 4/12

# Convert the 'Buyer_region' column to state codes

# Create a choropleth map using the plotly express library

Pre-Processing the original dataset for Regression

In [10]: # calculate the number of days since the release date

processed_df['Days Since Release'] = (processed_df['Order Date'] - proces

file:///Users/PRASI/Downloads/final_project (1).html 5/12

processed_df.drop('Buyer Region', axis=1, inplace=True)

In [11]: test_df = df.copy()

substring_show_names = ['Adidas-Yeezy-Boost-350-Low', 'Adidas-Yeezy-Boost

test_df = test_df.groupby(['Sneaker Name']).agg({'Sale Price': 'mean', 'N

# Sort by number of sales

file:///Users/PRASI/Downloads/final_project (1).html 6/12

Out[11]: Sneaker Name Sale Price Number of Sales

In [12]: sneakers = ['Air-Jordan-1-Retro-High-Off-White-Chicago']

# Convert every NaN value to 0 in the dataset

file:///Users/PRASI/Downloads/final_project (1).html 7/12

file:///Users/PRASI/Downloads/final_project (1).html 8/12

x = df_analysis['Days Since Release']

x = numpy.ravel(x, order = 'C')

ax.scatter(x, y, z, s=50, alpha=0.6)

for i, txt in enumerate(range(len(x))):

ax.set_xlabel('Days Since Release')

file:///Users/PRASI/Downloads/final_project (1).html 9/12

# fit the multiple linear regression model

# print the model summary

file:///Users/PRASI/Downloads/final_project (1).html 10/12

OLS Regression Results

kurtosistest only valid for n>=20 ... continuing anyway, n=18

In [15]: # create a meshgrid of the independent variables

file:///Users/PRASI/Downloads/final_project (1).html 11/12

# predict the dependent variable using the OLS model

file:///Users/PRASI/Downloads/final_project (1).html 12/12

You might also like