Professional Documents
Culture Documents
Mounted at /content/drive
/content/drive/MyDrive/Interview-AI Engineer-VNPay/dataset
[3]: !ls
[47]: df
1
1 2010-12-01 08:26:00 3.39 17850.0 United Kingdom 12
2 2010-12-01 08:26:00 2.75 17850.0 United Kingdom 12
3 2010-12-01 08:26:00 3.39 17850.0 United Kingdom 12
4 2010-12-01 08:26:00 3.39 17850.0 United Kingdom 12
… … … … … …
541904 2011-12-09 12:50:00 0.85 12680.0 France 12
541905 2011-12-09 12:50:00 2.10 12680.0 France 12
541906 2011-12-09 12:50:00 4.15 12680.0 France 12
541907 2011-12-09 12:50:00 4.15 12680.0 France 12
541908 2011-12-09 12:50:00 4.95 12680.0 France 12
Year TotalSpending
0 2010 15.30
1 2010 20.34
2 2010 22.00
3 2010 20.34
4 2010 20.34
… … …
541904 2011 10.20
541905 2011 12.60
541906 2011 16.60
541907 2011 16.60
541908 2011 14.85
[9]: print(df.dtypes)
print(df.isnull().sum())
InvoiceNo object
StockCode object
Description object
Quantity int64
InvoiceDate object
UnitPrice float64
CustomerID float64
Country object
dtype: object
InvoiceNo 0
StockCode 0
Description 1454
Quantity 0
InvoiceDate 0
UnitPrice 0
CustomerID 135080
Country 0
dtype: int64
2
[10]: print(df.describe())
3
wrongly sold (22719) barcode 170
wrongly sold as sets -600
wrongly sold sets -975
Name: Quantity, Length: 4223, dtype: int64
4
[16]: total_revenue_per_country = df.groupby('Country')['UnitPrice'].sum() * df.
↪groupby('Country')['Quantity'].sum()
5
dtype: float64
plt.figure(figsize=(12, 8))
plt.bar(total_revenue_per_country.index, total_revenue_per_country,␣
↪color='skyblue')
6
[17]: df['InvoiceDate'] = pd.to_datetime(df['InvoiceDate'])
df['Month'] = df['InvoiceDate'].dt.month
monthly_sales = df.groupby('Month')['UnitPrice'].sum()
print("Monthly Sales Trend:")
print(monthly_sales)
7
• December has the most sales activities
[23]: df['InvoiceDate'] = pd.to_datetime(df['InvoiceDate'])
df['Year'] = df['InvoiceDate'].dt.year
yearly_sales = df.groupby('Year')['UnitPrice'].sum()
print("Yearly Sales Trend:")
print(yearly_sales)
8
Lets find some VIP customers
[60]: df['TotalSpending'] = df['Quantity'] * df['UnitPrice']
total_spending_per_customer = df.groupby('CustomerID')['TotalSpending'].sum()
top_10_customers = total_spending_per_customer.nlargest(10)
9
[25]: num_customers = df['CustomerID'].nunique()
print("Number of unique customers:", num_customers)
sales_per_customer = df.groupby('CustomerID').apply(lambda x: (x['UnitPrice'] *␣
↪x['Quantity']).sum())
plt.figure(figsize=(10, 6))
plt.hist(df_filtered['UnitPrice'], bins=50, color='skyblue', edgecolor='black')
plt.title('Unit Price Distribution (Without Outliers - IQR Method)')
plt.xlabel('Unit Price')
plt.ylabel('Frequency')
plt.grid(True)
10
plt.show()
[30]: df['UnitPrice'].max()
[30]: 38970.0
Customer Segmentation
[58]: total_spending = df.groupby('CustomerID')['UnitPrice'].sum()
purchase_frequency = df.groupby('CustomerID')['InvoiceNo'].nunique()
average_spending_per_purchase = total_spending / purchase_frequency
customer_metrics = pd.DataFrame({
'TotalSpending': total_spending,
'PurchaseFrequency': purchase_frequency,
'AvgSpendingPerPurchase': average_spending_per_purchase
})
11
customer_metrics['Cluster'] = kmeans.labels_
plt.figure(figsize=(10, 8))
sns.scatterplot(data=customer_metrics, x='PurchaseFrequency',␣
↪y='TotalSpending', hue='Cluster', palette='viridis')
plt.title('Customer Segmentation')
plt.xlabel('Purchase Frequency')
plt.ylabel('Total Spending')
plt.show()
segment_analysis = customer_metrics.groupby('Cluster').agg({
'TotalSpending': 'mean',
'PurchaseFrequency': 'mean',
'AvgSpendingPerPurchase': 'mean'
})
print(segment_analysis)
/usr/local/lib/python3.10/dist-packages/sklearn/cluster/_kmeans.py:870:
FutureWarning: The default value of `n_init` will change from 10 to 'auto' in
1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
12
TotalSpending PurchaseFrequency AvgSpendingPerPurchase
Cluster
0 1487.143013 27.589958 86.416492
1 26184.300000 126.833333 726.355225
2 207.224451 3.594280 62.041187
3 40278.900000 5.000000 8055.780000
• Cluster 0: Regular Spenders
• Cluster 1: High-Value Spenders
• Cluster 2: Occasional Buyers
• Cluster 3: VIP Customers
Customer Churn Analysis
[62]: churn_period = 180
last_purchase_date = df.groupby('CustomerID')['InvoiceDate'].max()
current_date = df['InvoiceDate'].max()
df['Churned'] = (current_date - last_purchase_date).dt.days > churn_period
churn_rate = df['Churned'].mean() * 100
print("Churn Rate:", churn_rate, "%")
13
Churn Rate: 19.7163769441903 %
1 out of 5 customers does not make any transaction in the last 6 months
[ ]:
14