You are on page 1of 19

END TRIMESTER EXAMINATION

PROGRAMMING WITH PYTHON

A Report submitted partial fulfillment of the requirements for the degree of


Master of Business Administration

BY,
SHARA GEORGE VAIDIAN – 2327848

Under the guidance of

PROF. SASEEKALA M

MBA PROGRAMME
SCHOOL OF BUSINESS AND MANAGEMENT
CHRIST(DEEMED TO BE UNIVERSITY), BANG

APRIL 2024

1
S.No Title Page
No:

1. INTRODUCTION 3
2. OBJECTIVES 3
3. DATA DICTIONARY 4
4. SUMMARIZING DATA USING DESCRIPTIVE STATISTICS 5
5. QUESTIONS: QUESTION 1 6
QUESTION 2 7
QUESTION 3 9
QUESTION 4 15
QUESTION 5 16
6. INFERENCES 17
7. RECOMMENDATIONS 17
8. CONCLUSION 17
9. PLAIGIARISM REPORT 18

2
WHOLESALE CUSTOMER DATASET:

INTRODUCTION:
A wholesale distributor headquartered in Portugal has gathered comprehensive data on the annual
expenditure across various product categories within their retail outlets. These categories include Fresh
produce, Milk, Grocery items, Frozen foods, Detergents/Paper goods, and Delicatessen items. The dataset
encompasses the annual spending patterns of 440 retailers operating in three distinct regions: 1, 2 and
3( Lisbon, Oporto, and Other). Furthermore, the data is segmented by different sales channels, 1 and 2
( Hotel and Retail).

To undertake a thorough analysis, the Wholesale Customer dataset has been imported into a Jupyter
Notebook environment. The objective is to delve into the spending behavior across different product
categories, regions, and sales channels. By scrutinizing this dataset, the distributor aims to identify strategic
insights and formulate solutions to pertinent business challenges.

This dataset has 440 rows and 8 columns.

OBJECTIVES:
 Analyze spending patterns across regions and sales channels.
 Identify regions with highest and lowest spending.
 Compare spending between Hotel and Retail channels.
 Determine most and least popular product categories.
 Investigate outliers and address data quality issues.
 Provide recommendations for resource allocation and marketing.

3
DATA DICTIONARY:

The dataset comprises 8 variables, with all of them being of integer type. There are no null values present
in any of the columns. Channel has values 1 and 2 which stands for Hotel and Retail. Region has values
1,2,and 3 which stands for Lisbon, Oporto and Other.

Using the isnull() function to check for null values. The isnull().sum() gives the total number of null values if
present.

4
SUMMARIZING DATA USING DESCRIPTIVE STATISTICS:

The describe function provides a summary table containing statistical information about the variables in the
dataset. This includes the count of observations, standard deviation, minimum and maximum values, as
well as measures of central tendency such as mean, median (50th percentile), and quartiles (25th and 75th
percentiles).

Calculation of median:

 The Median of Fresh is 8504.0


 The Median of Milk is 3627.0
 The Median of Grocery is 4755.5
 The Median of Frozen is 1526.0
 The Median of Detergents Paper is 816.5
 The Median of Delicassen is 965.5
5
Calculation of mode:

All the variables except Region and Channel are unique numerical values. So we can calculate the mode of
Region and Channel

The most occurring value (mode) of Region is 3 (Other)


The most occurring value (mode) of Channel is 1 (Hotel)

QUESTIONS:

Question 1: Which Region and which Channel spent the most? Which Region and Channel spent the
least?

This code creates a copy of the original dataset and adds a new column called 'Spending'. This new column
calculates the total spending for each retailer by summing up their expenditures across different product
categories.

The plot suggests that the "Other" region had the highest spending, while the "Oporto" region had the
lowest spending.

6
The plot shows that spending was higher in the "Hotel" channel compared to the "Retail" channel.

Question 2: Describe the 6 varieties across the channel and region

Across Channel:
There are 2 channels – 1 and 2 which stands for Hotel and Retail respectively. For ease of understanding we
are assigning the values Hotel and Retail to 1 and 2

Hotel:

7
The total count of spending in the Retail channel is 298. The standard deviation ranges from 1104 to 13832,
suggesting diverse behaviour across variables. Notably, the minimum spending on Milk is notably higher
(55). Moreover, the minimum spending on Fresh, Grocery, Detergents_Paper, and Delicatessen all share
the same value of 3.000000.

Retail:

The total count of spending spent by Retail is 142. The standard deviation varies from 1812 to 12267. From
this we can understand that the variables does not show similar behaviour. The minimum amount spent on
Grocery is the highest (2743) and Delicassen is the lowest (3).

Across Region:
There are 3 regions – 1, 2 and 3 which stands for Lisbon, Oporto and Other respectively. For ease of
understanding we are assigning the values Lisbon, Oporto and Other to 1, 2 and 3.

Lisbon:

The total count of spending done by Lisbon is 77. The standard deviation varies from 1345 to 11557. From
this we can understand that the variables does not show similar behaviour. The minimum amount spent on
8
Grocery is the highest (258) and Detergents_Paper is the lowest (5).

Oporto:

In Oporto, the total count of spending amounts to 47. The standard deviation ranges from 1050 to 10842,
indicating dissimilar behavior across variables. Interestingly, the minimum spending on Grocery is the
highest (1330) , while Fresh exhibits the lowest minimum spending (3).

Other:

In the Other region, the total spending count is 316. The standard deviation ranges from 3232 to 13389,
suggesting differing behaviors across variables. Interestingly, the highest minimum spending is observed in
Milk (55). Furthermore, the minimum spending amounts on Fresh, Grocery, Detergents_Paper, and
Delicatessen are all equal at 3.

Question 3: Data visualization of the spread of the 6 varieties across region and channel.

9
This code simplifies the data by replacing numeric codes with descriptive labels. In this dataset, it changes 1
to 'Hotel' and 2 to 'Retail' in the 'Channel' column. Similarly, it changes 1 to 'Lisbon', 2 to 'Oporto', and 3 to
'Other' in the 'Region' column. This makes the data easier to understand and work with.

Spread of Fresh:

Spread of Milk:

10
Spread of Grocery:

11
Spread of Frozen:

Spread of Detergents Paper:

12
Spread of Delicassen:

Looking at the above tables, we can see that some categories like Milk, Grocery and Detergents_Paper have
a higher spend in the Retail channel than the Hotel Channel in all regions. On the other hand, Fresh and
Frozen have higher consumption in the Hotel channel than the Retail Channel in all regions.

Also, if we plot a box plot we can summarize that the spend for Fresh and groceries is the maximum across
region and channel while for Delicatessen it is the minimum across region and channel.

13
14
Therefore we can conclude that the 6 varieties does not show similar behavior across the Region and
Channel.

Question 4: Which item shows the most and least inconsistent behaviour?

In order to find the inconsistent behaviour, we want to calculate the coefficient of varaiation.
The coefficient of variation (CV) compares variability to the average in a dataset. A higher CV means more
variation compared to the average, while a lower CV suggests less variation. If the CV is high across
different parts of the data, it could indicate inconsistency or instability, possibly due to errors or outliers.

15
Varieties Coefficient of variation

Fresh 1.0527196084948245

Milk 1.2718508307424503

Grocery 1.193815447749267

Frozen 1.5785355298607762

Detergents paper 1.65276588104729

Delicassesn 1.8473041039189306

Based on the provided table, it's clear that the Coefficient of Variation is highest for the Delicatessen variety
and lowest for the Fresh variety. Hence, we can infer that Delicatessen exhibits the most inconsistency in
behavior, while Fresh displays the least inconsistency.

Question 5: Checking for Outliers:


To identify outliers in the dataset, boxplots are used. Outlier is defined a data point that significantly differs
from the other data points in the plot. The interquartile range (IQR), is the range between the first quartile
(25th percentile) and the third quartile (75th percentile). Any data point lying outside this range is
considered an outlier.

16
Based on the box plots for all variables shown above, it's evident that outliers exist within the dataset.
Outliers are observed in all the variables Fresh, Milk, Grocery, Frozen, Detergents_Paper, and Delicatessen.

INFERENCES:

 Out of all the regions, Other region is spending the highest and Oporto is spending the lowest.
 Hotel has a greater spending than retail

17
 The highest spending was done on the variety Fresh, followed by Grocery, Milk, Frozen,
Detergents_Paper and Delicatessen.
 Outliers are present in the given dataset

RECOMMENDATIONS:

 Allocate resources and marketing efforts towards the "Other" region, where spending is highest.
Conduct further research to understand customer preferences and behaviours in this region.
 For "Oporto," consider implementing targeted marketing strategies or promotions to stimulate
spending and increase market share.
 Invest in improving products and services tailored to the Hotel channel to maintain and increase
spending levels. Consider partnerships with hotels to offer specialized products.
 For the Retail channel, consider strategies to improve customer engagement, loyalty programs, or
product diversification to attract more spending.
 Promote Fresh, Grocery, and Milk products to capitalize on their popularity. Expand product variety
and ensure consistent quality.
 For categories with lower spending, evaluate opportunities for targeted marketing campaigns or
product promotions to increase demand.
 Look into the outliers to find out why they're there. Check if they're genuine anomalies or data
errors. Fix any issues with data quality or inconsistencies that might be causing the outliers. Use
reliable statistical methods to handle outliers and ensure the analysis is accurate.

CONCLUSION:

In summary, the analysis of the Wholesale Customer dataset has provided valuable insights into spending
patterns across various product categories, regions, and sales channels. We found that spending is highest
in the "Other" region and lowest in "Oporto," with the Hotel channel showing greater spending compared
to Retail. Fresh, Grocery, and Milk are the top contributors to spending, while Delicatessen displays the
most inconsistency in behavior. Additionally, outliers were identified in the dataset, emphasizing the
importance of data quality and robust statistical methods. To leverage these insights, recommendations
were made to optimize resource allocation, enhance offerings, and address outliers. Overall, this project
highlights the importance of data-driven decision-making in understanding market dynamics and driving
business growth.

PLAIGIARISM REPORT:

18
19

You might also like