You are on page 1of 12

Business Report

Project - SMDM
Group 10
16-March-2020

Contents

1 – Wholesale Customer Data Analysis .................................................................................................. 2


1.1 Problem 1.1 ............................................................................................................................. 2
1.2 Problem 1.2 ............................................................................................................................. 3
1.3 Problem 1.3 ............................................................................................................................. 4
1.4 Problem 1.4 ............................................................................................................................. 5
1.5 Problem 1.5 ............................................................................................................................. 6
2 - Clear Mountain State University (CMSU) Survey ............................................................................... 6
2.1 Problem 2.1 ............................................................................................................................. 6
2.2 Problem 2.2 ............................................................................................................................. 7
2.3. Problem 2.3 ............................................................................................................................. 8
2.4. Problem 2.4 ............................................................................................................................. 9
3 – Hypothesis Testing for Quality of Shingles...................................................................................... 10
3.1. Problem 3.1 ........................................................................................................................... 10
3.2. Problem 3.2 ........................................................................................................................... 11
3.3. Problem 3.3 ........................................................................................................................... 11
3.4. Problem 3.4 ........................................................................................................................... 12
Summary –
This business report provides detailed explanation of our groups approach to each problem given in
the assignment and provides relative information with regards to solving the problem.

1 – Wholesale Customer Data Analysis


We imported the ‘Wholesale Customer data’ dataset in python to analyse the spend under each
store items across regions and channel to find solutions to each problem. Below is the detailed
approach and answer.

1.1 Problem 1.1

Which region and channel spend most & least?

Solution:

Using describe function in python we first looked at the basic descriptive statistics of the data set.

Using bar graph with Region and Channel we were able to identify region with maximum spend and
minimum spend. Below is the bar graph representation-

Looking at the above bar graph, Hotel Channel spends more and Retail spends least. To further
prove this we added a total column to the data frame and grouped the data by Channel to get totals
by each channel.

Hotel channel spend amount is 7999569$ with the highest spend amount and Retail spend amount
6619931$ has least spend amount based on Channel. Below is the output from Python -

Similarly we grouped totals by region to get totals by region. Other regions spend amount is
10677599$ with the highest spend amount and Oporto region spend amount is 1555088$ and has
least spend amount by Region. Below is the output from Python -
1.2 Problem 1.2

There are 6 different varieties of items are considered. Do all varieties show similar behaviour
across Region and Channel?

Solution:

Using pivot tables for each category and checking spend across Region and Channel we get the
following outputs -

Fresh: Milk: Grocery:

Frozen: Detergents_Paper: Delicatessen:

Looking at the above tables, we see that some categories like Milk, Grocery & Detergents_Paper
have higher spend in the Retail channel versus Hotel, across all regions. On the other hand, Fresh
and Frozen have higher consumption in the Hotel channel versus Retail, across all regions.

Also, if we plot a box plot we can summarize that the spend for Fresh and groceries is the maximum
across region and channel while for Delicatessen it is the least across region and channel. The output
boxplot is below -
1.3 Problem 1.3

On the basis of the descriptive measure of variability, which item shows the most inconsistent
behaviour? Which items shows the least inconsistent behaviour?

Solution:

We first checked standard deviation of each category

Also, By using describe function, we found that delicatessen’s standard deviation is close to mean
and Fresh category standard deviation is far from mean showing most and least inconsistencies.

We also checked Skew and Var descriptive statistic functions to arrive at the conclusion that Fresh is
most inconsistent and Delicatessen is the least inconsistent item.
Var output Skew output

1.4 Problem 1.4

Are there any outliers in the data?

Solution:

For looking at the outliers by catgory, we used box plots and cat plots in seaborn library and below
are the results:

Boxplot –

Catplot -
These graphs clearly show that there are outliers in every category.

1.5 Problem 1.5

On the basis of this report, what are the recommendations?

Solution:

We looked at the wholesale customer data spend across various dimensions and below are some
recommendations –

 Fresh is most sold variety, maintaining an inventory for Fresh is recommended as its most
profitable
 Delicatessen is most sold in the Lisbon region in the Hotel Channel and it is recommended to
market further in Hotel channel and strategically stock up in Lisbon region
 Fresh items are sold highest in Hotel channel and recommended to stock up accordingly
 Grocery and Milk follow fresh in terms of totals in Other regions and it is recommeded to
market and stock up accordingly
 Delicatessan is highly sold for Hotels in Other regions and least sold in Oporto region.
 Marketing for all items is recommended in Oporto regin as the overall spend in Oporto
region across channels is less comparitively

2 - Clear Mountain State University (CMSU) Survey

2.1 Problem 2.1

For this data, construct the following contingency tables (Keep Gender as row variable)

2.1.1. Gender and Major


2.1.2. Gender and Grad Intention
2.1.3. Gender and Employment
2.1.4. Gender and Computer

Solution –

2.1.1 - After importing the survey.csv file and checking the basic describe function, we created a
crosstab for gender and Major with rows as gender and columns as major. Added the column totals
using sum function and axis as 0.
2.1.2 – Below is the contingency table for Gender and Grad Intention using cross tab function:

2.1.3 – Below is the contingency table for Gender and Employment using cross tab function:

2.1.2 – Below is the contingency table for Gender and Computer using cross tab function:

2.2 Problem 2.2

Assume that the sample is a representative of the population of CMSU. Based on the data, answer
the following questions:

2.2.1. What is the probability that a randomly selected CMSU student will be male?
What is the probability that a randomly selected CMSU student will be female?

Solution – Using the contingency table above probabilities can be calculated using totals as below

 Probability of random student being male is 0.46774


 Probability of random student being female is 0.53226

2.2.2. Find the conditional probability of different majors among the male students in CMSU.
Find the conditional probability of different majors among the female students of CMSU.

 Probability of male student being Accounting major is 0.06897


 Probability of male student being CIS major is 0.03448
 Probability of male student being Economics major is 0.13793
 Probability of male student being International major is 0.06897
 Probability of male student being Management major is 0.20690
 Probability of male student being Other major is 0.13793
 Probability of male student being Retail/Marketing major is 0.17241
 Probability of female student being Accounting major is 0.09091
 Probability of female student being CIS major is 0.09091
 Probability of female student being Economics major is 0.21212
 Probability of female student being International major is 0.12121
 Probability of female student being Management major is 0.12121
 Probability of female student being Other major is 0.09091
 Probability of female student being Retail/Marketing major is 0.27273

2.2.3. Find the conditional probability of intent to graduate, given that the student is a male.
Find the conditional probability of intent to graduate, given that the student is a female.

 Probability of male intending to graduate is 0.58621


 Probability of female intending to graduate is 0.33333

2.2.4. Find the conditional probability of employment status for the male students as well as for the
female students.

 Probability of male employment status is 0.89655


 Probability of female employment status is 0.81818

2.2.5. Find the conditional probability of laptop preference among the male students as well as
among the female students.

 Probability of male laptop preference is 0.89655


 Probability of female laptop preference is 0.87879

2.3. Problem 2.3

Based on the above probabilities, do you think that the column variable in each case is
independent of Gender?
Justify your comment in each case.

1 - Male and Female graduates have similar probabilities in choosing Majors except for the
following observations –
 More male students choose Management major
 More female students choose Economics & Retail/marketing majors

2- Male students intend to graduate more than female students

3- Employment status and Laptop preferences are independent of gender as the probabilities
are similar

2.4. Problem 2.4

Note that there are three numerical (continuous) variables in the data set, Salary, Spending and
Text Messages. For each of them comment whether they follow a normal distribution.
Write a note summarizing your conclusions.
[Recall that symmetric histogram does not necessarily mean that the underlying distribution is
symmetric]

By performing Shapiro test for the 3 continuous variables (Salary, Spending and Text Messages) and
looking at the P value, it can be concluded with 95% confidence level that none of these variables
follow normal distribution

Below are the histogram graphs for each variable

Salary – p value – 0.02

Spending –P value - 1.6854661225806922e-05


Text Messages – P value - 4.324040673964191e-06

3 – Hypothesis Testing for Quality of Shingles

Manufacturers of ABC asphalt shingles claim that mean moisture content of shingles cannot be
greater than 0.35 pound per 100 square feet. For this test, two samples of A and B Shingles are
provided containing 36 and 31 measurements respectively.

 The variable under consideration is mean moisture content/ pounds of moisture per 100
square feet
 The file (A & B shingles.csv) is imported into Python using pd.read_csv function
 Next we used the describe() function to check the statistics of the variable under
consideration

3.1. Problem 3.1

For the A shingles, form the null and alternative hypothesis to test whether the population mean
moisture content is less than 0.35 pound per 100 square feet.

Solution:
 Hypothesis Formulation:

H0: µ >= 0.35 pound per 100 square feet

H1: µ < 0.35 pound per 100 square feet

3.2. Problem 3.2

For the B shingles, form the null and alternative hypothesis to test whether the population mean
moisture content is less than 0.35 pound per 100 square feet.

Solution:

 Hypothesis Formulation:

H0: µ >= 0.35 pound per 100 square feet

H1: µ < 0.35 pound per 100 square feet

3.3. Problem 3.3

Do you think that the population means for shingles A and B are equal? Form the hypothesis and
conduct the test of the hypothesis. What assumption do you need to check before the test for
equality of means is performed?

Solution:

 In order to check if the population means for shingles A and B are equal or not, a Two
Sample test will need to be performed since there are two population means under
consideration.
 The assumptions for the Two Sample test are –
I. Populations are normally distributed;
II. The variance of the two population are equal; and
III. Confidence level of 95% or level of significance, α = 5%
 In case the above assumptions are satisfied, it is desirable to perform a two sample
independent t- test.
 Therefore we will first check if the data is normally distributed through the Shapiro’s test
 The Null and Alternate Hypothesis for Shapiro’s test –
o H0: Population is normally distributed
o H1: Population is not normally distributed
 For shingles A and B in the data frame, we used the stats.shapiro function in python to
perform the Shapiro’s test. The output for each gives two values – the t-value and the p-
value. The outputs are below:
A Shingles-

B Shingles-

 Since p-value for A & B Shingles is less than α (0.05), we fail to accept the null hypothesis
(H0) for normality, in other words – the data fails to pass the Ist assumption.
 Since the parametric test has failed for this data we will go for a Non Parametric test for two
samples – ManWhitneyU
 Hypothesis formulation:
The Null and Alternate Hypothesis for this ManWhitneyU test –
H0: µ1 = µ2 i.e. the population means for shingles A and B are equal or µ1 - µ2 = 0
H1: µ1 ≠ µ2 i.e. the population means for shingles A and B are not equal or µ1 - µ2 ≠ 0
 The stats.mannwhitneyu function in python provides two values – t-value & p-value. The
output is below:

 Interpretation: Since the p-value from the ManWhitneyU test is greater than α (0.05), we
fail to reject the null hypothesis (H0) concluding that the population means for shingles A
and B are equal.

3.4. Problem 3.4

What assumption about the population distribution is needed in order to conduct the hypothesis
tests above?

I. Population data is not normally distributed;


II. The variance of the two population are not equal;
III. Observations in both populations are independent i.e. there is no relationship between the two
groups
IV. Confidence level of 95% or level of significance, α = 5%

You might also like