You are on page 1of 19

Business Report Project – SMDM

By- Divjyot Shah Singh

Table of Contents

1 – Wholesale Customer Data Analysis............................................................................................................2


Problem 1.1..............................................................................................................................................2
Problem 1.2..............................................................................................................................................4
Problem 1.3..............................................................................................................................................6
Problem 1.4..............................................................................................................................................7
Problem 1.5..............................................................................................................................................8

2 - Clear Mountain State University (CMSU) Survey.....................................................................................10


Problem 2.1................................................................................................................................................10
Problem 2.2................................................................................................................................................11
Problem 2.3...................................................................................................................................................12
Problem 2.4...................................................................................................................................................13
Problem 2.5...................................................................................................................................................14
Problem 2.6...................................................................................................................................................15
Problem 2.7...................................................................................................................................................16
Problem 2.8...................................................................................................................................................17

3 – Hypothesis Testing for Quality of Shingles.............................................................................................18


Problem 3.1...................................................................................................................................................18
Problem 3.2...................................................................................................................................................19

1
Wholesale Customers Analysis
Problem Statement 1:

A wholesale distributor operating in different regions of Portugal has information on annual spending of several
items in their stores across different regions and channels. The data consists of 440 large retailers’ annual spending
on 6 different varieties of products in 3 different regions (Lisbon, Oporto, Other) and across different sales channel
(Hotel, Retail).

Solution – This business report provides detailed explanation of approach to each problem given in the assignment
and provides relative information with regards to solving the problem.

We imported the ‘Wholesale Customer data’ dataset in python to analyze the spend under each store items across
regions and channel to find solutions to each problem. Below is the detailed approach and answer.

Problem1.1.1. Use methods of descriptive statistics to summarize data.

Solution1.1.1. Using describe function in python we first looked at the basic descriptive statistics of the data set.

From the above two describe function, we can infer the following:

 Channel has two unique values, with "Hotel" as most frequent with 298 out of 440 transactions. i.e. 67.7
percentage of spending comes from "Hotel" channel.
 Retail has three unique values, with "Other" as most frequent with 316 out of 440 transactions. i.e. 71.8
percentage of spending comes from "Other" region.

2
Problem1.1.2.&1.1.3. Which region and channel spend most & least?

Solution1.1.2.&1.1.3. To calculate the region and channel with most and least spent we have created a summation of
all the Products into a new column ‘Spending’.

Now, using group by function for ‘Region’ and ‘Spending’ and similarly for ‘Channel’ and ‘Spending’ we were able to
indentify region with maximum spend and minimum spend. Below is the output:

From the above output we can infer:

 Other region spend amount is $10,677,599 with the highest spend amount and Oporto region spend amount
is $1,555,088 and has least spend amount by Region.
 Hotel channel spend amount is $7,999,569 with the highest spend amount and, Retail spend amount
$6,619,931 has least spend amount based on Channel.

3
Problem1.2. There are 6 different varieties of items are considered. Do all varieties show similar
behavior across Region and Channel? Provide a detailed justification for your answer.

Solution1.2. Using bar plot for each category and checking spend across Region and Channel we get the following
outputs:

1) Item – Fresh

Based on the plot, Fresh item has more spend in the Hotel channel and Other region

2) Item-Milk

Based on the plot, Milk item has more spend in the Retail channel and Other region

3) Item-Grocery
4
Based on the plot, Grocery item has more spend in the Retail channel and Oporto region

4) Item-Frozen

Based on the plot, Frozen item has more spend in the Hotel channel and Oporto region

5) Item-Detergents Paper
5
Based on the plot, Detergents Paper item has more spend in the Retail channel and Oporto region

6) Item- Delicatessen

Based on the plot, Delicatessen item has more spend in the Retail channel and Other region

Summary –

Looking at the above tables, we see that some categories like Milk, Grocery and Detergents-Paper have higher spend
in the Retail channel versus Hotel, across all regions. On the other hand, Fresh and Frozen have higher consumption in
the Hotel channel versus Retail, across all regions. Also, the spend for Fresh and groceries is the maximum across
region and channel while for Delicatessen it is the least across region and channel.

6
Problem1.3. On the basis of the descriptive measure of variability, which item shows the most
inconsistent behavior? Which items shows the least inconsistent behavior?

Solution1.3. To measure inconsistency behavior we have used techniques like Standard deviation, Coefficient of
Variation and Variance

Based on Standard deviation:

Fresh item have highest Standard deviation (12647.32). Therefore, it is most inconsistent.

Delicatessen item have smallest Standard deviation (2820.10). Therefore, it is least inconsistent.

Below is the output from Python:

Using Coefficient of Variation:

Fresh item has lowest coefficient of Variation (1.05). Therefore, it is least inconsistent.

Delicatessen item have highest coefficient of Variation (1.84). Therefore, it is most inconsistent.

Below is the output from Python –

Using Variance:
7
Fresh item has Variance (1.59). Therefore, it is least inconsistent.

Delicatessen item has Variance (7.95). Therefore, it is most inconsistent.

Below is the output from Python –

Problem1.4. Are there any outliers in the data? Back up your answer with a suitable plot/technique with the
help of detailed comments.

Solution1.4. To find out outliers we plotted box plot and the output gives the details that in all the data there are
outliers. Based on the output we can infer that Item-Fresh has wider distribution and therefore the data is more
scattered. On the other hand, Delicatessen narrow distribution and therefore the data is less scattered.

The black point is the outliers in box plot graph.

8
Problem1.5. On the basis of your analysis, what are your recommendations for the business? How can your
analysis help the business to solve its problem? Answer from the business perspective

Solution1.5.

As per the analysis, we find out that there are inconsistencies in spending of different items (by calculating Standard
deviation and Coefficient of Variation), which should be minimized. The spending of Hotel and Retail channel are
different which should be more or less equal. Also, the spent should be equal for different regions. Recommendation
to business is to focus on items other than “Fresh” and “Grocery”.

9
CMSU Survey Data Analysis
Problem Statement 2:

The Student News Service at Clear Mountain State University (CMSU) has decided to gather data about the
undergraduate students that attend CMSU. CMSU creates and distributes a survey of 14 questions and receives
responses from 62 undergraduates (stored in the Survey data set).

Problem2.1. For this data, construct the following contingency tables (Keep Gender as row variable)

Problem2.1.1. Gender and Major

Solution2.1.1. To construct the contingency table we have used cross tab function in Python

Below is the output from Python:

Problem2.1.2. Gender and Grad Intention

Solution2.1.1.

Below is the output from Python:

Problem2.1.3. Gender and Employment

10
Solution2.1.3.

Below is the output from Python:

Problem2.1.4 Gender and Computer

Solution2.1.4

Below is the output from Python

Problem2.2. Assume that the sample is representative of the population of CMSU. Based on the data,
answer the following question:

Problem2.2.1. What is the probability that a randomly selected CMSU student will be male?

Solution2.2.1. To calculate probability that a randomly selected CMSU student will be male we need to find out
total male students out of whole student from the given data.

We used value count function for column Gender to identify total male and female in the dataset.

Below is the output from Python:

Probability that a randomly selected candidate will be male: 0.46774193548387094

Problem2.2.2. What is the probability that a randomly selected CMSU student will be female?

Solution2.2.2 For this we need to find out total female students out of whole student from the given data.

11
Probability that a randomly selected candidate will be female: 0.532258064516129

Problem2.3. Assume that the sample is representative of the population of CMSU. Based on the data, answer
the following question:

Problem2.3.1. Find the conditional probability of different majors among the male students in CMSU.

Solution2.3.1. Using contingency tables of Gender and Majors we got the total numbers of males and number of males
opting for different majors

Below is the output from Python –

And from this output we can easily say that most of the males’ students prefer Management as Majors and CIS is
the least preferred one

Problem2.3.2. Find the conditional probability of different majors among the female students in CMSU.

Solution2.3.2. Using contingency tables of Gender and Majors we got the total numbers of males and number of
females opting for different majors

Below is the output from Python –

12
And from this output we can easily say that most of the females’ students prefer Retailing/Marketing as Majors.

Problem2.4. Assume that the sample is a representative of the population of CMSU. Based on the data, answer
the following question:

Problem2.4.1. Find the probability that a randomly chosen student is a male and intends to graduate.

Solution2.4.1. Using contingency tables of Gender and Grad Intention we got the total numbers of males and
number of males intends to be graduate

P (Grad Intention ∩ Male) = P (Grad Intention| Male) x P (male) = 0.27419354838709675

Problem2.4.2. Find the probability that a randomly selected student is a female and does NOT have a laptop

Solution2.4.2. Using contingency tables of Gender and Computer we got the total numbers of females and number
of females who does not have a laptop

P (not having laptop ∩ Female) = P (not having laptop | Female) x P (Female) = 0.06451612903225806

Problem2.5. Assume that the sample is representative of the population of CMSU. Based on the data, answer
the following question:

13
Problem 2.5.1 Find the probability that a randomly chosen student is either a male or has full- time
employment?

Solution2.5.1.
Using contingency tables of Gender and Employment we got the total numbers of males and number of males who are
full time employed

Probability of randomly chosen student is a male or has a full-time employment 0.5161290322580645

Problem2.5.2. Find the conditional probability that given a female student is randomly chosen, she is majoring in
international business or management.

Solution2.5.2. Using contingency tables of Gender and Major we got the total numbers of females and number of
females majoring in international business or management.

Probability that given a female student is randomly chosen, she is majoring in international business or
management is: 0.45546372819100095

Problem2.6. Construct a contingency table of Gender and Intent to Graduate at 2 levels (Yes/No). The
Undecided students are not considered now and the table is a 2x2 table. Do you think the graduate intention
and being female are independent events?

Solution2.6. To construct a 2x2 table we used drop function to remove column ‘Undecided’

Below is the output from Python –

For 2 events to be independent, following condition is to be satisfied

P(A ∩ B) = P(A) * P(B)

14
So, P (Grad Intention|Yes ∩ Female) = P(Winner) * P(Female)
P(Female) = 0.5
P(Grad Intention|Yes)= 0.7
P(Grad Intention|Yes) * P(Female) = 0.35
P(Grad Intention|Yes ∩ Female) = 0.275

This is not independent events as probability multiplication of both events is not equal to combined event, so having
Graduate intention and being female candidate are not independent events

Problem2.7. Note that there are four numerical (continuous) variables in the data set, GPA, Salary, Spending,
and Text Messages. Answer the following questions based on the data

Problem2.7.1. If a student is chosen randomly, what is the probability that his/her GPA is less than 3?

Solution2.7.1. Using contingency tables of Gender and GPA we got the total numbers of students and number of
students GPA less than 3

We have total of 17 students out of total of 62 students with GPA less than 3. Therefore, Probability of GPA less
than 3 is 0.27419354838709675

Problem 2.7.2 Find the conditional probability that a randomly selected male earns 50 or more. Find the
conditional probability that a randomly selected female earns 50 or more.

Solution2.7.2:

15
Using contingency tables of Gender and Salary we got the total numbers of Male and Female and number of male and
female earning 50 or more

Among MALE candidates:


14 candidates among total 29 male candidates have salary of 50 or more. Therefore, Probability of having salary 50
or more: 0.4827586206896552

Among FEMALE candidates:


18 candidates among total 33 female candidates have salary of 50 or more. Therefore, Probability of having salary
50 or more: 0.5454545454545454

Problem2.8.1 Note that there are four numerical (continuous) variables in the data set, GPA, Salary, Spending,
and Text Messages. For each of them comment whether they follow a normal distribution. Write a note
summarizing your conclusions.

Solution2.8.1. Used dist plot to know the normal distribution of these four numerical (continuous) variables in the data

16
set – GPA, Salary, Spending and Text Messages

And to confirm whether these four data sets are following normal distribution or not, we have done the Shapiro–
Wilk test and the output from Python we got –

GPA: Shapiro Result (statistic=0.9685361981391907, pvalue=0.11204058676958084)


Salary: Shapiro Result (statistic=0.9565856456756592, pvalue=0.028000956401228905)
Spending: Shapiro Result (statistic=0.8777452111244202, pvalue=1.6854661225806922e-05)
Text Messages: Shapiro Result (statistic=0.8594191074371338, pvalue=4.324040673964191e-06)

Problem2.8.2. Write a note summarizing your conclusions

Solution2.8.2. Using dist plot we could clearly indentify that ‘Spending’ and ‘Text Messages’ are not following the
normal distribution

Further to test normality of data set we performed Shapiro Wilk test. We confirm that out of the given four data
sets only ‘GPA’ is following normal distribution as p value is greater than 0.05.

Asphalt Shingles Data Analysis


Problem Statement 3 -
An important quality characteristic used by the manufacturers of ABC asphalt shingles is the amount of moisture

17
the shingles contain when they are packaged. Customers may feel that they have purchased a product lacking in
quality if they find moisture and wet shingles inside the packaging. In some cases, excessive moisture can cause the
granules attached to the shingles for texture and coloring purposes to fall off the shingles resulting in appearance
problems. To monitor the amount of moisture present, the company conducts moisture tests. A shingle is weighed
and then dried. The shingle is then reweighed, and based on the amount of moisture taken out of the product; the
pounds of moisture per 100 square feet are calculated. The company would like to show that the mean moisture
content is less than 0.35 pound per 100 square feet.
The file (A & B shingles.csv) includes 36 measurements (in pounds per 100 square feet) for A shingles and 31 for B
shingles.

Problem3.1. Do you think there is evidence that mean moisture contents in both types of shingles are within the
permissible limits? State your conclusions clearly showing all steps.

Solution3.1.

Step 1: Define null and alternative hypotheses

Null hypothesis states that mean moisture content is less than 0.35 per hundred square feet
Alternative hypothesis states that mean moisture content is not less than 0.35 per hundred square feet

 Ho: μ =< 0.35
 Ha: μ > 0.35

Step 2: Decide the significance level

Here we select α = 0.05

Step 3: Identify the test statistic


Since we do not know the population standard deviation and n > 30. So we use the t distribution and the tSTAT test
statistics

Step 4: Calculate the p - value and test statistic

Output from Python Jupyter (Shingles A):

One sample t test


t statistic: -1.4735046253382782 p value: 0.07477633144907513

Step 5: Decide to reject or accept null hypothesis

Since p value > 0.05, we fail to reject Ho. There is not enough evidence to conclude that the mean moisture
content for Sample A shingles is less than 0.35 pounds per 100 square feet. p-value = 0.0748. If the population
mean moisture content is in fact no less than 0.35 pounds per 100 square feet, the probability of observing a
sample of 36 shingles that will result in sample mean moisture content of 0.3167 pounds per 100 square feet or
less is 0.0748.

18
Output from Python Jupyter (Shingles B)

One sample t test


t statistic: -3.1003313069986995 p value: 0.0020904774003191826

Since p value < 0.05, reject Ho. There is enough evidence to conclude that the mean moisture content for Sample B
shingles is not less than 0.35 pounds per 100 square feet. p-value = 0.0021.If the population mean moisture content is
in fact no less than 0.35pounds per 100 square feet, the probability of observing a sample of 31 shingles that will result
in sample mean moisture content of 0.2735 pounds per 100 square feet or less is .0021.

Problem3.2. Do you think that the population means for shingles A and B are equal? Form the hypothesis and
conduct the test of the hypothesis. What assumption do you need to check before the test for equality of means is
performed?

Solution3.2:

Ho: μ(A)= μ(B)


Ha: μ(A)!=μ(B)
α = 0.05

Output from Python Jupyter:

t_statistic = 1.29 and pvalue= 0.202

As the pvalue > 0.05, we fail to reject Ho; and we can say that population mean for shingles A and B are equal

Test Assumptions:

When running a two-sample t-test, the basic assumptions are that the distributions of the two populations are normal,
and that the variances of the two distributions are the same. If those assumptions are not likely to be met, another
testing procedure could be use.

19

You might also like