You are on page 1of 32

BUSINESS ANALYTICS REPORT

Submitted to:
Concerned Faculty
At
Great Learning
The University of Texas at Austin

Submitted by:
Rachit Mittal
PGPDSBA online July E 2020
Post Graduate Program in Data Science and Business Analytics
Problem Statement 1

A wholesale distributor operating in different regions of Portugal has information on annual


spending of several items in their stores across different regions and channels. The data
consists of 440 large retailers’ annual spending on 6 different varieties of products in 3
different regions (Lisbon, Oporto, Other) and across different sales channel (Hotel, Retail).

Exploratory Data Analysis: -

Dataset shows that there are 9 variables.


1. Channel and Region both are categorical columns
2. Fresh, Milk, Grocery, Frozen, Detergents Paper and Delicatessen are integer data type.

1
Descriptive Data Analysis: -

1. The Channel column has two unique values in which ‘Hotels’ have the most values (298),
The Region column has three unique values in which ‘Other’ have the most values (316).
2. The various items mean and their standard deviation are as follows: -

No. Items Mean Standard Deviation


1 Fresh 12000.2977 12647.3289
2 Milk 5796.2659 7380.3772
3 Grocery 7951.2773 9503.1628
4 Frozen 3071.9318 4854.6733
5 Detergents Paper 2881.4932 4767.8544
6 Delicatessen 1524.8704 2820.1059

Checking for Null-values: -

The above data shows that there are no null-values in the data.

2
Question 1.1: - Which Region and which Channel seems to spend more? Which Region and
which Channel seems to spend less?

Answer 1.1: -

1. This clearly shows that the Hotels as a Channel are spending more i.e. 7,999,569 than
Retails which is spending 6,619,931.
2. This shows that hotels are spending 20.8% more as compared to Retails

3. The data shows that the spending in Other regions is way more as compared to both of
the regions combined. The spending in Others is 10,677,599 as compared to annual
spending in Lisbon and Oporto are 2,386,813 and 1,555,088 respectively.
4. The spending in Lisbon region is 53.4% more as compared to Oporto. Oporto is spending
the least which is 1,555,088

3
5. The combination of Channel and Region also shows that maximum spending is Hotel and
Others which is spending 5,742,077.
6. The Hotel and Oporto region seems to spending the least that is 719,150.

4
Question 1.2: - There are 6 different varieties of items are considered. Do all varieties show
similar behavior across Region and Channel?

Answer 1.2: -

1. The graph clearly shows that the amount spend on the ‘Fresh’ items is more through
Hotel channel as compared to the Retail channel. Also, in Hotel channel spending on the
Fresh items is maximum in every region as compared to Retail channel.
2. ‘Grocery’ items have been major contributor of annual expense across the Retail channel
in every region. ‘Frozen’ items and ‘Delicatessen’ items are the least contributors of
annual expense in the Retail channel.

5
3. The average annual spending on the ‘Fresh’ items is high in Lisbon region as compare to
Oporto region in Hotel channel, and vice-versa trend can be seen in Retail channel.

4. The average annual spending on ‘Milk’ items is higher in the Retail channel as compared
to Hotel. Lisbon seems to spending more as compared to Oporto region on Milk items
through both channels.

6
5. The average amount spent on the ‘Grocery’ items is more in Retail channel as compared
to the Hotel. Also, the Lisbon region is spending more in Grocery items via Retail
channel as compared to the Hotel in which Oporto’s annual spending is highest across the
country.

6. The average annual spending on ‘Frozen’ items in Hotel is more as compared to Retail
channel. Oporto region is the major contributor for Hotel whereas the average spending is
highest in Retail channel.

7
7. The average annual spending on ‘Detergent Paper’ is very high across the Retail channel
as compared to the Hotel. The Oporto region is spending the most via Retail whereas
Hotels dominate the spending in Lisbon region.

8. The average annual spending in ‘Delicatessen’ items is less in hotels as compared to the
Retail channel. The Lisbon region on an average spends the highest via Retail.

8
Question 1.3: - On the basis of descriptive measure of variability, which item shows the
most inconsistent behavior? Which items show the least inconsistent behaviour?

Answer 1.3: -

Items Standard
Deviation

Fresh 12647.33
Milk 7380.38
Grocery 9503.16
Frozen 4854.67
Detergents Paper 4767.85
Delicatessen 2820.11

The standard deviation clearly shows that Fresh items having the highest standard deviation
(12,647.33) shows the most inconsistent behavior whereas Delicatessen items have least standard
deviation (2,820.11) and shows the least inconsistent behavior.

9
10
The histograms of the items also show the same that the ‘Fresh’ items and ‘Grocery’ items are
the most widespread among the items and have the highest standard deviation as well whereas
‘Delicatessen’ items being less variable have the lowest standard deviation.

Question 1.4: - Are there any outliers in the data?

Answer 1.4: -

11
The box plots clearly show us that each of the item in the data has outliers in it.

Recommendations

12
1. The annual spending on Grocery items is directly proportional to the number of Retailers
in the region. So, Retail Channel should spend most on the Grocery items. The spending
should be done carefully as Grocery items are also very inconsistent.
2. With number of Retail channel equivalent in Lisbon and Oporto region the top 3 items
that Retailers should focus are Grocery, Milk and Detergent Paper keeping in mind the
variability in Grocery and Milk items.
3. Oporto region should focus on developing its Hotel channel with Fresh and Frozen items
being its topmost priority.
4. The annual spending through both channels by all the regions should be managed
carefully especially in case of Fresh items. The Fresh items have the highest standard
deviation and are least inconsistent. So, the spending on this item should be done
carefully.
Problem Statement 2
13
The Student News Service at Clear Mountain State University (CMSU) has decided to gather
data about the undergraduate students that attend CMSU. CMSU creates and distributes a survey
of 14 questions and receives responses from 62 undergraduates.

Exploratory Data Analysis: -

Data has 14 variables in it


1. There are 6 categorical variables that are Gender, Class, major, Grad Intent, Employment
and Computer.
2. There are 5 integer data type variables that are Age, Social Networking, Satisfaction,
Spending and Text Messages.
3. There GPA and Salary are 2 float data type variables.

Question 2.1: - For this data, construct the following contingency tables

14
2.1.1. Gender and Major
2.1.2. Gender and Grad Intention
2.1.3. Gender and Employment
2.1.4. Gender and Computer

Answer 2.1.1: -

Answer 2.1.2: -

Answer 2.1.3: -

Answer 2.1.4: -

15
Question 2.2: - Assume that the sample is representative of the population of CMSU. Based
on the data, answer the following question:

2.2.1. What is the probability that a randomly selected CMSU student will be male?

Answer 2.2.1: -

Total number of students = 62


Number of male students = 29

Probability that a randomly selected CMSU student will be male = 29/62

P(Male) = 0.468

2.2.2. What is the probability that a randomly selected CMSU student will be female?

Answer 2.2.2: -

Total number of students = 62


Number of Female students = 33

Probability that a randomly selected CMSU student will be female = 33/62

P(Female) = 0.532

Question 2.3: - Assume that the sample is representative of the population of CMSU. Based
on the data, answer the following question:

2.3.1 Find the conditional probability of different majors among the male students in
CMSU

Answer 2.3.1: -

16
Total number of students = 62
Number of males = 29

1. Number of males prefer Acc = 4

Probability Acc ∩ Male = 4/62


Probability Male = 29/62
Probability of a male selecting Acc = (Probability Acc ∩ Male)/ (Probability Male)

Probability of a male selecting Acc = 4/29

P (Accounting| Male) = 0.138

2. Number of males prefer CIS = 1

Probability CIS ∩ Male = 1/62


Probability Male = 29/62
Probability of a male selecting CIS = (Probability CIS ∩ Male)/ (Probability Male)

Probability of a male selecting CIS = 1/29

P (CIS| Male) = 0.034

3. Number of males prefer Economics/Finance = 4

Probability Economics Finance ∩ Male = 4/62


Probability Male = 29/62
Probability of a male selecting Economics/Finance =
(Probability Economics/Finance ∩ Male)/ (Probability
Male)

Probability of a male selecting Economics Finance = 4/29

P (Economics/Finance| Male) = 0.138

4. Number of males prefer International Business = 2

Probability International Business ∩ Male = 2/62


Probability Male = 29/62
Probability of a male selecting International Business =
(Probability International Business ∩ Male)/ (Probability
Male)

Probability of a male selecting International Business = 2/29

P (International Business| Male) = 0.069

17
5. Number of males prefer Management = 6

Probability Management ∩ Male = 6/62


Probability Male = 29/62

Probability of a male selecting Management =


(Probability Management ∩ Male)/ (Probability
Male)

Probability of a male selecting Management = 6/29

P (Management| Male) = 0.207

6. Number of males prefer Retailing/Marketing = 5

Probability Retailing/Marketing ∩ Male = 5/62


Probability Male = 29/62

Probability of a male selecting Retailing Marketing =


(Probability Retailing Marketing ∩ Male)/ (Probability
Male)

Probability of a male selecting Retailing/Marketing = 5/29

P (Retailing/Marketing| Male) = 0.172

7. Number of males prefer Other = 4

Probability Other ∩ Male = 4/62


Probability Male = 29/62

Probability of a male selecting Other = (Probability Other ∩ Male)/ (Probability Male)

Probability of a male selecting Other = 4/29

P (Other| Male) = 0.138

8. Number of males Undecided = 3

Probability Undecided ∩ Male = 3/62


Probability Male = 29/62

Probability of a male Undecided = (Probability Undecided ∩ Male)/ (Probability Male)

18
Probability of a male Undecided = 3/29

P (Undecided| Male) = 0.103

2.3.2 Find the conditional probability of different majors among the female students in
CMSU

Answer 2.3.2: -

Total number of students = 62


Number of females = 33

1. Number of females prefer Acc = 3

Probability Acc ∩ Female = 3/62


Probability Female = 33/62
Probability of a female selecting Acc = (Probability Acc ∩ Female)/ (Probability Female)

Probability of a female selecting Acc = 3/33

P (Accounting| Female) = 0.091

2. Number of females prefer CIS = 3

Probability CIS ∩ Female = 3/62


Probability Female = 33/62
Probability of a female selecting CIS = (Probability CIS ∩ Female)/ (Probability Female)

Probability of a female selecting CIS = 3/33

P (CIS| Female) = 0.091

3. Number of females prefer Economics/ Finance = 7

Probability Economics Finance ∩ Female = 7/62


Probability Female = 33/62
Probability of a female selecting Economics/Finance =
(Probability Economics/Finance ∩ Female)/ (Probability Female)

Probability of a female selecting Economics Finance = 7/33

19
P (Economics/Finance| Female) = 0.212

4. Number of females prefer International Business = 4

Probability International Business ∩ Female = 4/62


Probability Female = 33/62
Probability of a female selecting International Business =
(Probability International Business ∩ Female)/ (Probability
Female)

Probability of a female selecting International Business = 4/33

P (International Business| Female) = 0.121

5. Number of females prefer Management = 4

Probability Management ∩ Female = 4/62


Probability Female = 33/62

Probability of a female selecting Management =


(Probability Management ∩ Female)/ (Probability
Female)

Probability of a female selecting Management = 4/33

P (Management| Female) = 0.121

6. Number of females prefer Retailing/Marketing = 9

Probability Retailing/Marketing ∩ Female = 9/62


Probability Female = 33/62

Probability of a female selecting Retailing Marketing =


(Probability Retailing/Marketing ∩ Female)/ (Probability
Female)

Probability of a female selecting Retailing Marketing = 9/33

P (Retailing/Marketing| Female) = 0.273

7. Number of females prefer Other = 3

Probability Other ∩ Female = 3/62


Probability Female = 33/62

20
Probability of a female selecting Other =
(Probability Other ∩ Female)/ (Probability Female)

Probability of a female selecting Other = 3/33

P (Other| Female) = 0.091

8. Number of females Undecided = 0

Probability Undecided ∩ Female = 0/62


Probability Female = 33/62

Probability of a female Undecided =


(Probability Undecided ∩ Female)/ (Probability
Female)

Probability of a female Undecided = 0/33

P (Undecided| Female) = 0.0

Question 2.4: - Assume that the sample is a representative of the population of CMSU.
Based on the data, answer the following question:

2.4.1 Find the probability that a randomly chosen student is a male and intends to
graduate.

Answer 2.4.1: -

Total number of students = 62


Number of males = 29
Number of males Graduation Intent [Yes] = 17

Probability Graduation Intent [Yes] ∩ Male = 17/62


Probability Male = 29/62

Probability of a male Graduation Intent [Yes] =


(Probability Graduation Intent [Yes] ∩ Male)/ (Probability
Male)

21
Probability of a male Graduation Intent [Yes] = 17/29

P (Graduation Intent [Yes]| Male) = 0.586

2.4.1 Find the probability that a randomly chosen student is a female and does NOT
have a laptop.

Answer 2.4.1: -

Total number of students = 62


Number of females = 33
Number of females Not having laptop = 4

Probability Not having laptop ∩ Female = 4/62


Probability Female = 33/62

Probability of a female Not having laptop =


(Probability Not having laptop ∩ Female)/ (Probability
Female)

Probability of a female Not having laptop = 4/33

P (No Laptop | Female) = 0.121

Question 2.5: - Assume that the sample is representative of the population of CMSU. Based
on the data, answer the following question:

2.5.1 Find the probability that a randomly chosen student is either a male or has full-
time employment?

Answer 2.5.1: -

22
Total number of students = 62
Number of males = 29
Number of full-time employees = 10
Number of males’ full-time employees = 7

Probability Male = 29/62


Probability full time employees = 10/62
Probability males ∩ full-time employees = 7/62

P (Male U Full-Time Employment) =


Probability Male + Probability full-time employees - Probability males ∩ full-time
employees

Probability Male U Full Time Employment = 32/62

P (Male U Full-Time Employment) = 0.516

2.5.2 Find the conditional probability that given a female student is randomly chosen,
she is majoring in international business or management.

Answer 2.5.2: -

Total number of students = 62


Number of females = 33
Total number of female International Business or Management = 8

Probability International Business and Management ∩ Female = 8/62


Probability Female = 33/62

Probability female International Business or Management =


(Probability International Business or Management ∩ Female)/ (Probability female)

23
Probability female International Business or Management = 8/33

P (International Business or Management| Female) = 0.242

Question 2.6: - Construct a contingency table of Gender and Intent to Graduate at 2 levels
(Yes/No). The Undecided students are not considered now and the table is a 2x2 table. Do
you think the graduate intention and being female are independent events?

Answer 2.6: -

Event are called Dependent when P(A)*P(B) = P(A|B) *P(B)

Total number of students = 62


Number of females = 33
Number of student Graduation Intent Yes = 28
Number of female Graduation Intent Yes = 11

Probability female = 33/62


Probability student Graduation Intent Yes = 28/62

P(A)*P(B) = 924/3844

Probability female = 33/62


Probability Graduation Intent Yes | Female = 11/33

P(A|B) *P(B) = 11/62

P(A)*P(B) = 0.24
P(A|B) *P(B) = 0.177
The events are independent

24
Question 2.7: - Note that there are four numerical (continuous) variables in the data set,
GPA, Salary, Spending, and Text Messages. Answer the following questions based on the
data
2.7.1. If a student is chosen randomly, what is the probability that his/her GPA is
less-than 3?

Answer 2.7.1: -

Total number of students = 62


Number of students’ GPA less than 3 = 17
Probability student’s GPA less than 3 = 17/62

P (GPA < 3.0) = 0.27


2.7.2. Find the conditional probability that a randomly selected male earns 50 or
more. Find the conditional probability that a randomly selected female earns 50 or more.

Answer 2.7.2: -

Total number of students = 62

1. Number of males = 29
Number of males ∩ salary greater than 50 = 14

Probability males’ salary greater than 50 = 14/62


Probability Male = 29/62

Probability salary greater than 50 males =


(Probability males ∩ salary greater than 50)/ (Probability
Male)

Probability salary greater than 50 males = 14/29

P (Salary >= 50|Male) = 0.483

2. Number of females = 33
Number of females ∩ salary greater than 50 = 18

Probability females’ salary greater than 50 = 18/62

25
Probability Females = 33/62

Probability salary greater than 50 females =


(Probability females ∩ salary greater than 50)/ (Probability
Females)

Probability salary greater than 50 females = 18/33

P (Salary >= 50|Females) = 0.545

Question 2.8: - Note that there are four numerical (continuous) variables in the data set,
GPA, Salary, Spending, and Text Messages. For each of them comment whether they follow
a normal distribution. Write a note summarizing your conclusions

Answer 2.8: -

26
1. The GPA box plot is normally distributed as the whiskers of the box plot are of the same
length whereas the box plots of Salary, Spending, Text Messages have different whisker
length and hence are not normally distributed.
2. The Shapiro-Wilk Test also proves this. The Null and Alternative hypothesis of Shapiro-
Wilk test is

2.1. Ho: - The data is normally distributed.


Ha: - The data is not normally distributed.
3. The output for the Shapiro-Wilk Test are as follows: -

3.1. Shapiro-Wilk Test of Normality for GPA


Test stat: 0.969
P value: 0.112
Since, P value > α (0.05), we fail to reject the Null hypothesis.
Hence, GPA is Normally distributed.

3.2. Shapiro-Wilk Test of Normality for Salary


Test stat: 0.957
P value: 0.028
Since, P value < α (0.05), we reject the Null hypothesis.

27
Hence, Salary is NOT Normally distributed.

3.3. Shapiro-Wilk Test of Normality for Spending


Test stat: 0.878
P value: 0.0
Since, P value < α (0.05), we reject the Null hypothesis.
Hence, Spending is NOT Normally distributed.

3.4. Shapiro-Wilk Test of Normality for Text Messages


Test stat: 0.859
P value: 0.0
Since, P value < α (0.05), we reject the Null hypothesis.
Hence, Spending is NOT Normally distributed.

Problem Statement 3

An important quality characteristic used by the manufacturers of ABC asphalt shingles is the
amount of moisture the shingles contain when they are packaged. Customers may feel that they
have purchased a product lacking in quality if they find moisture and wet shingles inside the
packaging. In some cases, excessive moisture can cause the granules attached to the shingles
for texture and coloring purposes to fall off the shingles resulting in appearance problems. To
monitor the amount of moisture present, the company conducts moisture tests. A shingle is
weighed and then dried. The shingle is then reweighed, and based on the amount of moisture
taken out of the product, the pounds of moisture per 100 square feet are calculated. The company
would like to show that the mean moisture content is less than 0.35 pound per 100 square feet.

Question 3.1. Do you think there is evidence that mean moisture contents in both types of
shingles are within the permissible limits? State your conclusions clearly showing all steps.

Answer 3.1.: -

Let
�A = Mean moisture content in shingle A
�B = Mean moisture content in shingle B

For Shingle A
Step I
Null hypothesis (�0) states that mean moisture contents in shingles A, �A is greater than
equals to 0.35.
Alternative hypothesis (��) states that moisture contents in shingles A, �A is less than
0.35.

�0: �A >= 0.35


��: �A < 0.35
Step II

28
Since the � is not given so,
Here we select � = 0.05

The sample size (n) for this problem is 36.


Step III
We do not know the population standard deviation and n = 36. So, we use the t
distribution and the 𝑡𝑆𝑇�𝑇 test statistic.
Step IV

One sample t test


t statistic: -1.473505
P value: 0.07477633
Step V
Level of significance: 0.05
P value > Level of significance.
P value is 0.074776 and it is greater than 5% level of significance

So, the statistical decision is failing to reject the null hypothesis at 5% level of
significance.

Conclusion
Hence, at 95% confidence level, there is sufficient evidence to prove that mean moisture
content in A shingles is more than 0.35 pound per 100 square feet.

For Shingle B
Step I
Null hypothesis (�0) states that mean moisture contents in shingles A, �A is greater than
equals to 0.35.
Alternative hypothesis (��) states that moisture contents in shingles A, �A is less than
0.35.

�0: �A >= 0.35


��: �A < 0.35
Step II
Since the � is not given so,
Here we select � = 0.05

The sample size (n) for this problem is 31.


Step III
We do not know the population standard deviation and n = 31. So, we use the t
distribution and the 𝑡𝑆𝑇�𝑇 test statistic.
Step IV

29
One sample t test
t statistic: -3.10033130
P value: 0.0020904774
Step V
Level of significance: 0.05
P value < Level of significance
P value is 0.00209 and it is less than 5% level of significance

So, the statistical decision is to reject the null hypothesis at 5% level of significance.

Conclusion
Hence, at 95% confidence level, there is sufficient evidence to prove that mean moisture
content in B shingles is less than 0.35 pound per 100 square feet.

Question 3.2: - Do you think that the population mean for shingles A and B are equal?
Form the hypothesis and conduct the test of the hypothesis. What assumption do you need
to check before the test for equality of means is performed?

Answer 3.2: -

In testing whether the mean moisture content in shingles is same in both A and B.

Assumptions

1. We assumed that samples are random and both the populations are normally distributed.
2. We assumed unequal variances of the populations.
Step I
Null hypothesis (�0) states that the mean moisture content in shingles is the same, ��
equals ��.
Alternative hypothesis (��) states that the mean moisture content in shingles is different,
�� is not equal to ��.

�0: �� - �� = 0 i.e. �� = ��
��: �� - �� ≠ 0 i.e. �� ≠ ��
Step II
Since the � is not given so
Here we select � = 0.05.
Step III
We have two samples and we do not know the population standard deviation.

The sample is not a large sample. So, you use the t distribution and the 𝑡𝑆𝑇�𝑇 test statistic
for two sample unpaired test.

30
Step IV

Independent Sample t-test Assumed unequal variances.


t Stat = 1.28851
P Value = 0.20226
Step V
Level of significance: 0.05
P value > Level of significance.
P value is 0.20226 and it is greater than 5% level of significance

So, the statistical decision is failing to reject the null hypothesis at 5% level of
significance.

Conclusion
Hence, at 95% confidence level, there is sufficient evidence to prove that mean moisture
content in A is equal to mean moisture content in B.

31

You might also like