Professional Documents
Culture Documents
Submitted to:
Concerned Faculty
At
Great Learning
The University of Texas at Austin
Submitted by:
Rachit Mittal
PGPDSBA online July E 2020
Post Graduate Program in Data Science and Business Analytics
Problem Statement 1
1
Descriptive Data Analysis: -
1. The Channel column has two unique values in which ‘Hotels’ have the most values (298),
The Region column has three unique values in which ‘Other’ have the most values (316).
2. The various items mean and their standard deviation are as follows: -
The above data shows that there are no null-values in the data.
2
Question 1.1: - Which Region and which Channel seems to spend more? Which Region and
which Channel seems to spend less?
Answer 1.1: -
1. This clearly shows that the Hotels as a Channel are spending more i.e. 7,999,569 than
Retails which is spending 6,619,931.
2. This shows that hotels are spending 20.8% more as compared to Retails
3. The data shows that the spending in Other regions is way more as compared to both of
the regions combined. The spending in Others is 10,677,599 as compared to annual
spending in Lisbon and Oporto are 2,386,813 and 1,555,088 respectively.
4. The spending in Lisbon region is 53.4% more as compared to Oporto. Oporto is spending
the least which is 1,555,088
3
5. The combination of Channel and Region also shows that maximum spending is Hotel and
Others which is spending 5,742,077.
6. The Hotel and Oporto region seems to spending the least that is 719,150.
4
Question 1.2: - There are 6 different varieties of items are considered. Do all varieties show
similar behavior across Region and Channel?
Answer 1.2: -
1. The graph clearly shows that the amount spend on the ‘Fresh’ items is more through
Hotel channel as compared to the Retail channel. Also, in Hotel channel spending on the
Fresh items is maximum in every region as compared to Retail channel.
2. ‘Grocery’ items have been major contributor of annual expense across the Retail channel
in every region. ‘Frozen’ items and ‘Delicatessen’ items are the least contributors of
annual expense in the Retail channel.
5
3. The average annual spending on the ‘Fresh’ items is high in Lisbon region as compare to
Oporto region in Hotel channel, and vice-versa trend can be seen in Retail channel.
4. The average annual spending on ‘Milk’ items is higher in the Retail channel as compared
to Hotel. Lisbon seems to spending more as compared to Oporto region on Milk items
through both channels.
6
5. The average amount spent on the ‘Grocery’ items is more in Retail channel as compared
to the Hotel. Also, the Lisbon region is spending more in Grocery items via Retail
channel as compared to the Hotel in which Oporto’s annual spending is highest across the
country.
6. The average annual spending on ‘Frozen’ items in Hotel is more as compared to Retail
channel. Oporto region is the major contributor for Hotel whereas the average spending is
highest in Retail channel.
7
7. The average annual spending on ‘Detergent Paper’ is very high across the Retail channel
as compared to the Hotel. The Oporto region is spending the most via Retail whereas
Hotels dominate the spending in Lisbon region.
8. The average annual spending in ‘Delicatessen’ items is less in hotels as compared to the
Retail channel. The Lisbon region on an average spends the highest via Retail.
8
Question 1.3: - On the basis of descriptive measure of variability, which item shows the
most inconsistent behavior? Which items show the least inconsistent behaviour?
Answer 1.3: -
Items Standard
Deviation
Fresh 12647.33
Milk 7380.38
Grocery 9503.16
Frozen 4854.67
Detergents Paper 4767.85
Delicatessen 2820.11
The standard deviation clearly shows that Fresh items having the highest standard deviation
(12,647.33) shows the most inconsistent behavior whereas Delicatessen items have least standard
deviation (2,820.11) and shows the least inconsistent behavior.
9
10
The histograms of the items also show the same that the ‘Fresh’ items and ‘Grocery’ items are
the most widespread among the items and have the highest standard deviation as well whereas
‘Delicatessen’ items being less variable have the lowest standard deviation.
Answer 1.4: -
11
The box plots clearly show us that each of the item in the data has outliers in it.
Recommendations
12
1. The annual spending on Grocery items is directly proportional to the number of Retailers
in the region. So, Retail Channel should spend most on the Grocery items. The spending
should be done carefully as Grocery items are also very inconsistent.
2. With number of Retail channel equivalent in Lisbon and Oporto region the top 3 items
that Retailers should focus are Grocery, Milk and Detergent Paper keeping in mind the
variability in Grocery and Milk items.
3. Oporto region should focus on developing its Hotel channel with Fresh and Frozen items
being its topmost priority.
4. The annual spending through both channels by all the regions should be managed
carefully especially in case of Fresh items. The Fresh items have the highest standard
deviation and are least inconsistent. So, the spending on this item should be done
carefully.
Problem Statement 2
13
The Student News Service at Clear Mountain State University (CMSU) has decided to gather
data about the undergraduate students that attend CMSU. CMSU creates and distributes a survey
of 14 questions and receives responses from 62 undergraduates.
Question 2.1: - For this data, construct the following contingency tables
14
2.1.1. Gender and Major
2.1.2. Gender and Grad Intention
2.1.3. Gender and Employment
2.1.4. Gender and Computer
Answer 2.1.1: -
Answer 2.1.2: -
Answer 2.1.3: -
Answer 2.1.4: -
15
Question 2.2: - Assume that the sample is representative of the population of CMSU. Based
on the data, answer the following question:
2.2.1. What is the probability that a randomly selected CMSU student will be male?
Answer 2.2.1: -
P(Male) = 0.468
2.2.2. What is the probability that a randomly selected CMSU student will be female?
Answer 2.2.2: -
P(Female) = 0.532
Question 2.3: - Assume that the sample is representative of the population of CMSU. Based
on the data, answer the following question:
2.3.1 Find the conditional probability of different majors among the male students in
CMSU
Answer 2.3.1: -
16
Total number of students = 62
Number of males = 29
17
5. Number of males prefer Management = 6
18
Probability of a male Undecided = 3/29
2.3.2 Find the conditional probability of different majors among the female students in
CMSU
Answer 2.3.2: -
19
P (Economics/Finance| Female) = 0.212
20
Probability of a female selecting Other =
(Probability Other ∩ Female)/ (Probability Female)
Question 2.4: - Assume that the sample is a representative of the population of CMSU.
Based on the data, answer the following question:
2.4.1 Find the probability that a randomly chosen student is a male and intends to
graduate.
Answer 2.4.1: -
21
Probability of a male Graduation Intent [Yes] = 17/29
2.4.1 Find the probability that a randomly chosen student is a female and does NOT
have a laptop.
Answer 2.4.1: -
Question 2.5: - Assume that the sample is representative of the population of CMSU. Based
on the data, answer the following question:
2.5.1 Find the probability that a randomly chosen student is either a male or has full-
time employment?
Answer 2.5.1: -
22
Total number of students = 62
Number of males = 29
Number of full-time employees = 10
Number of males’ full-time employees = 7
2.5.2 Find the conditional probability that given a female student is randomly chosen,
she is majoring in international business or management.
Answer 2.5.2: -
23
Probability female International Business or Management = 8/33
Question 2.6: - Construct a contingency table of Gender and Intent to Graduate at 2 levels
(Yes/No). The Undecided students are not considered now and the table is a 2x2 table. Do
you think the graduate intention and being female are independent events?
Answer 2.6: -
P(A)*P(B) = 924/3844
P(A)*P(B) = 0.24
P(A|B) *P(B) = 0.177
The events are independent
24
Question 2.7: - Note that there are four numerical (continuous) variables in the data set,
GPA, Salary, Spending, and Text Messages. Answer the following questions based on the
data
2.7.1. If a student is chosen randomly, what is the probability that his/her GPA is
less-than 3?
Answer 2.7.1: -
Answer 2.7.2: -
1. Number of males = 29
Number of males ∩ salary greater than 50 = 14
2. Number of females = 33
Number of females ∩ salary greater than 50 = 18
25
Probability Females = 33/62
Question 2.8: - Note that there are four numerical (continuous) variables in the data set,
GPA, Salary, Spending, and Text Messages. For each of them comment whether they follow
a normal distribution. Write a note summarizing your conclusions
Answer 2.8: -
26
1. The GPA box plot is normally distributed as the whiskers of the box plot are of the same
length whereas the box plots of Salary, Spending, Text Messages have different whisker
length and hence are not normally distributed.
2. The Shapiro-Wilk Test also proves this. The Null and Alternative hypothesis of Shapiro-
Wilk test is
27
Hence, Salary is NOT Normally distributed.
Problem Statement 3
An important quality characteristic used by the manufacturers of ABC asphalt shingles is the
amount of moisture the shingles contain when they are packaged. Customers may feel that they
have purchased a product lacking in quality if they find moisture and wet shingles inside the
packaging. In some cases, excessive moisture can cause the granules attached to the shingles
for texture and coloring purposes to fall off the shingles resulting in appearance problems. To
monitor the amount of moisture present, the company conducts moisture tests. A shingle is
weighed and then dried. The shingle is then reweighed, and based on the amount of moisture
taken out of the product, the pounds of moisture per 100 square feet are calculated. The company
would like to show that the mean moisture content is less than 0.35 pound per 100 square feet.
Question 3.1. Do you think there is evidence that mean moisture contents in both types of
shingles are within the permissible limits? State your conclusions clearly showing all steps.
Answer 3.1.: -
Let
�A = Mean moisture content in shingle A
�B = Mean moisture content in shingle B
For Shingle A
Step I
Null hypothesis (�0) states that mean moisture contents in shingles A, �A is greater than
equals to 0.35.
Alternative hypothesis (��) states that moisture contents in shingles A, �A is less than
0.35.
28
Since the � is not given so,
Here we select � = 0.05
So, the statistical decision is failing to reject the null hypothesis at 5% level of
significance.
Conclusion
Hence, at 95% confidence level, there is sufficient evidence to prove that mean moisture
content in A shingles is more than 0.35 pound per 100 square feet.
For Shingle B
Step I
Null hypothesis (�0) states that mean moisture contents in shingles A, �A is greater than
equals to 0.35.
Alternative hypothesis (��) states that moisture contents in shingles A, �A is less than
0.35.
29
One sample t test
t statistic: -3.10033130
P value: 0.0020904774
Step V
Level of significance: 0.05
P value < Level of significance
P value is 0.00209 and it is less than 5% level of significance
So, the statistical decision is to reject the null hypothesis at 5% level of significance.
Conclusion
Hence, at 95% confidence level, there is sufficient evidence to prove that mean moisture
content in B shingles is less than 0.35 pound per 100 square feet.
Question 3.2: - Do you think that the population mean for shingles A and B are equal?
Form the hypothesis and conduct the test of the hypothesis. What assumption do you need
to check before the test for equality of means is performed?
Answer 3.2: -
In testing whether the mean moisture content in shingles is same in both A and B.
Assumptions
1. We assumed that samples are random and both the populations are normally distributed.
2. We assumed unequal variances of the populations.
Step I
Null hypothesis (�0) states that the mean moisture content in shingles is the same, ��
equals ��.
Alternative hypothesis (��) states that the mean moisture content in shingles is different,
�� is not equal to ��.
�0: �� - �� = 0 i.e. �� = ��
��: �� - �� ≠ 0 i.e. �� ≠ ��
Step II
Since the � is not given so
Here we select � = 0.05.
Step III
We have two samples and we do not know the population standard deviation.
The sample is not a large sample. So, you use the t distribution and the 𝑡𝑆𝑇�𝑇 test statistic
for two sample unpaired test.
30
Step IV
So, the statistical decision is failing to reject the null hypothesis at 5% level of
significance.
Conclusion
Hence, at 95% confidence level, there is sufficient evidence to prove that mean moisture
content in A is equal to mean moisture content in B.
31