You are on page 1of 21

SMDM PROJECT

REPORT
7/8/2022

DSBA

Dileep Motukuri
1

Contents
INTRODUCTION.................................................................................................................................3
PROBLEM 1.........................................................................................................................................3
QUESTION 1.1.1..................................................................................................................................3
QUESTION 1.1.2..................................................................................................................................4
QUESTION 1.1.3..................................................................................................................................5
QUESTION 1.2.....................................................................................................................................5
QUESTION 1.3.....................................................................................................................................8
QUESTION 1.4.....................................................................................................................................9
QUESTION 1.5.....................................................................................................................................9
PROBLEM 2.......................................................................................................................................10
QUESTION 2.1...................................................................................................................................11
2.1.1. Gender and Major..................................................................................................................11
2.1.2. Gender and Grad Intention.....................................................................................................11
2.1.3. Gender and Employment.......................................................................................................12
2.1.4. Gender and Computer............................................................................................................12
QUESTION 2.2...................................................................................................................................12
2.2.1 What is the probability that a randomly selected CMSU student will be male?..................12
2.2.2 What is the probability that a randomly selected CMSU student will be female?...............12
QUESTION 2.3...................................................................................................................................12
2.3.1 Find the conditional probability of different majors among the male students in CMSU.. .12
2.3.2 Find the conditional probability of different majors among the female students of CMSU.
.....................................................................................................................................................13
QUESTION 2.4...................................................................................................................................13
2.4.1 Find the probability that a randomly chosen student is a male and intends to graduate......13
2.4.2 Find the probability that a randomly selected student is a female and does NOT have a
laptop...........................................................................................................................................14
QUESTION 2.5...................................................................................................................................14
2.5.1 Find the probability that a randomly chosen student is either a male or has a full-time
employment?...............................................................................................................................14
2.5.2 Find the conditional probability that given a female student is randomly chosen, she is
majoring in international business or management......................................................................14
QUESTION 2.6...................................................................................................................................15
QUESTION 2.7...................................................................................................................................15
2.7.1 If a student is chosen randomly, what is the probability that his/her GPA is less than 3?...15
2

2.7.2 Find conditional probability that a randomly selected male earns 50 or more. Find
conditional probability that a randomly selected female earns 50 or more..................................15
QUESTION 2.8...................................................................................................................................16
PROBLEM 3.......................................................................................................................................18
QUESTION 3.1...................................................................................................................................18
QUESTION 3.2...................................................................................................................................19

LIST OF TABLES & FIGURES


Table1 Sample Dataset ……………………………………………………………………………….3
Table 1.1 Description of data…………………………………………………………………………..4
Figure 1.1 Channel & Region wise Spending……………………………………...…………………...4
Table-1.2 Description of 6 Varieties……………………………………...…………..………………..5
Table-1.2.1 Fresh Items……………………………………...…………..……………………………..5
Table-1.2.2 Milk Items……………………………………...…………..………………………………6
Table-1.2.3 Grocery Items……………………………………...…………..…………………………..6
Table-1.2.4 Frozen Items……………………………………...…………....…………………………..7
Table-1.2.5 Detergents Paper Items………………………………...……………...…………………...7
Table-1.2.6 Delicatessen Items………………………………...………...………...…………………...8
Table-1.3 Description with Measure of variability……………………………………...……………..8
Figure 1.4 Outliers Analysis……………………………………...………………………..…………..9
Table 1.5 Product Analysis……………………………………...………………………..…………...10
Table 2 Survey Sample Data……………..……………………………………………………………10
Table 2.1.1 Contingency Table Gender vs Major……………………………………...……………..11
Table 2.1.2 Contingency Table Gender vs Graduation Intention. ……………………………………11
Table 2.1.3 Contingency Table Gender vs Employment……………………………………..………11
Table 2.1.4 Contingency Table Gender vs Computer. ……………………………………..………...……12

Table 2.3 Cross table for Gender vs Major…..………...……………………………………………..12


Table 2.4.1 Table for Gender vs Grad Intention…..………...………………………………………..13
Table 2.4.2 Table for Gender vs Computer...…..………...…………………………………………...13
Table 2.5.1 Table for Gender vs Employment……..………...………………………………………..14
Table 2.5.2 Table for Gender vs Major………...…..………...………………………………………..14
3

Table 2.6 Table for Gender vs Grad Intention (Yes/No) ………...…..………...……………………..14

INTRODUCTION

The business report is prepared based on the statistical techniques used with python for below three
problems. This report includes explanation of the approach used, insights, inferences, all outputs of
codes like graphs and tables.

problem 1 – Wholesale Customers Analysis in different regions of Portugal.

Problem 2 – Clear Mountain State University (CMSU) undergraduate student survey data Analysis.

Problem 3 – ABC asphalt shingles Moisture test data analysis & Hypothesis testing.

PROBLEM 1

Wholesale Customers Analysis


A wholesale distributor operating in different regions of Portugal has information on annual spending
of several items in their stores across different regions and channels. The data consists of 440 large
retailers’ annual spending on 6 different varieties of products in 3 different regions (Lisbon, Oporto,
Other) and across different sales channel (Hotel, Retail).

Table1 Sample Dataset

QUESTION 1.1.1

1.1.1 Use methods of descriptive statistics to summarize data.

 Data has three regions out of which ‘Other’ region has top frequency of 316 out of 440
purchases.
 Data has two Channels out of which ‘Hotel’ channel has top frequency of 298 out of 440
purchases.
 Based on mean for 6 varieties in below table Fresh, grocery and milk have the higher average
spending than frozen, Detergent Paper and Delicatessen.
4

Table 1.1 Description of data

QUESTION 1.1.2

1.1.2 Which Region and which Channel spent the most?

 Based on below data and 1.1 Graph ‘Other’ region and ‘Hotel’ Channel have spent most.
5

Figure 1.1 Channel & Region wise Spending

QUESTION 1.1.3

1.1.3 Which Region and which Channel spent the less?

 Based on above data and 1.1 Graph ‘Oporto region and ‘Retail Channel have spent less.

QUESTION 1.2

1.2 There are 6 different varieties of items that are considered. Describe and comment/explain
all the varieties across Region and Channel? Provide a detailed justification for your
answer.

 Data has three regions out of which ‘Other’ region has top frequency of 316 out of 440
purchases.
 Data has two Channels out of which ‘Hotel’ channel has top frequency of 298 out of 440
purchases.

Table-1.2 Description of 6 Varieties

 For Fresh Items Average Spending Mean is 12,000 and maximum value being 112,151. When
we compare fresh Items by channel and region wise ‘other’ region have max frequency &
mean for both channels, ‘Lisbon’ region is the least frequency and mean in retail Channel. The
max spending is from ‘Hotel’ chain in ‘Other’ region. Overall Hotel channel has more
spending than Retail for Fresh Items.
6

Table-1.2.1 Fresh Items

 For Milk Items Average Spending Mean is 5,796 and maximum value being 73,498. When we
compare Milk Items by channel and region wise ‘other’ region have max frequency for both
channels, ‘Oporto region has the least mean in both Channels. The max spending is from
‘Retail’ chain in ‘Other’ region. Overall Retail channel has more spending than Hotel for Milk
Items.

Table-1.2.2 Milk Items


7

 For Grocery Items Average Spending Mean is 7,951 and maximum value being 92,780. When
we compare Grocery Items by channel and region wise ‘other’ region have max frequency for
both channels, ‘Oporto region has the least mean in ‘Hotel’ Channel. The max spending is
from ‘Retail’ chain in ‘Other’ region. Overall Retail channel has more spending than Hotel for
Grocery Items.

Table-1.2.3 Grocery Items

 For Frozen Items Average Spending Mean is 3,072 and maximum value being 60,869. When
we compare Frozen Items by channel and region wise ‘other’ region have max frequency for
both channels, ‘Oporto’ region has the Max mean in ‘Hotel’ Channel. The max spending is
from ‘Hotel’ chain in ‘Oporto’ region. Overall Hotel channel has more spending than Retail
for Frozen Items.

Table-1.2.4 Frozen Items

 For Detergents Paper Items Average Spending Mean is 2,881 and maximum value being
40,827. When we compare Detergents Paper Items by channel and region wise ‘other’ region
have max frequency for both channels, ‘Oporto’ region has the Max mean in ‘Retail’ Channel.
The max spending is from ‘Retail’ chain in ‘Other’ region. Overall Retail channel has more
spending than hotel for Detergents Paper Items.
8

Table-1.2.5 Detergents Paper Items

 For Delicatessen Items Average Spending Mean is 1,524 and maximum value being 47,943.
When we compare Delicatessen Items by channel and region wise ‘other’ region have max
frequency for both channels, ‘Oporto’ region has the less mean in both Channels. The max
spending is from ‘Hotel’ chain in ‘Other’ region. Overall Retail channel has more spending
than hotel for Delicatessen Items.

Table-1.2.6 Delicatessen Items

QUESTION 1.3

1.3 Based on a descriptive measure of variability, which item shows the most inconsistent
behaviour? Which items show the least inconsistent behaviour?

The coefficient of variation (relative standard deviation) is a type of measure of dispersion. A measure


of dispersion is a quantity that is used to gauge the extent of variability of data. It is a
statistical measure of the dispersion of data points around the mean.

 From the below table observed that Delicatessen has the most inconsistent behavior with
coefficient of variation of 1.85 and Fresh has the Least inconsistent behavior with coefficient of
variation of 1.05.
9

Table-1.3 Description with Measure of variability.

QUESTION 1.4

1.4 Are there any outliers in the data? Back up your answer with a suitable plot/technique with
the help of detailed comments.

 From the below Figure Boxplot, we can observe that all Items have outliers and especially
Fresh & Grocery have Max.

Figure 1.4 Outliers Analysis

QUESTION 1.5

1.5 Based on your analysis, what are your recommendations for the business? How can your
analysis help the business to solve its problem? Answer from the business perspective
10

 Based on the analysis Hotel have more spending and disturber can concentrate more on Hotel
channel to improve more the sales.
 From the analysis we can conclude that the ‘Other’ region has the highest spending, so
distributor should increase the good distribution channel and focus more on the other region to
improve sales.
 From the analysis observed that spending on Fresh, Grocery & Milk items is maximum
compared to Other Three items. Delicatessen and Detergents paper are having the least
spending. Distributor Should target more on the sales of Fresh, grocery and milk product to
improve the Business.

Table 1.5 Product Analysis

PROBLEM 2

The Student News Service at Clear Mountain State University (CMSU) has decided to gather data
about the undergraduate students that attend CMSU. CMSU creates and distributes a survey of 14
questions and receives responses from 62 undergraduates (stored in the Survey data set).
11

Table 2 Survey Sample Data

QUESTION 2.1

2.1 For this data, construct the following contingency tables (Keep Gender as row variable).

2.1.1. Gender and Major

Below the contingency table across the Gender and Major

Table 2.1.1 Contingency Table Gender vs Major

2.1.2. Gender and Grad Intention

Below the contingency table across the Gender and Graduation Intention.

Table 2.1.2 Contingency Table Gender vs Graduation Intention.


12

2.1.3. Gender and Employment


Below the contingency table across the Gender and Employment.

Table 2.1.3 Contingency Table Gender vs Employment.

2.1.4. Gender and Computer


Below the contingency table across the Gender and Computer.

Table 2.1.4 Contingency Table Gender vs Computer.

QUESTION 2.2

Assume that the sample is a representative of the population of CMSU. Based on the data,
answer the following question:

2.2.1 What is the probability that a randomly selected CMSU student will be male?

Probability of selecting male student = (Total no. of male students) / (Total no. of students)
=29/62=0.4677

2.2.2 What is the probability that a randomly selected CMSU student will be female?

Probability of selecting Female student = (Total no. of female students) / (Total no. of students)
=33/62=0.5323

QUESTION 2.3

2.3.1 Find the conditional probability of different majors among the male students in CMSU.
13

The conditional probability of accounting major among the male students is 0.137.
The conditional probability of CIS major among the male students is 0.034.
The conditional probability of Economics/Finance major among the male students is 0.137.
The conditional probability of International Business major among the male students is 0.068.
The conditional probability of Management major among the male students is 0.206.
The conditional probability of Other major among the male students is 0.137.
The conditional probability of Retailing/Marketing major among the male students is 0.172.
The conditional probability of Undecided major among the male students is 0.103.

Table 2.3 Cross table for Gender vs Major.

2.3.2 Find the conditional probability of different majors among the female students of
CMSU.

The conditional probability of accounting major among the female students is 0.09.
The conditional probability of CIS major among the female students is 0.09.
The conditional probability of Economics/Finance major among the female students is 0.212.
The conditional probability of International Business major among the female students is 0.121.
The conditional probability of Management major among the female students is 0.121.
The conditional probability of Other major among the female students is 0.09.
The conditional probability of Retailing/Marketing major among the female students is 0.272.
The conditional probability of Undecided major among the female students is 0.

QUESTION 2.4

2.4.1 Find the probability that a randomly chosen student is a male and intends to graduate.

The probability of selected student being a male and intends to graduate is (17/29) * (29/62) = 0.2741

Table 2.4.1 Table for Gender vs Grad Intention.


14

2.4.2 Find the probability that a randomly selected student is a female and does NOT have a
laptop.

The probability of selected student being a female and without laptop is ((33-29)/33) * (33/62) = 0.0645

Table 2.4.2 Table for Gender vs Computer.

QUESTION 2.5

2.5.1 Find the probability that a randomly chosen student is either a male or has a full-time
employment?

The probability that a randomly selected student is either male or has full-time employment = (29/62)
+ (10/62) - (7/62) = 0.5161

Table 2.5.1 Table for Gender vs Employment.

2.5.2 Find the conditional probability that given a female student is randomly chosen, she is
majoring in international business or management.

The probability that a randomly selected student is either female or has major in international business
or management = (4/33) + (4/33) = 0.2424
15

Table 2.5.2 Table for Gender vs Major.

QUESTION 2.6

2.6 Construct a contingency table of Gender and Intent to Graduate at 2 levels (Yes/No). The
Undecided students are not considered now, and the table is a 2x2 table. Do you think graduate
intention and being female are independent events?

If the two events are independent the following condition must be satisfied the probability
multiplication of both events is equal to the probability of combined event.

P(Female ∩ Grad Intention (yes))= P(Female)*P(Grad Intention (yes))


P(Female ∩ Grad Intention (yes)) = 11/40 = 0.275
P(Female)*P(Grad Intention (yes)) = (20/40) * (28/40) = 0.350

As the probability multiplication of both events is not equal to the probability of combined
event, so being a female and graduate intention are not independent events.

Table 2.6 Table for Gender vs Grad Intention (Yes/No).

QUESTION 2.7

2.7 Note that there are four numerical (continuous) variables in the data set, GPA,
Salary, Spending and Text Messages. Answer the following questions based on the data

2.7.1 If a student is chosen randomly, what is the probability that his/her GPA is less than 3?

 The probability of student with GPA less than 3 = 17/62 = 0.274

2.7.2 Find conditional probability that a randomly selected male earns 50 or more. Find
conditional probability that a randomly selected female earns 50 or more.
16

 Probability of the male student earning more than or equal to 50 = ((14/32) * (32/62)) /
(29/62) = 0.4827
 Probability of the female student earning more than or equal to 50 = ((18/32) * (32/62)) /
(33/62) = 0.5454

QUESTION 2.8

2.8.1 Note that there are four numerical (continuous) variables in the data set, GPA, Salary,
Spending and Text Messages. For each of them comment whether they follow a normal
distribution.
2.8.2 Write a note summarizing your conclusions

Based on the below data apart from GPA & Salary remaining both Spending & Text messages have
Skewness of more than 1. Based on the below data we can consider only GPA is Normally
distributed.

Table 2.8.1 Variables Skewness

GPA has a near bell shaped curve. The mean, median and mode are nearly equal. Hence there exists a
slight skewness. The distribution is a u-shaped with skewness of -0.31. based on below graphs GPA
looks like Normally distributed.
17

Salary has near bell shaped curve. The mean, median and mode are not equal. Hence there exists a
skewness. The distribution can be considered as u-shaped with skewness of 0.53. based on below
graphs Salary is not Normally distributed

Spending has near bell shaped curve. The mean, median and mode are not equal. Hence there exists a
skewness. The distribution can be considered as u-shaped with skewness of 1.58. based on below
graphs Salary is not Normally distributed

Text messages has near bell shaped curve. The mean, median and mode are not equal. Hence there
exists a skewness. The distribution can be considered as u-shaped with skewness of 1.29. based on
below graphs Salary is not Normally distributed
18

PROBLEM 3

An important quality characteristic used by the manufacturers of ABC asphalt shingles is the amount
of moisture the shingles contain when they are packaged. Customers may feel that they have
purchased a product lacking in quality if they find moisture and wet shingles inside the packaging.   In
some cases, excessive moisture can cause the granules attached to the shingles for texture and
coloring purposes to fall off the shingles resulting in appearance problems. To monitor the amount of
moisture present, the company conducts moisture tests. A shingle is weighed and then dried. The
shingle is then reweighed, and based on the amount of moisture taken out of the product, the pounds
of moisture per 100 square feet are calculated. The company would like to show that the mean
moisture content is less than 0.35 pounds per 100 square feet.
The file (A & B shingles.csv) includes 36 measurements (in pounds per 100 square feet) for A
shingles and 31 for B shingles.

QUESTION 3.1

3.1 Do you think there is evidence that means moisture contents in both types of shingles are
within the permissible limits? State your conclusions clearly showing all steps.

Null and alternative hypotheses


Null hypothesis states that Mean moisture content, 𝜇 is less than or equal to 0.35. 𝐻0: 𝜇 <= 0.35
Alternative hypothesis states that mean moisture content, 𝜇 is greater to 0.35. 𝐻𝐴: 𝜇 > 0.35
Significance level
Here we select α = 0.05.
Calculate the p - value and test statistic
Hypothesis Testing for sample A
One sample t test
t statistic: -1.4735046253382782 p value: 0.07477633144907513

Decide to reject or accept null hypothesis

Level of significance: 0.05


We have no evidence to reject the null hypothesis since p value > Level of significance
Our one-sample t-test p-value= 0.07477633144907513

p value is 0.9252236685509249 and it is greater than 5% level of significance

So, the statistical decision is failing to reject the null hypothesis at 5% level of significance.

Conclusion – with a 95% confidence level, there is no enough evidence to conclude that the mean
moisture content for Sample A shingles is greater than 0.35 pounds per 100 square feet.
19

Calculate the p - value and test statistic


Hypothesis Testing for sample B
One sample t test
t statistic: -3.1003313069986995 p value: 0.0020904774003191826

Decide to reject or accept null hypothesis

Level of significance: 0.05


We have evidence to reject the null hypothesis since p value < Level of significance
Our one-sample t-test p-value= 0.0020904774003191826

p value is 0.0020904774003191826 and it is lesser than 5% level of significance

So, the statistical decision is to reject the null hypothesis at 5% level of significance.

Conclusion – with a 95% confidence level, there is enough evidence to conclude that the mean
moisture content for Sample B shingles is greater than 0.35 pounds per 100 square feet.

QUESTION 3.2

Do you think that the population mean for shingles A and B are equal? Form the hypothesis
and conduct the test of the hypothesis. What assumption do you need to check before the test for
equality of means is performed?

Null and alternative hypotheses

Null hypothesis states that Mean moisture contents are the same, 𝜇𝐴 equals 𝜇𝐵 . 𝐻0 : 𝜇𝐴 = 𝜇𝐵

Alternative hypothesis states that mean moisture contents are not same, 𝜇𝐴 is not equal to 𝜇𝐵 . 𝐻𝐴 :
𝜇𝐴 ≠ 𝜇𝐵

Decide the significance level

Here we select 𝛼 = 0.05 and the population standard deviation is not known.

Identify the test statistic

Two samples are provided without population standard deviation.

Sample sizes for both samples are not equal.

The sample is not a large sample, n is near to 30. So we use the t distribution and the 𝑡𝑆𝑇𝐴𝑇 test
statistic for two sample unpaired test.

For A and B, we have 36 and 31 samples respectively.

Calculate the p - value and test statistic


20

t_statistic=1.2896282719661123 and pvalue=0.2017496571835306

Decide to reject or accept null hypothesis

two-sample t-test p-value= 0.2017496571835306


We do not have enough evidence to reject the null hypothesis in favour of alternative hypothesis
We conclude that the mean moisture content in both the shingles are same.

Conclusion - with a 95% confidence level, we conclude that the mean moisture content in both the
shingles are same.

You might also like