You are on page 1of 23

SMDM PROJECT-2022

STATISTICAL METHODS FOR DECISION MAKING


CONTENTS

1 -Wholesale Customer Data Analysis........................................................................................ 3

1.1Problem 1.1.............................................................................................................................3

1.2Problem 1.2.............................................................................................................................3

1.3Problem 1.3.............................................................................................................................9

1.4Problem 1.4.............................................................................................................................11

1.5Problem 1.5.............................................................................................................................11

2 - Clear Mountain State University (CMSU) Survey....................................................................

2.1Problem 2.1..............................................................................................................................12

2.2Problem 2.2..............................................................................................................................13

2.3.Problem 2.3.............................................................................................................................14

2.4.Problem 2.4.............................................................................................................................15

2.5.Problem 2.5.............................................................................................................................16

2.6.Problem 2.6.............................................................................................................................17

2.7.Problem 2.7.............................................................................................................................18

2.8.Problem 2.8.............................................................................................................................20

3 – Hypothesis Testing for Quality of Shingles..............................................................................21

3.1.Problem 3.1.............................................................................................................................22

3.2.Problem 3.2.............................................................................................................................23

2
Problem 1

Wholesale Customers Analysis

Problem Statement:

A wholesale distributor operating in different regions of Portugal has information on annual


spending of several items in their stores across different regions and channels. The data consists
of 440 large retailers’ annual spending on 6 different varieties of products in 3 different regions
(Lisbon, Oporto, Other) and across different sales channel (Hotel, Retail)

The data describes about the amount of spend in each region corresponding with major items pertaining
to Food market in both Hotel and Retail.

We can see the First five rows of the dataset below

The above dataset has no null values and this is checked using isnull() function in python.

The dataset consists of 440 rows and 9 columns.

1.1 Use methods of descriptive statistics to summarize data. Which Region and which Channel
spent the most? Which Region and which Channel spent the least?

➢ Descriptive statistics (mean,median,mode, standard deviation,min,max,25th percentile,50th


percentile,75th percentile) can be easily calculated in python using describe() function.We can
also see how largely the data has been dispersed.

3
From the above table, we can conclude the following

1. The dataset consists of 2 categorical value (Channel and region) and 7 numerical values
(buyer/sender, fresh,milk,grocery,frozen,detergents_paper and delicatessen)

2. The mean ,median, std of the variables has been found above

3. Using the min, max we can also calculate the range of the variables and Interquartile which in
turns gives us the outlier values

➢ To find out the most spending in Region and Channel we need to create a new column
‘Total_spending’ which is total of all 6 items

The below output was generated using groupby() function in python.

From the python calculation, we can conclude that

4
• “Others” is the region which has spent the most and “Hotel” is the channel that
has spent the most
• “Oporto” is the region which has spent the least and “Retail” is the channel that
has spent the least

1.2. There are 6 different varieties of items that are considered. Describe and comment/explain all
the varieties across Region and Channel? Provide a detailed justification for your answer .

5
➢ Looking from the above graphs it can be seen that some items like Milk,Grocery and Detergents
paper have highest spend in Retail Channel across all regions.On the other hand,Frozen and
Fresh spendings is higher on Hotel side across all regions.

• FRESH items spending is more on the Hotel channel and in OTHERS region
• MILK item spending is more on the Retail channel across OTHERS region closely followed by
LISBON
• GROCERY item spending is more on the Retail channel across LISBON region where the
number of Retail shops are lesse than hotels
• FROZEN item spendings is more on the Hotel channel across OPORTO region
• DETERGENTS PAPER item spendings is more on the Retail channel across OPORTO region
• DELICATESSEN item spendings is more on the Retail channel across LISBON region

6
Contigency Table

• Lisbon is the capital of Portugal and Lisbon region shows higher spend in Grocery item across
Retail channel compared to Hotel channel where the difference of spending is relatively
low.Contrary,there are more number of Hotels in Lisbon compared to Retail.Number of Hotels are
high in Lisbon because of more number of tourists.Milk and Grocery items seems to be on the
higher demand in LISBON.

• Oporto is like the commercial capital and second largest city of Portugal,Frozen and Detergents
Paper items seems to be on the higher demand in OPORTO.

1.3 On the basis of the descriptive measure of variability, which item shows the most inconsistent
behaviour? Which items shows the least inconsistent behaviour?

➢ Usually, consistency and Inconsistency behavior of a variable can be identified by Standard


deviation of the variable. Higher the standard deviation higher the inconsistency.

➢ From the below descriptive statistics table, we can see that Fresh has higher standard deviation
value but it doesn’t conclude that the variable Fresh is inconsistent because the range of values
in Fresh is very high(i.e Min=3 and Max=112151,Range=112148).To check the consistency of
the variable we must calculate Co-efficient of variation which is equal to

[Standard Deviation(σ)/Population Mean ( μ)]

7
We have used Python to calculate the COV of the variables Fresh,Milk,Grocery,Detergents_Paper and
Delicatessen.

➢ We can infer that COV of Item Delicatessen is high ,therefore the behaviour of the variable shows
high inconsistency and COV of Item Fresh is low ,therefore the behaviour of the variable shows
more consistency.

FRESH=Most consistent behaviour


DELICATESSEN=Least Consistent behaviour

1.4 Are there any outliers in the data? Back up your answer with a suitable plot/technique with the
help of detailed comments.

We can use Box plot to find out the outliers in a dataset

8
From the above box plot it is concluded that the dataset consists of outliers.

The outliers are present in all the food items/variables


1. Fresh
2. Milk
3. Grocery
4. Frozen
5. Detergents Paper
6. Delicatessen

1.5 On the basis of your analysis, what are your recommendations for the business? How can your
analysis help the business to solve its problem? Answer from the business perspective.

As seen earlier,let’s explore the given Dataset.

As per the dataset there are three number of regions(Lisbon,Oporto and Others)

9
• Lisbon is the capital of Portugal and one of the biggest city and the most busiest because of
tourism
• Oporto is the second largest city and also the commercial capital of Portugal
• Other cities of Portugal are named as ‘Others’

Below is the map of Portugal with its State territories

(Ref taken from Internet: https://ontheworldmap.com/portugal)

Correlation between the different food items

10
➢ Highest correlation can be seen between the Grocery and Detergents Paper ,Milk and
Grocery ,Milk and Detergents Paper.

From the above Heatmap,we can conclude the following

➢ There is higher chance of buying Detergents Paper and Milk along with Grocery Items

The below are the recommendations that can be made based on the above analysis to the Wholesale
Distributor Agency

1. As there are more number of Hotels in Lisbon and Grocery items are on the highest side
of spending it can be seen that there is also more demand for fresh items at Hotels in
Lisbon. Wholesale Distributor can stock more Grocery items based on the inventory of
the respective location of where the Hotels are located as per the Tourism seasons. the
wholesale distributor can further increase its revenue by distributing Grocery items along
with Milk and Detergents Paper.(Data analysis statement driven through Heatmap above)

2. In Oporto there are more number of Retails than Lisbon region despite Lisbon being the
capital .There is more demand for Retail stores in Oporto and most of the population in
Oporto buy food items from Retail stores. Grocery and Detergents paper are in the
Highest demand at Oporto

3. We can see highest inconsistency across some products .Therefore,wholesale distributor


has to focus more on those products to make them fall under less consistent zone

4. The wholesale distributor must also focus on other products apart from fresh and Grocery
as mean spending are comparatively more from other products by having more
marketing campaigns for those products and also can work on discounting to increase
the spending

11
PROBLEM 2

The Student News Service at Clear Mountain State University (CMSU) has decided to gather
data about the undergraduate students that attend CMSU. CMSU creates and distributes a
survey of 14 questions and receives responses from 62 undergraduates (stored in
the Survey data set).

The first five rows of the dataset

2.1. For this data, construct the following contingency tables (Keep Gender as row variable)

Contingency table is constructed in python using crosstab() function.

2.1.1. Gender and Major

12
2.1.2. Gender and Grad Intention

2.1.3. Gender and Employment

2.1.4. Gender and Computer

2.2 Assume that the sample is representative of the population of CMSU. Based on the
data, answer the following question:

13
2.2.1. What is the probability that a randomly selected CMSU student will be male?

Probability that a randomly selected CMSU student will be male is 46.8%

2.2.2. What is the probability that a randomly selected CMSU student will be female?

Probability that a randomly selected CMSU student will be Female is 53.2%

2.3. Assume that the sample is representative of the population of CMSU. Based on the
data, answer the following question:

2.3.1. Find the conditional probability of different majors among the male students in
CMSU.
Conditional probability of different majors among the male students in CMSU is

➢ Using contingency tables of Gender and Majors we got the total numbers of males and number of
males opting for different majors

Below is the output from Python –


Probability of Males opting for Accounting. is 13.79%
Probability of Males opting for CIS. is 3.45%
Probability of Males opting for Economics/Finance. is 13.79%
Probability of Males opting for International Business. is 6.90%
Probability of Males opting for Management. is 20.69%
Probability of Males opting for Other. is 13.79%
Probability of Males opting for Retailing/Marketing. is 17.24%
Probability of Males opting for Undecided is 10.34%

And from this output we can easily say that most of the males students prefer Management as Majors
and CIS is the least preferred one.

2.3.2 Find the conditional probability of different majors among the female students of
CMSU.
Conditional probability of different majors among the Female students in CMSU is

14
Using contingency tables of Gender and Majors we got the total numbers of females and number of
females opting for different majors

➢ Below is the output from Python –

Probability of Females opting for Accounting. is 9.09%


Probability of Females opting for CIS. is 9.09%
Probability of Females opting for Economics/Finance is 21.21%
Probability of Females opting for International Business. is 12.12%
Probability of Females opting for Management. is 12.12%
Probability of Females opting for Other is 9.09%
Probability of Females opting for Retailing/Marketing is 27.27%
Probability of Females opting for Undecided. is 0.00%

And from this output we can easily say that most of the Female students prefer Retailing/Marketing as
Majors and least is CIS.

2.4. Assume that the sample is a representative of the population of CMSU. Based on the
data, answer the following question:

2.4.1. Find the probability That a randomly chosen student is a male and intends to
graduate.

From the contingency table of Gender and Graduate Intention

Probability that a randomly chosen student is a Male = 29/62

Probability of Male that intends to Graduate = 17/29

As it is And statement we have to find P(A)*P(B)

Probability a randomly chosen student is a male and intends to graduate.

= Probability that a randomly chosen student is a Male * Probability that a randomly chosen student is a
Male

Hence from the calculations done in Python we conclude that :

➢ The probability That a randomly chosen student is a male and intends to graduate is 27.42 %

2.4.2 Find the probability that a randomly selected student is a female and does NOT
have a laptop.
From the constructed contingency Table 2.1.4,we can conclude the below

Probability that a randomly chosen student is a Female = 33/62

Probability of Female with No Laptop = 1-(29/33)

Probability that a randomly selected student is a female and does NOT have a laptop

15
= Probability that a randomly chosen student is a Female * Probability of Female with No Laptop

Hence from the calculations done in Python we conclude that :

➢ The probability that a randomly selected student is a female and does NOT have a laptop is 6.4%

2.5. Assume that the sample is representative of the population of CMSU. Based on the
data, answer the following question:

2.5.1. Find the probability that a randomly chosen student is a male or has full-time
employment?

Below Data is constructed using contingency table in 2.1.3

Probability of a Student being Male = 29/62

Probability of a student having Full Time Employment = 10/62

Probability of a Male student in Full Time Employment = 7/62

Probability that a randomly chosen student is a male or has full-time employment

= Probability of a Student being Male + Probability of a student having Full Time Employment - Probability
of a Male having Full Time Employment

Hence from the calculations done in Python we conclude that:

➢ The probability that a randomly chosen student is a male or has a full-time employment 51.61 %

2.5.2. Find the conditional probability that given a female student is randomly chosen,
she is majoring in international business or management.

Below Data is constructed using contingency table in 2.1.1

Probability of international business given Female = 4/33

Probability of management given Female = 4/33

Since international business and management are independent of each other

Probability of international business or management given Female

= Probability of international business given Female + Probability of management given Female

Hence from the calculations done in Python we conclude that:

➢ The conditional probability that given a female student is randomly chosen, she is majoring in
international business or management is 24.242 %.

16
2.6. Construct a contingency table of Gender and Intent to Graduate at 2 levels (Yes/No).
The Undecided students are not considered now and the table is a 2x2 table. Do you
think the graduate intention and being female are independent events?

2x2 Contigency table is constructed by using python crosstab() function

p(A)= Probability of female students=33/62=0.53


p(B)=Probability of being a Grad Intention as Yes =28/40 =0.7
P(Being Female and Grad Intent as Yes)=11/20=0.55
For Independent Events P(AintB)=P(A).P(B)
i.e P(Grad Intent as Yes being Female students)= P(Grad Intent as Yes). P(Female Students).
P(AintB)=P(A|B).P(B)
Assuming the events are independent
P(A).P(B)=P(A|B).P(B)
P(A)=P(A|B)
P(Grad Intention Yes)=P(Grad Intent as Yes|Female students)
From the contigency table above
28/40=11/20
0.7 != 0.55

As the equation is not proved, the events are not independent.


➢ Hence, Graduate intention and being female are not independent events.

17
2.7. Note that there are four numerical (continuous) variables in the data set, GPA, Salary,
Spending, and Text Messages.

Answer the following questions based on the data

2.7.1. If a student is chosen randomly, what is the probability that his/her GPA is less
than 3?

From the calculations done in Python ,we can conclude the following statements

Total no.of students whose GPA is less than 3=17

Total no.of Students =62

Probability that his/her GPA is less than 3 =17/62

➢ If a student is chosen randomly, the probability that his/her GPA is less than 3 is 27.4%

2.7.2. Find the conditional probability that a randomly selected male earns 50 or more.
Find the conditional probability that a randomly selected female earns 50 or more.
By using Python , we get the following output

No.of males who earns 50 or more = 14

No.of Female who earns 5 or more = 18

➢ Conditional probability that a randomly selected male earns 50 or more is 48.2%


➢ Conditional probability that a randomly selected Female earns 50 or more is 54.5%

18
2.8. Note that there are four numerical (continuous) variables in the data set, GPA, Salary,
Spending, and Text Messages. For each of them comment whether they follow a normal
distribution.

By using distplot() in python ,we have found out whether the continuous variables in the dataset follow a
normal distribution

19
We have also calculated the skewness of the variables

From the above graphs and skewness calculations, we can conclude the following

1. ‘GPA’ variable is almost normally distributed and has slight negative skewness i.e. the
data is slightly left skewed.
2. ‘Salary’ is almost normally distributed, and it is slightly right skewed based on the
skewness value
3. ‘Spending’ variable is highly right skewed
4. ‘Text messages’ variable are highly right skewed

For normally distributed data,Mean=Median=Mode

GPA Salary Spending Text Messages


Mean 3.1 48.5 482 246.2
Median 3.1 40 500 200
Mode 3.1 50 500 300

*Mean, Median and mode values are calculated using Python functions

➢ From the above table, we can confidently conclude that continuous variable ‘GPA’
is perfectly a normally distributed data.

2.8.2 Write a note summarizing your conclusions

We have dataset of students answered to the survey and we have 62 responses from the students both
male and female. There are a greater number of female students who have answered the survey.
Majority of the male students have their interest in Management course and majority of the female
students have their interest in Retailing/Marketing. From the data, a greater number of students are
looking for a part time job.

20
PROBLEM 3

An important quality characteristic used by the manufacturers of ABC asphalt shingles is the
amount of moisture the shingles contain when they are packaged. Customers may feel that they
have purchased a product lacking in quality if they find moisture and wet shingles inside the
packaging. In some cases, excessive moisture can cause the granules attached to the shingles
for texture and coloring purposes to fall off the shingles resulting in appearance problems. To
monitor the amount of moisture present, the company conducts moisture tests. A shingle is
weighed and then dried. The shingle is then reweighed, and based on the amount of moisture
taken out of the product, the pounds of moisture per 100 square feet are calculated. The company
would like to show that the mean moisture content is less than 0.35 pounds per 100 square feet.

The file (A & B shingles.csv) includes 36 measurements (in pounds per 100 square feet) for A shingles
and 31 for B shingles.

First five rows of Shingles Dataset is shown below

3.1 Do you think there is evidence that means moisture contents in both types of shingles are
within the permissible limits? State your conclusions clearly showing all steps.

For the A shingles, the null and alternative hypothesis to test whether the population mean moisture
content is less than 0.35 pound per 100 square feet is given:

H0 : mean moisture content >=0.35

HA : mean moisture content < 0.35

Level of significance: 0.05

21
We have samples and we do not know the population standard deviation.

The sample is not a large sample. So we can use the T distribution and the test statistic

Since we a testing for only sample A we use One sample T test.

Hence from the calculations done in Python we conclude that :

We have no evidence to reject the null hypothesis since p value(0.07) > Level of significance(0.05)

For the B shingles, the null and alternative hypothesis to test whether the population mean moisture
content is less than 0.35 pound per 100 square feet is given:

H0 : mean moisture content >=0.35

HA : mean moisture content < 0.35

Level of significance: 0.05

We have samples and we do not know the population standard deviation.

The sample is not a large sample. So you use the T distribution and thetest statistic .

Since we a testing for only sample B we use One sample T test. . Also as python by default in Python,
ttest_1samp shows the result of 2-sided it is divided by 2 as our sample is one-sided.

Hence from the calculations done in Python we conclude that :p=0.002

We have evidence to reject the null hypothesis since p value(0.02) < Level of significance

Conclusion:

For Sample A-There is no proper evidence that mean moisture content in shingles is less
than 0.35
For Sample B-There is proper evidence that mean moisture content in shingles is less than
0.35`

22
3.2 Do you think that the population mean for shingles A and B are equal? Form the
hypothesis and conduct the test of the hypothesis. What assumption do you need to
check before the test for equality of means is performed?

Assumptions for the Hypothesis Testing:

H0 : Population mean of A shingles= Population mean of B shingles


HA : Population mean of A shingles ≠ Population mean of B shingles

Level of significance: 0.05

We have two samples A and B and we do not know the population standard deviation.

The samples are not large sample. So we can use the t distribution and the tSTAT test
statistic

Since we are comparing two samples, we can use the two sample t test .
Hence from the calculations done in Python we conclude that:
Two-sample t-test p-value= 0.2017496571835306
We do not have enough evidence to reject the null hypothesis since p value (0.2) > Level of
Significance (0.05)

Population mean of A shingles= Population mean of B shingles

END OF THE REPORT

23

You might also like