You are on page 1of 22

NAME- ASHISH PAVAN KUMAR K

EMAIL- ashishrdg@gmail.com

Mob- +918885765109

Contents

Problem 1

1.1 Use methods of descriptive statistics to summarize data. Which Region and which Channel spent the most?
Which Region and which Channel spent the least?

1.2 There are 6 different varieties of items that are considered. Describe and comment/explain all the varieties
across Region and Channel? Provide a detailed justification for your answer.

1.3 On the basis of a descriptive measure of variability, which item shows the most inconsistent behavior? Which
items show the least inconsistent behaviour?

1.4 Are there any outliers in the data? Back up your answer with a suitable plot/technique with the help of
detailed comments.

1.5 On the basis of your analysis, what are your recommendations for the business? How can your analysis help
the business to solve its problem? Answer from the
business perspective

Problem 2

2.1. For this data, construct the following contingency tables (Keep Gender as row variable)

2.1.1. Gender and Major

2.1.2. Gender and Grad Intention

2.1.3. Gender and Employment

2.1.4. Gender and Computer

2.2. Assume that the sample is representative of the population of CMSU. Based on the data, answer the following
question:

2.2.1. What is the probability that a randomly selected CMSU student will be male?

2.2.2. What is the probability that a randomly selected CMSU student will be female?

2.3. Assume that the sample is representative of the population of CMSU. Based on the data, answer the following
question:

2.3.1. Find the conditional probability of different majors among the male students in CMSU.
2.3.2 Find the conditional probability of different majors among the female students of CMSU.

2.4. Assume that the sample is a representative of the population of CMSU. Based on the data, answer the
following question:

2.4.1. Find the probability That a randomly chosen student is a male and intends to graduate.

2.4.2 Find the probability that a randomly selected student is a female and does NOT have a laptop. 

2.5. Assume that the sample is representative of the population of CMSU. Based on the data, answer the following
question:
2.5.1. Find the probability that a randomly chosen student is a male or has full-time employment?

2.5.2. Find the conditional probability that given a female student is randomly chosen, she is majoring in
international business or management.

2.6.  Construct a contingency table of Gender and Intent to Graduate at 2 levels (Yes/No). The Undecided students
are not considered now and the table is a 2x2 table. Do you think the graduate intention and being female are
independent events?

2.7. Note that there are four numerical (continuous) variables in the data set, GPA, Salary, Spending, and Text
Messages.
Answer the following questions based on the data

2.7.1. If a student is chosen randomly, what is the probability that his/her GPA is less than 3?

2.7.2. Find the conditional probability that a randomly selected male earns 50 or more. Find the conditional
probability that a randomly selected female earns 50 or more.

2.8. Note that there are four numerical (continuous) variables in the data set, GPA, Salary, Spending, and Text
Messages. For each of them comment whether they follow a normal distribution. Write a note summarizing your
conclusions.

Problem 3

3.1 Do you think there is evidence that means moisture contents in both types of shingles are within the
permissible limits? State your conclusions clearly showing all steps.

3.2 Do you think that the population mean for shingles A and B are equal? Form the hypothesis and conduct the
test of the hypothesis. What assumption do you need to check before the test for equality of means is performed?
Please reflect on all that you have learnt while working on this project. This step is critical in cementing all your
concepts and closing the loop.
SOLUTIONS

Problem 1

A wholesale distributor operating in different regions of Portugal has information on annual spending of several
items in their stores across different regions and channels. The data (Wholesale Customer.csv) consists of 440 large
retailers’ annual spending on 6 different varieties of products in 3 different regions (Lisbon, Oporto, Other) and
across different sales channel (Hotel, Retail).

1.1 Use methods of descriptive statistics to summarize data. Which Region and which Channel spent the
most? Which Region and which Channel spent the least?

Descriptive Statistics of our Data:

mea 25
std min 50% 75% max
count n %

12000.29772 12647.32886 3127.7 8504. 16933.7 112151.


Fresh 440.0 3.0
7 5 5 0 5 0

55. 1533.0 3627.


Milk 440.0 5796.265909 7380.377175 7190.25 73498.0
0 0 0

2153.0 4755. 10655.7


Grocery 440.0 7951.277273 9503.162829 3.0 92780.0
0 5 5

25. 1526.
Frozen 440.0 3071.931818 4854.673333 742.25 3554.25 60869.0
0 0

Detergents_Pape
440.0 2881.493182 4767.854448 3.0 256.75 816.5 3922.00 40827.0
r

Delicatessen 440.0 1524.870455 2820.105937 3.0 408.25 965.5 1820.25 47943.0


Descriptive Statistics of our Data including Channel & Retail:

uniq to me 25
count freq std min 50% 75% max
ue p an %

Hot Na
Channel 440 2 298 NaN NaN NaN NaN NaN NaN
el N

Oth Na
Region 440 3 316 NaN NaN NaN NaN NaN NaN
er N

440. Na Na Na 12000.29 12647.32 3127. 8504 16933. 11215


Fresh 3.0
0 N N N 7727 8865 75 .0 75 1.0

440. Na Na Na 5796.265 7380.377 55. 1533. 3627 7190.2 73498.


Milk
0 N N N 909 175 0 0 .0 5 0

440. Na Na Na 7951.277 9503.162 2153. 4755 10655. 92780.


Grocery 3.0
0 N N N 273 829 0 .5 75 0

440. Na Na Na 3071.931 4854.673 25. 742.2 1526 3554.2 60869.


Frozen
0 N N N 818 333 0 5 .0 5 0

Detergents_ 440. Na Na Na 2881.493 4767.854 256.7 816. 40827.


3.0 3922.0
Paper 0 N N N 182 448 5 5 0

440. Na Na Na 1524.870 2820.105 408.2 965. 1820.2 47943.


Delicatessen 3.0
0 N N N 455 937 5 5 5 0

Region
Lisbon 2386813
Oporto 1555088
Other 10677599
Name: Spending, dtype: int64

Channel
Hotel 7999569
Retail 6619931
Name: Spending, dtype: int64
Highest spend in the Region is from Others and lowest spend in the region is from Oporto
Highest spend in the Channel is from Hotel and lowest spend in the Channel is from Retail.

Region Channel
Lisbon Hotel 1538342
Retail 848471
Oporto Hotel 719150
Retail 835938
Other Hotel 5742077
Retail 4935522
Name: Spending, dtype: int64

Highest spend in the Region/Channel is from Others/Hotel


and lowest spend in the Region/Channel is from Oporto/Hotel

Detergents_Pape Delicatesse
Fresh Milk Grocery Frozen
r n

Channe
l

Hotel 13475.56 3451.72 3962.14 3748.25 790.56 1415.96

10716.5
Retail 8904.32 16322.85 1652.61 7269.51 1753.44
0

In Channel "Hotel" Average Highest Spending in Fresh items and Lowest Spending in Detergents Paper.

In Channel "Retail" Average Highest Spending in Grocery items and Lowest Spending in Frozen items.

Delicatesse
Fresh Milk Grocery Frozen Detergents_Paper
n

Regio
n

11101.7 7403.0
Lisbon 5486.42 3000.34 2651.12 1354.9
3 8
Delicatesse
Fresh Milk Grocery Frozen Detergents_Paper
n

Regio
n

9218.6
Oporto 9887.68 5088.17 4045.36 3687.47 1159.7
0

12533.4 7896.3
Other 5977.09 2944.59 2817.75 1620.6
7 6

In Region "Lisbon" Average Highest Spending in Fresh and Lowest in Delicatessen


items.

In Region "Oporto" Average Highest Spending in Fresh and Lowest in Delicatessen


items.

In Region "Other" Average Highest Spending in Fresh and Lowest in Delicatessen


items.

1.2 There are 6 different varieties of items that are considered. Describe and
comment/explain all the varieties across Region and Channel? Provide a detailed
justification for your answer.

See Behavior in all items across Channel and Region use Bar Plot. Here we see that they
are different in Channel and Region.
Based on the plot, we can conclude that Fresh item is sold more in the Retail channel
Based on the observation, the milk is sold more in retail channel in other region.
Based on observation, we can say that the groceries are sold more in retail channel in Oporto.
Frozen are sold more in Hotel Channel in Oporto location.
Based on the observation Detergents Paper are sold more in retail channel in Oporto Location.

1.3 On the basis of the descriptive measure of variability, which item shows the most
inconsistent behavior? Which items shows the least inconsistent behaviour?
Fresh 12647.33
Milk 7380.38
Grocery 9503.16
Frozen 4854.67
Detergents_Paper 4767.85
Delicatessen 2820.11
dtype: float64

Fresh item has highest Standard deviation So that is Inconsistent.

Delicatessen item have smallest Standard deviation, so that is consistent.

Based on coefficient of Variation—


Cv fresh = 1.0527

Cv milk= 1.2718

Cv grocery= 1.1938

Cv frozen= 1.5785

Cv detergent= 1.6527

Cv delicatessen= 1.8473

“Fresh” item has lowest coefficient of Variation So that is consistent.


“Delicatessen” item has highest coefficient of Variation, so that is Inconsistent.

1.4 Are there any outliers in the data? Back up your answer with a suitable plot/technique
with the help of detailed comments.

Use Boxplot to see Outliers:

The black point is the outliers in boxplot graph.


Yes, there are outliers in all the items across the product range (Fresh, Milk,
Grocery, Frozen, Detergents Paper & Delicatessen)
Outliers are detected but not necessarily removed, it depends of the situation.

1.5 On the basis of your analysis, what are your recommendations for the business? How
can your analysis help the business to solve its problem? Answer from the business
perspective.

As per the analysis, I find out that there are inconsistencies in spending of different items (by calculating
Coefficient of Variation), which should be minimized. The spending of Hotel and Retail channel are
different which should be more or less equal. And also spent should equal for different regions. Need to
focus on other items also than “Fresh” and “Grocery”.

Problem 2 
2.1. For this data, construct the following contingency tables (Keep Gender as row
variable)
2.1.1. Gender and Major
Internatio A
Account CI Economics/ Managem Oth Retailing/ Undeci
Major nal l
ing S Finance ent er Marketing ded
Business l

Gend
er

Fema 3
3 3 7 4 4 3 9 0
le 3

2
Male 4 1 4 2 6 4 5 3
9

6
All 7 4 11 6 10 7 14 3
2

From the data we can tell that there are 33 female students and 29 male students. All the 33 female
students took the major subjects and 26 male students took major subjects and remaining 3 male
students are undecided.

2.1.2. Gender and Grad Intention

Grad Undecide
No Yes All
Intention d

Gender

Female 9 13 11 33

Male 3 9 17 29

All 12 22 28 62
From the data we can tell that 11 female students have grad intention and 17 male students are grad
intention. 13 female students are undecided and 9 male students are undecided. 9 female has no grad
intention and 3 male students are no grad intention.

2.1.3. Gender and Employment

Full- Unemploye
Employment Part-Time All
Time d

Gender

Female 3 24 6 33

Male 7 19 3 29

All 10 43 9 62

There are 3 full time female students who has employment, 24 part time female students who has
employment and 6 unemployed female students.

7 male students who has full time employment, 19 male students who has part time employment and 3
male unemployed students.

2.1.4. Gender and Computer

Lapto
Computer Desktop Tablet All
p

Gender

Female 2 29 2 33

Male 3 26 0 29
Lapto
Computer Desktop Tablet All
p

Gender

All 5 55 2 62

From the data we can tell that, 2 female students have desktop, 29 female students’ laptop, 2 female
students have desktop.

3 male students have desktop, 26 male students have laptop.

2.2. Assume that the sample is representative of the population of CMSU. Based on the
data, answer the following question:
2.2.1. What is the probability that a randomly selected CMSU student will be male?
The probability that randomly selected CMSU student of male: 0.46774193548387094

2.2.2. What is the probability that a randomly selected CMSU student will be female?

The probability that randomly selected CMSU student of female: 0.532258064516129

2.3. Assume that the sample is representative of the population of CMSU. Based on the
data, answer the following question:
2.3.1. Find the conditional probability of different majors among the male students in
CMSU.
Among MALE candidates:
Probability of having accounting: 0.13793103448275862
Probability of having a cis: 0.034482758620689655
Probability of having finance/economics: 0.13793103448275862
Probability of having international business: 0.06896551724137931
Probability of having management: 0.20689655172413793
Probability of having other subjects: 0.13793103448275862
probability of having retail/marketing: 0.1724137931034483
probability of undecided: 0.10344827586206896

2.3.2 Find the conditional probability of different majors among the female students of
CMSU.

Among Female candidates:


Probability of having accounting: 0.09090909090909091
Probability of having a cis: 0.09090909090909091
Probability of having finance/economics: 0.21212121212121213
Probability of having international business: 0.12121212121212122
Probability of having management: 0.12121212121212122
Probability of having other subjects: 0.09090909090909091
probability of having retail/marketing: 0.2727272727272727
probability of undecided: 0.0
2.4. Assume that the sample is a representative of the population of CMSU. Based
on the data, answer the following question:
2.4.1. Find the probability That a randomly chosen student is a male and intends to
graduate.

P(Grad Intention ∩ Male) = P (Grad intention| Male) x P (male) = 0.27419354838709675

2.4.2 Find the probability that a randomly selected student is a female and does NOT
have a laptop. 

P(laptop ∩ female) = P (laptop| female) x P (female) = 0.06451612903225806

2.5. Assume that the sample is representative of the population of CMSU. Based on the
data, answer the following question:
2.5.1. Find the probability that a randomly chosen student is a male or has full-time
employment?

probability of male: 0.46774193548387094


prob of full-time employment: 0.16129032258064516
prob of male and full-time employment: 0.11290322580645161
prob of male or full time is: 0.5161290322580645

2.5.2 Find the conditional probability that given a female student is randomly chosen, she
is majoring in international business or management.

since international business and management are mutually exclusive events

prob of international business or management |Female 0.24242424242424243

2.6 Construct a contingency table of Gender and Intent to Graduate at 2 levels (Yes/No).
The Undecided students are not considered now and the table is a 2x2 table. Do you
think graduate intention and being female are independent events?
Grad
No Yes All
Intention

Gender

Female 9 11 20

Male 3 17 20

All 12 28 40

prob of female i.e., P(F) is 0.5

prob of graduation intention i.e., P(Yes) is 0.7

Prob of F ∩ Yes is 0.275

P(F). P(Yes) is 0.35

since P(F ∩ Yes) is not equal to P(F).P(Yes)

we can conclude that they are not independent events

2.7 Note that there are four numerical (continuous) variables in the data set, GPA, Salary,
Spending and Text Messages. Answer the following questions based on the data
2.7.1 If a student is chosen randomly, what is the probability that his/her GPA is less than
3?
There is total 17 students who has the GPA less than 3.

Probability of a student is chosen randomly has a GPA less than 3 is: 0.025

2.7.2 Find conditional probability that a randomly selected male earns 50 or more. Find
conditional probability that a randomly selected female earns 50 or more.

Probability of randomly selected male earns 50 or more is P(True Male) is 0.4827586206896552

2.8. Note that there are four numerical (continuous) variables in the data set, GPA, Salary,
Spending, and Text Messages. For each of them comment whether they follow a normal
distribution. Write a note summarizing your conclusions.
From the graph, we can say that GPA follows a normal distribution. There are no any outliers.

From the graph we can tell that, Salary doesn’t follow the normal distribution and there are
outliers. And the distribution is right skewed.

From the graph we can tell that, spending doesn’t follow a normal distribution. There are
outliers in the data. The distribution is right skewed.
From the plots we can say that, text messages don’t follow the normal distribution. There are outliers in
the data and right skewed.

PROBLEM 3
3.1 Do you think there is evidence that means moisture contents in both types of
shingles are within the permissible limits? State your conclusions clearly showing all
steps.

One sample t test  

t statistic: -1.4735046253382782 p value: 0.07477633144907513  


Since pvalue > 0.05, do not reject H0 . There is not enough evidence to conclude that the mean
moisture content for Sample A shingles is less than 0.35 pounds per 100 square feet. p-value =
0.0748. If the population mean moisture content is in fact no less than 0.35 pounds per 100
square feet, the probability of observing a sample of 36 shingles that will result in a sample
mean moisture content of 0.3167 pounds per 100 square feet or less is .0748.

One sample t test  

t statistic: -3.1003313069986995 p value: 0.0020904774003191826  

Since pvalue < 0.05, reject H0 . There is enough evidence to conclude that the mean moisture
content for Sample B shingles is not less than 0.35 pounds per 100 square feet. p-value =
0.0021. If the population mean moisture content is in fact no less than 0.35pounds per 100
square feet, the probability of observing a sample of 31 shingles that will result in a sample
mean moisture content of 0.2735 pounds per 100 square feet or less is .0021.
3.2 Do you think that the population mean for shingles A and B are equal? Form the
hypothesis and conduct the test of the hypothesis. What assumption do you need to
check before the test for equality of means is performed?

H0 : μ(A)= μ(B)  

Ha : μ(A)!= μ(B)  

α = 0.05  
t_statistic=1.29 and pvalue=0.202  

As the pvalue > α , do not reject H0; and we can say that population mean for shingles A and B
are equal Test Assumptions When running a two-sample t-test, the basic assumptions are that
the distributions of the two populations are normal, and that the variances of the two
distributions are the same. If those assumptions are not likely to be met, another testing
procedure could be use.

You might also like