Professional Documents
Culture Documents
EMAIL- ashishrdg@gmail.com
Mob- +918885765109
Contents
Problem 1
1.1 Use methods of descriptive statistics to summarize data. Which Region and which Channel spent the most?
Which Region and which Channel spent the least?
1.2 There are 6 different varieties of items that are considered. Describe and comment/explain all the varieties
across Region and Channel? Provide a detailed justification for your answer.
1.3 On the basis of a descriptive measure of variability, which item shows the most inconsistent behavior? Which
items show the least inconsistent behaviour?
1.4 Are there any outliers in the data? Back up your answer with a suitable plot/technique with the help of
detailed comments.
1.5 On the basis of your analysis, what are your recommendations for the business? How can your analysis help
the business to solve its problem? Answer from the
business perspective
Problem 2
2.1. For this data, construct the following contingency tables (Keep Gender as row variable)
2.2. Assume that the sample is representative of the population of CMSU. Based on the data, answer the following
question:
2.2.1. What is the probability that a randomly selected CMSU student will be male?
2.2.2. What is the probability that a randomly selected CMSU student will be female?
2.3. Assume that the sample is representative of the population of CMSU. Based on the data, answer the following
question:
2.3.1. Find the conditional probability of different majors among the male students in CMSU.
2.3.2 Find the conditional probability of different majors among the female students of CMSU.
2.4. Assume that the sample is a representative of the population of CMSU. Based on the data, answer the
following question:
2.4.1. Find the probability That a randomly chosen student is a male and intends to graduate.
2.4.2 Find the probability that a randomly selected student is a female and does NOT have a laptop.
2.5. Assume that the sample is representative of the population of CMSU. Based on the data, answer the following
question:
2.5.1. Find the probability that a randomly chosen student is a male or has full-time employment?
2.5.2. Find the conditional probability that given a female student is randomly chosen, she is majoring in
international business or management.
2.6. Construct a contingency table of Gender and Intent to Graduate at 2 levels (Yes/No). The Undecided students
are not considered now and the table is a 2x2 table. Do you think the graduate intention and being female are
independent events?
2.7. Note that there are four numerical (continuous) variables in the data set, GPA, Salary, Spending, and Text
Messages.
Answer the following questions based on the data
2.7.1. If a student is chosen randomly, what is the probability that his/her GPA is less than 3?
2.7.2. Find the conditional probability that a randomly selected male earns 50 or more. Find the conditional
probability that a randomly selected female earns 50 or more.
2.8. Note that there are four numerical (continuous) variables in the data set, GPA, Salary, Spending, and Text
Messages. For each of them comment whether they follow a normal distribution. Write a note summarizing your
conclusions.
Problem 3
3.1 Do you think there is evidence that means moisture contents in both types of shingles are within the
permissible limits? State your conclusions clearly showing all steps.
3.2 Do you think that the population mean for shingles A and B are equal? Form the hypothesis and conduct the
test of the hypothesis. What assumption do you need to check before the test for equality of means is performed?
Please reflect on all that you have learnt while working on this project. This step is critical in cementing all your
concepts and closing the loop.
SOLUTIONS
Problem 1
A wholesale distributor operating in different regions of Portugal has information on annual spending of several
items in their stores across different regions and channels. The data (Wholesale Customer.csv) consists of 440 large
retailers’ annual spending on 6 different varieties of products in 3 different regions (Lisbon, Oporto, Other) and
across different sales channel (Hotel, Retail).
1.1 Use methods of descriptive statistics to summarize data. Which Region and which Channel spent the
most? Which Region and which Channel spent the least?
mea 25
std min 50% 75% max
count n %
25. 1526.
Frozen 440.0 3071.931818 4854.673333 742.25 3554.25 60869.0
0 0
Detergents_Pape
440.0 2881.493182 4767.854448 3.0 256.75 816.5 3922.00 40827.0
r
uniq to me 25
count freq std min 50% 75% max
ue p an %
Hot Na
Channel 440 2 298 NaN NaN NaN NaN NaN NaN
el N
Oth Na
Region 440 3 316 NaN NaN NaN NaN NaN NaN
er N
Region
Lisbon 2386813
Oporto 1555088
Other 10677599
Name: Spending, dtype: int64
Channel
Hotel 7999569
Retail 6619931
Name: Spending, dtype: int64
Highest spend in the Region is from Others and lowest spend in the region is from Oporto
Highest spend in the Channel is from Hotel and lowest spend in the Channel is from Retail.
Region Channel
Lisbon Hotel 1538342
Retail 848471
Oporto Hotel 719150
Retail 835938
Other Hotel 5742077
Retail 4935522
Name: Spending, dtype: int64
Detergents_Pape Delicatesse
Fresh Milk Grocery Frozen
r n
Channe
l
10716.5
Retail 8904.32 16322.85 1652.61 7269.51 1753.44
0
In Channel "Hotel" Average Highest Spending in Fresh items and Lowest Spending in Detergents Paper.
In Channel "Retail" Average Highest Spending in Grocery items and Lowest Spending in Frozen items.
Delicatesse
Fresh Milk Grocery Frozen Detergents_Paper
n
Regio
n
11101.7 7403.0
Lisbon 5486.42 3000.34 2651.12 1354.9
3 8
Delicatesse
Fresh Milk Grocery Frozen Detergents_Paper
n
Regio
n
9218.6
Oporto 9887.68 5088.17 4045.36 3687.47 1159.7
0
12533.4 7896.3
Other 5977.09 2944.59 2817.75 1620.6
7 6
1.2 There are 6 different varieties of items that are considered. Describe and
comment/explain all the varieties across Region and Channel? Provide a detailed
justification for your answer.
See Behavior in all items across Channel and Region use Bar Plot. Here we see that they
are different in Channel and Region.
Based on the plot, we can conclude that Fresh item is sold more in the Retail channel
Based on the observation, the milk is sold more in retail channel in other region.
Based on observation, we can say that the groceries are sold more in retail channel in Oporto.
Frozen are sold more in Hotel Channel in Oporto location.
Based on the observation Detergents Paper are sold more in retail channel in Oporto Location.
1.3 On the basis of the descriptive measure of variability, which item shows the most
inconsistent behavior? Which items shows the least inconsistent behaviour?
Fresh 12647.33
Milk 7380.38
Grocery 9503.16
Frozen 4854.67
Detergents_Paper 4767.85
Delicatessen 2820.11
dtype: float64
Cv milk= 1.2718
Cv grocery= 1.1938
Cv frozen= 1.5785
Cv detergent= 1.6527
Cv delicatessen= 1.8473
1.4 Are there any outliers in the data? Back up your answer with a suitable plot/technique
with the help of detailed comments.
1.5 On the basis of your analysis, what are your recommendations for the business? How
can your analysis help the business to solve its problem? Answer from the business
perspective.
As per the analysis, I find out that there are inconsistencies in spending of different items (by calculating
Coefficient of Variation), which should be minimized. The spending of Hotel and Retail channel are
different which should be more or less equal. And also spent should equal for different regions. Need to
focus on other items also than “Fresh” and “Grocery”.
Problem 2
2.1. For this data, construct the following contingency tables (Keep Gender as row
variable)
2.1.1. Gender and Major
Internatio A
Account CI Economics/ Managem Oth Retailing/ Undeci
Major nal l
ing S Finance ent er Marketing ded
Business l
Gend
er
Fema 3
3 3 7 4 4 3 9 0
le 3
2
Male 4 1 4 2 6 4 5 3
9
6
All 7 4 11 6 10 7 14 3
2
From the data we can tell that there are 33 female students and 29 male students. All the 33 female
students took the major subjects and 26 male students took major subjects and remaining 3 male
students are undecided.
Grad Undecide
No Yes All
Intention d
Gender
Female 9 13 11 33
Male 3 9 17 29
All 12 22 28 62
From the data we can tell that 11 female students have grad intention and 17 male students are grad
intention. 13 female students are undecided and 9 male students are undecided. 9 female has no grad
intention and 3 male students are no grad intention.
Full- Unemploye
Employment Part-Time All
Time d
Gender
Female 3 24 6 33
Male 7 19 3 29
All 10 43 9 62
There are 3 full time female students who has employment, 24 part time female students who has
employment and 6 unemployed female students.
7 male students who has full time employment, 19 male students who has part time employment and 3
male unemployed students.
Lapto
Computer Desktop Tablet All
p
Gender
Female 2 29 2 33
Male 3 26 0 29
Lapto
Computer Desktop Tablet All
p
Gender
All 5 55 2 62
From the data we can tell that, 2 female students have desktop, 29 female students’ laptop, 2 female
students have desktop.
2.2. Assume that the sample is representative of the population of CMSU. Based on the
data, answer the following question:
2.2.1. What is the probability that a randomly selected CMSU student will be male?
The probability that randomly selected CMSU student of male: 0.46774193548387094
2.2.2. What is the probability that a randomly selected CMSU student will be female?
2.3. Assume that the sample is representative of the population of CMSU. Based on the
data, answer the following question:
2.3.1. Find the conditional probability of different majors among the male students in
CMSU.
Among MALE candidates:
Probability of having accounting: 0.13793103448275862
Probability of having a cis: 0.034482758620689655
Probability of having finance/economics: 0.13793103448275862
Probability of having international business: 0.06896551724137931
Probability of having management: 0.20689655172413793
Probability of having other subjects: 0.13793103448275862
probability of having retail/marketing: 0.1724137931034483
probability of undecided: 0.10344827586206896
2.3.2 Find the conditional probability of different majors among the female students of
CMSU.
2.4.2 Find the probability that a randomly selected student is a female and does NOT
have a laptop.
2.5. Assume that the sample is representative of the population of CMSU. Based on the
data, answer the following question:
2.5.1. Find the probability that a randomly chosen student is a male or has full-time
employment?
2.5.2 Find the conditional probability that given a female student is randomly chosen, she
is majoring in international business or management.
2.6 Construct a contingency table of Gender and Intent to Graduate at 2 levels (Yes/No).
The Undecided students are not considered now and the table is a 2x2 table. Do you
think graduate intention and being female are independent events?
Grad
No Yes All
Intention
Gender
Female 9 11 20
Male 3 17 20
All 12 28 40
2.7 Note that there are four numerical (continuous) variables in the data set, GPA, Salary,
Spending and Text Messages. Answer the following questions based on the data
2.7.1 If a student is chosen randomly, what is the probability that his/her GPA is less than
3?
There is total 17 students who has the GPA less than 3.
Probability of a student is chosen randomly has a GPA less than 3 is: 0.025
2.7.2 Find conditional probability that a randomly selected male earns 50 or more. Find
conditional probability that a randomly selected female earns 50 or more.
2.8. Note that there are four numerical (continuous) variables in the data set, GPA, Salary,
Spending, and Text Messages. For each of them comment whether they follow a normal
distribution. Write a note summarizing your conclusions.
From the graph, we can say that GPA follows a normal distribution. There are no any outliers.
From the graph we can tell that, Salary doesn’t follow the normal distribution and there are
outliers. And the distribution is right skewed.
From the graph we can tell that, spending doesn’t follow a normal distribution. There are
outliers in the data. The distribution is right skewed.
From the plots we can say that, text messages don’t follow the normal distribution. There are outliers in
the data and right skewed.
PROBLEM 3
3.1 Do you think there is evidence that means moisture contents in both types of
shingles are within the permissible limits? State your conclusions clearly showing all
steps.
Since pvalue < 0.05, reject H0 . There is enough evidence to conclude that the mean moisture
content for Sample B shingles is not less than 0.35 pounds per 100 square feet. p-value =
0.0021. If the population mean moisture content is in fact no less than 0.35pounds per 100
square feet, the probability of observing a sample of 31 shingles that will result in a sample
mean moisture content of 0.2735 pounds per 100 square feet or less is .0021.
3.2 Do you think that the population mean for shingles A and B are equal? Form the
hypothesis and conduct the test of the hypothesis. What assumption do you need to
check before the test for equality of means is performed?
H0 : μ(A)= μ(B)
Ha : μ(A)!= μ(B)
α = 0.05
t_statistic=1.29 and pvalue=0.202
As the pvalue > α , do not reject H0; and we can say that population mean for shingles A and B
are equal Test Assumptions When running a two-sample t-test, the basic assumptions are that
the distributions of the two populations are normal, and that the variances of the two
distributions are the same. If those assumptions are not likely to be met, another testing
procedure could be use.