You are on page 1of 17

PROBLEM STATEMENT 1

A wholesale distributor operating in different regions of Portugal has information on


annual spending of several items in their stores across different regions and
channels. The data consists of 440 large retailers’ annual spending on 6 different
varieties of products in 3 different regions (Lisbon, Oporto, Other) and across
different sales channel (Hotel, Retail).

EXPLORATORY DATA ANALYSIS

The data set has 2 channel i.e. Hotel and Retail and 6 items bought/supplied via this
channel.

Check for null entries and data type:

From the above image we can clearly see that the data has no null entries.
Channel and Region is of object data type while all other columns are of integer data
type.
1.1 Use methods of descriptive statistics to summarize data. Which Region
and which Channel seems to spend more? Which Region and which Channel
seems to spend less?

The data has 440 rows with 2 unique values under channel and 3 unique values
under region column. The standard deviation of all the items is higher than there
mean values, this shows that the data is abnormally distributed. Hotel has the highest
number of values in the channel column and other has the highest number of values
in region column.
To see which channel and region spend more or less I used countplot and the result
can be seen below:

From the above charts we can clearly see that:


 Hotel under channel and other under region are spending more.
 Retail under channel and Oporto under region are spending less.
1.2 There are 6 different varieties of items are considered. Do all varieties show
similar behaviour across Region and Channel? Provide justification for your
answer.
To get the detailed understanding of the behaviour across region and channel I used
catplot for the same:

From the above figures we can clearly see that all the varieties do not show similar
behavior across region and channel.
Delicatessen, detergents paper, grocery and milk items are mostly sold via retail
channel whereas frozen, fresh are mostly sold via hotel channel.

1.3 On the basis of a descriptive measure of variability, which item shows the
most inconsistent behaviour? Which items show the least inconsistent
behaviour?
We used Coefficient of Variance (CV)to check
inconsistent behaviour in the data as the mean
values differ drastically from one another in the
given data

 The coefficient of variance of Delicatessen is 1.8473041039189306


 The coefficient of variance of Detergent Paper is 1.6527657881041729
 The coefficient of variance of Fresh is 1.0527196084948245
 The coefficient of variance of Frozen is 1.5785355298607762
 The coefficient of variance of Grocery is 1.193815447749267
 The coefficient of variance of Milk is 1.2718508307424503

Delicatessen items shows the most inconsistent behaviour with coefficient of


variance being 1.8473041039189306
Fresh items show the least inconsistent behaviour with coefficient of variance
being 1.0527196084948245

1.4 Are there any outliers in the data?


To know about the outliers boxplot is the trusted graph and the same was used to get
the visualised description of the data.

Yes, it can be seen from the boxplot that there are outliers in the data. Thus we can
say that the data contains extreme values.
1.5 On the basis of your analysis, what are your recommendations for the
business? How can your analysis help the business to solve its problem?
Answer from the business perspective.
One the basis of analysis, I found out that:
 There is a huge gap within spending on hotel and retail channel that should be
decreased.
 There is also a huge gap within spending on the basis of region that should be made
equal or somewhere around each other based on the customers.
 Currently main focus is on grocery and fresh items, recommended to start focusing
on other items as well.
 Inconsistency can be seen through coefficient of variance in spending of different
items, recommended to decrease the inconsistency for better business output.

PROBLEM 1 SUMMARY:
1. Hotel under channel and other under region are spending more.
Retail under channel and Oporto under region are spending less.
2. Different varieties of product do not show similar behavior across region and
channel.
3. Delicatessen items shows the most inconsistent behaviour while Fresh items
show the least inconsistent behaviour.
4. There are outliers in the data.
5. Provided recommendations for the business.

PROBLEM STATEMENT 2
The Student News Service at Clear Mountain State University (CMSU) has decided
to gather data about the undergraduate students that attend CMSU. CMSU creates
and distributes a survey of 14 questions and receives responses from 62
undergraduates (stored in the Survey data set).

EXPLORATORY DATA ANALYSIS

The data set has 14 variables out of which 6 columns are of integer data type, 6 are
of object data type and 2 are of float data type.
There are no null entries in the data.

2.1. Construct the following contingency tables (Keep Gender as row variable)

2.1.1. Gender and Major

Most of the students are interested in Retailing/Marketing major.

2.1.2. Gender and Grad Intention

Most of the students said yes for their grad intention.

2.1.3. Gender and Employment


Most of the students take part time job and very less remains unemployed while
studying.
2.1.4. Gender and Computer

Maximum number of students have laptops compared to desktop and tablets.

2.2. Assume that the sample is representative of the population of CMSU.


Based on the data, answer the following question:
2.2.1. What is the probability that a randomly selected CMSU student will be
male?

P ( Male )=Total number ofmale /Total number of students

The probability that a randomly selected CMSU student will be male is 0.4677 or
46.77%

2.2.2. What is the probability that a randomly selected CMSU student will be
female?

P( Female)=Total number of female /Total number of students

The probability that a randomly selected CMSU student will be female is 0.5323 or
53.23%

2.3. Assume that the sample is representative of the population of CMSU.


Based on the data, answer the following question:
2.3.1. Find the conditional probability of different majors among the male
students in CMSU.
Conditional probability:
p ( A∧B )
p ( A|B )=
p(B)

 The probability of major being Accounting given that the student is male is
0.14 or 14%
 The probability of major being CIS given that the student is male is 0.03 or 3%
 The probability of major being Economics/Finance given that the student is
male is 0.14 or 14%
 The probability of major being International Business given that the student
is male is 0.07 or 7%
 The probability of major being Management given that the student is male is
0.21 or 21%
 The probability of major being Other given that the student is male is 0.14 or
14%
 The probability of major being Retailing/Marketing given that the student is
male is 0.17 or 17%
 The probability of major being Undecided given that the student is male is
0.10 or 10%

2.3.2 Find the conditional probability of different majors among the female
students of CMSU.
Conditional probability:
p ( A∧B )
p ( A|B )=
p(B)

 The probability of major being Accounting given that the student is female is
0.09 or 9%
 The probability of major being CIS given that the student is female is 0.09 or
9%
 The probability of major being Economics/Finance given that the student is
female is 0.21 or 21%
 The probability of major being International Business given that the student
is female is 0.12 or 12%
 The probability of major being Management given that the student is female
is 0.12 or 12%
 The probability of major being Other given that the student is female is 0.09
or 9%
 The probability of major being Retailing/Marketing given that the student is
female is 0.27 or 27%
 The probability of major being Undecided given that the student is female is
0.0 or 0%

2.4. Assume that the sample is a representative of the population of CMSU.


Based on the data, answer the following question:
2.4.1. Find the probability That a randomly chosen student is a male and
intends to graduate.
male∧ yes
p ( male ∩ yes )=
total male

The probability that a randomly chosen student is a male and intends to graduate is
0.59 or 59%

2.4.2 Find the probability that a randomly selected student is a female and
does NOT have a laptop.

female∧no laptop
p ( female ∩no laptop )=
total female

The probability that a randomly chosen student is a female and does not have a
laptop is 0.12 or 12%

2.5. Assume that the sample is representative of the population of CMSU.


Based on the data, answer the following question:

2.5.1. Find the probability that a randomly chosen student is either a male or
has full-time employment?
male∨full timeemployment
p ( male∨full timeemployment )=
Total

The probability that a randomly chosen student is either a male or has full time
employment is 0.52 or 52%

2.5.2. Find the conditional probability that given a female student is randomly
chosen, she is majoring in international business or management

p(international business∨management∨female)
p (internation|

The conditional probability that given a female student is randomly chosen, she is
majoring in international business or management is 45.55 %.

2.6. Construct a contingency table of Gender and Intent to Graduate at 2 levels


(Yes/No). The Undecided students are not considered now and the table is a
2x2 table. Do you think the graduate intention and being female are
independent events?

An event is an independent event when it satisfies the following condition:

p( A∧B)= p ( A )∗p( B)

Asked:

p(female∧ yes)=p (female)∗p ( yes)


OR
p ( female∧ yes ) ≠ p (female)∗p( yes)

Using the above equations, I can conclude that the graduate intention and being
female are not independent events.

2.7. Note that there are four numerical (continuous) variables in the data set,
GPA, Salary, Spending, and Text Messages.

2.7.1. If a student is chosen randomly, what is the probability that his/her GPA
is less than 3?
total of less than3
p ( below 3 ) =
total number of students
If a student is chosen randomly, the probability that his/her GPA is less than 3 is
0.2742 or 27.42%

2.7.2. Find the conditional probability that a randomly selected male earns 50
or more. Find the conditional probability that a randomly selected female earns
50 or more.
p(total male earning 50∨more )
p ( male earns 50∨more )=
p (male)

The conditional probability that a randomly selected male earns 50 or more is 0.48

p (total female earning50∨more)


p ( female earns50∨more )=
p(female)

The conditional probability that a randomly selected female earns 50 or more is 0.55

2.8. Note that there are four numerical (continuous) variables in the data set,
GPA, Salary, Spending, and Text Messages. For each of them comment
whether they follow a normal distribution. Write a note summarizing your
conclusions.
I used distplot to know about the distribution of the given data and that can be seen
below:
From the above graphs we can clearly see that the distribution is not normal as
the curve is not symmetrical bell shaped also mean and standard deviation are
not equal.

PROBLEM 2 SUMMARY:
1. Shown relation between gender and major, gender and grad intension,
gender and employment, and gender and computer with the help of
contingency tables.
2. 2.1 Probability of CMSU student being male is 46.77%
2.2 Probability of CMSU student being female is 53.23%
3. 3.1 Conditional probability of different majors among male student is:
 Accounting - 0.14
 CIS -0.03
 Economics/Finance -0.14
 International Business -0.07
 Management -0.21
 Other -0.14
 Retailing/Marketing -0.17
 Undecided- 0.10
3.2 Conditional probability of different majors among female student is:
 Accounting - 0.09
 CIS -0.09
 Economics/Finance -0.21
 International Business -0.12
 Management -0.12
 Other -0.09
 Retailing/Marketing -0.27
 Undecided- 0.00
4. 4.1 Probability of a CMSU student being male and intends to graduate is
59%
4.2 Probability of a CMSU student being female and does not have a
laptop is 12%
5. 5.1 Probability of a CMSU student being male or has full time
employment is 52%
5.2 Probability of a CMSU female student majoring in international
business or management is 45.55%
6. Graduate intention and being female are not independent events.
7. 7.1 Probability that CMSU student’s GPA is less than 3 is 27.42%
7.2 Probability that CMSU male student earns 50 or more is 48%
Probability that CMSU female student earns 50 or more is 55%
8. GPA, Salary, Spending, and Text Messages are not normally distributed
as per the distplot.

PROBLEM 3
An important quality characteristic used by the manufacturers of ABC asphalt
shingles is the amount of moisture the shingles contain when they are packaged.
Customers may feel that they have purchased a product lacking in quality if they find
moisture and wet shingles inside the packaging. In some cases, excessive moisture
can cause the granules attached to the shingles for texture and colouring purposes
to fall off the shingles resulting in appearance problems. To monitor the amount of
moisture present, the company conducts moisture tests. A shingle is weighed and
then dried. The shingle is then reweighed, and based on the amount of moisture
taken out of the product, the pounds of moisture per 100 square feet is calculated.
The company would like to show that the mean moisture content is less than 0.35
pound per 100 square feet.

The file (A & B shingles.csv) includes 36 measurements (in pounds per 100 square
feet) for A shingles and 31 for B shingles.

EXPLORATORY DATA ANALYSIS

The data set has 2 columns showing A and B shingles measurements. Data type is
float.
3.1 Do you think there is evidence that means moisture contents in both types
of shingles are within the permissible limits? State your conclusions clearly
showing all steps.

For A shingles:

Step 1: Null and Alternative hypotheses

H o :μ=0.35

H 1 : μ< 0.35

Step 2: The Level of Significance

α =0.05

Step 3: As the population standard deviation is unknown we will use one tail t-test

Step 4: Calculate the p - value and test statistic


The calculated t statistics and p value is:

t statistic: -1.4735046253382782
p value: 0.07477633144907513

Step 5: Decide to reject or accept null hypothesis


As per the calculated p value we do not have enough evidence to reject the null
hypothesis in favour of alternative hypothesis.

Thus there is no sufficient evidence to prove that A shingles is within the


permissible limit.

For B shingles:

Step 1: Null and Alternative hypotheses

H o :μ=0.35

H 1 : μ< 0.35

Step 2: The Level of Significance

α =0.05

Step 3: As the population standard deviation is unknown we will use one tail t-test

Step 4: Calculate the p - value and test statistic

The calculated t statistics and p value is:

t statistic: -3.1003313069986995
p value: 0.0020904774003191826

Step 5: Decide to reject or accept null hypothesis


As per the calculated p value we have enough evidence to reject the null hypothesis
in favour of alternative hypothesis.

Thus at 95% confidence level, there is sufficient evidence to prove that B


shingles is within the permissible limit.

Conclusion:
A shingles is not within the permissible limit whereas B shingles is within the
limit as per the t test being conducted on the given data.
3.2 Do you think that the population mean for shingles A and B are equal?
Form the hypothesis and conduct the test of the hypothesis. What assumption
do you need to check before the test for equality of means is performed?

Assumptions:

1. The Distribution of the two population is normal.


2. The two samples are independent.
3. The population variance is assumed to be equal.

Step 1: Null and Alternative hypotheses

H 0 : μA=μB

H 1 : μA ≠ μB

Step 2: The Level of Significance

α = 0.05

Step 3: As the population standard deviation is unknown and we have 2 samples


hence we will use two tail t-test

Step 4: Calculate the p - value and test statistic

t statistic: 1.2896282719661123
p value: 0.2017496571835306

Step 5 Decide to reject or accept null hypothesis


We do not have enough evidence to reject the null hypothesis in favour of alternative
hypothesis.

Conclusion:

Hence we can say that the population mean for shingles A and B are equal.

PROBLEM 3 SUMMARY:
1. A shingles is not within the permissible limit while B shingles is within the
limits.
2. The population mean of both singles A and B are equal.
Problem 3 Conclusion:

As per the test conducted on the given data it is being advised for the compan-y to
focus on A shingles as the moisture content is not within the permissible limit.

You might also like