Professional Documents
Culture Documents
The data set has 2 channel i.e. Hotel and Retail and 6 items bought/supplied via this
channel.
From the above image we can clearly see that the data has no null entries.
Channel and Region is of object data type while all other columns are of integer data
type.
1.1 Use methods of descriptive statistics to summarize data. Which Region
and which Channel seems to spend more? Which Region and which Channel
seems to spend less?
The data has 440 rows with 2 unique values under channel and 3 unique values
under region column. The standard deviation of all the items is higher than there
mean values, this shows that the data is abnormally distributed. Hotel has the highest
number of values in the channel column and other has the highest number of values
in region column.
To see which channel and region spend more or less I used countplot and the result
can be seen below:
From the above figures we can clearly see that all the varieties do not show similar
behavior across region and channel.
Delicatessen, detergents paper, grocery and milk items are mostly sold via retail
channel whereas frozen, fresh are mostly sold via hotel channel.
1.3 On the basis of a descriptive measure of variability, which item shows the
most inconsistent behaviour? Which items show the least inconsistent
behaviour?
We used Coefficient of Variance (CV)to check
inconsistent behaviour in the data as the mean
values differ drastically from one another in the
given data
Yes, it can be seen from the boxplot that there are outliers in the data. Thus we can
say that the data contains extreme values.
1.5 On the basis of your analysis, what are your recommendations for the
business? How can your analysis help the business to solve its problem?
Answer from the business perspective.
One the basis of analysis, I found out that:
There is a huge gap within spending on hotel and retail channel that should be
decreased.
There is also a huge gap within spending on the basis of region that should be made
equal or somewhere around each other based on the customers.
Currently main focus is on grocery and fresh items, recommended to start focusing
on other items as well.
Inconsistency can be seen through coefficient of variance in spending of different
items, recommended to decrease the inconsistency for better business output.
PROBLEM 1 SUMMARY:
1. Hotel under channel and other under region are spending more.
Retail under channel and Oporto under region are spending less.
2. Different varieties of product do not show similar behavior across region and
channel.
3. Delicatessen items shows the most inconsistent behaviour while Fresh items
show the least inconsistent behaviour.
4. There are outliers in the data.
5. Provided recommendations for the business.
PROBLEM STATEMENT 2
The Student News Service at Clear Mountain State University (CMSU) has decided
to gather data about the undergraduate students that attend CMSU. CMSU creates
and distributes a survey of 14 questions and receives responses from 62
undergraduates (stored in the Survey data set).
The data set has 14 variables out of which 6 columns are of integer data type, 6 are
of object data type and 2 are of float data type.
There are no null entries in the data.
2.1. Construct the following contingency tables (Keep Gender as row variable)
The probability that a randomly selected CMSU student will be male is 0.4677 or
46.77%
2.2.2. What is the probability that a randomly selected CMSU student will be
female?
The probability that a randomly selected CMSU student will be female is 0.5323 or
53.23%
The probability of major being Accounting given that the student is male is
0.14 or 14%
The probability of major being CIS given that the student is male is 0.03 or 3%
The probability of major being Economics/Finance given that the student is
male is 0.14 or 14%
The probability of major being International Business given that the student
is male is 0.07 or 7%
The probability of major being Management given that the student is male is
0.21 or 21%
The probability of major being Other given that the student is male is 0.14 or
14%
The probability of major being Retailing/Marketing given that the student is
male is 0.17 or 17%
The probability of major being Undecided given that the student is male is
0.10 or 10%
2.3.2 Find the conditional probability of different majors among the female
students of CMSU.
Conditional probability:
p ( A∧B )
p ( A|B )=
p(B)
The probability of major being Accounting given that the student is female is
0.09 or 9%
The probability of major being CIS given that the student is female is 0.09 or
9%
The probability of major being Economics/Finance given that the student is
female is 0.21 or 21%
The probability of major being International Business given that the student
is female is 0.12 or 12%
The probability of major being Management given that the student is female
is 0.12 or 12%
The probability of major being Other given that the student is female is 0.09
or 9%
The probability of major being Retailing/Marketing given that the student is
female is 0.27 or 27%
The probability of major being Undecided given that the student is female is
0.0 or 0%
The probability that a randomly chosen student is a male and intends to graduate is
0.59 or 59%
2.4.2 Find the probability that a randomly selected student is a female and
does NOT have a laptop.
female∧no laptop
p ( female ∩no laptop )=
total female
The probability that a randomly chosen student is a female and does not have a
laptop is 0.12 or 12%
2.5.1. Find the probability that a randomly chosen student is either a male or
has full-time employment?
male∨full timeemployment
p ( male∨full timeemployment )=
Total
The probability that a randomly chosen student is either a male or has full time
employment is 0.52 or 52%
2.5.2. Find the conditional probability that given a female student is randomly
chosen, she is majoring in international business or management
p(international business∨management∨female)
p (internation|
The conditional probability that given a female student is randomly chosen, she is
majoring in international business or management is 45.55 %.
p( A∧B)= p ( A )∗p( B)
Asked:
Using the above equations, I can conclude that the graduate intention and being
female are not independent events.
2.7. Note that there are four numerical (continuous) variables in the data set,
GPA, Salary, Spending, and Text Messages.
2.7.1. If a student is chosen randomly, what is the probability that his/her GPA
is less than 3?
total of less than3
p ( below 3 ) =
total number of students
If a student is chosen randomly, the probability that his/her GPA is less than 3 is
0.2742 or 27.42%
2.7.2. Find the conditional probability that a randomly selected male earns 50
or more. Find the conditional probability that a randomly selected female earns
50 or more.
p(total male earning 50∨more )
p ( male earns 50∨more )=
p (male)
The conditional probability that a randomly selected male earns 50 or more is 0.48
The conditional probability that a randomly selected female earns 50 or more is 0.55
2.8. Note that there are four numerical (continuous) variables in the data set,
GPA, Salary, Spending, and Text Messages. For each of them comment
whether they follow a normal distribution. Write a note summarizing your
conclusions.
I used distplot to know about the distribution of the given data and that can be seen
below:
From the above graphs we can clearly see that the distribution is not normal as
the curve is not symmetrical bell shaped also mean and standard deviation are
not equal.
PROBLEM 2 SUMMARY:
1. Shown relation between gender and major, gender and grad intension,
gender and employment, and gender and computer with the help of
contingency tables.
2. 2.1 Probability of CMSU student being male is 46.77%
2.2 Probability of CMSU student being female is 53.23%
3. 3.1 Conditional probability of different majors among male student is:
Accounting - 0.14
CIS -0.03
Economics/Finance -0.14
International Business -0.07
Management -0.21
Other -0.14
Retailing/Marketing -0.17
Undecided- 0.10
3.2 Conditional probability of different majors among female student is:
Accounting - 0.09
CIS -0.09
Economics/Finance -0.21
International Business -0.12
Management -0.12
Other -0.09
Retailing/Marketing -0.27
Undecided- 0.00
4. 4.1 Probability of a CMSU student being male and intends to graduate is
59%
4.2 Probability of a CMSU student being female and does not have a
laptop is 12%
5. 5.1 Probability of a CMSU student being male or has full time
employment is 52%
5.2 Probability of a CMSU female student majoring in international
business or management is 45.55%
6. Graduate intention and being female are not independent events.
7. 7.1 Probability that CMSU student’s GPA is less than 3 is 27.42%
7.2 Probability that CMSU male student earns 50 or more is 48%
Probability that CMSU female student earns 50 or more is 55%
8. GPA, Salary, Spending, and Text Messages are not normally distributed
as per the distplot.
PROBLEM 3
An important quality characteristic used by the manufacturers of ABC asphalt
shingles is the amount of moisture the shingles contain when they are packaged.
Customers may feel that they have purchased a product lacking in quality if they find
moisture and wet shingles inside the packaging. In some cases, excessive moisture
can cause the granules attached to the shingles for texture and colouring purposes
to fall off the shingles resulting in appearance problems. To monitor the amount of
moisture present, the company conducts moisture tests. A shingle is weighed and
then dried. The shingle is then reweighed, and based on the amount of moisture
taken out of the product, the pounds of moisture per 100 square feet is calculated.
The company would like to show that the mean moisture content is less than 0.35
pound per 100 square feet.
The file (A & B shingles.csv) includes 36 measurements (in pounds per 100 square
feet) for A shingles and 31 for B shingles.
The data set has 2 columns showing A and B shingles measurements. Data type is
float.
3.1 Do you think there is evidence that means moisture contents in both types
of shingles are within the permissible limits? State your conclusions clearly
showing all steps.
For A shingles:
H o :μ=0.35
H 1 : μ< 0.35
α =0.05
Step 3: As the population standard deviation is unknown we will use one tail t-test
t statistic: -1.4735046253382782
p value: 0.07477633144907513
For B shingles:
H o :μ=0.35
H 1 : μ< 0.35
α =0.05
Step 3: As the population standard deviation is unknown we will use one tail t-test
t statistic: -3.1003313069986995
p value: 0.0020904774003191826
Conclusion:
A shingles is not within the permissible limit whereas B shingles is within the
limit as per the t test being conducted on the given data.
3.2 Do you think that the population mean for shingles A and B are equal?
Form the hypothesis and conduct the test of the hypothesis. What assumption
do you need to check before the test for equality of means is performed?
Assumptions:
H 0 : μA=μB
H 1 : μA ≠ μB
α = 0.05
t statistic: 1.2896282719661123
p value: 0.2017496571835306
Conclusion:
Hence we can say that the population mean for shingles A and B are equal.
PROBLEM 3 SUMMARY:
1. A shingles is not within the permissible limit while B shingles is within the
limits.
2. The population mean of both singles A and B are equal.
Problem 3 Conclusion:
As per the test conducted on the given data it is being advised for the compan-y to
focus on A shingles as the moisture content is not within the permissible limit.