Professional Documents
Culture Documents
1.1Problem 1.1.............................................................................................................................3
1.2Problem 1.2.............................................................................................................................3
1.3Problem 1.3.............................................................................................................................9
1.4Problem 1.4.............................................................................................................................11
1.5Problem 1.5.............................................................................................................................11
2.1Problem 2.1..............................................................................................................................12
2.2Problem 2.2..............................................................................................................................13
2.3.Problem 2.3.............................................................................................................................14
2.4.Problem 2.4.............................................................................................................................15
2.5.Problem 2.5.............................................................................................................................16
2.6.Problem 2.6.............................................................................................................................17
2.7.Problem 2.7.............................................................................................................................18
2.8.Problem 2.8.............................................................................................................................20
3.1.Problem 3.1.............................................................................................................................22
3.2.Problem 3.2.............................................................................................................................23
2
Problem 1
Problem Statement:
The data describes about the amount of spend in each region corresponding with major items pertaining
to Food market in both Hotel and Retail.
The above dataset has no null values and this is checked using isnull() function in python.
1.1 Use methods of descriptive statistics to summarize data. Which Region and which Channel
spent the most? Which Region and which Channel spent the least?
3
From the above table, we can conclude the following
1. The dataset consists of 2 categorical value (Channel and region) and 7 numerical values
(buyer/sender, fresh,milk,grocery,frozen,detergents_paper and delicatessen)
2. The mean ,median, std of the variables has been found above
3. Using the min, max we can also calculate the range of the variables and Interquartile which in
turns gives us the outlier values
➢ To find out the most spending in Region and Channel we need to create a new column
‘Total_spending’ which is total of all 6 items
4
• “Others” is the region which has spent the most and “Hotel” is the channel that
has spent the most
• “Oporto” is the region which has spent the least and “Retail” is the channel that
has spent the least
1.2. There are 6 different varieties of items that are considered. Describe and comment/explain all
the varieties across Region and Channel? Provide a detailed justification for your answer .
5
➢ Looking from the above graphs it can be seen that some items like Milk,Grocery and Detergents
paper have highest spend in Retail Channel across all regions.On the other hand,Frozen and
Fresh spendings is higher on Hotel side across all regions.
• FRESH items spending is more on the Hotel channel and in OTHERS region
• MILK item spending is more on the Retail channel across OTHERS region closely followed by
LISBON
• GROCERY item spending is more on the Retail channel across LISBON region where the
number of Retail shops are lesse than hotels
• FROZEN item spendings is more on the Hotel channel across OPORTO region
• DETERGENTS PAPER item spendings is more on the Retail channel across OPORTO region
• DELICATESSEN item spendings is more on the Retail channel across LISBON region
6
Contigency Table
• Lisbon is the capital of Portugal and Lisbon region shows higher spend in Grocery item across
Retail channel compared to Hotel channel where the difference of spending is relatively
low.Contrary,there are more number of Hotels in Lisbon compared to Retail.Number of Hotels are
high in Lisbon because of more number of tourists.Milk and Grocery items seems to be on the
higher demand in LISBON.
• Oporto is like the commercial capital and second largest city of Portugal,Frozen and Detergents
Paper items seems to be on the higher demand in OPORTO.
1.3 On the basis of the descriptive measure of variability, which item shows the most inconsistent
behaviour? Which items shows the least inconsistent behaviour?
7
We have used Python to calculate the COV of the variables Fresh,Milk,Grocery,Detergents_Paper and
Delicatessen.
➢ We can infer that COV of Item Delicatessen is high ,therefore the behaviour of the variable shows
high inconsistency and COV of Item Fresh is low ,therefore the behaviour of the variable shows
more consistency.
1.4 Are there any outliers in the data? Back up your answer with a suitable plot/technique with the
help of detailed comments.
8
From the above box plot it is concluded that the dataset consists of outliers.
1.5 On the basis of your analysis, what are your recommendations for the business? How can your
analysis help the business to solve its problem? Answer from the business perspective.
As per the dataset there are three number of regions(Lisbon,Oporto and Others)
9
• Lisbon is the capital of Portugal and one of the biggest city and the most busiest because of
tourism
• Oporto is the second largest city and also the commercial capital of Portugal
• Other cities of Portugal are named as ‘Others’
10
➢ Highest correlation can be seen between the Grocery and Detergents Paper ,Milk and
Grocery ,Milk and Detergents Paper.
➢ There is higher chance of buying Detergents Paper and Milk along with Grocery Items
The below are the recommendations that can be made based on the above analysis to the Wholesale
Distributor Agency
1. As there are more number of Hotels in Lisbon and Grocery items are on the highest side
of spending it can be seen that there is also more demand for fresh items at Hotels in
Lisbon. Wholesale Distributor can stock more Grocery items based on the inventory of
the respective location of where the Hotels are located as per the Tourism seasons. the
wholesale distributor can further increase its revenue by distributing Grocery items along
with Milk and Detergents Paper.(Data analysis statement driven through Heatmap above)
2. In Oporto there are more number of Retails than Lisbon region despite Lisbon being the
capital .There is more demand for Retail stores in Oporto and most of the population in
Oporto buy food items from Retail stores. Grocery and Detergents paper are in the
Highest demand at Oporto
4. The wholesale distributor must also focus on other products apart from fresh and Grocery
as mean spending are comparatively more from other products by having more
marketing campaigns for those products and also can work on discounting to increase
the spending
11
PROBLEM 2
The Student News Service at Clear Mountain State University (CMSU) has decided to gather
data about the undergraduate students that attend CMSU. CMSU creates and distributes a
survey of 14 questions and receives responses from 62 undergraduates (stored in
the Survey data set).
2.1. For this data, construct the following contingency tables (Keep Gender as row variable)
12
2.1.2. Gender and Grad Intention
2.2 Assume that the sample is representative of the population of CMSU. Based on the
data, answer the following question:
13
2.2.1. What is the probability that a randomly selected CMSU student will be male?
2.2.2. What is the probability that a randomly selected CMSU student will be female?
2.3. Assume that the sample is representative of the population of CMSU. Based on the
data, answer the following question:
2.3.1. Find the conditional probability of different majors among the male students in
CMSU.
Conditional probability of different majors among the male students in CMSU is
➢ Using contingency tables of Gender and Majors we got the total numbers of males and number of
males opting for different majors
And from this output we can easily say that most of the males students prefer Management as Majors
and CIS is the least preferred one.
2.3.2 Find the conditional probability of different majors among the female students of
CMSU.
Conditional probability of different majors among the Female students in CMSU is
14
Using contingency tables of Gender and Majors we got the total numbers of females and number of
females opting for different majors
And from this output we can easily say that most of the Female students prefer Retailing/Marketing as
Majors and least is CIS.
2.4. Assume that the sample is a representative of the population of CMSU. Based on the
data, answer the following question:
2.4.1. Find the probability That a randomly chosen student is a male and intends to
graduate.
= Probability that a randomly chosen student is a Male * Probability that a randomly chosen student is a
Male
➢ The probability That a randomly chosen student is a male and intends to graduate is 27.42 %
2.4.2 Find the probability that a randomly selected student is a female and does NOT
have a laptop.
From the constructed contingency Table 2.1.4,we can conclude the below
Probability that a randomly selected student is a female and does NOT have a laptop
15
= Probability that a randomly chosen student is a Female * Probability of Female with No Laptop
➢ The probability that a randomly selected student is a female and does NOT have a laptop is 6.4%
2.5. Assume that the sample is representative of the population of CMSU. Based on the
data, answer the following question:
2.5.1. Find the probability that a randomly chosen student is a male or has full-time
employment?
= Probability of a Student being Male + Probability of a student having Full Time Employment - Probability
of a Male having Full Time Employment
➢ The probability that a randomly chosen student is a male or has a full-time employment 51.61 %
2.5.2. Find the conditional probability that given a female student is randomly chosen,
she is majoring in international business or management.
➢ The conditional probability that given a female student is randomly chosen, she is majoring in
international business or management is 24.242 %.
16
2.6. Construct a contingency table of Gender and Intent to Graduate at 2 levels (Yes/No).
The Undecided students are not considered now and the table is a 2x2 table. Do you
think the graduate intention and being female are independent events?
17
2.7. Note that there are four numerical (continuous) variables in the data set, GPA, Salary,
Spending, and Text Messages.
2.7.1. If a student is chosen randomly, what is the probability that his/her GPA is less
than 3?
From the calculations done in Python ,we can conclude the following statements
➢ If a student is chosen randomly, the probability that his/her GPA is less than 3 is 27.4%
2.7.2. Find the conditional probability that a randomly selected male earns 50 or more.
Find the conditional probability that a randomly selected female earns 50 or more.
By using Python , we get the following output
18
2.8. Note that there are four numerical (continuous) variables in the data set, GPA, Salary,
Spending, and Text Messages. For each of them comment whether they follow a normal
distribution.
By using distplot() in python ,we have found out whether the continuous variables in the dataset follow a
normal distribution
19
We have also calculated the skewness of the variables
From the above graphs and skewness calculations, we can conclude the following
1. ‘GPA’ variable is almost normally distributed and has slight negative skewness i.e. the
data is slightly left skewed.
2. ‘Salary’ is almost normally distributed, and it is slightly right skewed based on the
skewness value
3. ‘Spending’ variable is highly right skewed
4. ‘Text messages’ variable are highly right skewed
*Mean, Median and mode values are calculated using Python functions
➢ From the above table, we can confidently conclude that continuous variable ‘GPA’
is perfectly a normally distributed data.
We have dataset of students answered to the survey and we have 62 responses from the students both
male and female. There are a greater number of female students who have answered the survey.
Majority of the male students have their interest in Management course and majority of the female
students have their interest in Retailing/Marketing. From the data, a greater number of students are
looking for a part time job.
20
PROBLEM 3
An important quality characteristic used by the manufacturers of ABC asphalt shingles is the
amount of moisture the shingles contain when they are packaged. Customers may feel that they
have purchased a product lacking in quality if they find moisture and wet shingles inside the
packaging. In some cases, excessive moisture can cause the granules attached to the shingles
for texture and coloring purposes to fall off the shingles resulting in appearance problems. To
monitor the amount of moisture present, the company conducts moisture tests. A shingle is
weighed and then dried. The shingle is then reweighed, and based on the amount of moisture
taken out of the product, the pounds of moisture per 100 square feet are calculated. The company
would like to show that the mean moisture content is less than 0.35 pounds per 100 square feet.
The file (A & B shingles.csv) includes 36 measurements (in pounds per 100 square feet) for A shingles
and 31 for B shingles.
3.1 Do you think there is evidence that means moisture contents in both types of shingles are
within the permissible limits? State your conclusions clearly showing all steps.
For the A shingles, the null and alternative hypothesis to test whether the population mean moisture
content is less than 0.35 pound per 100 square feet is given:
21
We have samples and we do not know the population standard deviation.
The sample is not a large sample. So we can use the T distribution and the test statistic
We have no evidence to reject the null hypothesis since p value(0.07) > Level of significance(0.05)
For the B shingles, the null and alternative hypothesis to test whether the population mean moisture
content is less than 0.35 pound per 100 square feet is given:
The sample is not a large sample. So you use the T distribution and thetest statistic .
Since we a testing for only sample B we use One sample T test. . Also as python by default in Python,
ttest_1samp shows the result of 2-sided it is divided by 2 as our sample is one-sided.
We have evidence to reject the null hypothesis since p value(0.02) < Level of significance
Conclusion:
For Sample A-There is no proper evidence that mean moisture content in shingles is less
than 0.35
For Sample B-There is proper evidence that mean moisture content in shingles is less than
0.35`
22
3.2 Do you think that the population mean for shingles A and B are equal? Form the
hypothesis and conduct the test of the hypothesis. What assumption do you need to
check before the test for equality of means is performed?
We have two samples A and B and we do not know the population standard deviation.
The samples are not large sample. So we can use the t distribution and the tSTAT test
statistic
Since we are comparing two samples, we can use the two sample t test .
Hence from the calculations done in Python we conclude that:
Two-sample t-test p-value= 0.2017496571835306
We do not have enough evidence to reject the null hypothesis since p value (0.2) > Level of
Significance (0.05)
23