You are on page 1of 13

Contents

Problem1 ...................................................................................................................................................... 3
1.1. Use methods of descriptive statistics to summarize data. Which Region and which Channel spent the
most? Which Region and which Channel spent the least?................................................................ 3
1.2. There are 6 different varieties of items that are considered. Describe and comment/explain all the varieties
across Region and Channel? Provide a detailed justification for your answer...................................4
1.3. On the basis of the descriptive measure of variability, which item shows the most inconsistent behavior?
Which items shows the least inconsistent behavior?........................................................................ 5
1.4. Are there any outliers in the data? Back up your answer with a suitable plot/technique with the help of
detailed comments............................................................................................................................ 6
1.5. On the basis of your analysis, what are your recommendations for the business? How can your analysis
help the business to solve its problem? Answer from the business perspective...............................6
Problem2 ...................................................................................................................................................... 7
For this data, construct the following contingency tables (Keep Gender as row variable)............................7
2.1.1. Gender and Major ............................................................................................................ 7
2.1.2. Gender and Grad Intention ..............................................................................................7
2.1.1. Gender and Employment ................................................................................................. 7
2.1.2. Gender and Computer .....................................................................................................7
2.1. Assume that the sample is a representative of the population of CMSU. Based on the data, answer the
following questions:.......................................................................................................................... 7
2.2.1. What is the probability that a randomly selected CMSU student will be male? What is the probability
that a randomly selected CMSU student will be female?.............................................................. 7
2.2.2. What is the probability that a randomly selected CMSU student will be female? ..............8
2.2. Assume that the sample is representative of the population of CMSU. Based on the data, answer the
following questions: .......................................................................................................................... 8
2.3.1. Find the conditional probability of different majors among the male students in CMSU .......8
2.3.2. Find the conditional probability of different majors among the female students of CMSU ....8
2.3. Assume that the sample is representative of the population of CMSU. Based on the data, answer the
following questions: .......................................................................................................................... 9
2.4.1. Find the probability That a randomly chosen student is a male and intends to graduate. ....9
2.4.2. Find the probability that a randomly selected student is a female and does NOT have a laptop 9
2.4. Assume that the sample is representative of the population of CMSU. Based on the data, answer the
following questions: .......................................................................................................................... 9
2.5.1. Find the probability that a randomly chosen student is a male or has full-time employment?. 9
2.5.2. Find the conditional probability that given a female student is randomly chosen, she is majoring in
international business or management..........................................................................................9
2.5.  Construct a contingency table of Gender and Intent to Graduate at 2 levels (Yes/No). The Undecided
students are not considered now and the table is a 2x2 table. Do you think the graduate intention and being
female are independent events?........................................................................................................ 10
2.6. Note that there are three numerical (continuous) variables in the data set, Salary, Spending and Text
Messages. Answer the following questions based on the data..........................................................10
2.7.1.  If a student is chosen randomly, what is the probability that his/her GPA is less than 3? . .10
2.7.2. Find the conditional probability that a randomly selected male earns 50 or more. Find the conditional
probability that a randomly selected female earns 50 or more.......................................................11
2.7. Note that there are three numerical (continuous) variables in the data set, Salary, Spending and Text
Messages. For each of them comment whether they follow a normal distribution. Write a note summarizing
your conclusions. [Recall that symmetric histogram does not necessarily mean that the underlying distribution
is symmetric]..................................................................................................................................... 12
Problem13 .............................................................................................................................................. 13
3.1 Do you think there is evidence that means moisture contents in both types of shingles are within the
permissible limits? State your conclusions clearly showing all steps................................................. 13
3.2 Do you think that the population mean for shingles A and B are equal? Form the hypothesis and conduct the
test of the hypothesis. What assumption do you need to check before the test for equality of means is
performed? ....................................................................................................................................... 13

1.1 Use methods of descriptive statistics to summarize data. Which


Region and which Channel spent the most? Which Region and
which Channel spent the least?

Descriptive statistics help describe and understand the features of a specific data set by giving short
summaries about the sample and measures of the data. The most recognized types of descriptive
statistics are measures of center: the mean, median, and mode, which are used at almost all levels
of math and statistics.

count unique top freq mean std min 25% 50% 75% max

Buyer/Spender 440 NaN NaN NaN 220.5 127.161 1 110.75 220.5 330.25 440

Channel 440 2 Hotel 298 NaN NaN NaN NaN NaN NaN NaN

Region 440 3 Other 316 NaN NaN NaN NaN NaN NaN NaN
Fresh 440 NaN NaN NaN 12000.3 12647.3 3 3127.75 8504 16933.8 112151

Milk 440 NaN NaN NaN 5796.27 7380.38 55 1533 3627 7190.25 73498

Grocery 440 NaN NaN NaN 7951.28 9503.16 3 2153 4755.5 10655.8 92780

Frozen 440 NaN NaN NaN 3071.93 4854.67 25 742.25 1526 3554.25 60869

Detergents_Paper 440 NaN NaN NaN 2881.49 4767.85 3 256.75 816.5 3922 40827

Delicatessen 440 NaN NaN NaN 1524.87 2820.11 3 408.25 965.5 1820.25 47943

Table 2. Summary of the data

Fig.3 – Region vs Products

Fig.3 – Channel vs Products

After calculating the annual spending of products based on different regions and channels we found that,

Other is the Region that spent the most and Oporto is the Region that spent the least.

Hotel is the Channel that spent the most and Retail is the Channel that spent the least.

1.2 There are 6 different varieties of items that are considered. Describe
and comment/explain all the varieties across Region and Channel?
Provide a detailed justification for your answer.
After calculating the annual spending of products based on different regions and channels we found that,

From the above plot and based on the values below, we can see that on the basis of Region, Other is the region that
has high annual spending on products where Fresh should be the most sold item as its annual spending is high.
Delicatessen is the least sold or bought variety as its annual spending is low. On viewing the graph, on all three
regions, Fresh seem to have high annual spending and Delicatessen seem to have low annual spending.

On the basis of Channel, Hotel is the Channel that has high annual spending on products where Fresh is the most
sold item that has high annual spending. Delicatessen is the least sold or bought variety as its annual spending is
low. On viewing the graph, on the two channels, Fresh seem to have high annual spending and Delicatessen seem
to have low annual spending.

Looking at the pivot table below, we see that some categories like Milk, Grocery & Detergents_Paper have higher
spend in the Retail channel versus Hotel, across all regions. On the other hand, Fresh and Frozen have higher
consumption in the Hotel channel versus Retail, across all regions.

1.3 On the basis of a descriptive measure of variability, which item


shows the most inconsistent behavior? Which items show the least
inconsistent behavior?
Fresh item have highest Standard deviation So that is Inconsistent.

Delicatessen item have smallest Standard deviation, So that is consistent

“Fresh” item have lowest coefficient of Variation So that is consistent.


“Delicatessen” item have highest coefficient of Variation, So that is Inconsistent.

1.4 Are there any outliers in the data? Back up your answer with a suitable
plot/technique with the help of detailed comments.
Yes there are outliers in all the items across the product range (Fresh, Milk, Grocery, Frozen, Detergents_Paper & Delicatessen)

1.5 On the basis of your analysis, what are your recommendations for the
business? How can your analysis help the business to solve its problem?
Answer from the business perspective.
As per the analysis, I find out that there are inconsistencies in spending of different items (by calculating Coefficient
of Variation), which should be minimized. The spending of Hotel and Retail channel are different which should be
more or less equal. And also spent should equal for different regions. Need to focus on other items also than
“Fresh” and “Grocery.

Problem 2
The Student News Service at Clear Mountain State University (CMSU) has decided to gather data about the
undergraduate students that attend CMSU. CMSU creates and distributes a survey of 14 questions and receives
responses from 62 undergraduates.

2.1. For this data, construct the following contingency tables (Keep
Gender as row variable)
2.1.1. Gender and Major
2.1.2. Gender and Grad Intention

2.1.3. Gender and Employment

2.1.4. Gender and Computer

2.2. Assume that the sample is representative of the population of


CMSU. Based on the data, answer the following question:

2.2.1. What is the probability that a randomly selected CMSU student


will be male?
Female 33
Male 29
Name: Gender, dtype: int64
For this we need to find out total male students out of whole student from the given data,
Male student Probability = (Total number of Male students)/ (Total number of students at CMSU).
prob_male= 29/62= 0.46774193548387094
After calculation we got the result that probability of 46.77% student will be male in CMSU if randomly selected.

2.2.2. What is the probability that a randomly selected CMSU student


will be female?
Female student Probability = (Total number of Female students)/ (Total number of students at CMSU).
For this we need to find out total female students out of whole student from the given data,
prob_female= 33/62= 0.532258064516129

After calculation we got the result that probability of 53.23% student will be female in CMSU if randomly selected.
2.3. Assume that the sample is representative of the population of
CMSU. Based on the data, answer the following question:
2.3.1. Find the conditional probability of different majors among the
male students in CMSU.
Using contingency tables of Gender and Majors we got the total numbers of males and number of males opting for
different majors  
Below is the output from Python –
Probability of Males opting for Accounting. is 4/29= 13.79%
Probability of Males opting for CIS. is 1/29= 3.45%
Probability of Males opting for Economics/Finance. is 4/29= 13.79%
Probability of Males opting for International Business. is 2/29= 6.90%
Probability of Males opting for Management. is 6/29= 20.69%
Probability of Males opting for Other. is 4/29= 13.79%
Probability of Males opting for Retailing/Marketing. is 5/29= 17.24%
Probability of Males opting for Undecided. is 3/29= 10.34%
And from this output we can easily say that most of the males students prefer Management as Majors and CIS is
the least preferred one.

2.3.2 Find the conditional probability of different majors among the


female students of CMSU.
Using contingency tables of Gender and Majors we got the total numbers of females and number of females opting
for different majors  
Below is the output from Python –
Probability of Females opting for Accounting. is 3/33= 9.09%
Probability of Females opting for CIS. is 3/33= 9.09%
Probability of Females opting for Economics/Finance. is 7/33= 21.21%
Probability of Females opting for International Business. is 4/33= 12.12%
Probability of Females opting for Management. is 4/33= 12.12%
Probability of Females opting for Other. is 3/33= 9.09%
Probability of Females opting for Retailing/Marketing. is 9/33= 27.27%
Probability of Females opting for Undecided. is 0/33 =0%
And from this output we can easily say that most of the female students prefer Retail or Marketing as Majors and
there are no female who have not decided their majors.

2.4. Assume that the sample is a representative of the population of


CMSU. Based on the data, answer the following question:
2.4.1. Find the probability That a randomly chosen student is a male
and intends to graduate.
The number of "Male and Intends to Graduate" divided by the total = 17/62= 0.27419354838709675
From the contingency table of Gender vs Grad Intention, we got the result that probability of 27.41% student will
be male in CMSU who intends to graduate if randomly selected.

2.4.2 Find the probability that a randomly selected student is a female


and does NOT have a laptop. 
The number of "Female and Does not have a laptop" divided by the total= 4/62= 0.06451612903225806
From the contingency table of Gender vs Computer, we got the result that probability of 6.45% student will be
female in CMSU who does not have a laptop

2.5. Assume that the sample is representative of the population of


CMSU. Based on the data, answer the following question:
2.5.1. Find the probability that a randomly chosen student is a male or
has full-time employment?
(number of males + number of full time employment students − number of male who is full time employed) / Total
number of students = (29+10-7)/62 = 0.5161290322580645
From the contingency table of Gender vs Employment, we got the result that probability of 51.61% student will be
male or has full time employment in CMSU.

2.5.2. Find the conditional probability that given a female student is


randomly chosen, she is majoring in international business or
management.
Given that the student is female, the probability of majoring in international business or management is (IB) , given
that a student is female (F)" as P(IB|F).
The total number of possible outcomes is 33 as there are 33 female students in CMSU. Out of those 33 females, 8
students are majoring in International business or management, and thus
P(IB|F) = 8/33 = 0.24242424242424243
From the contingency table of Gender vs Major, we got the result that conditional probability that given a female
student is randomly chosen who is majoring in international business or management is 24.24%

CHECKING THE OUTLIERS IN THE DATA

BEFORE TREATING OUTLIERS AFTER TREATING OUTLIERS


1.2 Encode the data (having string values) for Modelling. Split the data into train
and test (70:30). Apply Linear regression using scikit learn. Perform checks for
significant variables using appropriate method from statsmodel. Create multiple
models and check the performance of Predictions on Train and Test sets using
Rsquare, RMSE & Adj Rsquare. Compare these models and select the best one
with appropriate reasoning.

ENCODING THE STRING VALUES


GET DUMMIES

27

Dummies have been encoded.


Linear regression model does not take categorical values so that we have encoded categorical
values to integer for better results.
Train/Test split and Linear Regression model:
13

You might also like