You are on page 1of 9

Applied Statistical Analysis

MSBA 310
Assignment I

Submission Date:
22 – Oct – 2022

Import the csv file into R and answer the following questions:

a. Compute the mean, quartiles, min, max, and standard deviation of net sales.
Interpret

summary(Net.Sales)

1
Min. 1st Qu. Median Mean 3rd Qu. Max.

13.23 39.60 59.70 77.60 100.90 287.59

sd(Net.Sales)

[1] 55.66494

Based on the above-mentioned codes, we say that Pelican Stores’ net sales/ customer range from
13.23$ and 287.59$. 25% of its customers purchase for up to 39.6$, and only 25% purchase
products for more than 100.9$. On average, customers purchase products for 77.6$ (mean).

To study the dispersion of the given data, the standard deviation is compared with the mean:

sd(Net.Sales)/mean(Net.Sales)

[1] 0.7173271

The proportion is 71.7%. This shows that the data is of high dispersion; hence, we say that the
median (59.7$) is more representative for the data from the mean (77.6$).

10/11 for not mentioning the std impact on mean+- std

b. Use an appropriate chart (justify its use) and compare net sales among regular and
promotional customers. What this chart can tell you? Explain

The used chart is the side-by-side boxplot. This box plot is used to compare the net sales
(quantitative variable) and customer type (qualitative variable). The boxplot aggregates the net
sales data according to the customer types, and then the boxplot shows the median sales of each
category, minimum value, maximum value and other quartiles.

net_sales_by_customer_type=boxplot(Net.Sales~Type.of.Customer,col=c("blue","red"),main="
Net Sales by Customer Type",ylab = "Net Sales",medcol="white")

2
By comparing the two boxplots, we realize that:

1-The median of the promotional customers is higher than that of the regular customers. This
shows that the amount of money spent by the upper 50% of the promotional customers is higher
than that of the regular customers.

2- The range of the net sales is wider for the promotional customers. This suggests that the
amount of money spent by the promotional customers are more dispersed than that spent by the
regular customers.

3- The promotional customers’ data has more outliers. This also shows that it has more variation
and dispersion.

12/12

c. Produce Quantile-Quantile plot for the customer age. Comment on the shape of the
distribution.

qqnorm(Age, pch=16, col="blue", main="Normal Q-Q Plot for Customer Age")

qqline(Age,col="red")

3
This plot shows that the data tends to be along the qqline of the normal distribution (theoretical
quantiles). Hence, we could approximate the data to the normal distribution, though there are
some deviations, especially at the two extremes of the data. This could be because of some
outlier values.

6/10 it is not normal

d. Compute a 95% confidence interval for the mean net sales generated by regular
customers. Interpret

regular_sales=subset(Net.Sales,Type.of.Customer=="Regular")

length(regular_sales)

Output 1:

[1] 30

Since the sample size is not less than 30, we could approximate the average to the normal
distribution but with no given variance. Hence, we use the t test.

t.test(regular_sales)

Output 2:

4
Interpretation: We are 95% confident that the mean of net amount of sales spent by the regular
customers falls between 48.897$ and 75.086$.

7/7

e. Compute a 95% confidence interval for the mean net sales generated by
promotional customers. Interpret

promotional_sales=subset(Net.Sales,Type.of.Customer=="Promotional")

length(promotional_sales)

Output 1:

[1] 70

Since the sample size is above 30, we could approximate the estimate as to the normal
distribution but with no given variance. We use the t test.

t.test(promotional_sales)

Output 2:

Interpretation: We are 95% confident that the mean of net amount of sales spent by the
promotional customers falls between 69.635$ and 98.945$.

7/7

5
f. Compare the findings of parts (e) and (f). What conclusion can you provide?
Elaborate

As a first step, by comparing the 95% confidence intervals, we notice that the mean of net sales
of promotion customers is higher than that of the regular customers. Even the confidence
intervals do not intersect.

In order to further test this observation (hypothesis), we perform a two-sample t-test.

Hypothesis Test:

H0: μ promotion ≤ μregular

H1: μ promotion > μregular

At a first step, the F test is used to compare the variances:

var.test(Net.Sales~Type.of.Customer,data=shoppers)

Output:

P-value<0.05. Hence, we reject H0 (the true variances are equal). So, we consider the variances
as not equal.

t.test(Net.Sales~relevel(factor(Type.of.Customer), ref="Promotional"),
data=shoppers,alternative="greater")

Output:

6
The p-value = 0.012 <0.05. Therefore, we reject H0  We have enough evidence to say that the
mean sales spent by promotional customers is higher than that of regular customers, with a 95%
confidence level.

6/8 he must say that they are not higher since the Confidence intervals overlap

g. What proportion of promotional customers that Pelican Stores should expect in general?
Use a 95% confidence level and interpret

prop.test(length(promotional_sales),length(Net.Sales))

Output:

The 95% confidence interval is: [0.599,0.785]

This means that we are 95% confident that the percentage of promotional customers of Pelician
Stores lays between 59.9% and 78.5%.

8/8

h. Is it reasonable to believe that the percentage of promotional customers exceeds 80%?


Justify

H0: p<= 0.8

H1: p> 0.8

7
n x p = 100 x 0.7 = 70 >5

n (1-p) = 100 x 0.3 = 30 >5

Therefore, we do not use the correction variable in the code.

prop.test(length(promotional_sales),length(Net.Sales),p=0.8, alternative="greater", correct = F)

The resulted p-value = 0.994 > 0.05. Hence, we say we do not have enough evidence to reject
H0, and we consider the proportion to be <= 0.8.

5/6 for the redundancy

i. Use an appropriate chart and visualize the relationship between age and net sales. Do you
think there is an association between them? Justify

Since both, age and net sales, are quantitative variables, we use the scatterplot to visualize the
relationship between them.

plot(Net.Sales~Age, pch=19, col="blue", main="Scatterplot for Age and Net Sales")

abline(lm(Net.Sales~Age), col = "red")

Output:

8
It does not seem that there is association (correlation) between these two variables since the line
used to plot the correlation is a horizontal line: showing that the net sales is not affected by
difference in ages.

12/12

j. We want test whether the method of payment is related to the type of customer. What is
the most relevant test to use? Justify?

Method of payment: qualitative

Type of customer: qualitative

Since both variables are qualitative, we use the chi-squared independence test.

7/7

k. Can you conclude that he method of payment and customer type are related? Conduct the
test at a 5% significant level and specify the testing steps

tbl = table(Type.of.Customer,Method.of.Payment)

chisq.test(tbl)

Output:

p-value = 0.000355 < 0.05. Hence, with a 95% confidence level, we reject the null
hypothesis (that states that the net sales is independent from the type of customer). Hence, we
have enough evidence to say that the net sales is dependent on the type of the customer.

8/12 must mention explicitly what h0 and h1 are

You might also like