You are on page 1of 28

Data Analysis Report on the Location of

MineBlox’s New Mall

Submitted by:

Gaurano, Llana Kirsten T.

Submitted to:

MineBlox

December 05, 2021


Table of Contents
Chapter 1: Introduction 1

Chapter 2: Results and Discussion 4


Age_Group 4
Income_Level 5
Educational Attainment 5
Best Location Based on Average Expenditure per Age Group 6
Best Location Based on Average Expenditure per Income Level 7
Relationships 9
Income and Expenditure 10
Age and Expenditure 11
Age and Income 12
Best Variable for Prediction of Sales 13
Location with Best Sales Based on Regression Model 13
Significant Differences Between Mean Expenditures 15
Age_Groups 15
Education 16
Significant Differences Between Mean Income Considering
Both Age Group and Education 17
Age_Group 17
Education 18
Interaction Effect Between Age Group and Education 19
Visualizations 19

Chapter 3: Conclusion 23
Chapter 1 Introduction
MineBlox is a US-based development company that is planning to put up a new mall in
an attempt to expand its business. The company has now listed 3 suitable locations that each
have unique population demographics. They are now trying to decide the most appropriate
location for the new building so that it can provide the best sales. Description of the locations
are as follows:
● Location 1 - A newly developed area that is mostly composed of new college graduates.
Thus, rent and cost of apartments are high
● Location 2 - Near an original suburban area that is mostly comprised of middle-aged
people with children. Has been a prime location for stores and citizens here live
comfortably.
● Location 3 - Beside a retirement home that is mostly comprised of senior citizens and
retirees. Nature is well preserved and has a more relaxed atmosphere.

The goal of this report is to answer MineBlox’s questions and to uncover the best
location. Other questions to be answered are as follows:
● What is the average income of each age group and the implications of its standard
deviation?
● What is the average income of each income level and the implications of its standard
deviation?
● Based on the average expenditure of each age group, which location is the best?
● Based on the average expenditure of each income level, which location is the best?
● What is the direction of the relationship between income and expenditure and how
strong is it?
● What is the direction of the relationship between age and expenditure and how strong is
it?
● What is the direction of the relationship between age and income and how strong is it?
● Which variable would best predict the sales of the locations?
● Utilizing average income per income level, which location would be predicted to have the
most sales using the regression model?
● Between the mean expenditures of the age groups, are there any significant differences?
If so, which groups specifically?
● Between the mean expenditures of their education, are there any significant differences?
If so, which groups specifically?
● Considering age group and education at the same time, are there any significant
differences between the mean income among the age groups? If so, which age groups’
incomes specifically?
● Considering age group and education at the same time, are there any significant
differences between the income among their education? If so, which educational
attainments’ incomes specifically?
● Does an interaction effect exist between education and age group with regards to
income?

1
The data set was provided by MineBlox. Results from their previous survey from another
mall were used. A total of 2,240 entries were recorded which were all provided with a unique ID.
The participants were asked to answer the following questions:
● Year_Birth - year of birth
● Education - highest educational attainment
● Marital_Status - marital status
● Income - income (USD)
● Kidhome - number of household members younger than 13 years old
● Teenhome - number of household members aged 13 to 19 years old
● Dt_Customer - date customer signed up with store
● Recency - days since last visit
● MntWines - amount spent on wine products (USD)
● MntFruits - amount spent on fruit products (USD)
● MntMeatProducts - amount spent on meat products (USD)
● MntFishProducts - amount spent on fish products (USD)
● MntSweetProducts - amount spent on sweet products (USD)
● MntGoldProds - amount spent on gold products (USD)
● NumDealsPurchases - number of times a deal promotion was taken
● NumWebPurchases - number of times a web promotion was taken
● NumCatalogPurchases - number of times a catalog promotion was taken
● NumStorePurchases - number of times a store promotion was taken
● NumWebVisitsMonth - number of visits to the store website since signing up to the store

The original data was duplicated and cleaned in Google Sheets in order for data analysis
to become easier. Additionally, to make data organized. Firstly, blank entries were all removed
because this might affect the results. Second, reported incomes that were less than 24,000 USD
and more than 423,000 USD were removed. This is because according to salaryexplorer
(2021), the lowest average yearly income of a person working in the US is 24,000 USD while
the highest is 423,000 USD. The given data had values that were very far from each other or
were extremely less or larger. For example, there was an entry with an income of 1,703 USD
which is significantly less than others. Another entry had an income of 666,666 USD which is
also larger compared to others. Thus, I used the statistics provided by salaryexplorer (2021) and
removed the values less than 24,000 USD and greater than 423,000 USD in order to properly
represent the respondents in each income level and to not make data that spread out.
Additionally, to lessen the entries. Next, respondents that were 100 years old or older were
removed because data may be outdated and affect results because the ages are significantly
higher than normal. Then, respondents that listed Absurd, Alone, and YOLO in Marital_Status
were removed because they were too little and cannot be represented accurately. Lastly,
respondents that answered “2n cycle” in education were changed to “Master” because that is
the other name for it and to make the data uniform.

For the data processing, I inserted a new column called Age as of 2021, and this column
was obtained by subtracting 2021 and the values from Year_Birth. The purpose of this is to
easily track and categorize or group their ages. Then I created another column called

2
Age_Group to group their ages into three categories. Age group 1 is for people aged 25 to 40,
group 2 is 41 to 70, and group 3 is for people aged 71 and above. This was created because
one of the objectives will be based on this group. Another column called Income_Level was also
created to group the income of each respondent. Low was for respondents with incomes that
are less than or equal to 25,000 USD, Middle is between 25,001 USD and 75,000 USD, High is
greater than or equal to 75,001 USD. This was also created because this is the basis for one of
the objectives. Lastly, the Total_Expenditure column was inserted by getting the sum of
MntWines, MntFruits, MntMeatProducts, MntFishProducts, MntSweetProducts, and
MntGoldProds which will also be a basis for one of the questions. After these steps were
performed 1985 unique respondents remained.

The data were analyzed by obtaining the average income and average total expenditure
of each age group, income level, and educational attainment. The standard deviation of income
and expenditure for each age group, income level, and educational attainment was also
calculated. In addition to that, various charts were created in order for the observation of the
shopping habits and trends of the respondents. Data analysis for the locations were also
extracted by obtaining the total revenues of each location by multiplying the average
expenditures and their designated population percentages. Then, the location with the highest
total revenue was chosen to be the most suitable location for MineBlox’s new mall because this
2
will be most beneficial. Next, the correlation coefficient (r) and coefficient of determination (𝑅 )
were determined in order to find out if the relationships of the various variables have strong
relations and if their predictions were accurate. The closer the r value was to 1 then it can be
2
determined that the relationship between the two quantities is strong. For the 𝑅 value, if it is
also closer to 1 then the predictions are more accurate or precise. If the values are far from 1
then the opposite happens. Various scatterplots were created to observe the direction of the
relationships and the variables’ influence on each other. Moreover, a regression model was
made based on the best predicting variable. The reason for this was to find out the formula that
can aid in obtaining the estimated sales of each location. Substituting the x value from the
formula to the average income of each income level. The location with the greatest amount of
total estimated sales was chosen. Last analysis performed was the ANOVA tests and Tukey’s
Pairwise test to determine the significant differences of the mean expenditures and income of
the respondents’ age group and education. The one-way ANOVA test was utilized in order to
inform us if there was a significant difference in the mean expenditure of age groups and
education separately. But first, Levene’s test was performed to confirm if the groups had equal
variances or not. If the p-value was less than 0.05 in this test, then this would mean that the
variances of the groups are not equal, we then base the p-values on the Welch F test to know if
it has significant differences. However, if Levene’s test shows a p-value greater than 0.05, we
can use the original p-value to know if the groups have significant differences. The past 4.08
software was used to get the p-value of the variables that are asked for. The p-values are then
compared to the significance level which is 0.05. If the value was less than 0.05 then it was
assumed that there were differences among the groups. However, it is still not established which
groups specifically. Thus, Tukey’s Pairwise test was conducted to discover which groups exactly
had p-values that are less than 0.05 so that we can indicate which groups had differences. For
the significant difference of the mean income for both age group and education, a two-way

3
ANOVA test was used because in this case, there are 3 null hypotheses. The past 4.08 software
was still used to obtain the p-values for the age group, education, and interaction. The first
independent variable, in this case, was age group, then the second independent variable was
education, and the dependent variable was income. We then compare the p-values to the usual
significance level of 0.05 and if it is less than that, we establish that there are differences among
the groups. The same goes for interaction, if the p-value is less than 0.05, then there is an
interaction between the groups. However, if the p-value is greater than 0.05 then interaction
simply does not exist. Note that we still do not know which categories specifically in education
and age groups have differences, so the same with what happened previously, the Tukey’s
Pairwise test was utilized to discern which groups had p-values less than 0.05. In order for us to
know which groups precisely have differences.

Chapter 2: Results and Discussion


The results of the data analysis and processing are presented in this section. The
Age_Group is divided into 3 groups called 1,2, and 3. Income level is divided into Low, Middle,
and High. The average income and average expenditure of each column are presented as well
as its standard deviation. Table 2 shows the summary of the results for Age_Group while Table
3 is for the Income Level. Note that the values for the tables below are rounded off to the
nearest tenth decimal place.

Age_Group
Table 1. Descriptive Statistics of each Age_Group
Age_Group Average Standard Deviation Average Standard Deviation
Income (USD) for Income Expenditure (USD) for Expenditure

1 54169.4 22666.8 717.7 689.1

2 55512.5 18377.7 642.8 584.0

3 62135.5 18790.5 835.1 623.4

As you can see from the table, the results show that the average income for age group 1 is
54169.4 USD. For age group 2 it is 55512.5 USD and group 3 is 62135.5 USD. The standard
deviation for group 1 is 22666.8, group 2 is 18377.7, and for group 3 it is 18790.5. Age group 1
in the table has the biggest standard deviation for income with a value of 22666.8. This must
mean that group 1 or people aged 25 to 40 have more diverse and spread out values for their
incomes. Additionally, participants from age group 1 have incomes that are very different from
each other. Then, the average total expenditure for group 1 is 717.7 USD, group 2 is 642.8
USD, and group 3 is 835.1 USD. Lastly, the standard deviation for expenditure for group 1 is
689.1, group 2 is 584.0, and group 3 is 623.4.

4
Income_Level
Table 2. Descriptive Statistics of Each Income Level
Income_Level Average Income (USD) Standard Deviation for Average Standard Deviation
Income Expenditure (USD) for Expenditure

Low 24492.4 261.9 81.0 104.9

Middle 49939.8 14114.7 505.5 495.4

High 83916.3 12077.3 1431.1 461.6

For the descriptive data of our income level, the average income for Low is 24492.4 USD,
Middle is 49939.8 USD, and High is 83916.3 USD. For their standard deviations, we can
observe that the Middle-income level has the highest standard deviation with a value of 14114.7.
Then, this would mean that the people belonging in the middle have income values that have
significant differences from each other and are more varied. This may be because the middle
category has more diverse types of people and it is the majority in our dataset which has more
entries than other categories. The average expenditure of the Low class is 81.0 USD, then
Middle class is 505.5 USD, and High is 1431.1 USD. The standard deviation for Low is 104.9,
Middle shows 495.4, then 461.6 for High.

Educational Attainment
Table 3. Descriptive Statistics of Each Educational Attainment
Educational Average Standard Deviation Average Standard Deviation
Attainment Income (USD) for Income Expenditure (USD) for Expenditure

Basic 26604.6 2530.6 115.7 197.4

Graduation 55994.5 18955.9 683.9 600.1

Master 54858.9 18655.1 627.5 605.6

PhD 57458.3 19612.4 696.5 616.4

For the data analysis of our income level, the average income for Basic is 26604.6, Graduation
is 55994.5, Master is 54858.9 and PhD is 57458.3. The standard deviation for PhD had the
largest value in both income and expenditure with 19612.4 and 616.4 respectively. This
suggests that the incomes and expenditures of the people with a PhD are spread out over a
large range of values. Meaning their values are variegated. The average total expenditure for
basic is 115.7, graduation is 683.9, master is 627.5, and PhD is 696.5.

5
Best Location Based on Average Expenditure per Age Group
If we were to find the location of our mall, based on the average expenditure per age
group, we first have to calculate the total revenue of the three locations. In order for us to do
that we need to multiply our average expenditure per age group to the population statistics of
each age group from the 3 locations. %age will represent the percentage population for each
age group, while Revenue (USD) will be the product of Average Expenditure (USD) and %age.
Total Revenue is the sum of Revenue (USD) of each age group. Table 4 will present the findings
of location 1. Table 5 presents location 2’s results. Table 6 will present results for location 3. All
the values below are rounded off to the nearest tenth decimal place.

Table 4. Location 1 Total Revenue (Age group)


Location 1

Age_Group Average Expenditure (USD) %age Revenue (USD)

1 717.7 65 466.5

2 642.8 30 192.8

3 835.1 5 41.8

Total Revenue 701.1 USD

This table presents the results if we multiply the average expenditure of each age group by the
population percentage of location 1. We will get a total of 701.1 USD

Table 5. Location 2 Total Revenue (Age Group)


Location 2

Age_Group Average Expenditure (USD) %age Revenue (USD)

1 717.7 35 251.2

2 642.8 55 353.5

3 835.1 10 83.5

Total Revenue 688.2 USD

This table displays our results after doing the same steps as table 5. We then obtain a total
revenue of 688.2 USD.

6
Table 6. Location 3 Total Revenue (Age Group)
Location 3

Age_Group Average Expenditure (USD) %age Revenue (USD)

1 717.7 5 35.9

2 642.8 35 225.0

3 835.1 60 501.1

Total Revenue 762.0 USD

This table shows the results of our data analysis that we performed the same steps as table 5.
We then get a total revenue of 762.0 USD.

Out of all the tables that we tried to calculate and compare, location 3 had the most satisfying
results. It had the highest total revenue out of all the locations. If we try to observe table 6, age
group 3 had the biggest contribution to the sales following age group 2 then age group 1. This
would mean that location 3 would be the best choice in this situation and would be the most
advantageous for MineBlox.

Best Location Based on Average Expenditure per Income Level


Similar to the process that we went through for the location based on age group, we can
also perform the steps that we did here. The differences are that instead of using the average
expenditure of each age group, we will now be utilizing the average expenditure of each income
level. Remember that we have three categories: Low, Middle, and High. Additionally, our
population percentage will also change because we are now determining the income level of the
3 locations. %age will present the percentage population for each income level. Revenue (USD)
will be the product of Average Expenditure (USD) and %age, while Total Revenue is the sum of
Revenue (USD). Table 7 will be presenting results for location 1, Table 8 for location 2, and
Table 9 for location 3.

Table 7. Location 1 Total Revenue (Income Level)


Location 1

Income Level Average Expenditure (USD) %age Revenue (USD)

Low 81.0 10 8.1

Middle 505.5 20 101.1

High 1431.1 70 1001.8

Total Revenue 1,111 USD

7
After we multiply the average expenditure of each income level by the population percentage,
we will see these results presented in table 8. We now get a total revenue of 1,111 USD.

Table 8. Location 2 Total Revenue (Income Level)


Location 2

Income Level Average Expenditure (USD) %age Revenue (USD)

Low 81.0 20 16.2

Middle 505.5 60 303.3

High 1431.1 20 286.2

Total Revenue 605.7 USD

Table 8 shows the results for location 2 as we performed the same steps as table 7. The total
revenue is 605.7 USD

Table 9. Location 3 Total Revenue (Income Level)


Location 3

Income Level Average Expenditure (USD) %age Revenue (USD)

Low 81.0 35 28.4

Middle 505.5 40 202.2

High 1431.1 25 357.8

Total Revenue 588.4 USD

The same process in table 8 was performed in this table and the results are being displayed.
The total revenue for location 3 would be 588.4 USD.

After we calculate the locations, we can now compare the three tables. If we observe them,
location 1 had the highest total revenue taking income level into account. Looking back at table
7, the high-income class made a large contribution that exceeded 1,000 USD. This may happen
because they had the highest average expenditure as well as population statistics. The middle
class goes next and the low class had the smallest contribution. If we are to follow this scenario,
then location 1 would be the best choice for the company.

8
Relationships
This section presents how strong and what kind of relationships can be observed
between income and expenditure, age and expenditure, & age and income. To obtain the
2
correlation coefficient (r) and Coefficient of determination (𝑅 ), table 10 will show the formulas
used in Google Sheets. To analyze the strength of each relationship, table 11 will present an
2
interpretation of the values. Note that the calculated r and 𝑅 values below are the exact
numbers because these are indicators of how strong the relationship between the variables is
2
and how accurate are its predictions. Table 12 will present the calculations of the r and 𝑅 for
income and expenditure, table 13 for age and expenditure, and table 14 for age and income.

2
Table 10. Formula For Correlation (r) and Coefficient of Determination (𝑅 )
(Independent variable & Correlation coefficient (r) Coefficient of determination (𝑅 )
2
Dependent variable) Formula Formula

(x & y) =CORREL(data_y, data_x) =RSQ(data_y, data_x)

Table 11. Interpretation of Correlation Coefficient (Dancey and Reidy, 2007)


Correlation Coefficient Strength of Relationship

1 Perfect

0.7 - 0.9 Strong

0.4 - 0.6 Moderate

0.1 - 0.3 Weak

0 None

9
Income and Expenditure

Figure 1. Scatterplot of Expenditure and Income

As you can see in figure 1, the direction of the relationship between income and expenditure is
going upwards. It can also be observed that as the income increases or goes to the right then
the total expenditure goes up or increases too. This would mean that income and total
expenditure are proportional to each other. If one increases or decreases, the other does the
same thing.
2
Table 12. R and 𝑅 values of Income & Expenditure
(Independent variable & Correlation coefficient (r) 2
Coefficient of determination (𝑅 )
Dependent variable)

Income & Expenditure 0.7911607287 0.6259352987

In this table, the r value of income and total expenditure is 0.7911607287. If we interpret the
strength of their relationship based on table 11 this value means that the two variables have a
2
strong relationship. The 𝑅 value of the two is 0.6259352987 also near to 1, which means the
predictions are not that off.

10
Age and Expenditure

Figure 2. Scatterplot of Expenditure and Age

Looking at figure 2, we can also observe that the direction of the relationship between total
expenditure and age is somewhat going upward but it is very faint unlike the previous one
displayed in figure 1 that is extremely obvious. This must mean that the ages of the respondents
do not mean that much to their total expenditures. Thus, their relationship is not proportional to
each other.

2
Table 13. R and 𝑅 values of Age & Expenditure
(Independent variable & Correlation coefficient (r) Coefficient of determination (𝑅 )
2
Dependent variable)

(Age & Expenditure) 0.06504304087 0.004230597166

Observing table 13, the r value of the variables age and expenditure is 0.06504304087 which is
far from 1. If we analyze the value using table 11, their correlation coefficient implies that there is
2
no relationship between the two. As well as their 𝑅 value which is 0.004230597166 that is
significantly far from 1. The predictions from these variables would be off or not that accurate.

11
Age and Income

Figure 3. Scatterplot of Income & Age

The same with the outcome of figure 2, figure 3 also displays an extremely faint upward trend.
This means that the variables age and income are not that dependent on each other. They are
not directly correlated to each other.

2
Table 14. R and 𝑅 values of Age & Income
(Independent variable & Correlation coefficient (r) 2
Coefficient of determination (𝑅 )
Dependent variable)

(Age & Income) 0.1214112989 0.01474070349

As shown in table 14, the r value between age and income is 0.1214112989 which is still not
2
close to 1. Meaning their relationship is weak. The 𝑅 value is 0.01474070349, also very far
from 1. Then, the predictions of these variables are also not that accurate.

12
Best Variable for Prediction of Sales
If we look at table 12, we can observe that the independent variable income has the
2
biggest 𝑅 value of 0.6259352987 which is the closest to the value of 1. Denoting that the
relationship between income and expenditure is the most accurate prediction. Thus, if we were
to find the best variable to predict the sales of each location, income would be it. If we were to
look at the regression model below, the direction of our relationship or trendline is extremely
noticeable indicating that the relationship between income and total expenditure is strong and
proportional to each other. Out of all the relationship scatterplots that we created, the most
noticeable trend that we can observe is between income and total expenditure

Figure 4. Regression Model

Location with Best Sales Based on Regression Model


To find the location with the best sales using the regression model we need to find out
the estimated sales of each income level in the 3 locations. In order to do that, we first use the
formula from our regression model which in this case is y = 0.025x + -726. We then substitute x
with the average income per income level. After solving for that, we then multiply the answer
previously by the designated percentage population of its income level. The product will be the
estimated sales of that income level. Total Estimated Sales is the sum of all estimated sales.
Table 15 displays the results for location 1, table 16 for location 2, and table 17 for location 3.
Note that the values below are rounded off to the first decimal place.

13
Table 15. Location 1 Total Estimated Sales
Location 1

Income Level (y = 0.025x + -726) %age Estimated Sales


x = average income per (USD)
income level

Low 0 10 0

Middle 522.5 20 104.5

High 1371.9 70 960.3

Total Estimated 1064.8


Sales

After substituting x in the formula with the average income per income level, we get the values
in the second column. Then, we multiply it by the percentage population. We then get the total
estimated sales of location 1 which is 1064.8 USD.

Table 16. Location 2 Total Estimated Sales


Location 2

Income Level (y = 0.025x + -726) %age Estimated Sales


x = average income per (USD)
income level

Low 0 20 0

Middle 522.5 60 313.5

High 1371.9 20 274.4

Total Estimated 587.9


Sales

The same steps as table 15 were conducted in this table. The total estimated sales we get for
location 2 is 587.9 USD.

14
Table 17. Location 3 Total Estimated Sales
Location 3

Income Level (y = 0.025x + -726) %age Estimated Sales


x = average income per (USD)
income level

Low 0 35 0

Middle 522.5 40 209.0

High 1371.9 25 343.0

Total Estimated 552.0


Sales

Table 17 shows the results for location 3 and the total estimated sales we obtained is 552.0
USD.

If we try to analyze tables 15-17, location 1 has the highest total estimated sales with a value of
1064.8. The next would be location 2 then 3 with values of 587.9 and 552.0 respectively. If we
choose the location based on the regression model, the best place would be location 1 because
it had the highest estimated profit that will be beneficial for MineBlox’s new mall.

Significant Differences Between Mean Expenditures


This section shows the significant differences in the expenditures among the respondents’ age
groups and education. In order to analyze the significant differences between each group, a
one-way ANOVA test, and Tukey’s pairwise test was performed. The tables below will display
the results. Note that the values are rounded off to at least 3 significant figures.

Age_Groups

Table 18. Levene’s Test for expenditure based on age group


Test for equal means
Levene's test for homogeneity of variance, from means p (same): 8.03E-08

Levene’s test was conducted to confirm if the variances are equal or not. In this case, the
p-value of the test is 8.03E-08 which is less than 0.05. Meaning the variances of the age groups
are not equal. We then use the p-value of the Welch F test to determine if the groups have
significant differences which is shown in the table below.

15
Table 19. Welch F test for expenditure based on age group
Welch F test in the case of unequal variances: F=6.88 df=284 p=0.00121

Table 19 shows that the p-value of the expenditures for the age groups is 0.00121 which is less
than the usual value of our significance level, 0.05. This implies that at least two groups have
significant differences between their mean. To determine which groups specifically, we use
Tukey’s pairwise test which is shown in table 20.

Table 20. Tukey’s Pairwise test for expenditure based on age group
Age_Group 1 Age_Group 2 Age_Group 3
Age_Group 1 0.122 0.151
Age_Group 2 2.78 0.00131
Age_Group 3 2.63 4.97

As we can see in the highlighted box the p-value is 0.00131 which is less than 0.05. It is located
on the third row and the fourth column. Meaning that the means between Age_Group 2 and
Age_Group 3 are different. To put it another way, the respondents that belong to Age_Group 2
have a significantly different expenditure than respondents belonging to Age_Group 3.

Education

Table 21. Levene’s Test for expenditure based on education


Test for equal means
Levene's test for homogeneity of variance, from means p (same): 1.30E-07

The p-value of Levene’s test, in this case, is 1.30E-07 which is less than 0.05. Denoting that the
variances of this group are not equal. The Welch F test is used to know if there are significant
differences in the case of unequal variances which is shown below.

Table 22. Welch F test for expenditure based on education


Welch F test in the case of unequal variances: F=49.1 df=107 p=4.89E-20

As displayed in table 22, the p-value of the expenditure for education is 4.89E-20. This value is
still lesser than 0.05 meaning that there are significant differences in the mean expenditure of at
least two education levels. By using Tukey’s Pairwise test, we can indicate which groups. This is
shown in the table below.

16
Table 23. Tukey’s Pairwise test for expenditure based on education
Basic Graduation Master PhD
Basic 0.000183 0.00115 0.000151
Graduation 5.90 0.318 0.983
Master 5.27 2.42 0.286
PhD 5.97 0.524 2.51

The table shows three highlighted boxes which are all located on the second row and in
columns 3 to 5. These p-values are all lesser than 0.05 meaning that these groups have
different means. Basic and graduation have a value of 0.000183, while basic and master is
0.00115. Finally, 0.000151 for basic and PhD. This indicates that basic has a significantly
different expenditure compared to all the other educational attainments which are graduation,
master, and PhD.

Significant Differences Between Mean Income Considering Both


Age Group and Education
This section shows the significant differences for three null hypotheses. In this case, the first
independent variable is age group, the second independent variable is education, and the
dependent variable is income. In order to investigate the effect of two independent variables on
a dependent variable, a two-way ANOVA test was used. To know which groups specifically have
differences from each other, Tukey’s pairwise test was performed. Note that the values are
rounded off to at least 3 significant figures.

Age_Group

Table 24. Two-Way ANOVA for age group and education focusing on age group
Sum of sqrs df Mean square F p (same)
Age Group: 6.21E+09 2.00 3.11E+09 8.73 0.000168

The table shows that the p-value of the first independent variable which is age group is
0.000168. Clearly less than 0.05. Meaning that there is a significant difference in incomes due to
age group. Using Tukey’s Pairwise test to see which groups specifically, the results are shown
below.

17
Table 25. Tukey’s Pairwise test for the first independent variable (age group)
Age_Group 1 Age_Group 2 Age_Group 3
Age_Group 1 0.388 8.09E-05
Age_Group 2 0.388 4.21E-07
Age_Group 3 8.09E-05 4.21E-07

Table 25 shows the p-values which are less than 0.05 which in this case is highlighted. Thus,
this means that Age_Group 1 and Age_Group 3 with a value of 8.09E-05 have a significant
difference in their incomes. Additionally, Age_Group 2 and Age_Group 3 also have a difference
between groups because their p-value is 4.21E-07.

Education

Table 26. Two-Way ANOVA for age group and education focusing on education
Sum of sqrs df Mean square F p (same)
Education: 1.88E+10 3.00 6.27E+09 17.6 2.80E-11

The p-value for education, the second independent variable, as shown in table 26 has a value of
2.80E-11 which is less than 0.05. Implying that there are differences in incomes due to
education. Tukey’s Pairwise test is conducted to know which groups exactly. Results are shown
below.

Table 27. Tukey’s Pairwise test for the second independent variable (education)
Graduation Master Basic PhD
Graduation 0.669 4.12E-11 0.4804
Master 0.669 4.12E-11 0.1411
Basic 4.12E-11 4.12E-11 4.12E-11
PhD 0.4804 0.1411 4.12E-11

As seen in table 27, the p-values that are less than 0.05 are highlighted. Basic and Graduation
have a p-value of 4.12E-11 indicating that these two groups have a difference. Master and Basic
also have a p-value less than 0.05 which is 4.12E-11. So, there is also a significant difference
between the two. Lastly, basic and PhD also have a p-value of 4.12E-11. Meaning that
respondents that attained a basic education have a significantly different income than those with
a PhD. For short, respondents who have a basic education have a significantly different mean
income compared to those who have a higher education level.

18
Interaction Effect Between Age Group and Education
This section presents if there is an interaction effect between age group and education in
regards to income. A two-way ANOVA test was still used in order to analyze this. A table of the
results is shown below.

Table 28. Two-way ANOVA test for age group and education focusing on interaction
Sum of sqrs df Mean square F p (same)
Age Group: 6.21E+09 2.00 3.11E+09 8.73 0.000168
Education: 1.88E+10 3.00 6.27E+09 17.6 2.80E-11
Interaction: 2.74E+09 6.00 4.56E+08 1.28 0.262

As we can see in table 28, the p-value of the interaction between the independent variables,
age group, and education, is 0.262. Now, this value is greater than 0.05 so this means that there
is no interaction between education and age group.

Visualizations

Figure 5. Histogram of Income

19
As seen in Figure 5, the most common income the respondents receive is around 30,000 -
40,000 USD with a count of 359 respondents. This means that the majority of the participants
are in the middle class.

Figure 6. Histogram of Total Expenditures

As we observe in Figure 6, most of the respondents had a total expenditure that was less than
or equal to 100 USD. Out of 1985 respondents, 505 of them had these values. This means that
25.4% of the respondents had mall expenditures that were less than 100 USD.

20
Figure 7. Distribution of Customers Based on Age group

As you can see, the majority of the respondents belong to Age Group 2 who are aged
between 41 to 70. 1555 or 78.3% of the population belong to group 2, 298 or 15% are in group
1, and 132 or 6.6% are in group 3.

Figure 8. Distribution of Customers Based on Income Level

21
In figure 8, we can notice that a lot of the customers are from the middle class as they make up
1607 or 81.0% of the population. 356 or 17.9% are from the High class and 22 or 1.1% for the
Low class.

Figure 9. Distribution of Customers Based on Education

The chart displays that 1000 or 50.4% of the customers have graduated. That means half of the
population are graduates. 503 or 25.3% are master, 462 or 23.3% have PhDs, and 20 or 1.0%
have basic education.

22
Chapter 3: Conclusion
This report was created in order to present the data analysis and processing procedures
and to answer the questions or objectives stated in Chapter 1. Based on the results of the
analyses, it was discovered that the average income of Age_Group 1 is 54169.4 USD,
Age_Group 2 is 55512.5 USD, and Age Group 3’s average income is 62135.5 USD. Age_Group
1 was also found to have the largest standard deviation for income with a value of 22666.8.
Implying that the values for income within that group are spread out and have a larger disparity
between values. We also discovered the average income of each income level. Low has an
average of 24492.4 USD, Middle is 49939.8 USD, and High is 83916.3 USD. For the standard
deviation of total expenditure in each income level, Middle had the largest with 14114.7. This
means that the Middle-income level has values that are far from each other. Additionally,
compared to the standard deviations of low with 261.9 and high with 12077.3, middle has values
that are more spread out from its mean.

In addition to that, we also identified that if we base our chosen location according to the
average expenditure of each age group, Location 3 would be the most suitable choice. This is
because Location 3 had the highest total revenue with a value of 762 USD. The biggest
contributor from the 3 categories was Age_Group 3. They had the largest expenditure of 835.1
USD as well as the biggest percentage population of 60%. Their revenue was 501.1 USD, with
group 2 next, having a revenue of 225 USD, and finally, group 1 having 35.9 USD. With this in
mind, the most advantageous place to build the new mall in consonance with each Age_Group
will be Location 3. The location which is near a retirement home mostly comprising senior
citizens and retirees.

Results also show that if we utilize the income level as our foundation in picking the best
place, Location 1 would be the most befitting in this scenario. Location 1 displayed the highest
total revenue with a value of 1,111 USD. The customers in the category of High-income level
had the largest contribution with the highest average expenditure of 1431.1 USD and the largest
percentage population of 70%. Their revenue was 1001.8 USD, with the Middle-income level
next as 101.1 USD, and Low in last place with 8.1 USD. This means that choosing Location 1 in
this situation would be the most strategic action of the company. This location is a newly
developed area with a majority population of fresh college graduates.

To make it short, if the company is trying to make the age group as their foundation for
the best building site, Location 3 would be the advisable choice. However, if their basis would be
the customers’ income level, then the most appropriate would be Location 1. Both results yield
the highest total revenue that will benefit MineBlox. However, on the basis of our correlation
analysis income level would be the best basis for the location because it presents higher total
revenues compared to age group.

For the relationship between income and expenditure, it was established that both had
the strongest relationship out of the others. Their relationship had a direction going upwards that
was very noticeable indicating the two are directly related. Additionally, their r value was

23
0.7911607287, which according to the interpretation of the correlation coefficient by Dancey and
Reidy (2007), the relationship is strong. In the relationship of age and expenditure, it was
observed that the direction of their relationship was very faint. It was almost like a slightly
slanted line implying that the relationship of the two is non-existent. The two variables did not
mean that much to each other’s values. Their r value was also 0.06504304087 which indicates
that there was no relationship. The same goes for the relationship between the respondents’
age and income. A slight direction upward that is not noticeable was also discovered indicating
that their relationship is faint. Their r value was 0.1214112989 implying that the relationship is
weak. Meaning that age and expenditure do not have that much influence on each other.

The best variable that predicted the sales of each location was income. Thus, the
average income of each income level was used on the regression formula to determine the
estimated sales of each location. This led us to find out that location 1 had the highest value of
estimated sales with 1064.8 USD, followed by location 2 with 587.9 USD, and location 3 with
552.0 USD. In short, location 1 would be the best choice if we based it on the regression model
because it presents the highest estimated profits.

It was also discovered that there was a significant difference between the mean
expenditures among the age groups because the Welch F test for unequal variances showed
that the p-value was 0.00121. In Tukey’s Pairwise test it was discovered that specifically
Age_Group 2 and Age_Group 3 had those differences for their p-value was 0.00131 which is
less than 0.05. In other words, respondents in Age_Group 2 have a different expenditure than
people in Age_Group 3. The same goes for the education of the respondents. Their p-value
result in the Welch F test was 4.89E-20 meaning it had differences. Using Tukey’s Pairwise test,
it was established that respondents with basic education had a significantly different expenditure
compared to all of the other categories which are graduation, master, and PhD. Both groups
were established to have unequal variances because of their p-values in Levene’s test.
8.03E-08 for age groups and 1.30E-07 for education which are both less than 0.05.

If we were to consider both age group and education in regards to mean income, both
have significant differences among their groups because in the two-way ANOVA test the p-value
for age group is 0.000168 and education is 2.80E-11 which are both less than 0.05. For Tukey’s
Pairwise test in the first independent variable which is age group, it was established that
Age_Group 3 has different incomes when compared to Age_Group 1 and Age_Group 2. This is
because for the p-value of Age_Group 1 and Age_Group 3 the result was 8.09E-05 while for
Age_Group 2 and Age_Group 3 the p-value was 4.21E-07 which are both less than 0.05. For
Tukey’s Pairwise test in the second independent variable which is education, it was discovered
that basic has a significantly different income compared to the other three categories. Basic and
graduation had a p-value of 4.12E-11, the same p-value applies to basic and master, as well as
basic and PhD. Meaning that all the aforementioned variables have a p-value less than 0.05.
Additionally, there was no interaction effect between age group and education in regards to
income because it had a p-value of 0.262 which is greater than 0.05.

24
Now that all the objectives are answered, I hope the company can pick the perfect
location for the new mall. With this, the company can narrow down their choices and at the
same time predict which outcome will provide them with the best sales.

25
Reference:

‌SalaryExplorer. (2021). Average Salary in United States 2021 - The Complete Guide.
http://www.salaryexplorer.com/salary-survey.php?loc=229&loctype=1

You might also like