You are on page 1of 6

GEORGE BROWN COLLEGE

SCHOOL OF BUSINESS

Grocery Store Location


Cluster Segmentation of Census Data

Elham Safari

Deliverable 1: EM Activity Report

Deliverable 2: Case Questions


1- Explain the rationale for removing the cases where the mean household size is less than 0.1
The reason for removing cases with less than 0.1 household size is because there is not any information
related this group since the values for all attributes (except geographical information) is zero. That is
because a household size cannot be less than 1 and it is useless to analyze the data for this group so we
need to clean trash data in order to save processing time. It is best to apply such processing only to the
cases with meaningful information. It is meaningless to open a grocery store where the average
household size is less than 1 since there is no potential client around.
2- Why should the cluster segment exclude the geographic coordinates (LocX and LocY)?
Because we want to analyze data to group geographical areas into subsets based on urbanization,
household size and income factor. Therefore, it is not practical to get different clusterings on longitude
and latitude in order to choose locations for the grocery stores.
3- Why was the regions population (RegPop) also excluded from the segmentation?
As described in previous question, number of people in the region is not taken into account in order to
find suitable locations for grocery store since the important factor to make decision is based on the data
related to urbanization, household size and income factor.
4- Given that the cluster variables are on different measurement scales, why would they need
to be standardized? Explain in statistical terms how the internal standardization is
transforming the input cluster variables.
Before we perform a cluster analysis on data, it is necessary to standardize the variables because
variables with large variances have a larger effect on the resulting cluster comparing to variables with
small variances. Therefore, we need to know how many standard deviations the data values are away
from the mean in order to find outliers in order to have meaningful clustering. It helps us to minimize
the mean squared distance from each data value to its nearest center.

You might also like