You are on page 1of 8

Kruunccch

DOMS IIT Madras

Team: The Avengers


IIM Indore

PraveenKumar Alluri
Pratheek Perala
Manish Manohar
Kruunccch

Data Pre-processing
The input data provided is raw training data which contains workplace population within 2mile
radius and within 5mile radius. Also there is data stating the percentage of population between 0
and 5 years ages within 2mile radius, percentage of population between 5 and 9 years ages within
2mile radius etc. Similarly percentage population within 5 mile radius is given. In total there are
about 58 attributes.

The total workplace population comes under the age of 15 – 64 years age bracket. For e.g. for
restaurant 1, the percentage of people in 15 – 64 years age bracket is about 70%. And the total
workplace population is 28643. So we get the total population within 2mile radius as 47971. Based
on this, we can calculate the absolute numbers of people in for each attribute like population under
5 years age within 2 mile radius is 5407 which is actually 11.27% of the total population. In this way
all other values under each attribute for each restaurant are calculated.

We neglected the two attributes ratio of blue collar white collar within 2 mile and 5 mile radii as we
can’t get the exact estimate of the employees in the region. As per the definition of workplace
population, it is given that these are people within 15 – 64 age bracket but we cannot assume that
all people under this bracket would qualify as employees. So we neglected these two attributes.

This pre-processed training data as well test data are fed into Weka tool and converted into .arff file
format as this format is generally preferred for using Weka tool. The pre-processed data is ready and
can be used to be processed in the Weka tool for building Gaussian regression model.

Tools Employed - Weka Software (Waikato Environment for Knowledge Analysis)


Weka is a popular suite of machine learning software written in Java which is free software available
under the GNU General Public License. Weka is a data mining package which includes a wide variety
of methods. It’s easy to use interface makes it accessible for general use, while its flexibility and
extensibility make it suitable for academic use.

The Weka workbench contains a collection of visualization tools and algorithms for data analysis and
predictive modelling, together with graphical user interfaces for easy access to this functionality.
Weka supports several standard data mining tasks, more specifically, data pre-processing, clustering,
classification, regression, visualization, and feature selection. All of Weka's techniques are
predicated on the assumption that the data is available as a single flat file or relation, where each
data point is described by a fixed number of attributes (normally, numeric or nominal attributes, but
some other attribute types are also supported). Weka provides access to SQL databases using Java
Database Connectivity and can process the result returned by a database query. It is not capable of
multi-relational data mining, but there is separate software for converting a collection of linked
database tables into a single table that is suitable for processing using Weka. Another important
area that is currently not covered by the algorithms included in the Weka distribution is sequence
modelling. Weka's main user interface is the Explorer, but essentially the same functionality can be
accessed through the component-based Knowledge Flow interface and from the command line.
There is also the Experimenter, which allows the systematic comparison of the predictive
performance of Weka's machine learning algorithms on a collection of datasets.
1
The Avengers - IIM Indore
Kruunccch

The advantages that we found in Weka are as enumerated below:-

a. Pre-processing, processing and result interpretation processes are integrated and thus
easier to understand and implement
b. Weka is based on Java program which was more comfortable for our group
c. Weka had a good GUI which makes it easier to use
d. In Weka, the processing of the data occurs in the “ARFF” format which makes the
comparison of outputs more standard
e. Weka has user friendly interface to input the different parameters like classification
algorithms, data partitioning, etc. through dropdown lists

Methods Used
We have tried different regression methods like Linear Regression, Isotonic Regression, Least
MedSquares, Pace Regression, Regression by Discretion, and Gaussian Process etc. in both the tools
SPSS and Weka. By comparing the RMSE values from each of these methods in each of the tools, we
have observed least RMSE value from Gaussian regression process.

Gaussian Processes for Regression


A Gaussian process is a generalization of the Gaussian probability distribution. A probability
distribution describes random variables which are scalars or vectors (for multivariate distributions).

The Gaussian process is a distribution over functions and is specified by a mean and a covariance
function. Here, the mean is a function of x (which we will often take to be the zero function), and the
covariance is a function K (x, x’) that expresses the expected covariance between the values of the
function y at the points x and x’. The covariance function defines how smoothly the function varies
from a given x. The function y(x) in any one data modeling problem is assumed to be a single sample
from this Gaussian distribution.

Gaussian processes are implementations are available via various software packages and in most
programming languages, e.g. Weka (Java), Matlab, python, C, C++.

Data processing
Using SPSS and Microsoft Excel
We employed Microsoft excel to pre-process the training data. The same pre-processed excel sheet
was used by the SPSS software to model the situation using linear regression. The formula used is as
illustrated below:-

Y (predicted value) = C (constant) + β1X1 + β2X2 + β3X3 + ..........+ βnXn


The SPSS software gave us the corresponding beta values for each attribute used in the equation.
Using the same beta values, we could predict the total number of customers for the test data set.
This process was done in the excel sheet for the test data set as well as the training data set.

2
The Avengers - IIM Indore
Kruunccch

We were unable to calculate the RMSE values for the test data set as the actual values were not
provided. But we could calculate the same for the training data set using the predicted values and
the actual values provided. The RMSE formula that we used is as given below:-

RMSE = √ [(e12 + e22 + e32 +...+ en2)/n]


Where ei = (yi – yi’)

Yi = actual values

Yi’ = predicted values

The RMSE value that we got from the linear regression analysis was 286.7574

Using Weka
We employed Microsoft excel to pre-process the training data. Since the input for the Weka
software has to be in the “.arff” format, we had to convert the pre-processed excel sheet into “arff”
format. The new file in the “.arff” format was used as an input by the Weka software to model the
situation. The attributes and processes used as an input to the Weka software are as mentioned
below:-

Process used: Gaussian

Kernel used: Radial Basis Function (RBF) as covariance function


The formula for calculating the kernel value is as follows:

K(x,y) = exp(-∑(xi - yi)2/(2*σi2))


ϒ (gamma) = 1/ (2*σ2)

If σ is very small (ϒ is very large), the patterns will tend to be dissimilar and so over-fitting will
happen. Thus, the predicted values will be accurate for the training data set but the predicted values
for the test data set (unseen data) will be inaccurate. If σ is very large (ϒ is very small), the patterns
will tend to be very similar and under-fitting will arise. Thus, the predicted values will not be
accurate even for the training data set and there is no guarantee that the predicted values for the
test data set (unseen data) will have a good accuracy level.

Cache size is taken its default value 250007. Gamma value is taken default value 1. We tried with
different other gamma values but the most appropriate value for Gamma was close to the default
value at which point there was minimum over-fitting or under-fitting taking place.

Debug: false

Noise level: 1.0

Filter type: normalize training data

Dataset sometimes have missing values which needs to be processed. This can be done by either
using Normalization filter or one of the methods available for normalization techniques in data

3
The Avengers - IIM Indore
Kruunccch

mining. Normalization assumes Gaussian distribution and normalizes by mean and standard
variation of each attribute.

For the training data set, Weka calculated the predicted, actual as well as the error related to each
data point. The RMSE was calculated as 97.4533

For the test data set, Weka calculated the predicted values for each of the 70 data points.

The RMSE value that we got from the Weka software for the test data set was 158.583937

Choice of processing software


As we can see from the calculations done using both the tools, RMSE value from Weka tool is only
158.58 whereas from SPSS it was 286.7574. So the RMSE value obtained from the Weka model is
much lower than RMSE value obtained using the linear regression model of SPSS. So we concluded
that the Weka model is a much better model as compared to that of SPSS. Unlike the linear
regression of SPSS where we get separate beta coefficients for each of the attributes, the Weka
model does not separately show any such coefficients. However to know the importance of various
attributes for business decision making purposes, we used the attribute evaluator functionality of
Weka tool for determining the ranking of the various attributes.

Important drivers for the target variable


Tool used to rank the attributes: Weka has a special function called “Select Attributes” to rank the
attributes which can be done by choosing the Attribute Evaluator & Search Method.

Figure: Snapshot taken from Weka tool denoting the Attribute Evaluator & Search Method

For the given training Set, we used “ClassifierSubsetEval” as our attribute evaluator and Gaussian
Process as Learning Scheme for the evaluation (we used Gaussian Process to generate Total values
for the Test set)

Attribute Evaluator: ClassifierSubsetEval,

Learning Scheme: Gaussian Process

Search Method: Greedy Stepwise

4
The Avengers - IIM Indore
Kruunccch

Ranking of all the attributes

Attribute Rank
Retail_center_distance_1mileRadius 1
Percentage_household_income_75_100_2mileRadius 2
Percentage_population_55_64_2mileRadius 3
Quick_service_restaurants_1mileRadius 4
Median_age_5mileRadius 5
Room_count_1mileRadius 6
Full_service_restaurants_1mileRadius 7
Total_sqft_gross_lease_area_1mileRadius 8
Median_household_income_2mileRadius 9
Median_age_2mileRadius 10
Total_workplace_population_5mileRadius 11
Median_household_income_5mileRadius 12
Total_workplace_population_2mileRadius_15_64 13
Total_meeting_space_1mileRadius 14
Percentage_population_15_17_2mileRadius 15
Percentage_household_income_500above_2mileRadius 16
Percentage_population_25_34_5mileRadius 17
Percentage_population_45_54_2mileRadius 18
Percentage_household_income_50_75_5mileRadius 19
Percentage_population_10_14_2mileRadius 20
Percentage_household_income_15_25_2mileRadius 21
Percentage_population_45_54_5mileRadius 22
Percentage_household_income_25_35_2mileRadius 23
Percentage_household_income_125_150_5mileRadius 24
Percentage_population_55_64_5mileRadius 25
Total_population_2mileRadius 26
Percentage_population_15_17_5mileRadius 27
Percentage_population_35_44_2mileRadius 28
Percentage_household_income_0_15_5mileRadius 29
Percentage_population_5_9_2mileRadius 30
Total_population_5mileRadius 31
Percentage_household_income_200_500_2mileRadius 32
Percentage_population_0_5_5mileRadius 33
Percentage_household_income_50_75_2mileRadius 34
Percentage_population_21_24_5mileRadius 35
Percentage_household_income_15_25_5mileRadius 36
Percentage_population_25_34_2mileRadius 37
Percentage_household_income_100_125_5mileRadius 38
Percentage_population_35_44_5mileRadius 39
Percentage_population_21_24_2mileRadius 40
Percentage_population_5_9_5mileRadius 41
Percentage_household_income_25_35_5mileRadius 42
5
The Avengers - IIM Indore
Kruunccch

Percentage_population_10_14_5mileRadius 43
Percentage_household_income_35_50_2mileRadius 44
Percentage_population_0_5_2mileRadius 45
Percentage_household_income_100_125_2mileRadius 46
Percentage_population_65above_5mileRadius 47
Percentage_household_income_150_200_5mileRadius 48
Percentage_household_income_0_15_2mileRadius 49
Percentage_household_income_35_50_5mileRadius 50
Percentage_household_income_125_150_2mileRadius 51
Percentage_population_18_20_5mileRadius 52
Percentage_household_income_75_100_5mileRadius 53
Percentage_household_income_200_500_5mileRadius 54
Percentage_population_18_20_2mileRadius 55
Percentage_household_income_500above_5mileRadius 56
Percentage_population_65above_2mileRadius 57
Percentage_household_income_150_200_2mileRadius 58

Snapshot of the Attribute ranking from the Weka Tool

Figure: Snapshot from the attribute selector window of Weka tool (comments in red included later)
6
The Avengers - IIM Indore
Kruunccch

Suggestions and recommendations for business based on our model


Based on the data available and the parameters on which the Total number of customers per week
is calculated, it is more appropriate to make suggestions for NEW ENTRANTS.

(For existing businesses, strong recommendations on improving their business can be made only if
parameters such as increase in population, service level of restaurants, food menu, price, occupancy
ratio, etc. should be included)

From our model, following suggestions can be made

1. New Entrant should start his restaurant near to Retail Centers such as Malls, Shopping
centers etc. so that the business has high probability of getting more number of customers.
Existing businesses will get to improve their total number of customers under two conditions
– they can move near to a Retail Center or if a new Retail Center opens in their vicinity.
2. Households whose income is within a range of 75-100 and within 2 mile radius are more
likely to eat at restaurants as per the data provided.
So a restaurant should be opened at a location such that it has more number of households
with 75-100 income levels in 2 mile radius.
3. A higher 55-64 age group people in 2 mile radius don’t have any positive impact on the
restaurant business – this age group is more likely not to eat at restaurants. So new entrants
should try to start their business at a place where there is less number of this age group
people.
4. An increase in Quick Service Restaurants within 1 mile radius will drastically reduce the
number of customers – increased competition reduces the market share.
5. An increase in median age within both 2 mile & 5 mile radius will have a negative impact on
the no. of customers – young customers are more likely to eat at restaurants and businesses
should concentrate more on Young Customers.
6. An increase in No. of Rooms available within 1 mile will increase the probability of more
number of customer visiting the restaurant
7. As per the model higher numbers of full service restaurants are concentrated mostly at
places where there are higher numbers of customers.
8. An increase in total lease area available within a 1 mile radius is also positive for the
restaurant business

Conclusion
Finally we conclude that a Restaurant Business is more likely to prosper if it has the following
properties

a. Near to a retail center


b. Less number of Quick Service Restaurants
c. High number of Young people in the vicinity
d. High number of people with moderate-high incomes
e. More no. of room available within 1 mile radius & large total lease area available within 1
mile radius

7
The Avengers - IIM Indore

You might also like