Professional Documents
Culture Documents
- SAP Community
PAL is an optional component from SAP HANA and its main porpoise is to enable modelers to perform predictive analysis
over big volumes of data. If this is the first time you hear about PAL, I would recommend reading the official documentation.
You can also take a look at my prior post where I talk about the Apriori Algorithm.
In this post I’m going to focus on how to use the K-Means clustering algorithm included in PAL because it’s one of the most
popular and most commonly used in data-mining. But before we jump into the code, let’s talk about how the algorithm
works.
According to Wikipedia, “clustering is the task of grouping a set of objects in a way that objects in the same group (called
cluster) are more similar (in some sense or another) to each other than to those in other groups (clusters)”. In other words,
grouping your data into multiple clusters. The most common use case for a clustering algorithm is customer segmentation,
meaning you use a clustering algorithm to divide your customer database in multiple groups (or clusters) based on how
similar they are or how similar they behave, e.g., age, gender, interests, spending habits and so on.
The K-Means algorithm works in a very simple way (for me that I don’t have to code it in C++ J). The first step is to plot all
the database objects into space where each attribute is a dimension. So if we use a two attributes data set the resulting
chart would look something like this:
The first step is creating a table that will contain information on customers mobile phone usage habits with the following
structure:
"DAY_TIME_CALLS" DOUBLE, --> Percentage of Calls made during day time hours (9 a.m. - 6 p.m.)
"WEEK_DAY_CALLS" DOUBLE, --> Percentage of Calls made during week days (Monday thru Friday)