Building an Analytical Platform

3 SAP White Paper – Building an Analytical Platform
I was recently asked to build an analyti-
cal platform for a project. But what is an
analytical platform? The client, a retailer,
described it as a database where it could
store data and as a front end where it
could do statistical work. This work
would range from simple means and
standard deviations through to more
complex predictive analytics that could
be used, for example, to analyze past
performance of a customer to assess the
likelihood that the customer will exhibit a
future behavior. Or it might involve using
models to classify customers into groups
and ultimately to bring the two processes
together into an area known as decision
models. The customer had also come up
with an innovative way to resource the
statistical skills needed. ít had olered
work placements to master’s degree
students studying statistics at the local
university and arranged for them to work
with the customer insight team to
describe and develop the advanced
models. All the customer needed was a
platform to work with.
From a systems architecture and
development perspective, we could
describe the requirements in three rela-
tively simple statements:
1. Build a database with a very simple
data model that could be easily
loaded, that was capable of support-
ing high-performance queries, and
that did not consume a massive
amount of disk space. It would also
ideally be capable of being placed in
the cloud.
2. Create a Web-based interface that
would allow users to securely log on,
to write statistical programs that
could use the database as a source of
data, and to output reports and graph-
ics and well as to populate other
tables (for example, target lists) as a
result of statistical models.
3. Provide a way to automate the
running of the statistical data models,
once developed, so that they can be
run without engaging the statistical
development resources.
Of course, time was of the essence
and costs had to be as low as possible –
but we’ve come to expect that with
everypro|ectl
Step 1: The database
Our chosen solution for the database
was an SAP® Sybase® IQ database, a
technology our client was already famil-
iar with. SAP Sybase IQ is a column-store
database. This means that instead of
storing all the data in its rows, as many
other databases do, the data is organized
on disk by the columns. For example if a
colunn contains a leld lor country it will
have the text of each country (for exam-
ple, “United Kingdom”) stored many
times. In a column-store database the
text is stored only once and given a
unique ID. This is repeated for each
column and therefore the “row” of data
consists of a list of IDs linked to the data
held for each column.
This approach is very elcient lor
reporting and analytical databases.
Replacing text strings with an identiler
neans that signilcantly less space is
used. In our example, “United Kingdom”
would occupy 14 bytes, while the ID
might occupy only 1 byte – reducing the
storage for that one value in that one
column by a ratio of 14:1 – and this
conpression elect is repeated lor all
the data. Furthermore, because there is
less data on the disk, the time taken to
read the data from disk and to process
it lor queries is signilcantly reduced,
which massively speeds up the queries
too. Finally, each column is already
indexed, which again helps the overall
query speed.
An incidental but equally useful
consequence of using a column-store
database such as SAP Sybase IQ is that
there is no advantage in creating a star
schema as a data model. Instead, hold-
ing all the data in one large wide table is
at least as elcient. This is because stor-
ing each column with a key means that
the underlying storage of data is a star
schema. Creating a star schema in a
column-store database rather than a
large single table would mean incurring
unnecessary additional join and process-
ing overhead.
As a result of choosing SAP
Sybase IQ’s column-store database
we are able to have a data model that
consists of a number of simple single
table data sets (one table for each
dilerent type ol data to be analyzed)
that is quick to load and to query.
It should be noted that this type of
database solution is less elcient lor
online transaction processing (OLTP)
applications because of the cost of doing
small inserts and updates. However, this
is not relevant for this particular
usecase.
The solution can be deployed only on
a Linux platform. We use Linux for three
reasons. First, RStudio Server Edition is
not yet available for Microsoft Windows.
Second, precompiled packages for all
elements of the solution on Linux reduce
the install elort. And third, hosted Linux
environments are normally cheaper than
Windows environments due to the cost
of the operating system license. We
chose CentOS because it is a Red Hat
derivative that is free.
One additional advantage of this solu-
tion for some organizations is the ability
to deploy it in the cloud. Since the solu-
tion requires lles to be renotely deliv-
ered, and since all querying is done via a
Web interface, it is possible to use any
colocation or cloud-based hosting
provider. Colocation or cloud deploy-
nent olers a low start·up cost, reduced
systems management overhead, and
access for both data delivery and data
access. The system requires SSH access
for management; FTP, SFTP, or SCP for
lle delivery, and the R8tudio Web service
port open. The RStudio server uses the
server login accounts for security but
can also be tied to existing LDAP
infrastructure.
Step 2: Statistical tools and Web
interface
There are a number of statistical tools
in the market. Most are very expensive,
prohibitively so in this case, and the
associated skills are hard to come by
and expensive. However, since 1993 an
open-source programming language
called R (www.r-project.org) for statisti-
cal computing and graphics has been
under development. It is now widely used
among statisticians for developing statis-
tical software and data analysis, is used
by many universities, and is predicted to
become the most widely used statistical
package by 2015. The R project provides
a command line and graphical interface
as well as a large open-source library of
useful routines (http://cran.r-project.
org) and it is available as packaged soft-
ware for most platforms including Linux.
In addition, a second open-source proj-
ect called RStudio (http://rstudio.org)
provides a single integrated development
environment for R and can be deployed
on a local machine or as a Web-based
service using the server’s security model.
In this case, we implemented the server
edition in order to make the entire envi-
ronment Web based.
So in two simple steps (download and
install R, followed by download and
install RStudio) we can install a full
Web-based statistical environment. Note
that sone conlguring and prerequisite
packages may be required depending on
your environment, but these are well
documented on the source Web sites
and in general automatically download if
you are using a tool such as yum.
The next step was to get access to the
data held in our SAP Sybase IQ server.
This proved to also be very straightfor-
ward. There is a SAP Sybase white paper
(www.sybase.con/lles/Thankyou_
Fages/R_WF_V?.5·lnal_l.pdl) that
describes the process that can be simply
stated as:
Install the R JDBC package
Set up a JDBC connection
Establish your connection
Query the table
We now have an R object that contains
data sourced from SAP Sybase IQ that
we can work with. And what is amazing is
that it took me less than half a day to
build the platform from scratch.
At this point data has to be loaded and
the statisticians can get to work.
Obviously this is more time consuming
than the build, and over the days and
weeks the analysts created their models
and produced the results.
For this exercise we used our in-house
extract, transform, and load (ETL) tool to
create a repeatable data extraction and
load process, but it would have been
possible to use any of a wide range of
tools that are available for this process
Step 3: Automatically running the
statistical models
Eventually a number of models for
analyzing the data had been created and
we were ready to move into a production
environment. We automated the load of
the data into the agreed single-table
structure and wanted to also run the
data models.
Analytical Platform Server
R/JDBC Connection
(S) FTP/SCP
File Delivery
R/JDBC
Connection
Write to
Database Any Network
Connected Computer
with a Browser Accessing
the R Studio Server Edition
SAP
Sybase
IQ
CentOS
©2012 Data Management & Warehousing
Read File
ETL
Engine
R Studio
Server Edition
R
5 SAP White Paper – Building an Analytical Platform
SAP Sybase IQ has the functionality
to create user·delned lunctions (UDFs).
These C++ programs “talk” to a process
known as Rserve, which in turn executes
the R program and returns the results
to SAP Sybase IQ. This allows R func-
tions to be embedded directly into SAP
Sybase IQ SQL commands. While setting
this up requires a little more program-
ming experience, it does mean that all
processing can be done within SAP
Sybase IQ.
Conversely, it is possible to run R from
the command line and call the program
that in turn uses the RJDBC connection
to read and write data to the database.
Having a choice of methods is very
helpful as it means that it can be inte-
grated with the ETL environment in the
most appropriate way. If the ETL tool
requires 8QL only, then the user·delned
function (UDF) route is the most attrac-
tive. However, if the ETL tool supports
host callouts (as ours does) then running
R programs from a command line callout
is quicker than developing the UDF.
ConCluSionS
Business intelligence requirements are
changing and business users are moving
more and more from historical reporting
into predictive analytics in an attempt to
get both a better and deeper under-
standing of their data.
Traditionally, building an analytical
platform has required an expensive infra-
structure and a considerable amount of
time for setup and deployment.
By combining the high performance,
low footprint of SAP Sybase IQ with the
open-source R and RStudio statistical
packages, it is possible to quickly deploy
an analytical platform in the cloud for
which there are readily available skills.
This infrastructure can be used both
for rapid prototyping on analytical
models and for running completed
models on new data sets to deliver
greater insight into the data.
ABout the Author
David Walker has been involved with business
intelligence and data warehousing for over
?0 years, lrst as a user, then with a soltware
house, and lnally with a hardware vendor
belore setting up his own consultancy lrn
Data Management & Warehousing (http://
datamgmt.com) in 1995.
David and his team have worked around
the world on projects designed to deliver
the naxinun benelt to the business by
converting data into information and by
lnding innovative ways to help businesses
exploit that information.
David’s project work has given him experi-
ence in a wide variety of industries including
teleconnunications, lnance, retail, nanu-
facturing, transportation, and public sector
as well as a broad and deep knowledge of
business intelligence and data warehousing
technologies.
www.sap.com/contactsap
12/08 ©2012 SAP AG. All rights reserved.
SAP, R/3, SAP NetWeaver, Duet, PartnerEdge, ByDesign,
SAP BusinessObjects Explorer, StreamWork, SAP HANA, and
other SAP products and services mentioned herein as well as
their respective logos are trademarks or registered trademarks
of SAP AG in Germany and other countries.
Business Objects and the Business Objects logo, BusinessObjects,
Crystal Reports, Crystal Decisions, Web Intelligence, Xcelsius, and other
Business Objects products and services mentioned herein as well as their
respective logos are trademarks or registered trademarks of Business
Objects Software Ltd. Business Objects is an SAP company.
Sybase and Adaptive Server, iAnywhere, Sybase 365, SQLAnywhere, and
other Sybase products and services mentioned herein as well as their
respective logos are trademarks or registered trademarks of Sybase Inc.
Sybase is an SAP company.
Crossgate, m@gic EDDY, B2B 360°, and B2B 360° Services are registered
trademarks of Crossgate AG in Germany and other countries. Crossgate
is an SAP company.
All other product and service names mentioned are the trademarks of
their respective companies. Data contained in this document serves
inlornational purposes only. National product specilcations nay vary.
These materials are subject to change without notice. These materials
are provided by 8AF AO and its alliated conpanies (¨8AF Oroup¨)
for informational purposes only, without representation or warranty of
any kind, and SAP Group shall not be liable for errors or omissions with
respect to the materials. The only warranties for SAP Group products and
services are those that are set forth in the express warranty statements
accompanying such products and services, if any. Nothing herein should
be construed as constituting an additional warranty.