You are on page 1of 76

T D W I N AV I G AT O R

Predictive Analytics
By Fern Halper
TDWI NAVIGATOR PLATINUM SPONSORS
Contents
FOREWORD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

WHAT IS THE TDWI NAVIGATOR? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

MARKET OPPORTUNITIES AND OBSTACLES . . . . . . . . . . . . . . . . . . . . . . 4

Use Cases for Predictive Analytics . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

The Value of Predictive Analytics . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

MARKET FORCES AND STATUS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

Trends . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

Adoption . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

Adoption of Open Source Tools . . . . . . . . . . . . . . . . . . . . . . . . . 12

Adoption and Cultural Issues . . . . . . . . . . . . . . . . . . . . . . . . . . .12

Maturity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .12

MARKET LANDSCAPE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

Introduction to the Landscape . . . . . . . . . . . . . . . . . . . . . . . . . . . . .14

Feature Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .15

VENDORS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .17

TDWI ANALYST VIEWPOINT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68

Inclusion of a vendor, product, or service in TDWI research does not constitute an endorsement by TDWI or its management.
Sponsorship of a publication should not be construed as an endorsement of the organization or validation of its claims.

This report is based on independent research and represents TDWI’s findings; reader experience may differ. The information
contained in this report was obtained from sources believed to be reliable at the time of publication. Features and specifications can
and do change frequently; readers are encouraged to visit vendor websites for updated information. TDWI shall not be liable for any
omissions or errors in the information in this report.

TDWI publications are copyrighted material. Reproduction without express written permission of TDWI is prohibited.
FOREWORD
Predictive analytics is on the verge of widespread adoption. Enterprises are extremely interested
in deploying predictive capabilities. In a recent TDWI survey about data science, about 35% of
respondents said they had already implemented predictive analytics in some way. In a 2017 TDWI
education survey, predictive analytics was the top analytics-related topic respondents wanted to learn
more about.
Predictive analytics is also a dynamic market. Commercial vendors are providing robust, UI-based
solutions for businesses. Numerous upstarts as well as established vendors offer open source solutions
for data scientists. Some vendors provide both. There is also momentum building around utilizing
different techniques for predictive analytics; for instance, machine learning and deep learning are
attracting considerable attention as disparate data volumes continue to grow.
There are a tremendous number of opportunities for organizations in these techniques—and
numerous challenges. Skills obstacles still abound and organizations are struggling to build talent.
There are cultural and technical issues, too.
TDWI’s new Navigator Reports build upon our rich history of research in data management, BI,
and advanced analytics to help our audience plot their routes through a particular market. The
reports provide readers with a greater understanding of what is going on in a specific market as
well as the players in that market. The reports use our primary and secondary research to educate
end users about a particular market and provide key metrics that assess its state. The reports also
include overviews of solution providers so readers know where to look as they begin adding to their
analytics arsenal.
In keeping with TDWI’s vendor neutrality policy, these reports do not rank vendors. However,
the companies profiled in the reports are included specifically because their products meet certain
criteria. These vendors include established leaders in the field as well as up-and-coming companies
that have something unique to offer.
Our inaugural report is on predictive analytics. I trust you will find it useful.

Fern Halper
Vice President, TDWI Research 

2 
WHAT IS THE TDWI NAVIGATOR?
Most organizations now realize that they need data and analytics to be successful, and many organizations we speak with
are interested in growing their data and analytics capabilities. They have questions, though. They want to understand the
opportunities and the challenges others adopting the technology have faced. They want to understand where a technology is
heading and whether it is being adopted by their competitors. They also want to understand what vendors have to offer.
The goal of this TDWI Navigator Report is to help companies traverse a particular market by providing them with information
across three critical categories (see Figure 1):

1. MARKET OPPORTUNITIES & CHALLENGES 2. MARKET FORCES AND STATUS


This category includes how others are using This category includes three focus areas:
the technology, the value they are obtaining trends, adoption, and maturity. Trends are
from it, and the challenges they are facing. significant in understanding the forces shaping
This information is useful on two fronts. the market in order to determine what might
be important when starting or expanding a
First, understanding use cases, value, technology project. Adoption is important
and opportunities can help an so organizations can learn where others
organization understand whether ES are with the technology and thus
G
it should adopt a technology. LEN become aware of whether they
AL M
It can also be important to are ahead of or behind
CH

AR

TRENDS VALUE
help “sell” the solution to their competitors.
KE
S&

T
ITIE

others in the organization. Because adoption also


FO
RCE
KET OPPORTUN

depends on culture, we
S & STATUS

Second, understanding examine that element in


MATURITY CHALLENGES
technology and this report, too. TDWI
organizational ADOPTION USE CASES has been measuring
challenges can help analytics maturity for
MAR

organizations know the past few years via


what to expect and our analytics maturity
perhaps help get assessments (see TDWI.
FEATURES VENDORS
ahead of issues before org/assessments), and these
they arise. results are key to learning more
MA E about a particular market. It is one
RKET
L A N D S CA P
thing to adopt a technology, but
another to be mature in it.

3. MARKET LANDSCAPE
Of course, it is important to understand the vendors that
provide solutions in a particular market. Who are they?
What kinds of customers do they target? What are their
unique differentiators? What features and functionalities
do they offer that are important for success?

FIGURE 1. TDWI Navigator Framework

3
MARKET
and other quantitative staff, today data science is often the
realm of those who like to code using some of the open

OPPORTUNITIES
source technologies now available (such as R and Python).
This has helped to drive a new class of products that use
open source foundations and distribute analytics computing

AND OBSTACLES
tasks over many nodes in a compute cluster. It is a fast-
moving landscape.

Although it has been around for decades, predictive analytics USE CASES FOR PREDICTIVE ANALYTICS
has been receiving significant market attention recently as
Companies are using predictive analytics across a range of
end users become increasingly interested in the value it can
disparate data types to achieve greater value. Companies
provide. A variety of market forces have joined to make the
are also looking to deploy predictive analytics against
growing interest in predictive analytics possible: an increase
their big data. Predictive analytics is being operationalized
in computing power (CPU and GPU) that makes it faster to
more frequently as part of a business process. It is being
perform iterative calculations; a better understanding of the
embedded in systems and applications. Some of the top use
value of the technology; vendors making their tools easier
cases include:
to use; the advent of big data; and the introduction of new
algorithms and open source options. • FRAUD. Many organizations are using predictive analytics
to predict fraudulent transactions. This runs the gamut
Predictive analytics is important because it changes the
from financial institutions predicting fraudulent credit
nature of analysis from reactive to proactive. BI does a good
card activity to utility companies determining fraudulent
job slicing and dicing data to help answer questions such
electricity usage. The IRS also uses predictive analytics to
as what happened or what is happening. It can also provide
identify tax fraud.
dashboards and visualizations for exploration. However,
it cannot estimate targets of interest (called outcomes or • RISK ANALYSIS. Risk analysis has many flavors. Financial
labels). These outcomes might include: Who will respond to institutions use predictive analytics to determine
a promotion? Who will drop my service? When will a piece portfolio and investment risk. Insurance companies
of equipment fail? Is that image cancer? Predictive analytics use it to predict future claim rates to price insurance.
opens new possibilities to answer business questions. In verticals such as healthcare, analysts are putting
predictive analytics to work to determine the risk of
Predictive analytics includes both statistical and
patients developing an infection or being readmitted to
computational science approaches such as those utilized in
a hospital. In education, predictive analytics is used to
machine learning. Over time, some of the practices in the
determine the risk of students not passing their classes
two fields have overlapped. For instance, both now include
and dropping out of school. Across multiple industries,
clustering. Some vendors that tout machine learning include
HR is employing predictive analytics to predict the risk
regression as part of their machine learning toolkit, although
of employee churn.
regression has its origins in statistics. Machine learning
techniques such as decision trees and neural networks have • CUSTOMER-RELATED ANALYTICS. Understanding customer
been used in predictive analytics for years. The important behavior is one of the most popular use cases for
point is that interest has grown as organizations analyze predictive analytics. This is true across industries.
ever-larger amounts of (often disparate) data. Likewise, sales and marketing departments are often
among the first users of predictive analytics. The use
In fact, big data has helped to fuel the excitement about
cases for customer-focused predictive analytics are wide
predictive analytics and some of the “newer” machine
and varied, including churn and retention analysis, up-
learning and deep learning hype. The rise of data science and
selling, next-best offer, customer sentiment, customer
the data scientist is, in many ways, an outgrowth of the need
loyalty, recommendation engines, and customer journey.
to analyze this big data. Data science is an interdisciplinary
field that extracts insights from data. Although traditional • PREDICTIVE MAINTENANCE. Predictive maintenance, an
predictive modeling was performed primarily by statisticians Internet of Things (IoT) use case, is rapidly becoming

4 
popular in predictive analytics. Here, data from THE VALUE OF PREDICTIVE ANALYTICS
sensors and other devices is used to determine when Value can come in many forms, such as a benefit or an
a part failure might occur. This can be accomplished increase of usefulness. Often, organizations need to sell the
using predictive analytics and/or rules-based logic. For value of a technology to their management before securing
instance, a manufacturer might use sensor data from buy-in.
trucks to determine whether and when maintenance is
needed. The company could use analytics—for example, For predictive analytics, value can include better decision
a moving average of temperature from specific parts or making, a stronger understanding of behavior, increased
a predictive model that was built using historical data insights, improved operational efficiencies, decreased risk,
of failed parts—to determine if there is a problem. The reduced costs, or increased revenue. It is not necessarily
predictive model would be embedded into a system and measured in ROI. For instance, in a 2014 TDWI study,
operationalized to generate alerts or take automated close to 40% of respondents reported not being required to
action when new data indicates a problem. This kind of calculate ROI before deploying predictive analytics.
application is being used in many industries, including
oil and gas, utilities, manufacturing, healthcare devices,
and transportation. In a 2014 TDWI study, close to 40%
• SMART APPLICATIONS. Machine learning is of respondents reported not being
being embedded into many applications, from required to calculate ROI before
recommendation engines to dashboards. Predictive
analytics is also embedded into applications to support deploying predictive analytics.
new business models and intelligent applications.
Examples include dynamic pricing (eBay), digital That said, there have been several studies that illustrate how
advertising, and smart consumer apps such as Pandora using analytics can measurably improve revenue. For example,
and Waze. MIT researchers Andrew McAfee and Erik Brynjolfsson
found that companies that utilize big data and analytics show
productivity rates and profitability that are 5% to 6% higher

Greater predictive analytics skill leads to greater overall analytics success.

High Degree Moderate Degree Low Degree


56%
58%

56%

41%
38%
30%

12%

6%

3%

OVERALL USING PREDICTIVE EXPERTISE WITH


ANALYTICS PREDICTIVE ANALYTICS

FIGURE 2. Based on 150 respondents.


Source: 2017 Teams, Skills, and Budgets Survey.

5
than those of their peers.1 TDWI has also seen that predictive
analytics does drive value. For instance, in that same
In a 2017 TDWI study, 56% of
2014 study, 46% of respondents using predictive analytics
measured a top- or bottom-line impact (or both). Twenty- respondents who had expertise in
eight percent believed they had become more effective or predictive analytics felt their overall
analytics program was successful.
efficient but had not specifically measured an impact, and
26% simply gained more insight. Not surprisingly, those
who have been using the technology longer are more likely to
TDWI research has also pointed to some best practices for
measure impact.
gaining value with predictive analytics. For instance, we’ve
More recently, a 2017 TDWI survey revealed respondents seen that those organizations that use disparate data sources
using predictive analytics are more likely to state they have as part of their predictive analytics activities are more likely to
had a higher degree of success with their analytics programs measure value. Those that use a range of advanced analytics
than those who are not using the technology. As Figure techniques (such as text analytics or geospatial analytics)
2 illustrates, there is an 8% increase in the high success together with their predictive activities are also more likely
category for those who use predictive analytics compared to to measure value. Those enterprises that operationalize their
the overall respondent pool. analytics (so they can take action on the results) are also more
Unsurprisingly, those organizations that identify their skill likely to measure value (see Figure 3).
with the technology as either intermediate or expert report an
even higher success (i.e., value) rate, 56% versus 30%.

1 SUPPLEMENT
2 SUPPLEMENT
3 OPERATIONALIZE
TRADITIONAL DATA PREDICTIVE ANALYTICS PREDICTIVE MODELS

Geospatial
Embed in applications
Text
NLP/Text analytics
In-database scoring
Geospatial
Optimization
Part of a business
Machine Data process
Streaming

FIGURE 3. Three steps for analytics value.

McAfee, A., and E. Brynjolfsson [2012]. “Big Data: The Management Revolution,” The Harvard Business Review, October.
1

https://hbr.org/2012/10/big-data-the-management-revolution

6 
A GLOSSARY OF PREDICTIVE TERMINOLOGY

• PREDICTIVE ANALYTICS: Statistical, mathematical, and data-mining algorithms and techniques that
can be used on both structured and unstructured data to determine the probability of future
outcomes. Popular predictive analytics techniques include regression, classification, and clustering.
• MACHINE LEARNING: A kind of data analysis where systems learn from data to identify patterns
with minimal human intervention. The computer learns from examples using either supervised or
unsupervised approaches. In supervised learning, the system is given a target (or output or label)
of interest. The system is trained on these outcomes using various attributes (called features). In
unsupervised learning, there are no outcomes specified. Machine learning has been around for
decades; however, what has changed is the volume and diversity of data—as well as the compute
power to find insights in that data faster. Machine learning is often used in predictive analytics.
• DEEP LEARNING: A subset of machine learning that uses algorithms whose goal is to learn functions
that can classify complex patterns, such as images. Some deep learning algorithms, such as
artificial neural networks, have input nodes and a number of hidden layers that act as a kind
of “black box” to model one or more output nodes. Whereas early neural networks could not
recognize complex patterns, some algorithmic advances were made in the last decade or so and
those advancements, together with the availability of vast compute power, have made deep
learning feasible as well as more accurate than some thought possible. This has spurred greater
interest in deep learning for use in audio-, image-, and text-classification problems.
• DATA SCIENCE: An interdisciplinary field that extracts insights from data. Data science uses
predictive analytics techniques as well as other advanced analytics to find these insights. In many
ways, data science is an evolution of advanced analytics for analyzing big data. This has also
resulted in more computer scientists becoming involved in the process.
• CITIZEN DATA SCIENTIST: The next generation of statistical explorers, sometimes from nontraditional
backgrounds, who are variously self-taught, self-starting, self-sufficient, and self-service oriented.
These are often business analysts who may not have formal training in statistics or math.
• OPEN SOURCE: A collaborative development model where source code is made freely available while
the copyright holder retains the rights to study, change, or distribute the code. Open source is
popular because it offers a low-cost community of innovation that appeals to many data scientists
and application developers.
• PYTHON: An interpreted, interactive, object-oriented scripting language available through the
Python Foundation. Many developers favor Python because it is an easy-to-use, general-purpose
programming language with a number of libraries for analytics.
• R: A language and environment for statistical analysis, part of the GNU free software/open source
product. It includes data handling and storage facilities, a large set of tools for data analysis, and a
programming environment.

7
CHALLENGES organizations that already have skilled personnel on
Although predictive analytics can provide significant value board may train from within.
for those organizations that use it, there are also obstacles to • CONFERENCE-BASED TRAINING. There are also options
overcome (Figure 4). for external training. These are typically week-long
We consistently see in our research that lack of skills ranks at conferences that offer full-day or half-day classes. Boot
the top of the list of obstacles for any kind of more advanced camps are also popular.
analytics adoption. This includes a lack of skilled personnel • VENDOR TRAINING. Vendors typically offer training on
who understand the techniques and the technology. To predictive analytics and their toolsets.
address skills issues, organizations often look to hire data
Organizations cite a number of technical challenges related to
scientists as well as to train business analysts. Vendors offer
predictive and advanced analytics. One of the top challenges
automated model-building tools, but TDWI believes it is
is data integration. Most organizations have their data in
still important for users to have a basic understanding of the
multiple data sources. They need a way to bring this data
techniques involved. There are numerous training options
together. Many of the vendors profiled in this Navigator
available to meet different budgetary restrictions.
report provide tools to access disparate data sources as well
• ONLINE. Numerous online options are available for as tools for data preparation as part of their solution to help
those looking to understand certain techniques. TDWI analysts and data scientists. Some also provide functionality
offers online training, as do other sites such as Coursera for profiling data to assess its quality.
and Dataversity. Some are free. Certificate options are
Another top technical obstacle that organizations face is the
available through some of these sites. For those looking
lack of an enterprise data architecture—especially when
for more in-depth training, there are many online
they are dealing with big data. Many organizations look to
university options.
modernize their data architecture for predictive analytics
• ONSITE. Some educational organizations will come to a because the data warehouse isn’t always the best place to
company site to do training (TDWI does this). Some experiment with data or to build models. Additionally, if an
organization is dealing with big data, it may need to move

What keeps organizations from adopting advanced and predictive analytics.

LACK OF UNDERSTANDING OF
40%
TECHNOLOGIES

LACK OF SKILLED ANALYTICS


40%
PROFESSIONALS

LACK OF ENTERPRISE DATA ARCHITECTURE 38%

DATA INTEGRATION 34%

0% 10% 20% 30% 40% 50%

FIGURE 4. Based on 352 respondents.


Source: 2016 TDWI Best Practices Report: Data Science and Big Data

8 
to a distributed architecture to distribute the workload and results and understand the model. Some of these tools are not
maintain performance. Organizations have adopted Hadoop, flexible enough for data scientists or statisticians who want
appliances, in-memory analytics platforms, and the cloud as to select their own models to build and will want to iterate
part of their evolving architecture. Many of the products on on the model to make it more accurate. However, certain
the market today allow the user to add their own analytics vendors have provided various user interfaces, often referred
into these platforms—to bring the analytics to the data. to as personas, for different kinds of users.
Organizations should think about whether these potential These easy-to-use tools are offered both on premises and
challenges will affect them and how they plan to deal with in the cloud, and are an option worth considering when
them. TDWI provides a number of best practices reports to an organization starts to build predictive models. A few
help businesses think through some of these challenges.2 caveats include the need for conducting proper training and
establishing control processes prior to any model going into

MARKET FORCES
production to ensure there are no issues with it.
THE RISE OF OPEN SOURCE. Organizations are becoming

AND STATUS
increasingly interested in open source technologies such
as R, Scala, and Python for predictive analytics because
most open source products have an active and innovative
user community, which appeals to many data scientists and
application developers. Open source tools are often adopted
TRENDS by recent graduates entering the job market who were trained
Although predictive analytics is not new, several recent on these tools and want to have a choice in the tools they use
trends are important to understand when thinking about on the job.
implementing the technology. These developments impact
what vendors offer and how they offer it. These trends can As such, the market is evolving to better support open source.
also influence what an organization chooses to deploy and Some vendors provide commercialized open source solutions
how it chooses to deploy it. as part of their predictive analytics offering—often in a data
science workbench environment. Others have opened their
In some cases, the vendors are driving the trend; in others, proprietary GUI-based products to open source. For instance,
organizational needs are the driver. For some, it is both. Some many vendors provide users with the capability of calling a
key trends include: model built in an open source environment into their tools;
EASY-TO-BUILD MODELS VIA AUTOMATED MODEL BUILDING. Vendors a few let users call commercial models from open source
are always trying to make their tools easier to use. This is tools. Some vendors even provide nodes on their visual GUI
evident from many of the visual UIs and wizards provided interfaces for open source environments such as R or Python.
with many commercial products. However, this has gone a All that is required is that the users load these open source
step further, driven by the skills gap in many organizations. environments onto their machines.

Because statisticians and data scientists are expensive and Open source options are important to consider because
often hard to come by, many organizations are trying to of the range of functionality they can provide. That said,
build these skills internally. This has helped to give rise to when we asked respondents in a recent survey how they will
the “citizen data scientist” who often uses automated model deliver on big data analytics projects, respondents were as
building found in some tools to “easily” build predictive likely to use commercial software as they were to use open
models. Here, the business analyst provides the tool with the source tools such as R. In many ways, it will depend on the
outcomes of interest and the attributes that might predict team. Data scientists tend to like open source programming
that outcome. The tool builds the model and provides the environments. Others may prefer commercial software with a
results. It is then up to the business analyst to interpret the UI and full-feature functionality.
INCREASING MATURITY OF TOOLS. Predictive analytics tools are
2
See TDWI Best Practices Reports on predictive analytics, big data, and data generally becoming more mature in terms of the features
science online at tdwi.org/bpreports. Also look at our Checklist series on big data and functionality they provide as part of the overall analytics
maturity at tdwi.org/checklists.

9
life cycle—from data preparation to model building, The important point is that vendors are making the move to
model management, and model updating. This maturity is support various deployment options to enable models to go
especially evident in traditional products, although a few into production more smoothly. However, it is incumbent
new vendors are also supporting the complete analytics life on the organization to determine who is going to deploy the
cycle. These features include data preparation as well as model models. Some organizations separate the duties—there are
management and monitoring. builders of models and others who deploy and monitor them.
This often falls to IT or DevOps in larger organizations. The
Data preparation includes combining data sets as well as data
key is that someone (or a group of someones) is assigned
cleansing, feature selection (e.g., what attributes to use), and
responsibility.
feature transformation (e.g., calculating new features from
existing data). Model management and model updates are MACHINE LEARNING AND DEEP LEARNING. Deep learning has
important because organizations often lose track of their exploded in popularity because it provides accurate models
models, which then go stale. This is especially true if there and can be used against structured data as well as with audio,
are many models to monitor—and often companies have image, and text data. For example, deep learning is being
hundreds or even thousands of models to manage. Vendors used in medicine to identify skin cancers. In financial services,
are now offering tools and techniques to address such it is pushing the frontiers of fraud detection and being used
problems and deal with the analytics life cycle including to detect money laundering. In security, it is used for image
model deployment (addressed below). recognition and threat detection. It is even being used to
analyze audio from engines to identify imminent part failure.
This trend is important because it opens up new use cases
TDWI research indicates that models and opportunities to use predictive analytics against large,
can take an average of six to nine disparate data sets. Several open source deep learning
months to deploy. algorithms have also exploded on the market, including
MXNet, Google’s TensorFlow, and Berkeley’s Caffe.

MODEL DEPLOYMENT. As more organizations take action on DATA SCIENCE NOTEBOOKS. Notebooks take their cue from the
predictive models, they often want to put these models into lab notebooks we are all familiar with from science class.
production by operationalizing or embedding analytics into Notebooks (such as Jupyter’s popular offering) allow data
systems and applications. Nevertheless, model deployment scientists to create and share documents that contain live
is often one of the top two major stumbling blocks for code, equations, visualizations, and explanatory text. Data
organizations implementing predictive analytics (the other scientists like them because they can interact with the code.
is posing the right question to model in the first place). The notebooks also provide a means for better note keeping
TDWI research indicates that, on average, models can and collaboration. Some predictive analytics vendors are
take about six to nine months to deploy, in part because of providing support for notebooks, even as nodes in a visual
politics or compliance issues. However, deployment is a key workflow. That said, other vendors are already building tools
consideration for predictive analytics and vendors are working that provide alternatives to notebooks.
on ways to reduce this time. AUTOMATION. Predictive analytics automation comes in several
Predictive Modeling Markup Language (PMML) is one flavors. There is automation of the model-building process
deployment standard, introduced in the late 1990s. It is through a straightforward interface as we’ve described.
an XML-based interchange format that provides a way to Another kind of automation includes the factory approach
represent models in order to share them with systems and to model building. Here the builder can put one model flow
applications. However, there are limitations with the standard. together and then reuse that flow for other models. Often the
Some vendors are supporting a new standard called Portable same model can be used against different model segments.
Format for Analytics (PFA). Others are enabling deployment Automation can be used for model scheduling. Automation
using a language that is readable. For instance, some vendors can also occur in the model management process. Some tools
deploy models in C++ or C# so they can be deployed into a have features where a model builder can specify rules to alert
SQL server as a native function. Some output in Java. Some the organization that the model is degrading and needs to
use a Web services model and support a REST API. be retrained. Other tools go further and automatically detect

10 
model degradation based on lift or some other parameter. ADOPTION
Depending on your organization’s use cases, automation may Adoption refers to the acceptance and use of a technology in
be quite important. a business. TDWI has been collecting data about predictive
APPLICATION DEVELOPMENT. Part of the shifting dynamics in analytics adoption since 2007. We have seen an increase in
predictive analytics is moving towards building analytics- adoption from approximately 21% in 2007 to about 40%
driven applications. This is different than a statistician or in 2016 (see Figure 5). This puts the technology in the early
data scientist analyzing a data set using predictive analytics mainstream stage of adoption.
tools and techniques to gain insight. Here, developers
build applications that use predictive analytics as a feature.
Examples include recommendation engines on websites, TDWI has seen an increase in
predictive maintenance applications, customer service routing
(involving speech recognition), or finding the right match
predictive analytics adoption from
in a dating app. Such applications will use APIs (application 21% in 2007 to 40% in 2016.
programming interfaces) that abstract some of the complexity
of the technique so the developers do not necessarily need to Since 2013, we’ve seen consistent excitement about adopting
understand the algorithms. predictive analytics. When asked about plans to adopt the
technology in the next few years, generally about 40%–
This trend can have positive and negative effects. On the
50% of respondents who have not yet adopted predictive
one hand, intelligent applications can provide new business
analytics state that they plan to do so soon. However, given
opportunities. On the other hand, any predictive analytics
that adoption rates have not increased dramatically since
model put into production needs to be accurate and make
2013 (when we first started asking the question), it appears
sense for the situation. That means that the developer needs
that organizations are struggling or just slow to adopt the
to have some understanding of the technology.
technology. This may be because mainstream buyers are
more cautious; many want to see that organizations similar

Predictive analytics use has doubled since 2007.

45%
40%
40%

35%

30%

25%
21%
20%

15%

10%

5%

0%
2007 2008 2009 2010 2011 2012 2013 2014 2015 2016

FIGURE 5. Based on approximately 300 respondents per survey.


(Results from 2007 include both full and partial implementations.)

11
to theirs have had success with a particular technology be very important for data science in 2017; 17% felt that
before they take the plunge. However, they shouldn’t wait Python would be a popular option (Figure 6). Many vendors
too long or they will risk forfeiting the value that early support open source options; most traditional commercial
adopters have gained. vendors have opened their platforms to these tools.
For any technology, some industries are more likely to be
early adopters than others. Not surprisingly, we are seeing ADOPTION AND CULTURAL ISSUES
traction for predictive analytics in the following industries: Of course, much of the time it isn’t the technology
computer/networking (>50% of respondents), financial that causes businesses to struggle with adoption—it is
services (>40%), insurance (>40%), telecommunications organizational culture that can stop an effort dead in its
(>50%), federal government (>30%), and software/Internet tracks. TDWI has been collecting information about
companies (>35%). Over the past few years we have seen cultural issues surrounding adoption for the past few years.
an increase in the number of healthcare providers adopting We’ve seen some minor progress, but there are still obstacles.
the technology as well (>35%). Popular use cases include For instance, Figure 7 illustrates the responses of over
predicting hospital readmissions, population health and risk 1,000 respondents about cultural issues around analytics.
analysis, and operations management. Recent advances in Although 54% believe they can express analytics questions
machine learning and deep learning are paving the way for in a way that shows value to executives and 43% believe that
medical predictions such as identifying cancers. That said, analytics is a competitive differentiator, only 30% believe
there is still much room for growth in all of these industries. their organization is data-driven. Additionally, only 30%
of respondents think their organization is able to deal with
ADOPTION OF OPEN SOURCE TOOLS analytics failures—yet experimentation and innovation with
analytics depend on them.
We are also seeing growing interest in open source
technologies such as R and Python. Open source is The companies we speak with say there is no silver bullet.
frequently the technology of choice at universities and the Getting the company on board often requires a combination
go-to platform for data scientists. R and Python both have of having an executive with a vision, being persistent,
predictive analytics/machine learning libraries. Other open highlighting accomplishments, and most important,
source options include Weka, KNIME, and RapidMiner. In building trust.
a 2016 TDWI survey, 38% of respondents felt that R would

The three most important tools for data science.

R 38%

Spark 22%

Python 17%

0% 5% 10% 15% 20% 25% 30% 35% 40%

FIGURE 6. Based on 338 respondents.


Source: 2016 TDWI Best Practices Report: Data Science and Big Data

12 
Organizational views on analytics still need to mature.

60% 56%

50%
43%
40%
30% 30%
30%

20%

10%

0%
BUSINESS WE ARE DATA-DRIVEN. WE ACCEPT EARLY ANALYTICS IS A
UNDERSTANDS FAILURE. DIFFERENTIATOR.
THE VALUE.

FIGURE 7. Based on 1,032 responses. (% agree/completely agree)


Source: TDWI Big Data Maturity Model

MATURITY measure value than those that do not. These data sources
Whereas adoption looks at the percent of organizations using can enrich a data set and help provide more insight than
a particular product, maturity examines how far along they structured data alone. Many of the vendors profiled in
are with the technology. In terms of self-perception, many this Navigator report support disparate data types.
organizations adopting predictive analytics believe they are • OPERATIONALIZE PREDICTIVE ANALYTICS. Another sign of
still relatively early in their deployments. For instance, in a maturity is the capability to operationalize or make
2017 TDWI survey with over 200 respondents, 44% felt they predictive models part of a business process. This can
were still in the beginner stage of predictive analytics and be done in a number of ways, including in-database
28% felt they were at the intermediate or expert level. The scoring or embedding analytics into a system using
rest had not implemented the technology or didn’t know. PMML, APIs, or Web services. The importance of
TDWI has been building maturity assessments since 2013. operationalizing models is that it helps to make them
Maturity in predictive analytics deals with the complete actionable. For example, predictive analytics can be
analytics life cycle and is a combination of organizational embedded into a system to uncover fraudulent claims. In
and technological maturity. Those organizations mature in this case, a predictive analytics model is built to predict
predictive analytics typically: the probability that a claim is fraudulent using historical
data consisting of fraudulent and nonfraudulent claims.
• MAKE USE OF DISPARATE DATA SOURCES. The vast majority That model is embedded into a system that scores new
of organizations that build predictive models do so with claims for potential fraud. Those claims with a high
structured data. However, more mature organizations probability of fraud are routed to a special investigation
often use disparate data sources such as text, geospatial, unit for further analysis. TDWI research indicates that
or machine data in addition to structured data. They use of those organizations with predictive analytics in place,
text analytics techniques to extract entities, concepts, about half (50%) have also operationalized it as part of a
and sentiments from text data and use that as input to business process.
predictive models. This kind of data can provide lift
• MANAGE AND MONITOR MODELS. A key to success with
to models. We’ve also seen that organizations that use
disparate data sources in analytics are more likely to predictive analytics is the ability to manage and monitor
models. As noted in the Trends section, vendors are

13
Model sharing with IT/DevOps is not very sophisticated.

EMAIL 39%

SHARED FOLDERS 54%

WEB SERVER 34%

MODEL REGISTRY 29%

0% 10% 20% 30% 40% 50% 60%

FIGURE 8. Based on 141 respondents.


Source: 2017 TDWI Best Practices Report: Advanced Analytics

working to make their software more complete and multiple categories; however, they are listed in Figure 9 in
mature. This often involves providing a model registry to terms of the category they best fit.
help keep track of the models. Several vendors profiled
FULL LIFE CYCLE VENDORS. These vendors provide commercial
in this report provide model registries and model
solutions for the complete analytics life cycle using their
management capabilities. Yet we see only about 30%
own software. Many of these are robust solutions. They
of organizations that embed models into a business
typically have a visual UI that makes it easy to build models
process using a model registry (see Figure 8). Fifty-four
using drag-and-drop. They can support automated model
percent used shared folders, which is not an effective
building as well as reuse of these flows and some can scale to
way to manage models. Although IT or DevOps might
support numerous models. Some of these have an integrated
have its own plan in place to manage the models, it does
approach to support structured data as well as text data.
not appear that way from these results. In addition to
Many of these vendors have opened up their platforms to
managing models, it is also important to monitor them
support open source. They have refactored their existing
so they do not degrade. Buyers should look for solutions
algorithms to support a distributed big data model. These
that can manage and monitor models in production,
platforms typically support multiple personas, including the
especially as the number of models increases.
business analyst, the data scientist, and DevOps, via multiple
user interfaces.

MARKET BIG DATA/DATA SCIENCE WORKBENCHES. These are typically


newer entrants (although full life cycle vendors often offer a

LANDSCAPE workbench environment) that are predominately open source


and predominately focused on big data. The environment
typically supports big data and is script-based to support
R, Python, and other open source environments. These
INTRODUCTION TO THE LANDSCAPE platforms usually are targeted to the data scientist who codes
The predictive analytics solutions landscape is dynamic and likes to code. They may support notebooks or create
and evolving. It is important to understand the different alternatives to notebook environments. Some vendors in this
approaches different vendors are taking to the market. Below segment have a particular focus on machine learning and
are five categories of vendor offerings. Some vendors fall into deep learning.

14 
EASY TO USE/AUTOMATED. Whereas other vendor segments proprietary algorithms. Often, these vendors began as
support business analysts and data scientists, these vendors consulting companies or their technology tackled extremely
target business users and business analysts. Their goal is to complex problems for specific industries, so it is best to have a
provide easy-to-use tools that support predictive analytics company representative work with the client.
(although many full life cycle vendors also offer this; see API FOCUSED. These vendors offer a services approach to
Figure 10 for the roles supported). This includes automating applications that includes predictive analytics. These services
the model-building process as well providing black-box are typically APIs for easy consumption by developers and
predictive analytics or predictive analytics embedded into the data scientists. The vendor might provide machine learning
application. Here, the business user might simply specify a as a service or image classification as a service. Of course,
target variable of interest and receive the best model. many of the vendors profiled offer APIs to put models into
INTELLIGENT SOLUTION FOCUS. These vendors are more apt to production. The vendors in this category have a focus on
work with customers to build a solution using their own specific services. Details about each of these features are
found in the vendor profiles.

FEATURE SETS The roles of those using the product are also very important.
Organizations must consider numerous features when It would not make sense to purchase a data science work-
evaluating solutions. The choice of vendors to put on a short bench that primarily supports those who code in Python if
list will depend on the business problems the organization your average user is a business analyst.
is trying to solve, the skill set of the team, and the available Other considerations will include whether the organization is
budget. Do you need a solution targeted to a data scientist or dealing with big data, whether it intends to put models into
to a business analyst? Will DevOps be involved? production and how it plans to do that, as well as how many
models it expects to deal with.

15
Here are some features that organizations need to consider performance changes over time. This might include allowing
when embarking on a predictive analytics effort or building organizations to set rules or alerts if model performance
out their solutions in the space. begins to degrade.
HANDLING OF DATA SOURCES. Data comes in many shapes and BREADTH AND DEPTH OF ALGORITHMS. There are numerous
sizes and from multiple sources both internal and external to predictive algorithms available in the market. Make sure that
the company. Some of the data will be stored in commercial the solution has the kinds of algorithms you need. For instance,
databases or in an on-premises data warehouse. Other data your organization might be very interested in deep learning or
might be in Hadoop. Some data might be in the public cloud. it may have a specific need based on a vertical industry.
Buyers should consider what data sources and platforms a
vendor supports and how it helps users access that data. COLLABORATION. More often, organizations are building
teams to construct and deploy models. People often want to
DATA PREPARATION. Data for predictive analytics will need to collaborate. Some solutions have collaboration features that
potentially be merged and then prepared for analysis. This can share models. Other organizations want to share the
may involve feature engineering, which includes profiling output of the models with business users or provide them
the data for data quality as well as transforming and deriving with limited interactivity.
new variables for analysis. Feature engineering is extremely
important in predictive analytics. If your organization is ON-PREMISES VERSUS CLOUD DEPLOYMENTS. Some solutions are
using disparate data such as text data, you may want the offered in the cloud; others are offered on premises. Some
solution to be able to extract entities, concepts, or sentiments solutions support both options.
from text data to use for analysis. You may want to calculate SUPPORTS OPEN SOURCE. As mentioned, open source is
ratios or sums. Your organization may already have a data becoming popular as a low-cost option for predictive
preparation tool it likes. If not, it will be important to analytics. Many of the vendors profiled in this report support
understand what predictive analytics vendors have to offer in open source options such as R, Python, and Spark. Many also
terms of data preparation. support open source notebook environments such as Jupyter
EASE OF USE. Does your organization require a visual user notebooks.
interface or is it comfortable using a scripting language for EMBEDDING ANALYTICS. If you’re going to embed analytics as
building models? Part of the answer will depend on who is part of a business process or in an application, you might also
actually using the solution. Business analysts may be more at be interested in how you can do that. Does the vendor use
home in a visual drag-and-drop or automated model-building PMML? Does it have APIs that easily enable you to connect?
environment; data scientists might prefer a workbench. Does the vendor provide in-database scoring?
BIG DATA SUPPORT. If your organization has large volumes of FACTORY APPROACH. Some vendors offer a factory approach to
data to analyze, it is important that the selected predictive model building. That means that they enable reuse of model
analytics solution meets certain performance criteria. workflows to scale up to thousands of models. This can be
That means support for a distributed architecture. It also important if you expect to have many models for different
means algorithms that can run fast on clusters in this new groups or segments.
environment.
MODEL DEPLOYMENT. It is one thing to create a model; it is
another thing to put it into production. If your organization
plans to put the model into production, it needs to
understand how the vendor supports deployment. How is
model export handled? How easy is it to embed models into
systems or applications?
MODEL MANAGEMENT. As the number of models increases, your
enterprise will need to keep track of them. Some vendors offer
registries or other tools to help with versioning. Some vendors
offer model-monitoring capabilities to track how model

16 
Vendors
There are many vendors that provide predictive analytics capabilities. This section provides greater detail on the offerings of
the vendors shown in Figure 9. This should give readers starting off in predictive analytics or those building up predictive
capabilities with a good list of vendors to consider. The chart below describes the user personas each vendor supports in their
primary predictive analytics product.

Platinum Sponsor

DATA SCIENTIST BUSINESS ANALYST BUSINESS USER DEV/OPS


Alpine
Alteryx
Amazon
Angoss
Ayasdi
BigML
Cloudera
Cray
Dataiku
DataRobot
FICO
Fractal Analytics
H20.ai
IBM
Megaputer
Microsoft
OpenText
Oracle
Pentaho
Salesforce
SAP
SAS
Teradata
TIBCO Spotfire
TIBCO Statistica

17
www.alpinedata.com
Alpine Data provides Chorus, a platform for managing Chorus 6.2, Alpine Data has integrated Python (Jupy-
the data science process from ETL to deployment. ter) notebooks into the platform to enable interactive
Geared primarily towards data scientists using big Python analysis from within the Chorus ecosystem.
data, the goal of the platform is to help enterprises Support for PySpark has also been added to interface
collaboratively build models using machine learning with RDDs in Python for distributed functionality.
techniques, and to deploy and govern these models.
Alpine Data also provides visual editing capabilities for
Focused on scale, the Chorus Parallel Workflow Engine ETL and analytics workflows, enabling business analysts
optimizes analytics workflows based on whether the without coding skills to access data and build models.
environment is Hadoop or in-database, taking advantage
of MapReduce, Spark, or SQL where appropriate. The
Distributed Execution Engine can push the code into
the data platform in parallel across a cluster. With

Alpine Data has found its groove in big data predictive analytics that runs
natively in Hadoop and Spark. Chorus adds collaboration and governance to
machine learning projects as well, which are important features. Chorus is
UPSHOT targeted to data scientists building scalable, big data models and provides
visual workflow capabilities for business analysts as well. It’s worth looking
into for organizations that have implemented a data warehouse and
Hadoop and want to take the next step to performing predictive analytics.

18 
Product Name Version Description

Alpine Data Chorus 6.2 Integrated analytics platform that brings machine learning, data, and people together to
create operational solutions for business users.

DATA PREPARATION Visual ETL workflow editor with hundreds of functions shipped out of the box; also partners with
CAPABILITIES Trifacta for data preparation.

Cloudera, Hortonworks, MapR, Pivotal Greenplum and HDB, Oracle, PostgreSQL, Teradata, HPE
DATA SOURCES SUPPORTED
Vertica, Microsoft SQL Server, SAP HANA, AWS Redshift and EMR, Hive, and Impala.

Ships with multiple machine learning algorithms in the areas of classification, regression, clustering,
PREDICTIVE ANALYTICS
and dimensionality reduction; also supports Python and R libraries and provides an extensibility SDK
TECHNIQUES SUPPORTED
for adding new custom or open source algorithms.

MODEL MANAGEMENT Can schedule batch runs for retraining; includes audit trails and what Alpine calls “flow control” for
CAPABILITIES continuing or stopping flows depending on customized test conditions.

Supports deploying models as microservices that expose RESTful scoring APIs onto PaaS infrastructure
MODEL PRODUCTION/ such as Cloud Foundry, AWS, and Google Cloud platforms. In addition to PMML export for deployment,
EXPORT CAPABILITIES it supports the emerging Portable Format for Analytics (PFA) standard that can represent a broader set
of ETL and scoring functionality than PMML.

AUTOMATION CAPABILITIES Supports model reuse via templating workflows with different variables.

CLOUD VERSUS ON PREMISES On premises or in the cloud.

Licensing works on a per-seat basis based on four categories of roles, including analytics developer,
LICENSING
analyst, collaborator, and consumer.

19
www.alteryx.com
The Alteryx Analytics platform aims to accelerate the enables joint customers to use its machine learning
selection, preparation, and transformation of data algorithms as part of a workflow. For those who prefer
for analysis, as well as to automate the process of to code, Alteryx exposes a scripting facility (R Tool) that
designing, testing, and publishing analytics models. A analysts and data scientists can use to run R code within
seven-year-old company, Alteryx is probably best known the Alteryx environment.
for its self-service data prep and blending technology;
Alteryx Server enables business analysts to publish
its Alteryx Designer product enables users to perform
workflows to a private area for others to use and execute
drag-and-drop joins on common data fields or geospatial
at scheduled times to deliver analytics results throughout
information without writing SQL, R, or Python, or
an enterprise. It also offers the Alteryx Analytics Gallery,
coding in Java and other languages.
a cloud-based resource for organizations to privately
Alteryx is more than just a self-service data-preparation share analytics without setting up a server, as well as a
studio, however. It offers over 60 built-in tools for spatial public gallery that provides samples of analytics across
and R-based predictive analytics packaged as icons in multiple vertical industries.
its platform that can be used in building workflows.
Alteryx also has a partnership with DataRobot that

Well regarded for self-service data preparation, Alteryx now combines


these capabilities with advanced analytics (such as predictive analytics)
UPSHOT in one platform. Alteryx caters to two distinct user classes: technically savvy
line-of-business users—business analysts and power users—and
data scientists.

20 
Product Name Version Description

Alteryx Analytics 11.0 Self-service data analytics platform that allows analysts to prep, blend, and analyze data using a
repeatable workflow, then deploy and share analytics at scale.

Includes tools for data cleansing, blending, data quality, binning and smoothing, data investigation,
DATA PREPARATION
data profiling, filter and search, transformations, aggregations, data parsing, CASS certification, and
CAPABILITIES
spatial matching.

Alteryx supports a large number of data sources, including desktop, cloud, database, and Hadoop
sources. Connectors include Adobe Analytics, Tableau, Amazon S3, Azure Text Analytics, Google Ana-
DATA SOURCES SUPPORTED lytics, Google Sheets, Foursquare, Marketo, Mongo, Salesforce, Sharepoint, and Twitter. Alteryx also
has relationships with third-party data providers such as Experian and Dun & Bradstreet, to enable
users to enrich their internal data.

Alteryx supports over 60 techniques for statistical, predictive, time-series, clustering, and prescriptive
PREDICTIVE ANALYTICS
analytics, with R as the underlying language performing the predictive capabilities, as well as integra-
TECHNIQUES SUPPORTED
tion with DataRobot for machine learning and automated modeling.

MODEL MANAGEMENT Once a model is built, it can be stored within a workflow in Alteryx Server, which includes version
CAPABILITIES control as well as access controls.

A model can be exported as an R model object or in PMML. A model can be packaged as a macro,
MODEL PRODUCTION/ an analytics application, or sent to a database that can be automated to run to drive downstream
EXPORT CAPABILITIES processes. Models can be shared via Alteryx Server. Supports in-database scoring for the Oracle, SQL
Server, and Teradata RDBMSs.

AUTOMATION CAPABILITIES Once a workflow and models are created, they can be scheduled to automatically execute.

Alteryx Designer is a desktop-based deployment; Alteryx Server can be deployed on premises or in


CLOUD VERSUS ON PREMISES the cloud via AWS and Azure. Alteryx Analytics Gallery is cloud-based and is managed and delivered
entirely in the cloud.

LICENSING Named user, subscription model.

*Written before the Yhat acquisition

21
aws.amazon.com/machine-learning
Amazon offers a host of services via Amazon Web
Services to help developers build applications. These
include storage, compute, database, and analytics.
The Amazon Machine Learning service makes tools
developed through the company’s internal R&D
available to developers to build “smart” applications
without in-depth knowledge of machine learning.
It includes three components: data analysis, model
training, and model evaluation. These are all provided
via a wizard interface.

Amazon Machine Learning provides automated model building for developers


assembling smart applications. For organizations that use Amazon cloud services
UPSHOT
and want to build applications that utilize basic machine learning techniques,
this is an option worth considering.

22 
Product Name Version Description

Amazon Machine Learning A service that makes it easy for developers of all skill levels to use machine learning technology.
(AML)

DATA PREPARATION
Provides a data report to explore the data, find missing values.
CAPABILITIES

DATA SOURCES SUPPORTED Supports Amazon Redshift, RDS, S3, or CSV; can create models on up to 100 GB of data.

PREDICTIVE ANALYTICS
Binary classification, multiclass classification, regression.
TECHNIQUES SUPPORTED

MODEL MANAGEMENT
Monitor prediction usage patterns with Amazon CloudWatch metrics.
CAPABILITIES

MODEL PRODUCTION/
Output to Amazon S3 or to applications that can read from S3.
EXPORT CAPABILITIES

AUTOMATION CAPABILITIES Model metadata can be queried for model reuse.

CLOUD VERSUS ON PREMISES Amazon cloud only.

Model building based on compute rate of $0.42/hour; batch predictions: $0.10 per 1,000 predictions,
LICENSING rounded up to the next 1,000; real-time predictions: $0.0001 per prediction, rounded up to the
nearest penny plus a reservation charge.

23
www.angoss.com
Angoss provides a predictive analytics solution designed functionality in addition to advanced data mining,
for statisticians, data scientists, and business analysts. in-memory execution on Spark, in-place analysis of
The company believes that users need freedom of choice data in data lakes, and data access flexibility. Its server
in dealing with disparate data sources, predictive model and desktop platform, KnowledgeSTUDIO, provides
building, and deployment options. advanced data mining and predictive analytics for all
stages of the data mining cycle, including scorecard-
To that end, the Angoss suite of products supports a
building functionality.
visual drag-and-drop user interface as well as scripting
interfaces for open source tools, such as Python, R,
and Spark, and notebook environments. Its big data
platform, KnowledgeENTERPRISE, provides that

The company has made good use of the infusion of capital it received from its
parent, private equity firm Peterson Partners. Its solutions support the complete
analytics life cycle, and it has incorporated support for open source, even
including access to components such as Jupyter notebooks to embed custom
UPSHOT code and other machine learning packages directly within the visual workflow.

The platforms provide direct access from the Angoss workflow to visualization
tools such as Tableau and Qlik for reporting and multiple deployment options.
It is intended for customers across a multitude of industries and departments
(such as credit risk, fraud, marketing, sales, and CRM analytics), including teams
dealing with large volumes of data in distributed storage clusters.

24 
Product Name Version Description

Angoss KnowledgeENTER- 10.4 Data science platform integrated with Apache Spark to provide advanced data mining and
PRISE predictive analytics on large-scale distributed data structures.

Angoss KnowledgeSTUDIO 10.4 Advanced data mining and predictive analytics application for all phases of the model develop-
ment and deployment cycle including building repeatable workflows.

Angoss KnowledgeSEEKER 10.4 Fundamental data mining and predictive analytics software used for data exploration, decision-
tree analysis, predictive modeling with decision trees, and strategy development.

Wizards support joining, appending, and aggregating data sets, and removal of duplicate records.
Data profiling features help with data quality checks. Supports variable creation via multiple methods.
DATA PREPARATION
CAPABILITIES
Variable creation and transformation using Python is available in KnowledgeENTERPRISE. Data
preparation with SAS and in R and Python available in KnowledgeCORE.

Import and export data to and from text, Excel, SAS, SPSS, and R files; SAS and WPD data files can be
used directly in Angoss workflows (requires KnowledgeCORE license). Import and export data to and
from databases via ODBC. Also supports text analytics in KnowledgeREADER.
DATA SOURCES SUPPORTED
KnowledgeENTERPRISE supports data load from Hadoop HDFS, Hadoop ViewFs, Hadoop Archive,
Amazon S3, FTP, Network Shares, and other storage types supported by Spark. Data load from Hive
tables, text (CSV), Parquet, ORC, and Avro formats into Spark dataframes.

Supports numerous classifications, regression, clustering natively. Also supports R, Python, and Spark
PREDICTIVE ANALYTICS
libraries. Supports complex optimization in KnowledgeOPTIMIZER and segment-level optimization in
TECHNIQUES SUPPORTED
KnowledgeSTUDIO using Strategy Trees.

MODEL PRODUCTION/ Direct deployment within the Angoss application; Automatic generation of SAS, SQL, SPSS, PMML, and
EXPORT CAPABILITIES Java code for Angoss models for external deployment in other environments. Can run in-database.

Graphical UI with workflow design features; workflows can be reused. Models can be scheduled; In
AUTOMATION CAPABILITIES
KnowledgeENTERPRISE models can be scheduled to run on Spark.

CLOUD VERSUS ON PREMISES On-premises or cloud deployment; physical or virtual environments.

LICENSING Desktop licenses node-locked to host machine. Multiple license types for server version.

25
www.ayasdi.com
Ayasdi’s mission is to design, develop, and deploy platform includes a scalable infrastructure built on
intelligent applications. It uses a unique mathematical Hadoop for analyzing large amounts of data.
approach called TDA (Topological Data Analysis)
The company’s strategy is to sell discrete intelligent
developed at Stanford University to do this. Whereas
applications that include explainable output (what
other techniques try to fit a line to a set of observations
it calls justifiable). It has made inroads into financial
or group them together, TDA looks at the shape of data.
services, healthcare, and the public sector. Applications
This kind of analysis is useful for high-value problems include risk and credit monitoring, anti-money
that involve complex and high-dimensional data that is laundering, clinical variation management, and
rapidly generated or constantly evolving. It is also useful population risk prediction.
for signals that are hard to spot in big data. The Ayasdi

Ayasdi provides a math-based approach to solving complex analytics problems


and a way to explain and justify the results. Ultimately targeted to line-
UPSHOT
of-business users, developers can work with Ayasdi team members to put
intelligent applications into production.

26 
Product Name Version Description

Ayasdi 1.0 Ayasdi’s machine intelligence platform combines scalable computing and big data infrastruc-
ture with machine learning, statistical and geometric algorithms and Topological Data Analysis
to enable data scientists/analysts, domain experts, and business people to be more productive.

DATA PREPARATION Ayasdi works with ETL solution vendors; however, the company provides specific data transformation
CAPABILITIES tools that are often required by machine learning algorithms.

DATA SOURCES SUPPORTED HDFS, ODBC databases, CSV, and flat files.

PREDICTIVE ANALYTICS Segmentation, anomaly detection, prediction, recommendation, feature selection, time series, TDA,
TECHNIQUES SUPPORTED model repair, and model justification.

MODEL MANAGEMENT
Supports champion/challenger model selection.
CAPABILITIES

MODEL PRODUCTION/
Model export through PMML or REST API (not publically available).
EXPORT CAPABILITIES

Ayasdi’s platform can be used in fully unsupervised, semi-supervised, and fully supervised modes.

• In the fully unsupervised mode, Ayasdi executes a large number of ML algorithms and combines
them together using TDA. This allows the system to surface insights and models from data
automatically.

• In semi-supervised mode, the system either needs to be given an objective to optimize or the list
AUTOMATION CAPABILITIES of algorithms it cycles through can be restricted. In the former setting, the system cycles through
a vast number of algorithmic combinations which cleanly delineate the provided objective.

• In the fully supervised mode, the system needs both the algorithm selections as well as the
objective that’s being optimized.

Note that in all settings, the actual distributed execution of the algorithms is carried out automatically
by Ayasdi’s YARN scheduler.

CLOUD VERSUS ON PREMISES On premises or via private or public cloud infrastructures.

LICENSING Licensed on an annual subscription basis.

27
www.BigML.com
BigML was founded in 2011 with the mission to make use BigML’s tools that include an API, bindings, and
machine learning available to everyone. BigML is offered WhizzML, the domain-specific language BigML has
as a cloud or on-premises service to analysts, developers, developed to automate any machine learning workflow.
and data scientists who are using the platform to build
The company also offers customized training, which
and deploy predictive models.
consists of four three-hour sessions capped off with a
The software is geared to everyone—even those new certification exam.
to machine learning. Its goal is to provide consumable,
exportable, programmable, and scalable machine
learning. Business users can make use of the web-based
UI to build models. Data scientists and developers can

With a nice, modern UI and the goal of democratizing machine learning, BigML is a
UPSHOT good example of next-generation machine learning-as-a-service, targeted to those
who need an easy to use machine learning tool or for those developing apps.

28 
Product Name Version Description

BigML Machine learning service that offers an easy-to-use interface and a RESTful API.

BigML offers a Lisp-like language called Flatline for feature engineering (https://github.com/
DATA PREPARATION
bigmlcom/flatline). In addition, imported source files can be manipulated in several ways through the
CAPABILITIES
API or web dashboard (e.g., filtering rows, dropping columns).

BigML can accept many file formats including CSV, HDFS, JSON, Google Cloud, S3, Dropbox,
DATA SOURCES SUPPORTED
and Azure.

PREDICTIVE ANALYTICS Offers classification, regression, evaluations, clustering, anomaly detection, association modeling,
TECHNIQUES SUPPORTED topic modeling (for text analysis).

MODEL MANAGEMENT
Models can be created, viewed, updated, and deleted through either the web UI or the API.
CAPABILITIES

MODEL PRODUCTION/ REST APIs are available for developers to deploy models into applications. Models can also be
EXPORT CAPABILITIES exported to Java, Python, Node.js, and Ruby code as well as Tableau or JSON format.

Scripts can be written in WhizzML, a domain-specific scripting language designed specifically for
AUTOMATION CAPABILITIES building automatic machine learning workflows for repetitive tasks, or using the API plus bindings in
popular languages.

CLOUD VERSUS ON PREMISES Can be deployed on AWS, Azure, or any other cloud, and can be deployed on premises as well.

Cloud version plans range from free to $10,000 a month, based on maximum data set size, maximum
LICENSING number of parallel tasks, and level of support. The on-premises version starts at $45,000/yr. for a
single server and goes up to $2.2M for a global installation plan for large enterprises.

29
www.cloudera.com
Cloudera, delivers an enterprise data management Data scientists can use R or Python to create models in
and analytics platform called Cloudera Enterprise, the workbench, which provides side-by-side program and
which is built on the Apache Hadoop ecosystem. The interpreter interfaces, negating the need for a notebook.
company recently introduced Data Science Workbench If a data scientist decides to use a notebook, it can be
on Cloudera Enterprise, which is based on its 2016 zipped and included in the workbench. Everything done
acquisition of data science startup Sense.io. The goal in the interpreter is done in Spark. Additionally, the
of the workbench is to provide the best data science Spark configuration files are those used in production,
experience in the Hadoop ecosystem in a secure and which means that the models can be run easily in a
governable environment. The workbench provides data large-scale, distributed environment.
preparation, visualizations, and model testing and
training on Hadoop, as well as tools to help put models
into production in a distributed ecosystem.

Cloudera continues to build on its expertise in big data to provide a platform


geared towards data scientists to help them to experiment and enrich code, all
UPSHOT within the Hadoop ecosystem. Worth considering if your organization needs a
secure platform for data scientists and is comfortable with open source tools for
the analytics life cycle.

30 
Product Name Version Description

Cloudera Data Science Data Science Workbench allows data scientists to use open source languages—including R,
Workbench Python, and Scala—and libraries on a secure enterprise platform with native Apache Spark
and Apache Hadoop integration.

DATA PREPARATION Apache Spark; any package that runs in R, Python, or Scala; commercial software certified with
CAPABILITIES Cloudera, such as Trifacta or Paxata.

HDFS, HBase, any data source with Spark connectivity (e.g., relational databases, NoSQL, NewSQL),
DATA SOURCES SUPPORTED
streaming data sources through Kafka and Spark Streaming.

PREDICTIVE ANALYTICS
Via R, Python and Scala libraries, including Spark; commercial partners certified with Cloudera.
TECHNIQUES SUPPORTED

MODEL MANAGEMENT Version control performed via GitHub; models can be tracked via Cloudera Navigator; includes full
CAPABILITIES lineage to backtrace model.
MODEL PRODUCTION/ Uses Docker containers as an API endpoint to package up applications; Java, PMML, C, C++ available
EXPORT CAPABILITIES through Python and R packages (i.e., RPMML).

AUTOMATION CAPABILITIES Available through commercial partners certified with Cloudera, such as DataRobot.

CLOUD VS. ON-PREMISES Supports both.

LICENSING Available to licensees of Cloudera Enterprise Data Hub and Cloudera Data Science and Engineering.

31
www.cray.com
Cray, best known for its supercomputers, ranks Recently, the company has put a major emphasis on
analytics as one of its main focus areas along with supporting machine learning and deep learning. Its goal
high-performance computing. The company works with is to provide deep learning utilizing dense GPU and
researchers and scientists in government, defense, and CPU platforms, and open source tools and frameworks
earth science—as well as commercial organizations—to for seamless customer access to these analytics. This
tackle large and complex simulation, analytics, and deep includes support for the end-to-end analytics work flow
learning problems. These include problems in industries including data preparation, feature extraction, data
such as pharmaceuticals and life sciences, aerospace collection, and monitoring.
and automotive manufacturing, oil and gas exploration,
financial services, and any business looking to drive
innovation on a large scale.

Cray has a long history of pushing the limits of computing. Cray is worth looking
into for data scientists needing a flexible and scalable high-performance
UPSHOT
computing platform that supports deep learning for (ultimately) very large-scale
predictive analysis and advanced research and development.

32 
Product Name Version Description

Cray CS-Storm For organizations leveraging NVIDIA GPU accelerators in production-use machine learning.

Cray Urika-GX For CPU-based machine learning using Spark-based machine learning (MLLib) and deep learn-
ing (BigDL) tools.

Cray XC50 For large-scale data discovery and analytics using semantic databases.

Cray Graph Engine For GPU-based deep learning neural networks (frameworks like TensorFlow, Microsoft
Cognitive Toolkit).

DATA PREPARATION Customers can use any open source software (including Hadoop and Spark) or ISV tool that runs on
CAPABILITIES Linux.

Supports Hadoop and Spark on Cray Urika-GX and Spark on XC50, Hortonworks Data Platform (HDP),
DATA SOURCES SUPPORTED
Apache Spark, and Cray Graph Engine.

PREDICTIVE ANALYTICS Supports Spark MLlib or Python; also deep learning frameworks such as TensorFlow, MXNet, Caffe 2,
TECHNIQUES SUPPORTED and Microsoft CNTK may be used.

MODEL MANAGEMENT Cray has not pre-integrated any tool for model management; customers can use any open source or
CAPABILITIES ISV tool that runs on Linux.
MODEL PRODUCTION/
N/A
EXPORT CAPABILITIES

AUTOMATION CAPABILITIES Ad hoc or scheduled job scheduling with a Workload Manager.

CLOUD VERSUS ON PREMISES Supports both (cloud hosting provided through a relationship with Markley).

LICENSING N/A

33
www.datarobot.com
DataRobot is a machine learning platform that efficacy, and likely performance impact. Under the
simplifies and accelerates the process of building, testing, hood, DataRobot runs the user-submitted data set
and deploying machine learning models in supervised- against a curated set of up to 40 preinstantiated recipes
learning scenarios. A company goal is to automate model developed by its own data scientists. The recipes leverage
building for machine learning in order to open up the open source algorithms from R, Python, TensorFlow,
technology to business analysts and other nontechnical and others. The “best models” are then displayed on a
users. In addition, the company believes that automated leaderboard.
model building also frees data scientists to iterate and
Models are shared and put into production via
deploy machine learning into business processes quickly.
REST API endpoints that users can embed in their
To that end, the company has employed Kaggle data
applications. Workflows are persisted in DataRobot’s
scientists and baked their expertise into its software.
repository, which permits them to be shared with other
An analyst uploads data to DataRobot, which auto- people, applications, or services. DataRobot runs on
matically generates a list of “recipes”—i.e., prebuilt AWS or on premises, which includes private cloud, bare
workflows—ranked according to their accuracy, metal Linux, or the Hadoop environment.

DataRobot is geared to business analysts as well as data scientists and provides


UPSHOT automated model building in a Kaggle-like framework to produce the best
machine learning model.

34 
Product Name Version Description

DataRobot 3.0 An enterprise machine learning platform that automates model building, testing,
and validation.

Data preprocessing (e.g., missing imputation, encoding, and tokenizing of text data) in addition to
DATA PREPARATION
model building and deployment; no built-in ETL capabilities; DataRobot also partners with Alteryx,
CAPABILITIES
which can provide other data preparation capabilities.

DATA SOURCES SUPPORTED URL, ODBC, HDFS, local files, data frames in R or Python Pandas.

More than six supervised ML algorithms, including generalized linear models, decision trees, boosted
PREDICTIVE ANALYTICS trees, random forests, and neural networks; support vector machines. Several preprocessing algo-
TECHNIQUES SUPPORTED rithms, including encoding, missing imputation, and text mining capabilities. Open source algorithms
including R, Python, and TensorFlow are also supported.
Workflows can be stored within a project and multiple iterations of projects can be stored in a project
MODEL MANAGEMENT
management view. Each project can be named, tagged, copied, and shared. Alerts not currently
CAPABILITIES
available.
MODEL PRODUCTION/EXPORT Model workflows are made available for prediction as REST API endpoints. Workflows can be persisted
CAPABILITIES to the DataRobot repository, shared with other users, and invoked by applications/services.

AUTOMATION CAPABILITIES Automated model selection, training, tuning, and deployment.

CLOUD VERSUS ON PREMISES Available on AWS or on premises.

Annual subscription license is based on the number of users and number of compute engines needed
LICENSING
to support concurrent use.

35
www.dataiku.com
Dataiku provides an end-to-end data science platform Dataiku built the platform to address a range of skills.
that includes data preparation, coding, visualization, For instance, the platform provides interfaces for data
modeling, and deployment capabilities. Geared towards scientists to manage data and craft models via a drag-
data science teams in midsized to large companies, the and-drop workflow or a notebook environment, and to
goal of the platform is two-fold. First, the platform is deploy them into production. Dataiku also provides an
designed to help data scientists and business analysts interface for noncoders to build a model by specifying
collaborate on the same project. Second, it provides target variables of interest.
APIs to help push models into production. The platform
was built with a UX and open source-friendly mindset
to connect with Hadoop, Python, and other newer
platforms. Typical predictive use cases that the company
supports include fraud, churn, marketing and sales
analytics, predictive maintenance, and forecasting.

A new entrant that addresses the complete predictive analytics lifecycle,


Dataiku’s platform is a good addition to the traditional vendor mix. While
UPSHOT targeted to data science teams at midsized and large companies looking for an
end-to-end platform, the platform also provides automated model building for
those not expert in predictive analytics.

36 
Product Name Version Description

Dataiku DSS 4.04 Collaborative data science software platform for teams of data scientists, data analysts, and
engineers to explore, prototype, build, and deliver their own data products.

DATA PREPARATION
Visual interface for data connectivity, cleansing, enriching, blending, and transformation.
CAPABILITIES

DATA SOURCES SUPPORTED Platform provides more than 25 connectors.

More than 10 supervised ML algorithms including ridge regression and random forest; unsupervised
PREDICTIVE ANALYTICS
approaches including k-means and spectral clustering. Also supports ML techniques in open source
TECHNIQUES SUPPORTED
libraries such as R and Python.

MODEL MANAGEMENT Dashboard monitoring—users can visualize models and see who created them, who contributed to
CAPABILITIES them, and what changes were made; can also synchronize. Users can also set customizable alerts.

MODEL PRODUCTION/EXPORT Workflows can be bundled together using a REST API for deployment into applications. Also supports
CAPABILITIES in-database or in-cluster scoring—even on models that have been trained in-memory.

Models can be put into production via the automation node. Users can also retrain the model
AUTOMATION CAPABILITIES
automatically using fresh data.

CLOUD VERSUS ON PREMISES On premises, but can also be on private cloud.

LICENSING On premises version only; yearly subscription model (price based on server and number of users).

37
www.fico.com
FICO offers a range of data analysis products aimed ML. FICO Analytics Workbench supports visual data
at both data scientists and business analysts. The next exploration and wrangling. Users build data preparation
generation of its analytics tools is Analytics Workbench, workflows and save them as executable “recipes.”
which FICO positions as a self-service platform for data
Analytics Workbench provides a range of data analysis
analysis. Its underpinnings are provided by the FICO
tools, features, and capabilities. Self-service and ease-of-
Decision Management Suite.
use features include data visualization, binning libraries,
FICO Analytics Workbench consolidates FICO’s own and decision trees; many sample notebooks are provided
intellectual property, along with that from its acquisi- via the FICO community. Analytics Workbench
tions of InfoCentricity and KarmaSphere. It leverages leverages Spark as a parallel compute engine. The Spark
open source technologies such as Apache Spark, Python, environment permits multimodel analysis of relational,
and Scala (R will be supported in the next release). Users graph, and multimedia data, and also has built-in
can work in notebooks (based on Apache Zeppelin) that engines for query processing, streaming analytics, and
allow them to interact with visualizations and syndicate machine learning.
their work; in addition, notebooks provide access to
open source libraries, including scikit-learn and Spark

FICO, perhaps best known for its market leading decision management products
in analytics, has a great deal of experience in predictive analytics. Its Analytics
UPSHOT
Workbench brings many FICO analytics capabilities together in one platform
geared for collaboration among business analysts and data scientists.

38 
Product Name Version Description

FICO Analytics Workbench 1.0 The next generation of FICO’s analytics toolkit, combining existing data science tools with open
source technologies into a single, cloud-ready, machine learning and decision science platform,
powered by Spark.

DATA PREPARATION Visual, self-service facilities for data connectivity, cleansing, enriching, and transformation. Also
CAPABILITIES includes a weight-of-evidence feature to assess signal strength in variables.

Point-and-click support for structured sources such as CSV and programmatic access to Hadoop, JSON,
DATA SOURCES SUPPORTED
XML, and Parquet formats from stores including Hadoop, HIVE, and S3.

Regression, binning libraries, strategy trees, random forests, boosted trees, and neural networks;
PREDICTIVE ANALYTICS
does not currently support text analytics. Also supports ML and AI techniques via open source libraries
TECHNIQUES SUPPORTED
in Python and Scala.

Management and monitoring of models is provided in FICO’s Decision Central (a separate product),
MODEL MANAGEMENT
which provides the ability to track and compare models and lineage information (e.g., authors,
CAPABILITIES
contributors, changes, etc.).

FICO Model Executors support automated deployment of PMML or SPLM (SAS Programming Language)
MODEL PRODUCTION/EXPORT
models as Web services. Also supports FSML, a proprietary XML format that can be deployed into
CAPABILITIES
FICO’s decision management applications. Zeppelin notebooks can be packaged for export, too.

Automated data profiling, auto-binning and auto generation of models. Decision Central offers
AUTOMATION CAPABILITIES
automated workflow for model management, supports tracking, validation, sign-off, and more.

CLOUD VERSUS ON PREMISES Cloud service.

LICENSING SaaS licensing on a per-seat basis.

39
www.fractalanalytics.com
Fractal Analytics provides client-focused global analyt- The core platform is the Centralized Analytics Platform
ics services for industries including CPG and retail, which uses a proprietary services architecture to scale
hospitality, life sciences, healthcare, and financial a wide range of solutions. The Fractal team develops
services. As the company has evolved, it has also built a advanced math and science products, streamlines
number of applications and workbenches for advanced process automation, and builds reusable components
analytics including machine learning, deep learning, AI, by industry and solution area using the platform. The
and NLP. These are now used by Fractal Analytics data platform also houses numerous out-of-the-box and
scientists as well as by company clients. proprietary algorithms.

Best known for its expertise in data science and big data, Fractal Analytics is
best suited to large enterprises looking to build customer-focused predictive
UPSHOT solutions using an experienced and trusted partner. While clients may ultimately
use the Fractal Analytics platform on their own, they typically start by working
together with the Fractal Analytics team.

40 
Product Name Version Description

Fractal Centralized Analytics CAE is a unified, open-source and PCI-DSS-certified business analytics platform. The core is an
Platform (CAE) advanced workbench built on KNIME, integrated with R, Python, and SparkML libraries.

DATA PREPARATION
CAE has over 100 prebuilt modules to automate the processing and validation of the incoming data.
CAPABILITIES

DATA SOURCES SUPPORTED Cloudera, Hortonworks, Hive, Oracle, PostgreSQL, Teradata, SAP, MySQL, MS-SQL, AWS.

Prebuilt analytics include sentiment analysis using NLP and ML techniques; pricing and promotion
modeling; marketing mix modeling; over 35 advanced demand forecasting algorithms; driver analysis
PREDICTIVE ANALYTICS
using Bayesian belief networks; generalized text classification; assortment analysis and planning;
TECHNIQUES SUPPORTED
sales propensity modeling using random forests; key-value items analysis; product attribute mapping;
context extraction; churn propensity; and personality prediction.

Each client is set up as a separate instance with a model-versioning framework. There are automated
MODEL MANAGEMENT
jobs set up to monitor whether models need retraining and the ability for the models to self-learn
CAPABILITIES
based on new incoming data sets.

MODEL PRODUCTION/EXPORT
CAE uses open source standards and provides the ability to integrate with third-party applications.
CAPABILITIES

CAE runs automated validation checks on the statistical model outputs using rules to confirm their
AUTOMATION CAPABILITIES
validity.

CLOUD VERSUS ON PREMISES On premises or in the cloud.

Fractal either provides analytics services using prebuilt solutions in CAE or commissions the entire
LICENSING CAE platform for the client; the client then uses the platform to build and deliver analytics solutions
within their organization.

41
www.h20.ai
H20.ai provides an open source AI platform with an TensorFlow, and Caffe. Sparkling Water is an
emphasis on enterprise transformation. The company’s integration with Spark. Steam is used to deploy models
goal is to provide speed, accuracy, and model interpret- into production. These are integrated into the H20
ability for data scientists. It offers four products. The user interface.
core platform is H20—an in-memory distributed
The H20 UI is script-based. H20.ai also offers
machine learning platform that provides numerous
H20Flow—a visual interface to help users interpret
algorithms out of the box, refactored to support large
model results. Models are designed to have a small
scale distributed environments. H2O was written from
footprint and can be exported as code or Java data
scratch in Java in order to integrate with popular open
objects (MOJO, POJO) and embedded in various
source products like Apache Hadoop and Spark.
environments, including in streams. The small footprint
H20.ai also offers Deep Water, which integrates open of the models also makes them suitable for embedding
source deep learning frameworks such as MXNet, in IoT-related devices.

For use by data scientists who are comfortable working with open-source
analytics and need a way to put open source into production at enterprise scale,
UPSHOT H20.ai provides best-of-breed open source technologies in one platform. It also
provides tooling to support visualization of models and operationalizing them.
Finally, since H20 is open source, it is free unless support is needed.

42 
Product Name Version Description

H20 3 An in-memory open source distributed machine learning platform with visual intelligence
(H20 Flow).

DATA PREPARATION R users can use data.table for joins. Other data preparation features are available through
CAPABILITIES R and Python.

DATA SOURCES SUPPORTED Supports a range of data sources including Hadoop, S3, SQL-based, and NoSQL.

Offers statistical analysis, ensembles, deep neural networks, clustering, dimensionality reduction,
PREDICTIVE ANALYTICS and anomaly detection, as well as feed-forward deep learning. H20 Deep Water provides open source
TECHNIQUES SUPPORTED deep learning frameworks such as TensorFlow, MXNet, Caffe. H20 can also be used inside of R,
Python, and Scala.

MODEL MANAGEMENT
Steam provides cataloging features. No versioning in current release.
CAPABILITIES

MODEL PRODUCTION/EXPORT
Via Steam, users can export lightweight models as MOJO or POJO objects.
CAPABILITIES

Provides AutoML to automate parts of data preparation and model development including aspects of
AUTOMATION CAPABILITIES
data cleansing, feature engineering, feature selection, ensemble generation.

CLOUD VERSUS ON PREMISES Can support both.

All products are open source with Apache V2 licensing. Licensing based on level of support and data
LICENSING
volume. Steam AGPL license geared towards production models.

43
www.ibm.com
IBM has a broad range of capabilities for advanced IBM also offers DSX which provides popular open
analytics; its main predictive analytics products are source tools such as R, Python, Spark, and Scala—as
IBM SPSS Modeler and the Data Science Experience well as various notebooks—together in one place to
(DSX). The company’s predictive analytics strategy is help data scientists be productive. The big focus is on
about simplicity without sacrifice. SPSS enables business collaboration. Inside DSX, users can share assets such
analysts and data scientists to utilize disparate data as data sets between projects to facilitate collaborative
types such as structured, unstructured, and streaming development of analytics.
data; prepare the data for analysis; build models; and
deploy them into operational systems—all using a
code-free, visual GUI. It also includes support for R and
Python in workflows.

IBM is a long-time leader in predictive analytics. Built to scale, IBM SPSS


Modeler is a comprehensive, full life cycle analytics platform that supports data
UPSHOT
scientists, business analysts, and DevOps. DSX provides capabilities for coding
data scientists.

44 
Product Name Version Description

IBM SPSS Modeler 18 Predictive analytics platform that provides a range of advanced algorithms and techniques.

Collaboration and 8 Enables the deployment and sharing of predictive analytics across the enterprise including
Deployment Services centralized storage and capabilities for management and control.

SPSS Modeler offers features to merge, join, filter, and transform data for modeling. SPSS Statistics
DATA PREPARATION can be used to generate and implement statistical transformations. SPSS also offers pure ETL tools
CAPABILITIES that make it easy to access, transform, and load data back into a database (with SQL Pushback) or
Hadoop using IBM SPSS Analytic Server.
SPSS supports flat files, Greenplum databases, BigInsights and MapR (both via SPSS Analytic Server
DATA SOURCES SUPPORTED and BigSQL), Cloudera, other IBM products and databases, Oracle, SAP, SQL Server, Teradata, Sybase,
Salesforce, XML, and PDF.

SPSS supports random tree, neural network, KNN, linear and logistic regression, general linear
PREDICTIVE ANALYTICS methods, support vector methods, Bayesian networks, decision tree modeling and rule set building
TECHNIQUES SUPPORTED using CHAID, QUEST, C5.0, and C&R, Cox regression, anomaly detection, TwoStep clustering, Monte
Carlo simulations, fit, and evaluation. SPSS supports open source environments in Modeler and DSX.

SPSS Collaboration & Deployment Services (C&DS) provides a repository with full versioning, audit
MODEL MANAGEMENT data, automated model evaluation, automated model refresh, and automated model deployment, as
CAPABILITIES well as full logging and governance support. It also supports notification options should users wish to
require human review of new or refreshed models prior to moving models into production.
SPSS models are encapsulated in a “golden nugget” in the software and can be added to any Modeler
flow. The entire Modeler program to prepare data for scoring (as well as the scoring of the model itself)
MODEL PRODUCTION/EXPORT
can be deployed either in batches or real time (via web service calls to REST APIs); SPSS also supports
CAPABILITIES
PMML. In-database deployment with DB2, Teradata, and Netezza. Deployments to Oracle and SQL
Server are also possible albeit to a more limited extent.

AUTOMATION CAPABILITIES SPSS automatically picks best model.

CLOUD VERSUS ON PREMISES On premises or in SoftLayer, Azure, AWS, or Google Cloud platforms.

LICENSING SPSS is available in perpetual, term, or monthly options.

45
http://www.megaputer.com/
Megaputer’s focus is to provide customers with PolyAnalyst exposes an interactive drag-and-drop UI.
knowledge based on the analysis of structured and PolyAnalyst can be deployed on a standalone basis or in
unstructured data. The company’s flagship product the context of Hadoop or Spark, where it can leverage
is PolyAnalyst, an integrated analytics platform that available open source libraries and algorithms such as
permits joint analysis of this data. The PolyAnalyst R and Python. Because the PolyAnalyst UI supports
environment integrates data preparation, coding, drag-and-drop operations in Hadoop or Spark, users
visualization, modeling, and deployment capabilities. don’t have to write Java, Scala, or Python code.
It includes strong text analytics functionality such as
In addition to PolyAnalyst, Megaputer markets a range
feature extraction, semantic analysis, entity resolution,
of domain-specific analytical solutions, with a particular
sentiment analysis, and classification/categorization,
focus on the insurance, pharmaceutical, and healthcare
along with more advanced features such as the ability to
verticals. The company often works as a partner with
perform deep linguistic text parsing.
its clients, using its own data analysis consultants to
PolyAnalyst is designed for use by data scientists/ help customers build custom analytics solutions on its
statisticians and business analysts. To this end, platform.

Geared towards data analysts and data scientists, Megaputer provides advanced
text analytics capabilities that can be used to turn text data into structured data for
UPSHOT
predictive analytics. Worth looking at for those who want to combine multiple data
types in one platform as well as for those looking for solutions that require this.

46 
Product Name Version Description

Megaputer PolyAnalyst 6 Integrated data analysis platform that supports both structured and unstructured data.

DATA PREPARATION PolyAnalyst provides a layer of data cleansing, manipulation, and transformation as well as text
CAPABILITIES analytics features.
PolyAnalyst supports JDBC, ODBC, and OLE, Hadoop (HDFS, Hive), XML, JSON, Internet sources, email
servers, and social media (e.g., WebHose, BrandWatch), SDL SM2, SPSS, flat files, Excel, and a range of
DATA SOURCES SUPPORTED
text document formats including email and RSS feeds. The product currently supports 16 languages,
including English, Chinese, Japanese, and Arabic.

For data analysis, supports prediction, affinity, classification, regression, clustering, segmentation,
PREDICTIVE ANALYTICS neural networks. For text mining, supports feature extraction, semantic analysis, entity resolution,
TECHNIQUES SUPPORTED classification/categorization, and deep linguistic text parsing. Also supports R, Python, and Scala
code/libraries via Spark.

MODEL MANAGEMENT
Built-in management with condition-based alert generation.
CAPABILITIES

Exports predictive models in C++ (text analytics cannot be exported); integrates with data sources
MODEL PRODUCTION/EXPORT
and operational systems via standard interfaces such as JDBC and ODBC. PolyAnalyst includes REST
CAPABILITIES
APIS for data scoring; Megaputer recommends customers perform scoring in PolyAnalyst itself.

Tasks such as publishing a model and text analytics jobs can be instantiated via a scheduler and
AUTOMATION CAPABILITIES
automated by means of scripts.

CLOUD VERSUS ON PREMISES Managed cloud and on premises.

PolyAnalyst can be either licensed or leased as a standalone, server- or cluster-based application. Data
LICENSING analysis and result-reporting engines are provided under separate licenses. Custom domain-specific
analytical solutions can be licensed as a service or along with the underlying analytical platform.

47
www.microsoft.com
Microsoft believes that data, analytics, and AI can Information management products in the suite include
accelerate digital transformation for every organization Azure Data Factory to help build pipelines and collect
and help to drive action. The company also believes data. Azure Data Catalog is used to manage data sources
that intelligent solutions enable differentiation. Its and Event Hubs provides a staging area to deal with data
Cortana Intelligence Suite commercializes Microsoft’s streams. Microsoft Machine Learning Studio provides
IP in big data and machine learning and includes automation and ease-of-use features (e.g., wizards,
information management, machine learning and drag-and-drop capabilities) for building, training,
analytics as well as advanced analytics services and scoring, testing, and comparing models. Models can
pre-built solution templates. The suite consists of be operationalized as Web services and consumed by
thirteen different products which are all part of Azure applications and services.
data services, packaged together to help organizations
Cortana Intelligence Suite is completely cloud-based.
build and deploy predictive applications and solutions.
Like most cloud offerings, its pricing is based on
These can be purchased as one line item or services can
compute and storage. Microsoft also offers on-premises
be bought separately.
machine learning through SQL Server.

After years of building data science IP, Microsoft is commercializing its


knowledge and experience in Cortana to address big data, advanced analytics,
UPSHOT
and AI. A welcome addition to the market, the suite targets data scientists and
business analysts looking to build out intelligent applications and solutions.

48 
Product Name Version Description

Microsoft Cortana Intel- Microsoft suite of services for machine learning and AI.
ligence Suite

DATA PREPARATION Visual, self-service facilities for data connectivity, cleansing, enriching, blending, and transformation
CAPABILITIES via Azure Data Factory.
30 connectors including Azure Data Services, Amazon Redshift, Amazon S3, Salesforce, and other
third-party cloud services; SAP BW and SAP HANA; commercial and open source RDBMS platforms,
DATA SOURCES SUPPORTED NoSQL, Hadoop, OData-compliant data feeds, and file systems. Azure Data Factory can connect to
other Azure data sources, including Azure Data Lake Store, Azure SQL Data Warehouse, and Azure
DocumentDB, as well as Azure Stream.

PREDICTIVE ANALYTICS Four supervised ML algorithms: classification, clustering, anomaly detection, and regression. Also
TECHNIQUES SUPPORTED supports ML techniques in open source libraries such as R and Python.

MODEL MANAGEMENT Built-in management and monitoring; can visualize and compare models and lineage information
CAPABILITIES (e.g., authors, contributors, changes, etc.).

Models can be operationalized as Azure Web services and consumed by other Azure services, custom-
MODEL PRODUCTION/EXPORT
built cloud and on-premises applications, other cloud services. Models exploit in-database scoring
CAPABILITIES
with supported Azure services.

AUTOMATION CAPABILITIES N/A

CLOUD VERSUS ON PREMISES Cloud only.

LICENSING Can be licensed as a single line item or à la carte.

49
www.opentext.com
OpenText provides predictive analytics through its Big well as share the results in dashboards or reports. iHub
Data Analytics product, which is offered through its also provides APIs to enable analytics to be embedded in
OpenText Analytics Suite. The suite consists of three any mobile or Web application.
products: iHub, Big Data Analytics, and InfoFusion.
OpenText also provides connections to unstructured
Big Data Analytics, where the predictive analytics data for use in predictive analysis through its InfoFusion
capabilities reside, is geared to business users. It is a product. InfoFusion enables text mining of content data
columnar-based software that provides a visual drag- to extract entities and sentiment. That data can then be
and-drop interface for knowledge workers to explore big utilized as part of Big Data Analytics. Tight integration
data and perform predictive analytics using pre-defined of Big Data Analytics and iHub with InfoFusion is
algorithms against billions of rows of data. Big Data planned for 2017.
Analytics is tightly integrated with iHub, which helps
organizations prepare data and cleanse it for analysis as

OpenText is targeted at business analysts who want an easy-to-use, drag-and-


drop interface to build predictive models. The company is leveraging its heritage
as an EIM vendor to provide text analytics capabilities for using unstructured
UPSHOT data to make predictions, as well as black-box predictive analytics for business
users and analysts. OpenText is well-suited to IT departments that provide
predictive analytics to nontechnical users who want to analyze their data and
then share their insights.

50 
Product Name Version Description

OpenText Analytics Suite 16 Consists of three products: iHub, Big Data Analytics, and Infofusion. iHub and Big Data
Analytics share a single login.

Big Data Analytics allows users to integrate data from different sources via a drag-and-drop interface.
DATA PREPARATION
It also provides data preparation features including cleansing, enrichment, and creating fields that
CAPABILITIES
aggregate, rename, or calculate expressions.
Native connectors for popular SQL databases, an ODBC driver, and a remote data provider option for
DATA SOURCES SUPPORTED
loading data from a Web address. Native Spark and Hadoop support should be available in mid-2017.

Anomaly detection, association rules, clustering, profiling, segmentation, decision trees, Naive Bayes
PREDICTIVE ANALYTICS
classification, correlation, linear and logistic regression, summarization, and pattern mining. Support
TECHNIQUES SUPPORTED
for R and Python is expected in mid-2017.

MODEL MANAGEMENT Big Data Analytics provides an administrative environment for users to save models via a folder-based
CAPABILITIES interface.

MODEL PRODUCTION/EXPORT
iHub provides APIs for embedding analytics into applications including REST V2 via BDA.
CAPABILITIES

AUTOMATION CAPABILITIES Workflows can be saved and then scheduled.

CLOUD VERSUS ON PREMISES Both options are available.

LICENSING Licensing for Big Data Analytics is based on rows of data.

51
www.oracle.com
Oracle’s predictive and advanced analytics philosophy proprietary machine learning algorithms and integration
is to move the algorithms to the data management with open source MLlib and R as part of Oracle R
platform—specifically its database and big data Advanced Analytics for Hadoop (ORAAH).
platforms—rather than exporting data to specialized
Oracle also offers Oracle Analytics Cloud (OAC). OAC
analytical servers. Oracle’s Database Cloud Service
is designed to help business users tell their stories. It
includes Oracle Advanced Analytics (OAA), more than
provides a suite of tools that enable business analysts
30 in-database implementations of machine learning
to connect to data, prepare it, and analyze it in what
algorithms that are optimized to run as SQL functions
Oracle calls a “smart data discovery” environment. OAC
to leverage the Oracle database. OAA also provides
integrates with OAA through the common database
tight integration with R so users can “push down” R
access to provide visual model output from OAA and
scripts to equivalent in-database SQL functions. On
ORAAH that is more visually appropriate for business
Exadata Engineered Systems, “smart scan” technology
analysts such as coded probability bands.
enables the “scoring” of machine learning models on the
Exadata storage tier. Similarly, Oracle’s Big Data Cloud
Service and Big Data Appliance provide a number of

Geared towards data scientists, “citizen data scientists,” and application


developers, Oracle Advanced Analytics supports in-database predictive analytics
and machine learning deployments for enterprises. ORAAH provides similar
machine learning functionality for big data. For those utilizing Oracle Databases
UPSHOT
or Oracle Big Data platforms (cloud, on premises, or hybrid) who wish to leverage
their investment and eliminate data movement in order to perform and deploy
predictive analytics, Oracle Advanced Analytics and ORAAH are well worth
looking into.

52 
Product Name Version Description

Oracle Database High and 12.2 Provides in-database, parallelized implementations of 30+ machine learning algorithms,
Extreme Editions (includes integration with open source R, and the optional SQL Developer/Oracle Data Miner
OAA) workflow UI.
Oracle Advanced Analytics 12.2 Provides in-database, parallelized implementations of 30+ machine learning algorithms,
Database Option integration with open source R and the optional SQL Developer/Oracle Data Miner
(on-premises) workflow UI.
Oracle Big Data Cloud Service 12.2 Machine learning algorithms running in Hadoop and Spark.
(includes ORAAH)
Oracle Analytics Cloud 12.2 Includes a suite of tools for data prep, visualization, discovery, collaboration, packaged KPIs.

Oracle provides data management and ETL support (SQL, GoldenGate, Oracle Data Preparation Cloud
Services, etc.) for ingestion, profiling, and cleansing; transformation; and calculation, including
DATA PREPARATION features. Additionally, OAA adds support for auto binning, handling of missing values, aggregation,
CAPABILITIES geospatial data, text tokenization, and thesaurus while leveraging Oracle Database’s other data
management and processing features. On Hadoop and Big Data Cloud Services, Oracle provides
support for Hive, Spark, and Big Data SQL.

OAA runs in the Oracle database and ORAAH runs on Hadoop. Supports all data types supported by
DATA SOURCES SUPPORTED Oracle including transactional, geospatial, graph, and support for text data and text mining inside
the database using Oracle Text.

OAA supports multiple machine learning algorithms in each category for classification, clustering,
anomaly detection, regression, attribute importance, feature creation, explicit semantic analysis,
PREDICTIVE ANALYTICS association rules, basic statistical functions, time series, and CRAN R packages and Spark MLlib. All
TECHNIQUES SUPPORTED OAA algorithms additionally leverage other database features to support unstructured data. ORAAH
supports regression, classification, clustering, feature creation, and R transparency to Hive. Oracle
Cloud Analytics supports regression and clustering, and user’s custom R scripts.

MODEL MANAGEMENT Incorporates the features of a database including versioning and audit tracking. Provides the Oracle
CAPABILITIES Data Miner GUI and PL/SQL scripts to schedule and run models.

Models are built and run natively in the Oracle database or in Hadoop. Includes SQL Developer exten-
MODEL PRODUCTION/EXPORT
sion (Oracle Data Miner) for model building, evaluation, and application development and in the near
CAPABILITIES
future, the Oracle Machine Learning Zeppelin notebook.

Supports automated scheduling of predictive models. No explicit support for automated model
AUTOMATION CAPABILITIES
monitoring but this can be achieved through PL/SQL scripts.

CLOUD VERSUS ON PREMISES Cloud, on premises, or hybrid.

LICENSING Cloud and on-premises pricing.

53
www.pentaho.com
Pentaho, a Hitachi Group Company, provides a data multiple paths for building models via R, Python,
integration and analytics platform based on open source MLlib, Scala, and Weka.
that combines data access, integration, and analytics.
Machine learning scripts can be embedded directly in
With the recent addition of its machine learning orches-
a data workflow to enable Pentaho users to leverage
tration capability, the platform now supports data and
existing data preparation and feature engineering efforts
feature engineering, model building, model deployment,
in a production environment. Pentaho also provides
and model updating for the data science process. Geared
APIs for embedding visualizations containing models
towards data scientists, data engineers, and analysts who
into existing applications.
want to build and deploy models with complex data
using open source technologies, the platform provides

One of the first commercial open source analytics vendors, Pentaho is targeted
to data scientists who want to use complex, diverse data for predictive analytics
UPSHOT at scale, and supports structured and unstructured data. Well-known for its
capabilities in data integration and embedding analytics, the platform is also
worth considering for embedding predictive analytics into applications.

54 
Product Name Version Description

Pentaho Machine Learning 7.0 Orchestration software that streamlines the machine learning workflow and helps data
Orchestration (Capability Set scientists, engineers and analysts collaboratively build and deploy predictive models on
within Pentaho ) big data. Supported machine learning languages and libraries include R, Python, Weka,
Scala, and Java.

DATA PREPARATION Includes automated data onboarding, cleansing, and validation in a drag-and-drop environment; also
CAPABILITIES includes data standardization and validation.
Supports a range of data sources including relational databases, analytic databases, NoSQL databases,
DATA SOURCES SUPPORTED
Hadoop and other files, business applications, and more.

PREDICTIVE ANALYTICS Supports R, Python, and Scala, and includes open source Weka as part of its platform. Weka contains
TECHNIQUES SUPPORTED tools for data pre-processing, classification, regression, and clustering.

The data scientist can see the model and scripts from within the data integration view, and can modify
MODEL MANAGEMENT
them in their tool of choice (e.g., R Studio or Jupyter) before re-importing them into Pentaho for
CAPABILITIES
incorporation into an updated analytic data delivery process.
Pentaho allows data engineers to execute data scientists’ scripts in R, Python, MLlib, Scala, or Weka
MODEL PRODUCTION/EXPORT by embedding them directly in a data workflow, allowing them to take advantage of existing data and
CAPABILITIES feature engineering efforts. Using APIs for integrating BI and visualizations, organizations can also
embed models in Pentaho within their existing applications.

Pentaho data integration (PDI) includes version control and tracking to instantiate a new version or
AUTOMATION CAPABILITIES
rollback to a previous version of a model; also includes scheduling capabilities.

CLOUD VERSUS ON PREMISES Supports both.

Based on the number of server cores and (if big data) Hadoop nodes; the Data Science Pack comes with
LICENSING
an enterprise license.

55
www.salesforce.com
Salesforce Einstein provides pre-packaged predic- Salesforce Einstein is a layer of abstraction on top of
tive applications that can be part of any Salesforce the Salesforce data science platform used to develop
application, such as including analytics, sales, service, applications for business users. That data science
marketing, community, apps, commerce, or IoT. Ein- platform consists of four layers including a multitenant
stein learns and makes predictions on data in Salesforce infrastructure, data services, AI platform services, and
and can provide each customer with unique models a development platform. Much of the analytics intel-
built from their data. These insights and predictions are lectual property for the core of Einstein Analytics was
served up as part of Salesforce applications. Salesforce developed in-house. Salesforce also offers BeyondCore,
also offers Einstein Data Discovery for more technical which provides automated predictive model building for
users to explore data and build predictive models using nonstatisticians and data scientists.
automated model building.

Einstein is a good example of how predictive analytics can be delivered in a


powerful intelligent platform to nontechnical users. It is worth considering if
UPSHOT
your organization is looking for prepackaged applications for CRM targeted to
business users, or machine learning APIs for business analysts.

56 
Product Name Version Description

Salesforce Einstein AI built into the Salesforce platform.

Salesforce Einstein automatically prepares the data to make it ready for machine learning. Beyond its
DATA PREPARATION
native integration with Salesforce data, it reads the metadata, enriches the dataset with unique Auto-
CAPABILITIES
Feature Engineering capabilities, and splits the data into training and validation sets.

DATA SOURCES SUPPORTED Uses all Salesforce data as well as other data found in the Salesforce cloud.

PREDICTIVE ANALYTICS
Statistical algorithms, machine learning, deep learning.
TECHNIQUES SUPPORTED

MODEL MANAGEMENT Einstein’s native model management capabilities include model scheduling, model execution, model
CAPABILITIES monitoring, log analysis, and model versioning.

MODEL PRODUCTION/EXPORT The predictions and insights made by Einstein are written back to the Salesforce platform, which
CAPABILITIES enables them to be operationalized (e.g., included in a workflow), extended, or customized.

AUTOMATION CAPABILITIES Models are served in an automated way.

CLOUD VERSUS ON PREMISES Cloud only.

LICENSING Add-on to Salesforce cloud, $50/month per user.

57
www.sap.com
The goal of the SAP Predictive Analytics Suite is to Second, SAP is automating the model-building process
provide a collaborative platform to create, deploy, and with Predictive Factory to put machine learning and
maintain predictive models. Geared to data scientists, predictive models into production at scale (i.e., building
business analysts, and business users, the company is and managing thousands of models).
using a three-pronged approach to deliver on its vision.
Third, its strategy is to enable predictive models to be
First, it provides tools to help accelerate the model- embedded into applications and systems. It is currently
building process. It offers two approaches to model embedding predictive capabilities into its own SAP
building. One is Automated Modeler, which provides a products. It is also pushing machine learning into
wizard-driven approach for use by business analysts to its Cloud BI product in a guided discovery model to
build models. SAP also offers Expert Analytics as part of automatically find insights in data.
the suite, which provides data scientists access to predic-
SAP Predictive Analytics Suite integrates with both SAP
tive algorithms and tooling. Data scientists can use
HANA and with open source Hadoop. Users can access
Expert Analytics to build custom pipelines that exploit
data in HANA or in Hadoop as well as schedule analyti-
HANA’s in-database Automated Predictive Libraries
cal processing in either environment.
(APL) and Predictive Analysis Library (PAL) along with
R (Python is not yet supported).

SAP has come a long way over the past few years with its predictive analytics
offerings to provide a complete life cycle approach. It also provides predictive
UPSHOT analytics production at scale via its Predictive Factory. Geared towards business
analysts and data scientists in midsize to large companies (especially those that
already have SAP installed), it also provides some support for business users.

58 
Product Name Version Description

SAP Predictive Analytics Suite 3.2 Consists of three components: SAP Predictive Analytics (comprised of SAP Automated
Analytics, SAP Expert Analytics, and SAP Predictive Factory), SAP Predictive Services, and
SAP Predictive Analytics Integrator.

Provided in Data Manager as part of SAP Automated Analytics. This includes data integration as well
as data preparation for predictive modeling, including auto-generating attributes such as sums, dif-
DATA PREPARATION ferences, and ratios. The suite also lets users calculate new variables. Data Manager creates analytical
CAPABILITIES data sets based on time-stamped populations that can be automatically prepared and consumed by
modeling and operationalization workflows and which enable users to identify what happened to an
entity leading up to an event. Supports access to common RDBMSs.
SAP Hana, SAP Hana Vora, Oracle, Teradata, SQL server, Hadoop (Hive, Spark), Generic JDBC (Expert
DATA SOURCES SUPPORTED
Analytics), IBM DB2, IBM PureData System for Analytics, PostgreSQL, Sybase, and Vertica.
Classification, regression, clustering, association rules, and time series. PAL contains many algo-
PREDICTIVE ANALYTICS
rithms. Also includes R in Expert Analytics; users can integrate their own algorithms via nodes. No
TECHNIQUES SUPPORTED
support yet for Python.
Predictive Factory allows users to schedule the automated application of predictive models on data
MODEL MANAGEMENT sets that can then be consumed by business applications and BI systems. Testing of data and perfor-
CAPABILITIES mance deviation can also be scheduled and the models automatically retrained. It also automatically
monitors model performance using a set of KPIs.
Models can be exported to SQL, Java, JavaScript, SAS, C, or C++ code, PMML, and several other lan-
MODEL PRODUCTION/EXPORT
guages for use in custom-developed processes or applications. Models can be applied without export
CAPABILITIES
using direct in-database application or by embedding models in SAP business applications.
Includes automated model building and model application, and automatic retraining, scoring, and
AUTOMATION CAPABILITIES applying models to new data when models get stale. Furthermore, it enables the automatic creation of
analytical data sets based on a population snapshot that is automatically refreshed.

CLOUD VERSUS ON PREMISES Available in both the cloud and on premises.

Predictive Analytics Suite is the building block for all on-premises implementations. It’s priced by
the size of the database to be analyzed or scored, whichever is larger (in units of 64GB). Predictive
LICENSING Analytics Suite includes one license of Predictive Analytics Modeler. Additional Named User licenses
of Predictive Analytics Modeler can be added. Predictive Analytics Suite includes: Data Manager,
In-Database Scoring, Predictive Factory, Social Link Analysis, and Recommendation.

59
www.sas.com
SAS offers software for the complete predictive opening up its platform with open APIs for SAS, Java,
analytics life cycle, from data access and preparation to Python, and R.
production and management, within one application.
SAS has a multiprong vision for the future of predictive
Through its SAS Analytics Suite, the company’s goal
analytics. It supports the automation of model building
is to provide customers with a simple, powerful, and
through easy-to-use interfaces as well as in a factory
automated platform to help solve any business problem.
model that supports building thousands of models at
SAS, known for its comprehensive analytics capabilities,
scale. The software enables users to embed analytics, as
is continuing to develop more methods for machine
well—in memory, on Hadoop, in streams, in devices,
learning, NLP, and edge computing in order to help
and in databases. SAS is also bringing analytics to audio
organizations analyze multiple data structures. It also
and video, as well as utilizing AI to enable human-
enables customers to use multiple methods together. The
machine interaction in its products.
company is moving its products to one unified interface
to support collaboration across multiple personas (data
analyst, data scientist, and so on). The company is also

Well-known as a leader in complex analytics, SAS offers comprehensive support


for the complete analytics life cycle geared to multiple personas. SAS also
continues to innovate with new methods, features such as hyperparameter
UPSHOT
optimization, automation through rapid predictive modeling and factory
miner, third-generation in-memory analytic servers, and solutions for fraud,
cybersecurity, and IoT.

60 
Product Name Version Description

SAS Analytics Suite 14.2 Analytics suite with leading-edge algorithms to solve even the most complex problems.

SAS provides data preparation, data integration and data quality capabilities that support ETL, ELT,
profiling, data governance, master data management, data federation, lineage, business rules, and
DATA PREPARATION
real-time or batch data quality. SAS can password-encrypt files and restrict access to both transfer and
CAPABILITIES
storage; also provides the ability to de-identify information and provides row/column security and
data masking.
Supports native data sources such as Hadoop (including Cloudera, Hortonworks, MapR, BigInsights,
and Pivotal), Impala, Hive, Hive Server2, HAWQ, Redshift, PostgreSQL, SAP R/3, SAP BW, PI System,
DB2, Oracle, Sybase, Teradata, Informix, SQL Server, Netezza, Aster, Greenplum, HPE Vertica, ParAccel,
DATA SOURCES SUPPORTED
Firebird, flat files, SAP HANA, message queues, streaming, unstructured (such as PDF and DOC), and
social media. SAS also provides access to more than 32 data sources via ODBC, OLE DB and PC Files
engines (which includes Excel, Paradox, CSV, TXT, and more). Also supports streams.

SAS provides a range of multidisciplinary analytics including descriptive statistics, forecasting,


machine learning, text mining, and operations research. Examples include regression, LARS/LASO/
elastic networks, ensemble, two-stage, PCA, robust spatial, GLM, graph networks, weighted linear
PREDICTIVE ANALYTICS
assignment, minimum spanning tree, state space, variable clustering, deep neural, SVM, factorization
TECHNIQUES SUPPORTED
machines (including tensorfactorization), decision tree, gradient boosting, extreme gradient boosting,
forest, Bayesian, net-lift, hierarchical clustering, SOM (batch and Nadaraya-Watson), k-means cluster-
ing, fuzzy clustering; includes open APIs for Java, Python, and R.

Includes validation of both SAS and open models before they go into production. Offers Model
MODEL MANAGEMENT Monitoring Reports that evaluate predicted and actual target values; reports also include Lift, Gini
CAPABILITIES ROC, Gini Trend, K-S, and Mean Squared Error for prediction models. Users set performance thresholds
which trigger notifications for models that may require retraining.

MODEL PRODUCTION/EXPORT Supports in-database scoring and in-stream scoring; also supports model export via PMML, C++, and
CAPABILITIES Java.

SAS automates the entire analytics life cycle including automated, standardized, and repeatable
data prep processes. For model development, SAS automates code creation through Code Tasks and
AUTOMATION CAPABILITIES Snippets. SAS also automates the creation of thousands of segmented models and will pick the best-
performing model for each segment. For data scientists, SAS can optimize hyperparameters to help
them save time and reduce error.

CLOUD VERSUS ON PREMISES Both are supported.

For on-premises deployments, SAS offers named user, desktop/server-based, and processor-based
LICENSING subscription licensing options.For SaaS/Cloud deployments, SAS offers named user, server-based, and
concurrent user licensing options.

61
www.teradata.com
One of the Teradata’s guiding principles is that it is not the open source KNIME data mining workbench. This
simply a provider of pure technology. Its primary focus allows data scientists to access Aster Analytics functions
is building solutions that are technology enabled. This via KNIME to explore and prepare data sets, build
includes patented solutions from its big data consulting models, visualize results and package them as workflows.
arm, Think Big. Additionally, the company will work The Aster Analytics extension for KNIME can run
with best-of-breed and open source solutions as part of the individual functions or complete Aster Analytics
its strategy to solve business problems. Its own solutions workflows.
include Teradata Aster Analytics, Teradata in-database
The company offers over 1,000 data exploration, data
analytics, and Think Big Data Science Lab 3.0.
preparation, and analytics functions that run directly
The Teradata Aster Analytics software leverages mas- in the Teradata database. Teradata has also developed
sively parallel processing (MPP) platforms to provide a Think Big Data Science Lab 3.0, an open source offering
scalable prebuilt analytics library of functions for data that supports data science notebooks such as Jupyter.
preparation, predictive modeling, machine learning, and Think Big Data Science Lab is free software, available
the operationalization of insights. In addition to these via Github. It includes prebuilt packages for R, Tensor-
functions, which can be executed through SQL, Aster Flow, Spark, and other open source technologies.
Analytics can also run R code and libraries natively in its
MPP architecture. Teradata also offers an extension for

Teradata is moving past it roots as a big data management/data warehouse


company to help its customers build the next generation of advanced analytics
solutions—solutions that require algorithms at scale. With more than 50% of
UPSHOT its staff available as consultants, Teradata will work with data scientists and
business users to create and deploy these solutions. On the software side,
Teradata continues to support solutions that are targeted towards business
analysts with SQL or R programming skills.

62 
Product Name/Version Version Description

Teradata Aster Analytics 7 Multigenre advanced analytics at scale to help business users uncover and operationalize
nonintuitive insights.

Aster Analytics provides a variety of data exploration and preparation functions including statistics,
DATA PREPARATION descriptive analytics, IoT transformation, and parsers for unstructured and multistructured data.
CAPABILITIES Teradata Warehouse Miner provides a visual interface for data preparation and exploration functions
that runs on data directly in Teradata.
Teradata offers a data and query federation fabric that facilitates data access across distributed and
heterogeneous sources via QueryGrid. Supports third-party RDBMSs such as Oracle; any database can
DATA SOURCES SUPPORTED
be supported by creating a connector via APIs. QueryGrid also supports Oracle with an SDK for other
databases, NoSQL platforms, cloud apps, and cloud storage services, such as Amazon S3.
More than 10 supervised learning techniques, including: classification, clustering, pattern matching,
PREDICTIVE ANALYTICS path analysis, decision trees, distribution matching, naïve Bayes classification, graph analysis, neural
TECHNIQUES SUPPORTED networks, and text and sentiment analysis. Aster Analytics also supports open source libraries such as
R and Python.

MODEL MANAGEMENT Teradata Warehouse Miner provides some model management capabilities for analytic data manage-
CAPABILITIES ment and model version control.

Models can be operationalized and packaged for business users via Aster AppCenter, or, if applicable,
MODEL PRODUCTION/EXPORT instantiated as stored procedures in the Teradata Database. Models exploit in-database scoring, along
CAPABILITIES with Aster Analytics’ MPP processing. Aster Analytics models can be exported as a PMML-like model
that can be scored on any Java-supported platform using the Aster Scoring API.
Aster Analytics is also embedded within Teradata solutions. For example, Customer Interaction
AUTOMATION CAPABILITIES
Manager leverages Aster’s pathing functions within their interface built for marketers.

CLOUD VERSUS ON PREMISES On premises and in the cloud on Teradata IntelliCloud, AWS, and Azure.

License models differ based on the deployment platform. For cloud deployments, subscription
LICENSING licenses are available on an hourly, monthly, or annual basis. For Hadoop, term licenses per core are
offered. For appliance deployments, perpetual and term licenses are available.

63
www.tibco.com
The goal of the TIBCO’s Insight Platform suite is to TIBCO calls “contextual” calculations—prebuilt tools
enable users to explore data faster and take action sooner. or workflows users can launch just by clicking on them.
To that end, it supports an insight-to-action paradigm Supported contextual calculations include clustering
that builds on its expertise in visualization, analytics, and classification. Users can save models and scripts as
and data interconnection. Spotfire also integrates with data functions in the Spotfire Library, a central, shared
TIBCO StreamBase, a platform for streaming ingest repository accessible to desktop and Web clients that
and analysis to help put models into production. is used to store Spotfire analyses, data files, custom
data functions, information links, shared connections,
Spotfire caters to three distinct classes of user:
and data visualization color schemes. Users can export
nontechnical business users; analysts, engineers, and
R models to run natively in StreamBase, as well as
scientists; and data scientists. It supports descriptive
TIBCO’s other streaming analytics products.
stats, similarity, clustering (including K-means), regres-
sion, correlations, fitting, and forecasting, among other
types of analysis. In addition, Spotfire offers what

Best known for its visual capabilities in Spotfire, TIBCO offers a number of other
products to help with the predictive analytics life cycle. The recent purchase of
UPSHOT
Statistica will blend TIBCO’s integration and visualization capabilities with the
strength of Statistica’s analytics platform.

64 
Product Name Version Description

TIBCO Spotfire Analyst 7.9 Secure, governed, enterprise analytics platform with built-in data wrangling that delivers
recommendation-driven, visual, geo, predictive and streaming analytics.
TIBCO Enterprise Runtime for 4.3 A parallel compute engine for executing R, Python, and other code. TERR incorporates IP
R (TERR) from TIBCO’s S-PLUS, an implementation of S, a statistical programming language.
TIBCO StreamBase 10 Streaming data analytics platform.

DATA PREPARATION
Visual interface for data connectivity, cleansing, enriching, blending, and transformation.
CAPABILITIES

The platform supports connectivity to more than 20 data sources, including all major RDBMS plat-
forms, big data platforms, and cloud services. A bundled data federation layer permits federated
DATA SOURCES SUPPORTED
access to any JDBC-compliant source. Supports AWS, Azure, Google Cloud, and big data platforms such
as Hadoop, Spark, MongoDB, and Cassandra.

More than five techniques including: descriptive stats, similarity, clustering (including K-means),
PREDICTIVE ANALYTICS regression, correlations, fitting, and forecasting. Supports additional techniques via open source
TECHNIQUES SUPPORTED projects (R, KNIME, and Python, H20,ai), as well as proprietary models and libraries (S-PLUS, MatLab,
SAS).

MODEL MANAGEMENT Models and other statistical/analytical scripts are stored as data functions in the Spotfire Library,
CAPABILITIES permitting reuse in multiple applications.

MODEL PRODUCTION/EXPORT Spotfire users can export R models to run natively in TIBCO’s streaming analytics products, including
CAPABILITIES StreamBase. Spotfire also supports export to PMML via the R PMML package.

Automation via TIBCO Spotfire Automation Services, sold separately from Spotfire. Spotfire Automa-
tion Services includes Job Builder, a visual tool that bundles a set of predefined tasks for creating jobs,
AUTOMATION CAPABILITIES
along with an API. Job Builder is accessible via Spotfire Analyst. Jobs can be scheduled to execute on
demand, periodically, or according to other criteria.

CLOUD VERSUS ON PREMISES On premises, cloud, and hybrid deployments. Spotfire is also available via the AWS marketplace.

LICENSING Licensing is via an annual subscription.

65
www.statistica.io
TIBCO Statistica is focused on providing deep IoT gateways. Its goal is to bring the analytics to the
functionality for the analytics life cycle in a practical, data. To this end, Statistica recently released Statistica
easy-to-use and cost-effective platform. Geared towards Edge, which allows organizations to deploy data prep
both business analysts and data scientists, Statistica and analytics workflows to sensors, smart devices,
provides an updated visual drag-and-drop UI for and equipment. Users can also invest in the Statistica
business analysts as well as a workbench environment. Enterprise bundle, which offers rules-based decision
The company is now opening up its platform to open support to front line staff and support for embedding
source tools including R, Python, and Scala. algorithms both in big data platforms and in traditional
databases.
Statistica is also embedding its algorithms in big data
platforms such as Hadoop and even in sensors and

While the company has gone through some organizational changes since it
was owned by Dell, it still maintains its focus on providing an extremely solid
platform for business analysts and data scientists. The TIBCO Statistica platform
UPSHOT
will be a good addition to the TIBCO Spotfire platform where it will complement
TIBCO’s visual analytics and connectivity strengths with its own strength in the
predictive analytics life cycle.

66 
Product Name/Version Version Description

TIBCO Statistica 13.2 An integrated analytics platform that provides support for the analytics life cycle.

Provides options for merging, aggregating, stacking and unstacking, transformations and smoothing,
cleaning/recoding/imputing missing data, identifying duplicate records, and finding and recoding
DATA PREPARATION
outliers. Numerous specialized ETL functions are available to align customer records from different
CAPABILITIES
sources, to align time-stamped or batch data recorded at different time intervals, to aggregate diverse
data sources, etc.

Statistica can directly access data from all standard relational database formats, as well as specialized
DATA SOURCES SUPPORTED
databases (e.g., the OSI PI database for process data); text, Excel, SAS, SPSS, and Hadoop.

Supports association, classification, decision trees, neural networks, regression, optimization,


PREDICTIVE ANALYTICS
simulation, and others. Also supports R and Python. Integration to app marketplaces such as Apertiva
TECHNIQUES SUPPORTED
and Algorithmia. Integrates to Azure MachineML, H20.ai.

MODEL MANAGEMENT Provides a model repository for versioning and tracking changes; can monitor models via a monitoring
CAPABILITIES and alerting server.

PMML, Java, C#, SQL, and SAS, where data and analysis configurations (and others such as models,
MODEL PRODUCTION/EXPORT rules, etc.) are abstracted as separate version-controlled objects. Can also deploy a workflow and
CAPABILITIES make it available as a web service. Can port to visualization tools such as Tableau, Qlik, and Spotfire.
Supports in-database analytics.

AUTOMATION CAPABILITIES Users can create workflows called recipes, which can be reused.

CLOUD VERSUS ON PREMISES Primarily on premises (the business analyst UI will be browser-based on the cloud in fall).

LICENSING N/A

67
TDWI Analyst Viewpoint
As a special TDWI Navigator Report feature, the
Analyst Viewpoint takes a deeper look at the following
sponsoring organizations to lay out in greater detail the
key differentiators and some of the pros and cons of
their offerings.

68 
TDWI Analyst Viewpoint

www.sas.com
SAS has long been a market leader for complex and advanced AUTOMATION. SAS solutions are strong on the automation
analytics, which includes predictive analytics. Founded over front. Although some vendors automate a workflow or
forty years ago, the company has continued to maintain a provide automated model building, SAS does more across the
solid customer base, a strong road map, and year-over-year analytics life cycle. Very important, SAS provides automated
revenue growth. model monitoring by selecting and alerting the enterprise
when a model is degrading. SAS also provides automated
The predictive market is a dynamic one, however, and SAS
champion/challenger capabilities. Model alerts can be
realizes this. On the one hand, there are commercial products
generated at any frequency and can be emailed automatically
(of which SAS is one) that address the full analytics life cycle.
to those who need to know. As more companies move to
On the other hand, open source has gained a lot of traction
actually deploy predictive models rather than simply build
in analytics—especially for machine learning and building
them, these kinds of capabilities will be critical.
analytics apps. A new generation of data scientists, who can
code and enjoy it, are helping to build this momentum. HISTORY OF INNOVATION. SAS invests heavily in R&D—
nearly twice the percent of annual revenue that many other
Given this, SAS is making its platform more open. In fact,
technology companies do. It is building new techniques for
SAS Viya now includes integration with R and Python, and
neural networks and deep learning. It can perform advanced
SAS has projects such as SASPy which enables SAS to be
analytics in event streams. It is infusing advanced analytics
embedded in Python. Opening the platform is an important
into its key areas of growth and investment including risk
move for SAS.
management, visualization, customer intelligence, data
TDWI believes some key differentiators for SAS include: management, fraud, and security. SAS is even embedding
natural language processing and deep learning inside of its
BREADTH AND DEPTH OF FUNCTIONALITY. Although some
applications to provide more human-like interactions.
complain about the price of SAS products, you get what
you pay for. In this case, it is hard to beat the breadth and Overall, SAS should be on the short-list for those
depth of the SAS offering. The company provides many organizations with complex analytics problems, that need
algorithms for analytics, refactored to support big data. It a complete solution, and prefer a graphical interface. SAS’s
can even support analytics in streams. SAS addresses the move to include open source algorithms in its platform
complete analytics life cycle, from data management through should also be good news to those who prefer them for model
preparation to model building and deployment. Text mining development but need a complete solution.
is part of the platform to deal with unstructured text data.
It also addresses governance through its model management
capabilities, and it provides a platform for decision
management. In fact, SAS is often ahead of other vendors
in thinking through the practicalities of what is involved in
advanced analytics.

69
TDWI Analyst Viewpoint

www.alteryx.com
Alteryx, founded in 2010, was originally focused on • REALIZING THERE IS MORE TO SELF-SERVICE THAN THE TOOL
helping line-of-business users access and blend disparate ITSELF. Alteryx is geared primarily towards the business
data sources and perform spatial analysis. In 2012 it added analyst. With self-service making analytics and even
predictive analytics capabilities to deliver a complete predictive analytics easier to use, many organizations
self-service platform that allows users to access, blend, are jumping on board. Alteryx supports R and is
prepare, and analyze data via a visual interface. With its adding Python as nodes in its GUI. They also have a
recent acquisition of Yhat, Alteryx will also provide tools partnership with Data Robot. However, TDWI often
to manage and deploy machine learning and predictive finds that although many organizations use self-service
models for data scientists. Yhat tools enable data scientists tools, they are not necessarily widely penetrated in an
to embed predictive models in any application capable of organization. Part of the problem is skills development.
making REST API requests. Other Yhat tools provide To overcome this challenge, Alteryx has partnered with
model automation and tracking over time. These tools are Udacity to provide a 12-week nano-degree in predictive
being integrated into the Alteryx platform to provide a analytics.
unified solution for line-of-business users and data scientists.
• BUILDING OFF ITS BASE. Alteryx says it focuses on what
The market likes Alteryx. Its stock price has continued it does best, which is leading customers through
to rise since its IPO in March 2017. The company does a an analytics journey that starts with data discovery
good job growing its customer base as well as collaborating and preparation, and then help them move up the
with partners. TDWI has identified these additional analytics maturity curve with geospatial, predictive,
differentiators: and prescriptive analytics, ultimately leading to model
• DEALING WITH DISPARATE DATA. Aside from connecting
deployment. In this way, Alteryx can impact business
to many data sources (including spreadsheets, social processes, improve decisions, and grow its base of users.
media, cloud applications, and data warehouses such Alteryx is riding the self-service wave through data
as Oracle, Teradata, Hadoop, and Cloudera), Alteryx preparation and into building and deploying predictive
allows users to enrich data for analysis through third- analytics. With its move to operationalize analytics with
party packaged data from Experian, D&B, and Alteryx Server and its Yhat acquisition, it should continue
TomTom as well as census data from both the U.S. and to provide value to its customers.
Canada. Alteryx also provides parsing tools for text
data and tools to work with geospatial data. TDWI
research indicates that organizations that use disparate
data in their analytics are more likely to measure a
top- or bottom-line impact; Alteryx is ahead of many
vendors in this space.

70 
TDWI Analyst Viewpoint

www.cloudera.com
Cloudera was founded in 2008 to develop the first • STRENGTH IN BIG DATA. In many ways, data science is an
enterprise-class implementation of Apache Hadoop. The outgrowth of the need to analyze big data, which comes
company provides a family of products and services to with its own set of preparation and processing issues.
support the open source big data ecosystem. With a Cloudera understands big data, processing it in-memory
heavy investment by Intel, the company had built a solid and in-parallel. The company is also bringing other
foundation to become the premier Hadoop distribution aspects of the analytics life cycle to enterprise big data,
provider. Cloudera has contributed to an ecosystem around such as data wrangling and model testing.
Hadoop, which is important as Hadoop is moving into
• STRENGTH IN OPEN SOURCE. Cloudera was founded on open
mainstream adoption in production environments. Now, as
source technology. It is leveraging its strength in open
Hadoop hype has cooled down, Cloudera is emphasizing its
source to support data scientists, many of whom are
role as an enterprise data platform.
going to use open source analytics tools for data analysis
Of course, analytics is a key driver for big data, and Cloudera and application building. Although the user interface is
realized early on that it wasn’t enough to simply provide specific to Cloudera, users can use multiple open source
a commercial distribution for Hadoop. The company technologies in the workbench. A key concept in the
understood that analytics (and more specifically, advanced workbench is context. Each team will want their own
analytics) is a key driver for big data. Cloudera partners with area for various kinds of projects—whether that be R,
analytics providers such as SAS and Tableau, but is taking Python, or Scala. The company also works with dozens
analytics further. Recently, the company expanded its of other partners.
message to include customer insights that rely on analytics
• THINKING THROUGH THE ISSUES. Cloudera has a history
as well as IoT and cybersecurity. Taking advantage of a new
of thinking through how to make open source suitable
generation of data scientists who are interested in open source
for the enterprise. For example, the company has
analytics tools, Cloudera has introduced its own Cloudera
added tools and functionality to support solid metadata
Data Science Workbench to provide a “great data science
management and security in Hadoop, addressing
experience” in the Hadoop ecosystem.
enterprise concerns. Likewise, it is being thoughtful
TDWI had identified several key differentiators for Cloudera, about how to help organizations test and deploy models
including: into production in a distributed ecosystem.
TDWI research indicates that open source tools for advanced
analytics, such as predictive analytics and machine learning,
are rapidly gaining steam, especially among those who like to
code. Cloudera is in a good position to leverage its strengths
in open source and big data to provide a valuable platform for
data scientists.

71
TDWI NAVIGATOR PLATINUM SPONSORS

SAS is the leader in analytics. Through innovative analytics,


business intelligence and data management software and
services, SAS helps customers at more than 83,000 sites
make better decisions faster. Since 1976, SAS has been giving
customers around the world THE POWER TO KNOW.

Alteryx, Inc. is a leader in self-service data analytics. Alteryx


provides analysts with the ability to easily discover, prep,
blend, and analyze all data using a repeatable workflow,
then deploy and share analytics at scale for deeper insights
in hours, not weeks. Thousands of companies worldwide rely
on Alteryx daily.

Cloudera delivers the modern platform for machine learning


and advanced analytics built on the latest open source
technologies. The world’s leading organizations trust Cloudera
to help solve their most challenging business problems by
efficiently capturing, storing, ​processing and analyzing vast
amounts of data. Learn more at cloudera.com.

72 
TDWI Research provides research and advice for
data professionals worldwide. TDWI Research
focuses exclusively on data management and
analytics issues and teams up with industry thought
leaders and practitioners to deliver both broad and
deep understanding of the business and technical
challenges surrounding the deployment and use
of data management and analytics solutions. TDWI
Research offers in-depth research reports, commentary,
inquiry services, and topical conferences as well
as strategic planning services to user and vendor
organizations.

74 

You might also like