You are on page 1of 19

Introduction to Classification & Regression Trees (CART)

Posted by Venky Rao on January 13, 2013 at 5:56pm


View Blog

Decision Trees are commonly used in data mining with the objective of creating a model
that predicts the value of a target (or dependent variable) based on the values of several
input (or independent variables). In today's post, we discuss the CART decision tree
methodology. The CART or Classification & Regression Trees methodology was introduced
in 1984 by Leo Breiman, Jerome Friedman, Richard Olshen and Charles Stone as an
umbrella term to refer to the following types of decision trees:

Classification Trees: where the target variable is categorical and the tree is used to
identify the "class" within which a target variable would likely fall into.

Regression Trees: where the target variable is continuous and tree is used to
predict it's value.

The CART algorithm is structured as a sequence of questions, the answers to which


determine what the next question, if any should be. The result of these questions is a tree
like structure where the ends are terminal nodes at which point there are no more
questions. A simple example of a decision tree is as follows [Source: Wikipedia]:

The main elements of CART (and any decision tree algorithm) are:
1. Rules for splitting data at a node based on the value of one variable;
2. Stopping rules for deciding when a branch is terminal and can be split no more; and
3. Finally, a prediction for the target variable in each terminal node.
In order to understand this better, let us consider the Iris dataset (source: UC-Irvine
Machine Learning Repository http://archive.ics.uci.edu/ml/). The dataset consists of 5
variables and 151 records as shown below:

In this data set, "Class" is the target variable while the other four variables are independent
variables. In other words, the "Class" is dependent on the values of the other four
variables. We will use IBM SPSS Modeler v15 to build our tree. To do this, we attach the
CART node to the data set. Next, we choose our options in building out our tree as follows:

On this screen, we pick the maximum tree depth, which is the most number of "levels" we
want in the decision tree. We also choose the option of "Pruning" the tree which is used to
avoid over-fitting. More about pruning in a different blog post.

On this screen, we choose stopping rules, which determine when further splitting of a node
stops or when further splitting is not possible. In addition to maximum tree depth discussed
above, stopping rules typically include reaching a certain minimum number of cases in a
node, reaching a maximum number of nodes in the tree, etc. Conditions under which
further splitting is impossible include when [Source: Handbook of Statistical Analysis and
Data Mining Applications by Nisbet et al]:

Only one case is left in a node;

All other cases are duplicates of each other; and

The node is pure (all target values agree).


Next we run the CART node and examine the results. We first look at Predictor Importance,
which represents the most important variables used in splitting the tree:

From the chart above, we note that the most important predictor (by a long distance) is the
length of the Petal followed by the width of the Petal.

A scatter plot of the data by plotting Petal length by Petal width also reflects the predictor
importance:

This should also be reflected in the decision tree generated by the CART. Let us examine
this next:

As can be seen, the first node is split based on our most important predictor, the length of
the Petal. The question posed is "Is the length of the petal greater than 2.45 cms?". If not,
then the class in which the Iris falls is "setosa". If yes, then the class could be either
"versicolor" or "virginica". Since we have completely classified "setosa" in Node 1, that
becomes a terminal node and no additional questions are posed there. However, we still
need to Node 2 still needs to be broken down to separate out "versicolor" and "virginica".
Therefore, the next question needs to be posed which is based on our second most
important predictor, the width of the Petal.

As expected, in this case, the question relates to the width of the Petal. From the nodes,
we can see that by asking the second question, the decision tree has almost completely
split the data separately into "versicolor" and "virginica". We can continue splitting them
further until there is no overlap between classes in each node; however, for the purposes of
this post, we will stop our decision tree here. We attach an Analysis node to see the overall
accuracy of our predictions:

From the analysis, we can see that the CART algorithm has classified "setosa" and
"virginica" accurately in all cases and accurately classified "versicolor" in 47 of the 50 cases
giving us an overall accuracy of 97.35%.
Some useful features and advantages of CART [adapted from: Handbook of Statistical
Analysis and Data Mining Applications by Nisbet et al]:

CART is nonparametric and therefore does not rely on data belonging to a particular
type of distribution.

CART is not significantly impacted by outliers in the input variables.

You can relax stopping rules to "overgrow" decision trees and then prune back the
tree to the optimal size. This approach minimizes the probability that important
structure in the data set will be overlooked by stopping too soon.

CART incorporates both testing with a test data set and cross-validation to assess
the goodness of fit more accurately.

CART can use the same variables more than once in different parts of the tree. This
capability can uncover complex interdependencies between sets of variables.

CART can be used in conjunction with other prediction methods to select the input
set of variables.
http://tinyurl.com/b73kyoh

Like

BIG DATA storage and old DBMS technique comparison

Posted by Andrei Macsin on January 20, 2016 at 2:00am

Guest blog post by Vishal Sharma


Why BIG data is consider such a big thing. It was always there however technocrats or technology
people oversee or dare I say failed to see it. Now in last few years we evolve our understanding
on How data can be used by different analytical techniques like Predictive analysis to
leverage it in different business fields like Marketing, Sales, Product design etc. However
this is not what has to be discussed in this blog what is intended is to discuss BIG data storage
techniques.

Lets start by looking at some basic data models from the old history of DBMS.

1)
There was a flat file system in which all data was stored in a single file which was either text or
comma separated file with charter set defining each and every data. That was the period we dont
have any data structure or data type or storage optimization techniques.
However after some time some DBMS technique started developing to store the data and there
storage techniques (can read more about these on Wikipedia.com)
2)
In a hierarchical model, data is organized into a tree-like structure, implying a single parent for each
record. A sort field keeps sibling records in a particular order. Hierarchical structures were widely used
in the early mainframe database management systems, such as the Information Management System
(IMS) by IBM, and now describe the structure of XML documents. This structure allows one one-tomany relationship between two types of data. This structure is very efficient to describe many
relationships in the real world; recipes, table of contents, ordering of paragraphs/verses, any nested
and sorted information.

However it has its limitation as hierarchy is used as the physical order of records in storage. Record
access is done by navigating through the data structure using pointers combined with sequential

accessing. Because of this, the hierarchical structure is inefficient for certain database operations
when a full path (as opposed to upward link and sort field) is not also included for each record.
3)
The network model expands upon the hierarchical structure, allowing many-to-many relationships
in a tree-like structure that allows multiple parents. The network model organizes data using two
fundamental concepts, called records and sets. Records contain fields (which may be organized
hierarchically, as in the programming language COBOL). Sets define one-to-many relationships
between records: one owner, many members. A record may be an owner in any number of sets, and a
member in any number of sets.
A set consists of circular linked lists where one record type, the set owner or parent, appears once in
each circle, and a second record type, the subordinate or child, may appear multiple times in each
circle. In this way a hierarchy may be established between any two record types, e.g., type A is the
owner of B. At the same time another set may be defined where B is the owner of A. Thus all the sets
comprise a general directed graph (ownership defines a direction), or network construct. Access to
records is either sequential (usually in each record type) or by navigation in the circular linked lists.
The network model is able to represent redundancy in data more efficiently than in the hierarchical
model, and there can be more than one path from an ancestor node to a descendant. The operations
of the network model are navigational in style: a program maintains a current position, and navigates
from one record to another by following the relationships in which the record participates. Records can
also be located by supplying key values.
Although it is not an essential feature of the model, network databases generally implement the set
relationships by means of pointers that directly address the location of a record on disk. This gives
excellent retrieval performance, at the expense of operations such as database loading and
reorganization.
4)
Relationship model as a way to make database management systems more independent of any
particular application. It is a mathematical model defined in terms of predicate logic and set theory,
and systems implementing it have been used by mainframe, midrange and microcomputer systems.
Three key terms are used extensively in relational database models: relations, attributes, and domains.
A relation is a table with columns and rows. The named columns of the relation are called attributes,
and the domain is the set of values the attributes are allowed to take.The basic data structure of the
relational model is the table, where information about a particular entity (say, an employee) is
represented in rows (also called tuples) and columns. Thus, the "relation" in "relational database"
refers to the various tables in the database; a relation is a set of tuples. The columns enumerate the
various attributes of the entity (the employee's name, address or phone number, for example), and a
row is an actual instance of the entity (a specific employee) that is represented by the relation. As a
result, each tuple of the employee table represents various attributes of a single employee.All relations
(and, thus, tables) in a relational database have to adhere to some basic rules to qualify as relations.
First, the ordering of columns is immaterial in a table. Second, there can't be identical tuples or rows in
a table. And third, each tuple will contain a single value for each of its attributes.
A relational database contains multiple tables, each similar to the one in the "flat" database model.
One of the strengths of the relational model is that, in principle, any value occurring in two different
records (belonging to the same table or to different tables), implies a relationship among those two

records. Yet, in order to enforce explicit integrity constraints, relationships between records in tables
can also be defined explicitly, by identifying or non-identifying parent-child relationships characterized
by assigning cardinality (1:1, (0)1:M, M:M). Tables can also have a designated single attribute or a set
of attributes that can act as a "key", which can be used to uniquely identify each tuple in the table.
A key that can be used to uniquely identify a row in a table is called a primary key. Keys are commonly
used to join or combine data from two or more tables. For example, an Employee table may contain a
column named Location which contains a value that matches the key of a Location table. Keys are also
critical in the creation of indexes, which facilitate fast retrieval of data from large tables. Any column
can be a key, or multiple columns can be grouped together into a compound key. It is not necessary to
define all the keys in advance; a column can be used as a key even if it was not originally intended to
be one.
A key that has an external, real-world meaning (such as a person's name, a book's ISBN, or a car's
serial number) is sometimes called a "natural" key. If no natural key is suitable (think of the many
people named Brown), an arbitrary or surrogate key can be assigned (such as by giving employees ID
numbers). In practice, most databases have both generated and natural keys, because generated keys
can be used internally to create links between rows that cannot break, while natural keys can be used,
less reliably, for searches and for integration with other databases. (For example, records in two
independently developed databases could be matched up by social security number, except when the
social security numbers are incorrect, missing, or have changed.)
Then there are many changes to Relational Database like
1)

Object Oriented database model

2)

Dimensional model

3)

Multivalve Model

However this was all good then comes era where technocrats rediscover data and named it as BIG
data now what Relational Database was not at all sufficient to handle this type of data. So clustering
system like Hadoop, Program like PIG, HIVE and MAP reduce came into picture which makes managing
big data somewhat easy. But if you study these systems you will clearly see they are leveraging the
properties of previous data models that exist. It makes me think whether we are going back to basic
database models technique to work around the speed and storage capacity.
Anyways BIG data is more about analytics then about its storage or retrieval as this might be putting
an impact to our futurein term of data and its utilization.

Big Data, Fast Data, Smart Data

Posted by Alissa Lorentz on April 12, 2013 at 5:18am


View Blog

Big data needs to be fast and smart. Heres why.

DAUNTING DATA
Every minute, 48 hours of video are uploaded onto Youtube. 204 million e-mail messages
are sent and 600 new websites generated. 600,000 pieces of content are shared on
Facebook, and more than 100,000 tweets are sent. And that does not even begin to scratch
the surface of data generation, which spans to sensors, medical records, corporate
databases, and more.
As we record and generate a growing amount of data every millisecond, we also need to be
able to understand this data just as quickly. From monitoring traffic to tracking epidemic
spreads to trading stocks, time is of the essence. A few seconds delay in understanding
information could cost not only funds, but also lives.
BIG DATAS NOT A BUBBLE WAITING TO BURST
Though Big Data has been recently deemed an overhyped buzz word, its not going to go
away any time soon. Information overload is a phenomenon and challenge we face now,
and will inevitably continue to face, perhaps with increased severity, over the next decades.
In fact, large-scale data analytics, predictive modeling, and visualization are increasingly
crucial in order for companies in both high-tech and mainstream fields to survive. Big data
capabilities are a need, not a want today.
Big Data is a broad term that encompasses a variety of angles. There are complex
challenges within Big Data that must be prioritized and addressed such as Fast Data
and Smart Data.
SMART DATA
Smart Data means information that actually makes sense. It is the difference between
seeing a long list of numbers referring to weekly sales vs. identifying the peaks and troughs
in sales volume over time. Algorithms turn meaningless numbers into actionable insights.
Smart data is data from which signals and patterns have been extracted by intelligent
algorithms. Collecting large amounts of statistics and numbers bring little benefit if there is
no layer of added intelligence.
IN-THE-MOMENT DECISIONS
By Fast Data were talking about as-it-happens information enabling real-time decisionmaking. A PR firm needs to know how people are talking about their clients brands in realtime in order to mitigate bad messages by nipping them in the bud. A few minutes too late
and viral messages might be uncontainable. A retail company needs to know how their
latest collection is selling as soon as it is released. Public health workers need to
understand disease outbreaks in the moment so they can take action to curb the spread. A
bank needs to stay abreast of geo-political and socio-economic situations to make the best
investment decisions with a global-macro strategy. A logistics company needs to know how
a public disaster or road diversion is affecting transport infrastructure so that they can react
accordingly. The list goes on, but one thing is clear: Fast Data is crucial for modern
enterprises, and businesses are now catching onto the real need for such data capabilities.
GO REAL-TIME OR GO OBSELETE
Fast data means real-time information, or the ability to gain insights from data as it is
generated. Its literally as things happen. Why is streaming data so hot at the moment?
Because time-to-insight is increasingly critical and often plays a large role in smart,
informed decision making.

In addition to the obvious business edge that a company gains from having exclusive
knowledge to information about the present or even future, streaming data also comes with
an infrastructure advantage.
With big data comes technical aspects to address, one of which is the costly and complex
issue of data storage. But data storage is only required in cases where the data must be
archived historically. More recently, as more and more real-time data is recorded with the
onset of sensors, mobile phones, and social media platforms, on-the-fly streaming analysis
is sufficient, and storing all of that data is unnecessary.
STREAMING VS. STORING & DATAS EXPIRATION DATE
Historical data is useful for retroactive pattern detection, however there are many cases in
which in-the-moment data analyses are more useful. Examples include quality control
detection in manufacturing plants, weather monitoring, the spread of epidemics, traffic
control, and more. You need to act based on information coming in by the second. Redirecting traffic around a new construction project or a large storm requires that you know
the current traffic and weather situation, for example, rendering last weeks information
useless.
When the kind of data you are interested in does not require archiving, or only selective
archiving, then it does not make sense to accommodate for data storage infrastructure that
would store all the data historically.
Imagine that you wanted to listen for negative tweets about Justin Bieber. You would either
store historical tweets about the pop star, or analyze streaming tweets about him. Recording
the entire history of Twitter just for this purpose would cost tens of thousands of dollars in
server cost, not to mention physical RAM requirements to process the algorithms through
this massive store of information.
It is crucial to know what kind of data you have and what you want to analyze from it in
order to pick a flexible data analytics solution to suite your needs. Sometimes data needs to
be analyzed from the stream, not stored. Do we need such massive cloud infrastructure
when we do not need persistent data? Perhaps we need more non-persistent data
infrastructures that allow for data that does not to be stored eternally.
Datas Time-To-Live (TTL) can be set so that it expires after a specific length of time, taking
the burden off your data storage capabilities. For example, sales data on your company
from two years ago might be irrelevant to predicting sales for your company today. And that
irrelevant, outdated data should be laid to rest in a timely manner. As compulsive hoarding
is unnecessary and often a hindrance to peoples lifestyles, so is mindless data storage.
BEYOND BATCH PROCESSING
Aside from determining data life cycles, it is also important to think about how the data
should be processed. Lets look at the options for data processing, and the type of data
appropriate for each.
Batch processing: Batch processing means that a series of non-interactive jobs are
executed by the computer all at once. When referring to batch processing for data analysis,
this means that you have to manually feed the data to the computer and then issue a series
of commands that the computer then executes all at once. There is no interaction with the
computer while the tasks are being performed. If you have a large amount of data to
analyze, for instance, you can order the tasks in the evening and the computer will analyze
the data overnight, delivering the results to you the following morning. The results of the

data analysis are static and will not change if the original data sets change that is unless a
whole new series of commands for analysis are issued to the computer. An example is the
way all credit card bills are processed by the credit card company at the end of each month.
Real-time data analytics: With real-time data analysis, you get updated results every time
you query something. You get answers in near real-time with the most updated data up to
the moment the query was sent out. Similar to batch processing, real-time analytics require
that you send a query command to the computer, but the task is executed much more
quickly, and the data store is automatically updated as new data comes in.
Streaming analytics: Unlike batch and real-time analyses, stream analytics means the
computer automatically updates results about the data analysis as new pieces of data flow
into the system. Every time a new piece of information is added, the signals are updated to
account for this new data. Streaming analytics automatically provides as-it-occurs signals
from incoming data without the need to manually query for anything.
REAL-TIME, DISTRIBUTED, FAULT-TOLERANT COMPUTATION
How can we process large amounts of real-time data in a seamless, secure, and reliable
way?
One way to ensure reliability and reduce cost is with distributed computing. Instead of
running algorithms on one machine, we run an algorithm across 30 to 50 machines. This
distributes the processing power required and reduces the stress on each.
Fault-tolerant computing ensures that in a distributed network, should any of the computers
fail, another computer will take over the botched computers job seamlessly and
automatically. This guarantees that every piece of data is processed and analyzed, and that
no information gets lost even in the case of a network or hardware break down.
IN SHORT
In an age when time to insight is critical across diverse industries, we need to cut time to
insight down from weeks to seconds.
Traditional, analog data-gathering took months. Traffic police or doctors would jot down
information about patients infections or drunk driving accidents, and these forms would then
be mailed to a hub that would aggregate all this data. By the time all these details were put
into one document, a month had passed since an outbreak of a new disease or a problem
in driving behavior. Now that digital data is being rapidly aggregated, however, we are given
the opportunity to make sense of this information just as quickly.
This requires analyzing millions of events per second against trained, learning algorithms
that detect signals from large amounts of real, live data much like rapidly fishing for
needles in a haystack. In fact, it is like finding the needles the moment they are dropped into
the haystack.
How is real-time data analysis useful? Applications range from detecting faulty products in a
manufacturing line to sales forecasting to traffic monitoring, among many others. These next
years will hail a golden age not for any old data, but for fast, smart data. A golden age for
as-it-happens actionable insights.
Original post: http://www.augify.com/big-data-fast-data-smart-data/

Stream Processing What Is It and Who Needs It

Posted by William Vorhies on October 21, 2015 at 9:42am

View Blog

Summary: Stream Processing and In-Stream Analytics are two rapidly


emerging and widely misunderstood data science technologies. In this article
well focus on their basic characteristics and some business cases where they
are useful.
There are five relatively new technologies in data science that are getting a lot
of hype and generating a lot of confusion in the process. They are:
1. Stream Processing
2. In Stream Analytics
3. Real Time Analytics
4. In Data Base Analytics, and
5. In Memory Analytics

Gartner displays this as only three fast-rising trends but in the literature today
you will see all five. These are not simple to sort out but over this article and
probably the next several well try to help you understand what theyre good for,
how they work, and importantly what they wont do.

Lets start with Stream Processing


and In-Stream Analytics. The full formal name for this technology is Event
Stream Processing (ESP) so well use that shorthand here.
As you can tell from the name, the first requirement is that there is a stream of
data. Almost always this means time series data. That is events that happen in
sequence denoted by a specific time such as a string of sensor readings in IoT
applications or trigger events (also denoted by time) such as when your
customers mobile device is detected by your Wi-Fi system indicating that hes
close by.
ESP is a real time processing technique. So two things should be immediately
evident, 1.) Events you want to track should happen frequently and probably
close together in time, and 2.) There must be an important business reason for
detecting and responding to the event quickly.
Real Time:
While real time can mean many things in different environments and can be
microseconds to hours or even days in duration, if your time horizon is overnight
or every few days or longer then you can do just as well with batch processing
and you dont need ESP. For example, you are monitoring the flow of social
media comments about your business but the rate at which they are coming in
is relatively slow, say a few per hour. You may elect to store them and have
your marketing team analyze and respond the next day in batch mode. The
aggregate trending of social media on a daily basis is actually pretty fast so
general trends in batch mode should be plenty especially if there are not that
many comments to evaluate. However, if youre an ecommerce giant getting a
fast stream of comments and are concerned that you respond to or address
every negative comment within say minutes or hours then you probably need
ESP.

Trigger Events:
The need to respond quickly to a trigger event may trump frequency especially
if it falls in the category of very rare events. These might be systems
monitoring patients vital signs in a hospital, sudden changes in machine
operating characteristics that you previously determined mean that the
equipment may fail soon, or the detection of a fraudulent transaction.
This also highlights that the information in a single piece of data is not
particularly informative. It is often by comparing that data to other data in the
stream or to mathematical norms like averages or standard deviations that
signals are detected. So you may also want to define a time window of data
that is held in memory for comparison. That window may be only a few seconds
but it may also be much longer.
ESP is always in memory.
The single or multiple streams of in-bound data are said to be processed at the
edge of the system and in memory before being persisted in storage. In the
next article well talk more about the technology. For now it is sufficient to know
that very dense streams of data numbering millions of events per second can be
processed with latencies of only milliseconds by well-designed ESP
systems. The processing steps within ESP are relatively simple and can be
handled in-memory as they arrive, including distributing them among multiple
processors in shared-nothing MPP systems.
There can actually be a number of steps in ESP processing such as filtering,
splitting into multiple streams, creating notifications, joins with existing data,
and the application of business rules or scoring algorithms, all of which happens
in memory at the edge of the system before the data is passed into storage.
Technologies and Platforms:
You can use Apache Stream or Apache Spark as the basis of your system using
custom code to design the processing steps. You can also use proprietary
systems such as SAS Event Stream Processing that have much easier to
manipulate drag-and-drop interfaces and dont require coding. Gartner reviews
ESP vendors and thats a good place to start.
In-Stream Analytics:
Heres one area where ESP can mislead new users. In-Stream Analytics is a
feature of ESP and cannot exist separately from ESP. ESP can apply business
rules or even sophisticated predictive analytic models like scoring models to the

data stream and take action on the data based on those scores or
rules. However, brand new insights derived from analytics do not occur
here. They occur as always in separate analytic data stores, some in-memory
but most simply in-data base where data scientists can examine them and run
analytic workloads against them, developing new models, new optimization
routines, and new insights.
If you have a unique business need that requires that you be able to create new
predictive models or refresh existing predictive models using the most current
in-bound streaming data, then providers like SAS have very high performance
in-memory analytic platforms that enable data scientists to make these new
discoveries and updates in minutes or hours even on massive quantities of data.
These can then be fed back into the ESP system in near real time. Its
important to understand however that the development of new analytic insights
occurs in analytic data stores and not directly in ESP.
Business Cases:
Lets talk about some specific business cases where ESP is proving useful.

Fraud Detection
Fraud detection is a good place to begin our discussion since it deals with rare
events that are difficult to detect and illustrates some of the limits of ESP.
Even the most sophisticated methods of fraud detection tend to create large
numbers of false positives. The haystack gets much smaller but not small
enough to take automatic action say by blocking your customers credit card
transaction. Typically there is a team of humans who may be evaluating flagged
fraud-likely events and making a judgement call in near-real-time. There may
also be an additional layer of investigative analysts, for example evaluating
redirect sites that may or may not be the source of watering-hole malware
attacks which requires both significant time and labor.
However, some but not all rules of fraud detection do rise to the level of near
automated action. For example, an in-bound card-present credit card

transaction that takes place close in time to a second one for the same card but
physically far apart in geography has a high probability of one of the two being
fraudulent.
ESP could enhance the analysis of the case by adding rules and scoring based
on historic customer transaction information, profiles and even technical
information coming from the Internet sessions of customers. This allows the
bank to set rules to automatically block the transaction or automatically send a
text message to the subscriber with a query. One bank reported increasing its
fraud intercept rate to 95% with accompanying improvements in revenue,
decreases in the cost of fraud detection, and improved customer trust and
satisfaction.
Another interesting example illustrating streaming analytics is in the potentially
fraudulent authorization of gift cards. One organization set rules comparing the
number of cards authorized to the number sold in that location during the
previous few days, and added a comparison of the volume to the standard
deviation of transactions authorized over a similar period. If these business
rules were detected in ESP to be violated, further authorizations at that site
could be shut down until a human could evaluate the situation.
Financial Markets Trading
There are two applications that gave ESP its earliest start. One is financial
markets trading and the other is the monitoring of capital intensive equipment
where sensor data has long been captured and analyzed.
Automated high-frequency trading systems now account for between 75% and
85% of the volume on all major exchanges. They compete on the accuracy of
their algorithms and also on the time needed to receive, analyze, and act on
new data. Trading advantages are often measured in milliseconds.
This illustrates two features of ESP. First its ability to ingest multiple high
volume, high speed streams of input such as the stream of transactions from
each major exchange. Second, automated high-frequency traders may have
hundreds of models that all need simultaneous access to the data (sequential
evaluation would be unacceptably slow). ESP has the ability to split the
incoming stream into multiple copy streams each of which can be run
simultaneously against its own scoring model.
IoT and Capital Equipment Intensive Industries
Not all modern sensor applications occur in capital equipment intensive
industries. Some are simply paying attention to your thermostat settings,
particular operations of your car, or the number of steps you are taking. But
capital equipment intensive industries like power generation, mining and
extraction, transportation, and heavy manufacturing were among the first to use
networked sensors and collect that data for analysis. Over the last decade
sensors have become smaller, cheaper, more adaptable, and better at
communication. While that data was originally evaluated by operational
historians off line and in batch, ESP now allows real-time detection of fault
conditions. The most frequent examples are in predictive asset maintenance.
Using historical data in analytic data stores, data scientists and engineers
develop models that signal the onset of a condition that may shortly lead to an
unplanned failure or interruption. Applying that model through the ESP stream
allows earliest possible detection and failure prevention.

Separately developed models can also be used for optimization of complex


systems of networked devices or resources. ESP is extensively used in
networking optimization of power grids and even in traffic control systems to
speed your commute home.
Health and Life Sciences
Moving more toward direct benefit to humans, ESP is being rapidly adopted by
major hospitals. Those bedside devices that measure vital signs are now
networks of sensors feeding data via ESP into a central evaluation system. The
central system evaluates the stream of data based on business rules (well
established medical guidelines such as specific blood pressure measures
combined with pulse combined with respiration rate requires immediate
attention). It also uses more sophisticated predictive models looking separately
at the data with the similar intent, to send an alert to the right set of doctors
and nurses to take action at exactly the right time.
Marketing Effectiveness
Trigger events arent just heart attacks and machine malfunctions. They can be
specific customer actions. We all know the role of predictive models in
improving cross sell, upsell, and churn prevention. In the past these models
were used to predict which customers would be most likely to respond based on
their historical behavior. The behavior that we analyzed could cover weeks,
months, or even longer. Then we implemented these models through
campaigns at times of our choosing, not necessarily the time when the customer
was most receptive. ESP changes that by offering to let us fine tune and
implement our models based on very specific customer behavior in near real
time.
In one example a major telecommunication company found that a specific upsell
model could be made much more accurate and effective by tying it to the time
their customers were recharging their prepaid accounts. Once the model was
developed it could be implemented through ESP so that when the prepaid
recharge transaction was detected an SMS promotion could be sent to the
customer while the transaction was still under way.
Separately they were able to greatly enhance the effectiveness of other scoring
models by incorporating cellphone usage patterns. When those patterns are
detected in the ESP stream those triggers are used to generate individualized
offers, with great success.
Retail Optimization
Here are just a few of the ideas for ESP that have been implemented in the retail
world:
Promoting In-Store Shopping Frequency and Cross Sell
To promote in-store shopping, the retailer sends customers personalized,
optimized email promotions with sales and offers based on each customers
shopping history and local store quantities. ESP monitors in-store routers and
detects when customers (and their mobile devices) enter the store and looks up
customer details and histories. Existing promotional models evaluate the
customers history and determine an optimal set of offers to push to the in-store
customer via SMS or email.
In-Store Price Checking: Increasingly customers are using their mobile
devices to compare competitive prices while in-store. One retailer monitors in-

store Wi-Fi clickstreams to detect when a customer accesses a price comparison


site; retrieves the IP address, device ID and phone number; uses this information
to look up existing customer profiles; and determines if the customer is a
candidate for a promotion. Existing models identify the best offer and sends it to
the customer within seconds.
Creating New Sales from Product Returns: When customers come to
stores to return an item, details are instantly retrieved from scanned receipts by
ESP. Existing models analyze the customers history, recommend a specific
sales staff interaction, and generate and send coupon codes to the customers
mobile device for alternative replacements currently in stock. In addition, other
promotions that the customer may find interesting are sent.
So our takeaway is this. ESP is real time and also in-memory. In-Stream
analytics are part of ESP and can score, select, and send preferred actions to a
customer also in real time. However, the creation of new analytic insights does
not occur within ESP but rather in traditional analytic data stores. These are not
typically real time but can have cycles as short as a few hours which then can
be reintroduced into ESP.
ESP is an extremely powerful new technology that lets us get closer to events
and to our customers. If your needs are truly not real time then ESP may not be
for you. But perhaps some of these examples, particularly in retail and
marketing may spur you to take your analytics to the next level.
October 20, 2015
Bill Vorhies, President & Chief Data Scientist Data-Magnum - 2015, all rights
reserved.
About the author: Bill Vorhies is President & Chief Data Scientist at DataMagnum and has practiced as a data scientist and commercial predictive
modeler since 2001. Bill is also Editorial Director for Data Science Central. He
can be reached at:
Bill@Data-Magnum.com or Bill@DataScienceCentral.com
The original blog can be seen here.

You might also like