You are on page 1of 18

openSAP

Data Science in Action - Building a Predictive


Churn Model
Week 1 Unit 1

00:00:08 Hello and welcome to the openSAP course "Data Science in Action – Building a Predictive
Churn Model". My name is Stuart Clarke and I'm a consultant with SAP's Analytics and Insight
team.
00:00:20 I specialize in data science and predictive analytics and provide services to SAP's customers
and partners. You might have met me before in the openSAP course "Getting Started with
Data Science" that was released earlier this year.
00:00:36 Let's take a look at what should you expect over the next four weeks. The goal of this course is
to provide you with an introduction to how to conduct a data science project.
00:00:47 During the next four weeks you will be building a predictive churn model for a
telecommunications business so you can predict which of their premium customers are
switching to competitor networks.
00:01:00 Then you will be enhancing the model, using a social network analysis, and you will develop a
segmentation model.
00:01:08 By the end of the course, you should have a clearer understanding about how data science
projects are planned and delivered, and the process to build classification and segmentation
models
00:01:20 using the SAP Predictive Analytics automated modeling tools. Let me share with you some
more details on the weekly course topics:
00:01:32 In Week 1, there is a case study introduction. We're going to look at the CRISP Data Mining
project methodology.
00:01:40 We'll have a look at the introduction to the Telco case study. You'll start to understand the
business requirements and understand the data.
00:01:49 In Week 2, you're going to prepare and encode the data. So you're going to start to prepare
the analytical data set,
00:01:57 and there'll be an introduction to the automated modeling techniques in SAP Predictive
Analytics. You'll also look at initial data analysis and automated data encoding.
00:02:11 In Week 3, you will develop, evaluate, and deploy the models. So you'll start with an initial
churn model, then you'll evaluate the performance of that model.
00:02:23 Once you've built the model, you will deploy the model. In Week 4, you will monitor models
and learn how to improve their performance.
00:02:33 You'll develop a social link analysis and we'll look at segmentation, and you'll build a
segmentation model
00:02:44 After having successfully completed the first four weeks, you'll have one further week to
prepare for and participate in the final exam to earn a record of achievement.
00:02:56 Throughout the course, your feedback and your ideas are appreciated in the Discussion
Forum. So how do you get the points and successfully complete the course?
00:03:08 Well, there are four graded assignments throughout the first four weeks of the course. Each
assignment is worth 30 points, so you'll get a total of 120 points,
00:03:20 which is half of the total points available in the course. The other half of the available points
come from the final exam.
00:03:29 And just like every openSAP course, you need at least half of the maximum points available –
in this case 120 – to pass the course and receive your record of achievement.
00:03:44 A well-defined project methodology is important for a variety of reasons. For example, so that
there is a clear framework to record our experiences,
00:03:56 to help any project replication, for project planning and management, and to help with new
adopters and people working on the project.
00:04:08 The most popular data science project methodology is the CRoss Industry Standard Process
for Data Mining, known by its acronym CRISP-DM.
00:04:18 It was launched in 1996. The development of the methodology was led by five companies
including SPSS, Teradata,
00:04:28 Daimler AG, and NCR Corporation, with help from another 300 other organizations who
contributed to the process model.
00:04:40 The goal was to create a data-centric project methodology that is non-proprietary, application
and industry-neutral, and focused on business issues as well as technical analysis.
00:04:54 Polls conducted by KDnuggets show that CRISP-DM is still the leading methodology used by
industry data scientists. The CRISP methodology is a hierarchical process model.
00:05:08 At the top level, the process is divided into six different generic phases, ranging from business
understanding to the deployment of the project results.
00:05:19 The next level elaborates each of these different phases and comprises several generic tasks.
At this level, the description is generic enough to cover all data science scenarios.
00:05:33 The third level specializes these tasks for specific situations. For example, the generic task
might be cleaning data,
00:05:41 and the specialized task could be cleaning of numeric or categorical values. And then the
fourth level is the process –
00:05:51 the record of actions, decisions, and results of the actual execution of the data science project.
The six generic phases are represented in this diagram –
00:06:03 We have Business Understanding, then Data Understanding, Data Preparation, Modeling,
Evaluation, and Deployment.
00:06:11 The sequence of the phases is not strict and moving backwards and forwards between the
phases is always required.
00:06:21 The arrows in the process diagram indicate the most important and frequent dependencies
between the phases. And the outer circle in the diagram symbolizes the cyclic nature of any
data science project.
00:06:36 Of course, the process continues after a solution has been deployed. The lessons learned
during the process can trigger new, often more focused business questions
00:06:47 and subsequent data science processes will benefit from the experiences of previous ones.
The tasks are shown in the green boxes,
00:06:59 and the outputs from the tasks are shown in the blue boxes. Here in Phase 1, which is
Business Understanding,
00:07:07 you can see that the phase focuses on understanding the project objectives and requirements
from a business perspective,
00:07:15 then converting this knowledge into a data science problem definition and a preliminary plan
designed to achieve the objectives.
00:07:26 Phase 2 is Data Understanding. This phase starts with an initial data collection
00:07:33 and proceeds with activities in order to get familiar with the data, to identify data quality
problems,
00:07:41 to discover first insights into the data, or to detect interesting subsets to form hypotheses for
hidden information.
00:07:53 Phase 3 is the Data Preparation phase. This covers all the activities to construct the final data
set from the initial raw data.
00:08:03 Data preparation tasks are likely to be performed multiple times and not in any prescribed
order.

2
00:08:10 Tasks include table, record, and attribute selection, as well as transformation and the cleaning
of data for the chosen algorithms.
00:08:22 Phase 4 is Modeling. In this phase, various modeling techniques are selected and applied
00:08:29 and their parameters are calibrated to optimal values. Some techniques have specific
requirements for the form of data.
00:08:37 Therefore, stepping back to the data preparation phase is often necessary. Then Phase 5 is
Evaluation.
00:08:47 This phase thoroughly evaluates the model and reviews the model construction to be certain it
properly achieves the business objectives.
00:08:56 You must determine if there is some important business issue that has not been sufficiently
considered. At the end of this phase, a decision on the use of these data science results
should be reached.
00:09:11 Then Phase 6 is Deployment. The knowledge gained in the project will need to be organized
and presented in a way that the organization can use.
00:09:21 However, depending on the requirements, the deployment phase can be as simple as
generating a report
00:09:27 or as complex as implementing a repeatable data mining process across the enterprise. Most
models' predictive performance will degrade over time
00:09:41 because the data that you use to apply the model onto will change – the data distributions
might change as customers' characteristics change,
00:09:51 or competitors launch campaigns and the general business environment changes. The models
must be updated when this happens.
00:10:01 A monitoring phase can be added to the CRISP-DM methodology that specifically focuses on
this aspect.
00:10:12 In this unit, you have examined the six generic phases of the CRISP-DM project methodology.
These six phases are business understanding, data understanding, data preparation,

00:10:24 then modeling, evaluation, and deployment. You have also looked briefly at the different tasks
that are required in each phase.
00:10:35 Sometimes data scientists add in an extra phase to monitor the models so they are aware
when a model's performance degrades and needs updating.
00:10:46 You will follow this methodology through the next four weeks of this course.

3
Week 1 Unit 2

00:00:08 Hello and welcome to the second unit of this course, which is an introduction to the Telco case
study.
00:00:15 This unit explains some of the details of the data science approaches you will take for this type
of modeling, and it introduces you to some terminology.
00:00:25 Please remember, some of this information was presented in the first openSAP data science
course, "Getting Started with Data Science".
00:00:33 However, I will remind you of the important aspects as you follow this course. The next unit will
give more details regarding this specific use case
00:00:43 so you can develop your business understanding. You have been asked by a
telecommunications organization to analyze their Premium Service Plan customers,
00:00:55 determine the characteristics of "churning" customers, and build a "churn prediction model".
00:01:02 The telco's marketing team want to be able to identify which customers have a high probability
to churn, and they want to understand more about the characteristics of these customers

00:01:14 so that they can develop a targeted communication to persuade these customers not to churn.
The customers, or subscribers of a telecommunications organization can "churn",
00:01:30 which means they terminate their contract or stop using the service. The objective of a churn
prediction model is to predict which customers are most likely to churn from the current list of
active customers.
00:01:44 Communication campaigns can then be created to encourage them not to churn. Churn
analysis is a major application area for predictive analysis.
00:01:57 The objective is to classify a group of customers into two groups – churners and non-churners,
by building a predictive model that describes the attributes of those customers who have
churned,
00:02:09 in contrast to those who have not churned. Then the model can be used to predict which
customers are most likely to churn in future.
00:02:20 This will support a customer retention strategy to maximize the retention of customers.
00:02:27 The type of model most often used in churn analysis is referred to as a "classification" model,
as we wish to classify observations into classes
00:02:38 – did the customer churn or not? There are a variety of classification techniques,
00:02:45 including decision trees, neural networks, and regression – for example, "logistic regression".
And an introduction to these different techniques was presented in the first openSAP course.

00:02:58 In this project, you will use a regression approach to create the model. In fact, you'll be using
the automated functionality in SAP Predictive Analytics.
00:03:11 This uses regression techniques to create predictive classification models that are represented
by a polynomial equation.
00:03:20 Every value of each input variable is assigned a regression coefficient – that's b in this
equation that's shown on the slide.
00:03:31 A polynomial equation is very simple. It's represented by a simple formula:
00:03:36 Y = a + b1 * x1 + b2 * x2, and so on. In the equation, Y is the target variable.
00:03:48 High values of the output of the model indicate that customers will churn and low values
indicate that the customers are non-churners.
00:03:57 a is a constant value that is calculated by the regression algorithm. And the b values, b1 up to
bn, are called "regression coefficients".
00:04:09 Again, these are calculated by the regression algorithm. x1 up to xn are the categories of each
of the explanatory variables.
00:04:20 These are sometimes called "input" or "independent" variables. As you can see, in the
equation every value of each input variable (x) is assigned a regression coefficient (b).

4
00:04:34 This equation is what we call the "model". We can use this equation to help differentiate
between churners and non-churners.
00:04:43 The output from the model, Y, is a probability or a score. It's higher for churners than non-
churners.
00:04:52 When we build or "train" a model, we use historical data, where we know the values of the
explanatory variables and also, importantly, the value of the target variable.
00:05:05 So we know if the customer churned or not. The regression process estimates the values of a
and b for each explanatory variable
00:05:15 so that the model estimates the target values as accurately as possible, by comparing the
predicted output to the known actual value.
00:05:25 When we apply the model, summing a (the constant value) and each b times x, it gives us an
estimated value of the target variable Y, which we call the "score".
00:05:39 Again, just to remind you, a high score indicates that the customer has a high chance to churn.
And a low score indicates that the customer has a low chance to churn.
00:05:52 The "explanatory" variables are usually numeric or categorical, and they describe the attributes
of each customer.
00:06:00 In a telco churn model, the explanatory variables represent information about the customer,
such as their demographic information (their age and gender),
00:06:09 usage data, for example for voice calls, SMS, and data, revenue data or recharge – top-up –
data,
00:06:18 handset information, value added services (VAS) that the customer's subscribed to.
00:06:26 Also, a data scientist will create a range of "derived" variables, such as the number of calls per
month,
00:06:33 or the total duration of outgoing calls per month, and for incoming calls, international calls, and
for data uploads.
00:06:44 And we'd look at this over the last 3, 6, or 12 months. Then averages will be created.
00:06:51 Some examples are shown on this slide. When you train a classification model,
00:06:59 the "target" variable in the model build data set is usually coded as a binary variable, so it's
either coded as Yes/No or 1/0.
00:07:09 Specifically in a churn model, the target variable is often coded by the data scientist as 1 if the
customer churned, or 0 if they didn't churn.
00:07:23 There are two phases to a predictive modeling process: The first phase is the Model Build – or
Learning – Phase.
00:07:32 Predictive models are built or "trained" on historic data with a known outcome. The input
variables are called "explanatory" or "independent" variables.
00:07:43 For model building, the "target" or "dependent" variable is known. It can be coded, as a 1 or a
0.
00:07:51 And then the model is trained to differentiate between the characteristics of the customers who
are 1s and 0s. The second phase is the Model Apply – or Applying – Phase.
00:08:06 Once the model has been built, it is applied onto new, more recent data which has an
unknown outcome – it's unknown simply because the outcome is happening in the future.
00:08:19 The model calculates the score or probability of the target category occurring – for example,
the probability of a customer churning.
00:08:31 The Analytical Data Set, abbreviated to ADS, is created by merging the required tables and
aggregating usage data into specific time periods for each individual ID.
00:08:46 In this telco example, the unique ID is the line number. However, an important point is that
these are unique and there are no duplicates.
00:08:57 In fact, they represent the "granularity" of the predictive model you will be building. So, for
example, in other scenarios this could be for unique customers, subscribers, accounts,
00:09:10 machines, or even stores. There are often many thousands of unique IDs in an analytical data
set.

5
00:09:18 In this example, the data scientist defines a reference date, so all of the dynamic variables are
calculated relative to this reference date.
00:09:29 If the reference date is set to 2016-03-31, the customer's age in years, and tenure in months –
which is the period the customer was a customer –
00:09:43 is calculated relative to the reference date. Also, a range of usage attributes are derived for the
three months prior to the reference date.
00:09:53 These are shown in the number of voice calls in the three months, from M0 for January 2016,
M1 for February, and M2 for March,
00:10:05 and the total duration of calls in seconds, again for M0, M1, and M2. Similarly, we could create
aggregates for data usage in the three months.
00:10:19 Many more attributes can be derived in this way. As you can imagine, this scenario is a very
simple illustration.
00:10:26 In reality, we will often have hundreds, if not thousands, of these attributes created for each ID.

00:10:34 The target is a binary flag that indicates if the line number holder churned. A 1 indicates a
churner, and a 0 indicates a non- churner.
00:10:46 Remember that the target period occurs after the reference date, maybe the following month
or a later month.
00:10:55 In this scenario, the target period is in May, which gives a one-month latency period in April.
00:11:02 So the target period starts one month after the reference date and ends two months after the
reference date.
00:11:10 We will look at this in more detail later in the course. The usage aggregates are known for the
months before the reference date.
00:11:20 So by changing the reference date, you can re- calculate the explanatory variables relative to
the reference date in different timeframes.
00:11:29 Therefore, the data set itself is actually "dynamic", and you can use it to build a model on
historical data,
00:11:36 and then apply the model each month on updated data in a different timeframe. You simply
increase the reference date by +1 month.
00:11:49 To define the analytical data set to build the model, you use a reference date so that the
analytical data set has three months of history with a known target.
00:12:00 Here the reference date is set at the end of March 2016. The explanatory variables are known
for January, February, and March.
00:12:09 And the target is known – we actually know whether the customer churned or not in May 2016.
The automated capability of SAP's Predictive Analytics
00:12:20 uses a regression algorithm to build a classification model. The values in the polynomial
equation are calculated –
00:12:28 there is a value calculated for the constant a and the b values for each of the explanatory
variables. When all of the values are summed,
00:12:37 the regression equation gives an overall score. A high score indicates a churner,
corresponding to IDs where churn = 1,
00:12:46 and a low score indicates a non-churner, corresponding to IDs where churn = 0. Remember
that the number of months in the history does not always have to be three months,
00:12:59 and often we would build models with six or 12 months of data. It all depends on the data
availability and the actual business requirements.
00:13:11 Once you have built the model, to define the analytical data set to apply the model onto, you
use a reference date so that the analytical data set has three months of known history,
00:13:22 but, of course, the target is unknown because it's in the future. If you set the reference date to
the end of June 2016,
00:13:31 the explanatory variables are known for April, May, and June. However, the target is unknown,

6
00:13:38 and the model will then predict whether the customer will churn or not in August 2016. Usually,
when you apply a model, the reference date is set to the end of the last month with known
data.
00:13:51 So, in this example, we are acting as though today's date will be July 1, or shortly after that.
This will mean that the data for all of the voice calls, data usage, and so on,
00:14:02 are available for April, May, and all of June. The model itself is an equation, and the output
from the model is the "score".
00:14:12 And a higher score indicates that the unique ID is a potential churner. And a low score
indicates that they're potentially not going to churn.
00:14:22 The "score" is simply the output from the model equation. It can have negative values, and it
can have values greater than 1.
00:14:31 So it's not a probability. Scores and probabilities are different.
00:14:36 One of the output options in SAP Predictive Analytics is to output "probabilities" as well as
"scores".
00:14:43 The model scores are mapped into a probability, which varies from 0 to 1.
00:14:50 With probabilities, there are no negative values and the maximum value is 1. In your churn
model, a high probability indicates that the unique ID has a high probability to churn,
00:15:02 and, of course, a low probability indicates that the unique ID has a low probability to churn. For
predictive churn modeling, most data sets have the following structure:
00:15:15 There is historical data, in the past compared to a reference date, and this includes the
dynamic data that is computed in relation to the reference date.
00:15:25 There's a latency period that starts after the reference date, and this is a period where no data
is actually collected.
00:15:35 The period gives the business time to collect data, apply the model, create the campaign, and
deploy the users in a call center.
00:15:47 The target, which starts after the reference date plus the latency period, is the period where
the target behavior is observed.
00:15:56 So in this scenario, we are observing whether or not the subscriber churned. Durations for
each of these periods depend on the data availability
00:16:08 and the specific business requirements. Often the history period is 3, 6, or 12 months,
00:16:14 and this depends on data availability or data relevance. For example, data collected more than
six months ago may have become totally irrelevant.
00:16:25 It also depends on the specific business question that is being analyzed. Latency periods are
often one or two months,
00:16:34 depending on how quickly the business can deploy the model output and the specific business
question. Again, for example, if you build a model to predict if a customer is going to churn
next month,
00:16:47 then this wouldn't give the business time to actually create a campaign and convince the
customer not to churn.
00:16:54 The customer has already made their mind up that they are going to switch to another
supplier. However, if you predict that the customer is going to churn in two or three months'
time,
00:17:04 it actually gives the business time to convince the customer to stay. The target period is often
one month in duration,
00:17:13 although this is sometimes extended to two or three months depending on the number of
target events observed in the period.
00:17:23 For example, the number of churners in the target period may be low. And it also depends on
the specific business question.
00:17:31 If the churn rate is low, which means that there may only be a few hundred churners per
month, then the target period could be extended to two or three months

7
00:17:40 so that there are sufficient churners to make the model statistically robust. So, just to recap on
this unit...
00:17:50 you have been introduced to how you can use classification models to predict whether a
customer will churn or not. For predictive churn modeling, data sets can have a history period,

00:18:01 a latency period, and a target period. The start and end of these periods are defined by a
reference date.
00:18:09 The model is a simple polynomial equation, and the output from the model is a "score".
00:18:15 A high score indicates that the unique ID has a high potential to churn, and a low score
indicates that the unique ID has a low potential to churn.
00:18:24 Scores can have negative values, and can have values greater than 1. They are not
probabilities.
00:18:33 However, one of the outputs of the tool produces "probabilities" as well as "scores". Here, the
model scores are actually mapped into a probability,
00:18:45 which varies from 0 to 1. And with probabilities, there are no negative values and the
maximum value is 1.
00:18:54 Therefore, in a churn model, a high probability indicates that the unique ID has a high
probability to churn, and a low probability indicates that the unique ID has a low probability to
churn.

8
Week 1 Unit 3

00:00:09 Welcome back to the third unit of this first week. We are going to have a closer look at the
business requirements for the project you are about to start.
00:00:19 The business understanding phase focuses on understanding the project objectives and
requirements from a business perspective, then converting this knowledge into a data science
problem definition
00:00:32 and creating a preliminary plan designed to achieve the objectives. There are 4 tasks in this
phase.
00:00:39 You determine business objectives, assess the situation,
00:00:43 determine the data science goals, and produce the project plan.
00:00:49 The first task is to determine the business objectives. You need to thoroughly understand, from
a business perspective, what the client really wants to accomplish.
00:01:00 Often the organization has many competing objectives and constraints that must be properly
balanced. Your goal is to uncover important factors, at the beginning,
00:01:11 that can influence the outcome of the project. There are a number of important outputs:
00:01:18 The project background records the information that is known about the organization’s
business situation. The business objectives describe the customer’s primary objective, from a
business perspective.
00:01:32 The business success criteria describe the criteria for a successful outcome to the project from
the business point of view. Here's some key information for you about the project you are
going to be working on:
00:01:49 A telecommunications organization has a specific service designed for their Premium Service
Plan customers. They are worried because they are experiencing a high number of “churners”.

00:02:00 These are customers who are terminating the service and switching to a competitor. They
have asked you to analyze these churners, to identify their key characteristics,
00:02:12 and to build a “churn prediction model”. Also, the marketing team want to understand more
about the characteristics of customers
00:02:21 so that they can develop a targeted communication strategy based on each customer’s spend.

00:02:28 They've asked for a segmentation to support future communication strategies. This Premium
Service Plan is for “prepaid” customers only.
00:02:40 This terminology will be defined in more detail later in this phase. However, “prepaid” usually
refers to customers who purchase the service in advance of them using it,
00:02:51 and “top-up” when they need to continue with the service. This is opposed to “postpaid”
customers, who have a contract for 12 or 18 months with the telco.
00:03:02 There is a “bundle” of services provided in this plan: There are 4GB of 4G data, 500 local
minutes of voice calls, and unlimited local texts.
00:03:14 The service in the bundle lasts for 30 days, then it renews automatically. Payment is taken
directly from the customer’s bank account.
00:03:23 If there is insufficient credit available, and the payment is not made, then the customer has 7
more days to top-up.
00:03:30 During this time, the customer will be charged the standard local rates for minutes, texts, and
data until they actually do top-up. If, by the end of the 7th day, they still haven't got enough
credit, they are then classed as churned.
00:03:47 Also, any customer can opt out at any time, and then they are also classed as churned. The
main driver for customers to choose the Premium Service Plan is the 4 GB of data capacity
per month.
00:04:03 In fact, some customers don't use all of their 500 minutes of voice calls per month. Although
some do and they pay for the extra use.

9
00:04:12 Also, because there are unlimited texts, SMS and MMS are not important considerations for
this analysis. The telco confirms that as of the end of March 2016, there were 7445 Premium
Service Plan customers.
00:04:32 There is a relatively high churn rate for the Premium Service Plan customers. Each month,
over 1000 customers are churning.
00:04:40 Although they have a very active recruitment plan. The company sets you the following goals:
00:04:48 To develop a predictive churn model that will help identify which customers have the highest
probability to churn. To identify the key contributing factors in the churn model.
00:05:00 To develop a social network, using a link analysis and then determine if this information can
improve the model. To productionize the model, so that it can be applied to new data every
month.
00:05:13 And the performance of the model should be monitored and it should be updated automatically
when its accuracy starts to diminish.
00:05:24 You've been asked to develop a segmentation to help gain a deeper understanding of how the
customers are using the service, based on their spend.
00:05:34 One important consideration is that not too few, or too many segments are created, otherwise
the marketing team will not be able to create differentiated communication campaigns.
00:05:49 The company agreed the business success factors for the churn model: Model accuracy – the
accuracy of a predictive model is very dependent on the predictive power of the input
variables.
00:06:01 Some models can be very predictive, and that's because the input data discriminates between
the churners and non-churners very well. However, sometimes getting high discrimination is
difficult because the input data
00:06:14 does not differentiate very well between the churners and the non-churners. So in this case, it
was not possible to define a minimum value for the accuracy measure
00:06:26 because no predictive models using these data have ever been developed before.
00:06:32 So, there is no comparative measure available. Then there's model robustness.
00:06:39 The model must be robust when tested on a hold-out sample taken from the model build data
and on data taken from another time frame.
00:06:48 The model must work just as well on new data as it did when you built it, so that the modeling
is robust when you apply it onto new data.
00:06:57 The model will be applied on a monthly basis. The model “apply” process will need to be
scheduled and run automatically,
00:07:06 and when the model’s performance starts to degrade, it should be maintained automatically.
The next task in the CRISP methodology is to assess the situation.
00:07:19 This task involves more detailed fact-finding about all of the resources, constraints,
assumptions, and other factors that should be considered
00:07:28 in determining the data science goal and project plan. The outputs from this task will include
an inventory of resources,
00:07:37 a list of assumptions and constraints, the risks and any contingencies, and a cost benefit
analysis.
00:07:46 Here is the information you have been given so you can assess the situation for this project:
For the Inventory of Resources – You are the only analyst assigned to this project.
00:07:58 You will be given access to a business analyst and a data expert when you require their
support. You will be supplied with the relevant data and information about the data.
00:08:09 You will be using SAP Predictive Analytics automated modeling techniques, because of the
quick development time, high accuracy, and because the models can be operationalized
easily.
00:08:22 There are no known project assumptions or constraints. The data are available and permission
has been granted for you to access it.

10
00:08:31 There is a risk that data quality could be poor, however, this will be confirmed as one of the
initial steps in the project. The customer is reasonably sure that the data is of good quality.
00:08:45 The customer also confirms that there is no need for a data science glossary of terminology,
as they have a good understanding of data science and predictive modeling.
00:08:55 However, the telco provides their definition of their terminology. The telco has agreed that a
cost/benefit analysis is not required for this particular project.
00:09:09 Every industry has its own terminology. In telco, there are “prepaid” customers (where credit is
purchased in advance of service use)
00:09:18 and “postpaid” customers (where the customer enters into a contract, of usually 12 or 18
months' duration). Prepaid customers top-up when they make a payment so they can continue
to use their service.
00:09:32 Customers can buy bundled services. For example, some bundles include a mix of mobile,
fixed line, broadband services, and television services.
00:09:43 This Premium Service Plan is a “bundle” of services. It includes 4GB of 4G data, 500 local
minutes of voice calls, and unlimited texts.
00:09:57 The next task in the CRISP process is to determine the data science goals. A business goal
states the objectives in business terminology.
00:10:06 While a data science goal states project objectives in technical terms from a data science
perspective. There are 3 planned phases to this project:
00:10:20 Phase 1. This is where you will develop a churn model, using 3 months' history, 1 month's
latency period, and 1 month's target period.
00:10:30 Phase 2. You will build a social network analysis of the call patterns 1 month prior to the
reference date. And then Phase 3. You will develop a supervised cluster model
00:10:43 to understand how high to low spending customers are using the service, to improve the
relevance of marketing and retention campaigns.
00:10:55 The data science success criteria for each of the phases are as follows: For phase 1, the
predictive power of the model will be discussed with the telco once the initial model has been
developed.
00:11:07 The telco fully understands that predictive power can be very dependent on the predictiveness
of the model input variables, and you will be using all of the data that are currently available.

00:11:18 Also, because no predictive models have been developed previously on these data, there is no
comparison available.
00:11:25 Therefore, there is no minimum predictive power threshold that can be set. The model must
have a prediction confidence greater than or equal to 0.95 for the model to be robust.
00:11:39 And then for phase 2, the development of a social network model that could be used to
enhance future predictive models. The telco wants to understand how social network and any
potential analysis using the social network
00:11:57 can enhance the predictive models with the output. Then in phase 3, the profile of the
segments will indicate how high to low spending customers are using the service.
00:12:10 The marketing team can then use the information to develop a communication campaign to
encourage those customers with low spend to increase their spend, and to maintain the high
spenders.
00:12:23 Therefore, the number of segments must be manageable for the marketing team, and the
profiles of the behavior of the customers in each segment must be easy to comprehend and to
act on.
00:12:36 So no more than 10 segments will be created. The penultimate task is to produce a project
plan that sets out all of the steps you will be undertaking to achieve these business goals.
00:12:53 The project plan could be developed based on the CRISP phases. All of the resources and
task durations can be easily listed and attributed.

11
00:13:03 When multiple models are developed within a single project, there may be dependencies
between the models.
00:13:10 So, for example, many telco churn models will include explanatory variables that are
calculated from other models. For example, the outputs from a range of demographic models
that predict a customer’s age, gender, and nationality,
00:13:26 as well as the output from social network models that represent the links between customers in
a community and each customers role can be included in a churn model.
00:13:37 This will be explained in more detail later in this course. These dependencies must be
reflected in the project plan,
00:13:45 where one model development cannot proceed before the other contributing models have
been completed and tested. The final task is an initial assessment of the tools and techniques
you will be using for this project.
00:14:02 You will be using two algorithms: A classification algorithm is ideal for churn modeling,
00:14:07 as it provides a classification of customers into two groups: churners and non-churners.
00:14:13 This algorithm will also identify the important explanatory variables that contribute to the model
output and give us some basic insight into why customers might be churning.
00:14:24 A k-means cluster algorithm will group customers based on their behavior, and this will then
provide useful information for a targeted communication campaign that the customer wants to
run.
00:14:39 So just to recap on this unit. You've been introduced to the business understanding phase of
the project.
00:14:46 You have determined the business objectives of the project and business success criteria.
You've also determined the data science goals and data science success criteria.
00:14:58 You've been asked to develop a predictive churn model, a social network analysis, and a
supervised cluster model.
00:15:05 You have assessed the situation and you will be use SAP's Predictive Analytics automated
technology to build the models. Well, that's the end of this unit.
00:15:17 I look forward to seeing you again in the next unit.

12
Week 1 Unit 4

00:00:08 Hello, and welcome to the final unit of the first week. We are going to get a better
understanding of the data.
00:00:16 The objectives of the Data Understanding phase start with an initial data collection and then
proceeds with activities to get familiar with the data and to identify any obvious data quality
problems.
00:00:30 There are four tasks: collect initial data, describe data, explore data, and verify the data
quality. The first task is to acquire or access the data.
00:00:43 If the data is stored in a database, then you will need to get logon information, and ensure you
can connect the modeling software and access the data correctly.
00:00:56 The telco has made the following data sources available to you: an A_NUMBER_FACT table,
a CUSTOMER_ID_LOOKUP table,
00:01:06 a CUSTOMER table, a CDR table, a DATA_USAGE table, and a SPEND_SEGMENTATION.
00:01:17 The next task is to examine the "surface" properties of the acquired data and report on the
results. You will create a Data Description Report that describes the data, and this will include:

00:01:29 the information about the format of the data; the quantity of the data, for example the number
of records and fields in each table;
00:01:38 the identities of the fields; and any other surface features of the data which you have
discovered. The A_NUMBER_FACT table is a list of the unique line numbers – or
A_NUMBERs – that are associated with each account.
00:01:56 There are no duplications, so each ID is unique, appearing only once in the table. This is the
fact table for the data manipulation.
00:02:06 The other tables you will be using in the analysis can be merged and aggregated to it. The
goal of the analysis is to identify which A_NUMBERs are going to churn.
00:02:19 The A_NUMBER is the "entity" in our analysis – it's the object of interest for the model. There's
only one column of data.
00:02:28 When you run the statistical analysis, it shows there are 7,445 rows of data. All of the
customers have been customers for at least six months,
00:02:39 and were classed as active, which means that they hadn't churned as of the end of March
2016. You will need to identify the "entity" for the analysis.
00:02:53 An entity is the object that is targeted by the planned analytical task. It can be a customer, a
product, an account, or a store, and it's usually identified by a unique identifier.
00:03:07 The entity defines the granularity of the analysis and the models. Defining an entity can
sometimes be quite a challenge.
00:03:15 For example, you have to determine, together with everyone involved in the project, if the
entity for a project is the 'account', or if the 'customer' is the right entity.
00:03:27 And this can sometimes be quite difficult to agree. In this model, the entity is the A_NUMBER.

00:03:36 This is the CUSTOMER_ID_LOOKUP table. It's a table that simply links the CUSTOMER_ID
to the A_NUMBER.
00:03:45 There are only two columns of data. Again, when you run the statistical analysis, it shows
there are 7,445 rows of data.
00:03:54 And this matches the count for the A_NUMBER_FACT table, so the count is consistent. This is
the CUSTOMER table. It contains customer data.
00:04:06 For each CUSTOMER_ID there is information about the customer's gender, their age, their
geographical location, in a field that's called ZIP_CODE,
00:04:16 the distribution channel that references the channel they used to apply for the service, the
handset they are using, which is called DEVICE_BRAND_NAME and also
DEVICE_MODEL_NAME,

13
00:04:28 and the number of months they have been a customer, which is called their "tenure". There
are eight columns of data.
00:04:35 The statistical analysis shows there are 7,445 rows of data. Again, this is consistent with the
previous tables.
00:04:44 You can visualize the data by producing histograms. These show the distribution for each
variable independently, by creating bins of the data
00:04:54 and then counting how many records fall within the bin boundaries. On the x-axis, you typically
see equal-width bins,
00:05:03 and on the y-axis, you will see the count, or percentage of the count of records, in each bin.
The CDR table contains the call detail records.
00:05:18 There's a field that's called KxIndex, and this is simply a row number that is automatically
added, and it can be ignored in this analysis.
00:05:28 The data shows the calls that are made by each A_NUMBER, who they called, which is called
the B_NUMBER, the TYPE of call, which can be either MMS, VOICE, or SMS,
00:05:40 and the DURATION of the call, and of course, the duration is only for VOICE calls, and this is
shown in seconds,
00:05:48 and the date and time of the call. There are five columns of data.
00:05:53 And there are 466,080 rows of data in this table. The DATA_USAGE table contains the data
usage for each A_NUMBER, from January through to June 2016,
00:06:09 and also the percentage of the usage relative to the total data allowance per month. This data
has been preaggregated for you by the telco.
00:06:19 The telco has also added a flag that indicates if the line number churned in May or June. And
this is the target for the models you will be building.
00:06:29 There are 20 columns of data. Again, there are 7,445 rows of data, which is consistent with
previous tables.
00:06:40 The SPEND_SEGMENTATION table contains the spend for each A_NUMBER over the past
three months. This spend total includes any value added services, roaming charges,
international calls,
00:06:53 and any excess voice calls made over the maximum allocation for a month. This spend data
will be merged to the data set used to build the churn model,
00:07:02 and it will be targeted in the spend cluster model. Again, there are 7,445 rows of data, which is
consistent.
00:07:15 In the next task, you explore the data. So, for example, this might analyze the distribution of
key attributes,
00:07:23 and you check the results of simple aggregations. These analyses may address the data
science project goals.
00:07:31 They may also contribute to or refine the data description and quality reports and feed into the
transformation and other data preparation that might be needed for further analysis.
00:07:45 Summary statistics and data visualizations will enable you to gain insight into the data and give
you an early indication if the data is any good for predictive modeling.
00:07:56 You have to check if it's populated and if the distributions make sense. The simplest way to
gain this insight is to assess each variable one at a time
00:08:06 by producing summary statistics such as the minimum value, maximum value, the mean, and
the standard deviation. You can analyze the minimum and maximum values and identify if
there are any unusual outlying values.
00:08:23 The "mean" of a distribution is its average. This is simply the sum of all of the values for the
variable divided by the number of values you summed.
00:08:33 If the distribution is uniform or normal, then the mean represents the middle of the distribution.
However, data distributions are rarely normal.
00:08:45 The standard deviation measures the spread of the distribution. A larger standard deviation
means that the distribution has a greater range.

14
00:08:55 The data sets in this openSAP scenario are relatively straightforward. The distribution of the
important attributes can be analyzed in the statistical reports
00:09:05 which are automatically generated for you. You must ensure there are only two categories in
the target variables
00:09:13 and there are no missing values in the target. Again, this can be checked using the statistical
reports.
00:09:20 The pie charts here show the distribution for the churn target variables for May and June. You
will see there are only two categories, that have been coded as 1 and 0,
00:09:31 and there are no missing values When you use the SAP automated modelling tools,
00:09:37 much of the data exploration information will be provided in the statistical reports after you run
the first initial test model.
00:09:46 You will explore and analyze the data in more detail at that stage. CRISP_DM is a useful
guide, but sometimes there are advantages if you deviate a little.
00:10:01 You can consider building initial models before the data preparation and data understanding
phases have actually been completed.
00:10:09 These won't be the final models, but they can be used to assist with your data understanding
and provide important information that can be used in the data preparation phase.
00:10:19 You will have an early indication of which variables are good predictors. However, you should
remember that when you thoroughly prepare the data,
00:10:27 you might find some variables that were poor predictors actually become better predictors. It
will give you an early indication of the base line accuracy of a model
00:10:40 and will also indicate leaker variables. The model will automatically produce a wide range of
descriptive statistics,
00:10:48 such as cross tabulations of each explanatory variable with the target, and correlations
between the variables. The final task is to examine the quality of the data.
00:11:03 You should check if the data is complete and that it covers all of the required cases. You
assess if it's correct or if it contains errors, and how common the errors are.
00:11:13 And also check if there are missing values in the data and how they are represented, where
they occur, and how common they are.
00:11:21 You should create a data quality report that lists data problems, including inaccurate or invalid
values, missing values, unexpected distributions, and any outliers.
00:11:36 The statistical analysis in Predictive Analysis provides a list of the variables, the value and
storage, a count of any missing values, and a row count.
00:11:48 The example in the slide shows the analysis for the DATA_USAGE table. It provides the
following information:
00:11:55 It shows the variable name, its Value (whether it's continuous, ordinal, or nominal), its Storage
(and that means it's either integer, nominal, string, date, or date time),
00:12:06 and the Missing value count. Here, you can see here that there are no missing values – that
means there are no empty cells –
00:12:15 in the DATA_USAGE table as all the missing count values are 0. I'll explain the "value" and
"storage" of variables later in this course.
00:12:28 You will be using the SAP Predictive Analytics automated modelling tools that provide
automated data encoding strategies that deal with missing values and outliers for you
automatically.
00:12:40 Any data quality issues might be pretty obvious when you examine the data at this early stage
using the data statistics you have already seen
00:12:50 using the frequency charts and the continuous variable distribution reports. However, missing
values, outliers, and inconsistencies are also very obvious when the model is actually run
00:13:04 and you can examine the model data statistics and analyze the data in more detail at this
stage. So what I'm going to do now is give you a demo of how to use the SAP Predictive
Analytics tool

15
00:13:20 to get a better understanding of your data. When you access the training system, you'll notice
that there are two icons for SAP Predictive Analytics.
00:13:33 On the left-hand side, you've got the desktop version, and on the right-hand side, you've got
the client/server version.
00:13:41 For this training, you'll be using the desktop version only. So what I'd like you to do is double-
click on the left icon,
00:13:51 and that will open up the tool, it will take you into the Predictive Analytics software. And in this
demo, we're going to have a look at how you can get a clearer understanding,
00:14:03 a deeper understanding of the data that you'll be using for our project. So I'd like you to go into
Toolkit on the left-hand side,
00:14:11 and you'll see that there is a Data Viewer. So click on Open the Data Viewer...
00:14:18 and when you do this, you'll need to go to Browse, on the right-hand side, and you'll need to
link to the database we'll be using.
00:14:28 You'll be given the credentials to do this. We'll be linking to the SAP HANA DB,
00:14:34 and then you'll link to Catalog and TRIALSCHEMA, and all the data that we'll be using for this
project is included in the TRIALSCHEMA.
00:14:46 You'll see that there is a list of tables that are available in that schema, and let's have a look at
the first table, which is the A_NUMBER_FACT table.
00:14:55 So I highlight it, and I go down to the middle tabs here and view the data. And you'll see that
there is one column, which is the A_NUMBER,
00:15:05 which is the line number for each of the different customers, and you can see the top 100
rows.
00:15:10 You could, if you wanted to, increase the number of rows and then just refresh that. You could
increase it up to 1,000 or 10,000 or whatever.
00:15:18 And then you'll see all the line numbers. The second table that you want to have a look at is
the CUSTOMER_ID_LOOKUP table.
00:15:28 So I'll highlight it and view the data. And here, you'll see that it's a simple link between the
A_NUMBER and the CUSTOMER_ID.
00:15:38 Now that I know what the CUSTOMER_ID is, it means that I can actually link the CUSTOMER
table to the A_NUMBER.
00:15:47 So I've got the CUSTOMER table here, and I go down to View Data, and I can see here for
every CUSTOMER_ID, I have some demographic information.
00:15:57 So I've got their gender and age and some geographical location in the ZIP_CODE column.
And then on the right-hand side, I've got a column that's called TENURE_MTHS.
00:16:09 This is the number of months that a customer has been a customer. And I can run some
statistics on this data, so I click on the Statistics tab,
00:16:20 and I compute statistics over the whole data set. And it shows me the variables in the data set
and their value,
00:16:27 and they can be continuous, nominal, or ordinal. And they can have a storage of integer, or
number, data, date, time, or there can be a string.
00:16:37 And I'll be explaining more details about this information later in this course. This also shows
me the Missing Count,
00:16:46 so if there are any null values, I'll have a Missing Count value here. Obviously, you can see
here that there aren't any missing values.
00:16:54 And it gives me a Row Count, so there are 7,445 rows in this data. I can look at category
frequencies.
00:17:02 So I highlight that tab, and then I can select a variable – let's have a look at AGE, for example.
And it shows me the frequencies, and I can produce pie charts and histograms.
00:17:16 Now, I can export into Excel if I want to. And then, let's have a look at GENDER, for example,
and I can put the pie chart on for GENDER.

16
00:17:24 So I can see what the distribution is of the males and females within the data set. And then for
continuous variables, there's a tab here,
00:17:34 and for each of the different continuous variables, it gives me the minimum, maximum, and
mean values. And I can use this to find out if there are any unusual outlier values within the
data.
00:17:45 So that's the customer data. And the other data that we can look at is the DATA_USAGE table.

00:17:53 So again, I'll highlight that and I'll view the data. So I've got the A_NUMBER, here, and then
I've got the data usage for January, February, March,
00:18:04 April, and so on – for each of the different months. And then, on the right-hand side, I've got
some target values.
00:18:11 So this is the churn in the two months which are the target values: the churn in May and in
June. Again, I can look at the statistics.
00:18:21 So I'll compute the statistics over the whole data set. It shows me the variables, the Missing
Count, again I've got 7,445 rows.
00:18:31 I can look at the category frequencies. So, importantly here, I need to check that there are no
missing values in my target variable.
00:18:40 So I go down to CHURNIN_MAY, and I can see that I've only got 0s and 1s, which are the
churners and non-churners, and there were no missing values.
00:18:51 Again, I can check that for June – again, there are no missing values. And then for the
continuous variables, which are the actual data usage values,
00:19:00 It gives me the minimum, a maximum, and the mean values, and of course, the standard
deviation. So you can use this tool to get a much better understanding of the data and conduct
an initial data analysis.
00:19:20 So just to recap on this unit... You have looked at the Data Understanding phase of the project
using CRISP.
00:19:28 You have accessed and examined the data that is available in SAP HANA database. You've
described the data, you've started to explore and verify the data, and started to check data
quality.
00:19:40 You've used SAP automated analytics to create summary statistics for the tables, and you've
seen how to use the output to check the data frequency charts,
00:19:50 check for missing values, and produce the statistics for continuous variables. On the deck that
you can download,
00:19:59 I've included some interesting Web site references if you want to learn some more about data
distributions, leaker variables, and some other interesting information that might be useful for
you.
00:20:12 So I hope you've enjoyed the first week of this course. Good luck with the questions, and good
luck with the weekly assignment.

17
www.sap.com

© 2017 SAP SE or an SAP affiliate company. All rights reserved.


No part of this publication may be reproduced or transmitted in any form or
for any purpose without the express permission of SAP SE or an SAP affiliate
company.
SAP and other SAP products and services mentioned herein as well as their
respective logos are trademarks or registered trademarks of SAP SE (or an
SAP affiliate company) in Germany and other countries. Please see
http://global12.sap.com/corporate-en/legal/copyright/index.epx for additional
trademark information and notices.
Some software products marketed by SAP SE and its distributors contain
proprietary software components of other software vendors.
National product specifications may vary.
These materials are provided by SAP SE or an SAP affiliate company for
informational purposes only, without representation or warranty of any kind,
and SAP SE or its affiliated companies shall not be liable for errors or
omissions with respect to the materials. The only warranties for SAP SE or
SAP affiliate company products and
services are those that are set forth in the express warranty statements
accompanying such products and services, if any. Nothing herein should be
construed as constituting an additional warranty.
In particular, SAP SE or its affiliated companies have no obligation to pursue
any course of business outlined in this document or any related presentation,
or to develop
or release any functionality mentioned therein. This document, or any related
presentation, and SAP SE’s or its affiliated companies’ strategy and possible
future developments, products, and/or platform directions and functionality
are all subject to change and may be changed by SAP SE or its affiliated
companies at any time
for any reason without notice. The information in this document is not a
commitment, promise, or legal obligation to deliver any material, code, or
functionality. All forward-looking statements are subject to various risks and
uncertainties that could cause actual results to differ materially from
expectations. Readers are cautioned not to place undue reliance on these
forward-looking statements, which speak only as of their dates, and they
should not be relied upon in making purchasing decisions.

You might also like