OpenSAP Ds1 Week 3 Transcript

openSAP
Getting Started with Data Science

Week 3 Unit 1
00:00:12 Hi, welcome to the third week of the openSAP course "Getting Started with Data Science".
This week we will cover the modeling topic,
00:00:23 and we will start in this unit with an overview of the modeling phase. CRISP phase 4 covers
modeling with various modeling techniques which can be selected and applied,
00:00:40 and their parameters are calibrated to optimal values. Some techniques have specific
requirements on the form of data.
00:00:50 Therefore, stepping back to the data preparation phase is often necessary. Task 1 is to select
the modeling technique that is to be used, for example a decision tree, neural network, or other
technique.
00:01:08 If multiple techniques are applied, you will need to perform the following tasks for each
technique separately. Task 2 is to generate a test design.
00:01:22 Before actually building a model, you will need to generate a procedure or mechanism to test
the models quality and validity. Task 3 is to build the model.
00:01:36 Here, you run the modeling tool on the prepared dataset to create one or more models. Task 4
is to assess the model.
00:01:47 You interpret the models according to your domain knowledge, the data mining success
criteria, and the desired test design. You might contact the business analyst and domain
experts to discuss the data mining results from a business perspective.
00:02:07 In this phase you will only need to consider the models, whereas the evaluation phase, which
happens later in the process,
00:02:16 also takes into account any other results that were produced in the course of the project. As
the first step in modeling, we need to select the actual modeling technique that is to be used.
00:02:32 Whereas you possibly already selected a tool in the business understanding phase, this task
refers to the specific modeling technique, for example decision tree building with C4.5 or
neural network generation with back propagation.
00:02:52 If multiple techniques are applied, you should perform this task for each technique separately.
The output includes a document of the actual modeling technique that is to be used, any
modeling assumptions,
00:03:07 possibly regarding the data that is required by the modeling technique chosen. For example,
all attributes should have uniform distributions, no missing values are allowed, and the class
attribute must be a binary nominal.
00:03:26 Before we actually build a model, we need to identify the mechanism to test the models quality
and validity. For example, in supervised data mining tasks such as classification, where there
is a target variable,
00:03:42 it is common to use error rates as quality measures for data mining models. Therefore, we
typically randomly separate the dataset into estimation and validation subsets,
00:03:56 build the model on the estimation set, and estimate its quality on the validation set. The
outputs should describe the intended plan for training, testing, and evaluating the models.
00:04:12 A primary component of the plan is to decide how to divide the available dataset into
estimation data, validation data, and test datasets.
00:04:27 We need to run the modeling tool on the prepared dataset to create one or more models. The
outputs will include information about the parameter settings, the models themselves, and the
model description.
1
00:04:44 With any modeling tool, there are often a large number of parameters that can be adjusted.
We need to list the parameters and their chosen value, along with the rationale for the choice
of parameter setting.
00:05:02 We need to interpret the models according to our domain knowledge, the data mining success
criteria, and the desired test design. Then we rank the models according to the evaluation
success criteria,
00:05:18 remembering to take into account the business objectives and business success criteria. A
model assessment summarizes the result of this task,
00:05:29 lists the qualities of all of the generated models (for example, in terms of accuracy), and ranks
their quality in relation to each other.
00:05:40 According to the model assessment, we may need to revise the model parameter settings and
tune them for the next run in the build model task.
00:05:50 We then iterate model building and assessment until there is a strong belief that the best
model has been found. Then we need to document all revisions and assessments.
00:06:04 So, thats it for Unit 1 of the third week. In the next unit, we will look at detecting anomalies.
00:06:12 See you there.
Week 3 Unit 2
00:00:13 Welcome back to Week 3, Unit 2, where we are looking at detecting anomalies. An anomaly is
something that deviates from what is standard, normal, or expected.
00:00:28 In statistics, an outlier is an observation that is numerically distant from the rest of the data.
Outliers can occur because of errors and might need to be removed from the dataset or
corrected.
00:00:44 They can occur naturally and therefore must be treated carefully. In some cases, such as for
fraud analysis, the outlier can be the most interesting thing in the dataset.
00:00:58 Some statistics and algorithms can be heavily biased by outliers. For example, the simple
mean, correlation, and linear regression.
00:01:11 In contrast, the trimmed mean and median are not so affected. Outliers can be detected
visually, for example using scatter plots and box plots.
00:01:26 This slide shows a box plot. The median is the value separating the higher half of a data
sample from the lower half.
00:01:36 In simple terms, it may be thought of as the "middle" value of a dataset. The basic advantage
of the median in describing data, compared to the mean (which we often simply call the
"average")
00:01:51 is that it is not skewed so much by extremely large or small values, and so it may give a better
idea of a "typical" value. For example, in understanding statistics like household income, a
mean may be skewed by a small number of extremely high or low values.
00:02:12 Median income, for example, may be a better way to suggest what a "typical" income is.
Outliers can be detected using various algorithms.
00:02:25 The most well known being the Inter-Quartile Range Test or the Tukey Test, named after its
author. It's the calculation behind the construction of box plots.
00:02:39 We calculate the lower and upper quartiles (the 25th and 75th percentiles), which we denote
as Q1 and Q3 respectively. We also calculate the mid-spread.
00:02:55 This is given by Q3 - Q1 and is referred to as the Inter-Quartile Range (IQR). John Tukey
provided a precise definition for two types of outliers:
00:03:12 Outliers are either 3IQR or more above the third quartile, or 3IQR or more below the first
quartile; suspected outliers are slightly more central versions of outliers:
00:03:33 and these are either 1.5IQR or more above the third quartile, or 1.5IQR or more below the
first quartile. If either type of outlier is present, the whisker on the appropriate side of the box
plot is taken to 1.5IQR from the quartile,
00:03:58 rather than to the maximum or minimum value, and individual outlying data points are
displayed as unfilled circles for suspected outliers, or filled circles for outliers.
00:04:16 1.5IQR is often referred to as the "inner fence", while 3IQR from the quartile is referred to as
the "outer fence".
00:04:29 This is a non-parametric test it displays variations in samples of a statistical population
without making any assumptions about the underlying statistical distribution.
00:04:43 The calculation is very simple. However, Tukey's test is quite robust
00:04:48 and is certainly better than the more common variance test, where outliers are defined as
values outside two or three standard deviations, because in the variance test the outliers
themselves contribute to the statistic.
00:05:07 Here is a simple worked example. There are 15 values in the data set.
00:05:14 The lower quartile is the bottom 25%. The value is 10. The upper quartile is the top 25%. The
value is 12.
00:05:26 The median is 11. The IQR is calculated as 12 - 10 = 2.
00:05:36 If we use an IQR multiplier of 3, to identify outliers, then the upper limit is 12 + (3*2) = 18. The
lower limit is 10 - (3*2) = 4.
00:05:54 There is one value above the upper limit in this data set. This is P6 with a value of 24. There is
another one value below the lower limit. This is P11 with a value of 1.
00:06:12 So what we are now going to do is a demo in the SAP Predictive Analytics software. So the
first thing I'm going to do is capture the data, and we've got some very simple data here, which
is the IQR example
00:06:36 where I have one column which is periods, and another column which is the values for each
different period. So I'll capture that data.
00:06:51 And just to show you what the data looks like, I'll do a very simple line chart and show you the
value and the period.
00:07:07 And you can see quite obviously here that there are two outliers when you visualize it. Now
what I'm going to do is show you how we can actually predict this.
00:07:19 So I'll go to the Predict room and choose the algorithm that I want to run. And these are the
sets of algorithms that I have available here, on the right-hand side.
00:07:31 And I'm going to use the Inter-Quartile Range algorithm. And then when I set this, I have to
configure the settings.
00:07:41 I set the output mode to "Show Outliers". The feature I'm looking at is "VALUE".
00:07:50 And I'll set the fence coefficient as "3.0". And then once I've done all this, I can complete the
analysis and run it.
00:08:06 And here, the output shows me that this value which is being flagged as a "1" is an outlier, and
this value here which is being flagged as a "1" is another outlier, so this column flags the
outliers for me.
00:08:21 So this is a very simple example of how we can use the Inter-Quartile Range to identify where
the outliers are. The Nearest Neighbor algorithm is another way to detect outliers.
00:08:50 It is based on the concept of a local density, where locality is given by nearest neighbors. We
measure the distance between each point and each neighbor, and the distance is used to
estimate the density.
00:09:08 By comparing the local density of an object to the local density of its neighbors, we can identify
regions of similar density, and points that have a substantially lower density than their
neighbors.
00:09:24 These low density points are considered to be outliers. In this example, we are using the
Euclidean (straight line) distance.
00:09:34 Let's have a look at a worked example. We need to specify the number of neighbors, say here
K = 3.
00:09:44 And also specify the number of outliers to be detected. Here it is N = 2. We calculate all the
inter-object distances.
00:09:56 These are the Euclidean straight line distances shown in the table. Observation 1 has a value
of 16.9.
00:10:07 Observation 2 has a value of 24.0. The distance is 24.0 - 16.9 = 7.1.
00:10:19 The third observation, if you look on the slide, is 32.9. Therefore, the distance here is 32.9 -
16.9 = 16.0.
00:10:32 These distances are calculated for all of the observations. We select the three (K = 3) nearest
and calculate the average of the distances.
00:10:47 For each row in the distance tables, we select the minimum distance, the next largest, and the
next largest so there are three values. Then take the average of these values.
00:11:01 In row 1, the minimum distance is 0.2, which is the difference between row 1 and row 4. The
next shortest distance is 0.7, which is the difference between row 1 and row 9.
00:11:18 And the next shortest distance is 1.4, which is the difference between row 1 and row 7. Take
these three values and average them, to give 0.77.
00:11:32 Highlight the two rows (because we have N=2) with the largest average. Here, these are for
rows 2 and 5.
00:11:43 An improvement in the algorithm would be to apply the Inter-Quartile Range Test to the
Average Distance column to identify significant outliers within a neighborhood.
00:11:57 We will be returning to this test later in the course. So now what I'd like to do is show you
another demo where I'm going to show you the SAP Predictive Analytics tool
00:12:12 and how we can undertake this type of analysis. So if I create a new analysis,
00:12:29 and this time I'm going to require some nearest neighbor data, which is the data that we've just
looked at, with rows and columns, which is exactly the same data that we've just used, and
read this in.
00:12:44 And I move to the Predict room and go to my algorithms, and if I scroll down,
00:12:55 I'll find the "Nearest Neighbour" test. I double-click on "Nearest Neighbour", and then I have to
set its parameters.
00:13:08 Again, I don't want to remove outliers, I just want to show the outliers. The number of outliers
we're going to detect is
00:13:18 so these are the same settings that we used previously when we walked through the test
manually. The feature is the "COLUMN".
00:13:28 The neighborhood count is "3", as it was previously. And now that I've done that, I can set
those parameters and run the model.
00:13:42 And here it correctly identifies for me the rows with the outliers "ROW2" and "ROW5". So this
is another very simple example of how we can use algorithms to identify outliers.
00:14:10 There are a wide range of anomaly detection algorithms available apart from the ones we have
already looked at. For example, Cluster Modeling can be used,
00:14:20 Association Analysis that identifies rare occurrences, Principal Component Analysis, Distance-
Based Failure Analysis, and Link Analysis.
00:14:34 We will be looking at some of these during this course. Remember that anomalies can arise
from what is "unusual" and also what is "unexpected".
00:14:46 We will often build a model on the observed data, score some new data, and examine the
major variances of actual versus predicted to get a better understanding of the unusual and
expected values.
00:15:05 To summarize: Outlier analysis is a key step in predictive analysis, as outliers can significantly
affect a model.
00:15:17 We can perform a visual analysis, and we can use various algorithms.
00:15:24 Outliers need to be investigated and not simply removed from the analysis. With that I would
like to close this unit.
00:15:34 We will return in the next unit and look at association analysis. See you there.
Week 3 Unit 3
00:00:11 Welcome back to Week 3, Unit 3, where we will look at association analysis. We use
association rules in our everyday lives.
00:00:23 Here are some application examples: shopping carts and supermarket shoppers,
00:00:29 analysis of any product purchases not just in shops, but online as well, analysis of telecom
service purchases for voice and data packages.
00:00:41 If there is customer identification, maybe through a loyalty scheme, the purchases can be
analyzed over time and the sequence of product purchases analyzed.
00:00:52 Identification of fraudulent medical insurance claims consider cases where common rules
are broken. Differential analysis compare results between different stores, between
customers in different demographic groups,
00:01:08 between different days of the week or different seasons of the year, for example. The "basket"
can be a household, not only related to an individual.
00:01:21 This type of analysis is used to analyze customer purchase behavior. For example, how many
people bought both Product 1 and Product 2?
00:01:30 People who purchased Product 1 are generally buying which other products? This analysis
also supports: the design of in and out-of-store campaign strategies,
00:01:44 for example, developing combo offers based on products sold together; the design of retail
floor layouts where retailers organize and place associated product categories nearby inside a
store;
00:02:03 how a retailer should determine the layout of the catalog of an e-commerce site; how a retailer
controls inventory based on product demand and what products sell together;
00:02:17 also fashion retailers like to confirm the "look" of their fashion. I will show you a demonstration
with an example dataset.
00:02:30 Here, there are 30 transactions and there are only 17 baskets, shown in alternate colors
below, in what we call transaction format, which is sometimes called till-roll format.
00:02:46 First we will walk through a manual analysis and I will explain how the important metrics are
calculated. Then I will show you how we can do this in the SAP Predictive Analytics tool.
00:02:59 I will also be able to show you the transactional data more clearly when I read it into the tool.
Association analysis produces simple rules.
00:03:12 For example: If {Product A} and {Product B} are purchased, then {Product C}. These are
simple If Then type rules.
00:03:23 This example rule has two antecedents (Product A and Product B) on the left side of the rule,
and one consequent (Product C) on the right side.
00:03:37 The rule length is 3, as there are three products in this rule. Different rules can have different
lengths and numbers of antecedents.
00:03:50 There are three main statistics used in association analysis firstly we will look at the support.
Support is defined as the number of baskets that support the rule (meaning where the
combination of products exists),
00:04:10 divided by the total number of baskets, expressed as a percentage. So in the example, there
are three baskets that contain Product 5 and Product 6.
00:04:23 There are 17 baskets in total. The rule support is given by 3 / 17 * 100 (because it's a percent)
= 17.65.
00:04:39 Support is bi-directional. So, for example, if Product 3 and Product 2 then 4; if Product 2 and 4
then 3;
00:04:52 or if Product 4 and 3 then 2 all have the same rule support %. All those different combinations
of products have the same support.
00:05:03 We say that the support of the rule X then Y is symmetric, meaning: support (X then Y) =
support (Y then X). Secondly we will look at the metric called "confidence".
00:05:23 The confidence measures how often the consequent appears in transactions that contain the
antecedents. It is the conditional probability of those customers who have purchased the
specified antecedents to buy the consequent.
00:05:42 This is an extremely important metric for retailers. If we have a simple rule If Product A then
Product B
00:05:51 then the confidence is defined as the number of baskets in which Product A and Product B
both exist, divided by the number of baskets with Product A in, expressed as a percentage.
00:06:06 This is not symmetric, however, so if we look conversely: If Product B then Product A, we
calculate confidence as the number of baskets in which Product B and Product A both exist,
00:06:22 divided by the number of baskets with Product B, expressed as a percentage. You will see that
this is not the same.
00:06:32 Therefore, confidence is not bi-directional. The third metric is called "lift", or "improvement".
00:06:43 Both support and confidence are used to test a rules validity. However, there are times when
both of these measures may be high, and yet still produce a rule that is not useful.
00:06:56 Let's take a simple example with two products, A and B. The combination of Product A and
Product B has a support of, say, 40%
00:07:08 in other words, 40% of the baskets support this rule, and maybe there is a confidence of 80%
so in other words, in 80% of the baskets when customers buy Product A, they also buy
Product B.
00:07:25 This at first appears to be an excellent rule; it has both very high confidence and support.
However, what if customers in general buy Product B 95% of the time anyway?
00:07:40 In that case, Product B customers are actually less likely to buy Product A than customers in
general. Therefore, we have a third measure for market basket analysis this is the
improvement or lift which is defined as:
00:07:59 the confidence of the combination of items divided by the support of the result. In the simple
example, lift = confidence (A and B) / support (B), which in our example is 80% / 95% =
0.8421.
00:08:24 Now, any rule with a lift of less than 1 does not indicate a real cross-selling opportunity, no
matter how high its support and confidence, because it actually offers less ability to predict a
purchase than random chance.
00:08:43 If some rule had a lift of 1, it would imply that the probability of occurrence of the antecedent
and that of the consequent are independent of each other.
00:08:55 When two events are independent of each other, no rule can be drawn involving those two
events. However, if the lift is greater than 1, that lets us know the degree to which those two
occurrences are dependent on one another,
00:09:13 and makes these rules potentially useful for predicting the consequent in future datasets The
lift of the rule "X then Y" is symmetric, so lift (X then Y) = lift (Y then X).
00:09:32 So, how can we use this? What are the possible recommendations for an "X then Y" rule,
00:09:40 when X and Y are two separate products and have high support, high confidence, and high
positive lift values greater than 1? The retailer has a number of choices.
00:09:55 They could put X and Y closer in the store. They could package X with Y.
00:10:03 They could package X and Y with a poorly selling item. They can give discount on only one of
X and Y.
00:10:12 They could increase the price of X and lower the price of Y, or vice versa. They could advertise
only one of X and Y, so they don't advertise X and Y together.
00:10:26 So for example, if X was a toy and Y a form of sweet, then offering sweets in the form of toy X
could also be a very good option. I will now show you how we can use the Apriori rule in SAP
Predictive Analytics.
00:10:51 So here we have some product association data. It's exactly the same data as I used when we
stepped through the process manually.
00:11:02 And I can show you that data a little more clearly now. So for each one of these RecordIDs, we
can see the TransactionID, and the product that was purchased.
00:11:17 Now what I'd like to do is to use the R-Apriori algorithm. So I'll bring that into the palette, and
the first thing I need to do is configure the settings.
00:11:32 So the output mode is "Rules". We have the data in a transactional format.
00:11:41 The item that we're looking at is the ProductName. And the TransactionID is referenced here
by the TransactionID.
00:11:54 Now what I need to do is to set the support and confidence. These are used as filters to
remove any inconsequential rules.
00:12:05 So here the support I'll put in is 0.01. And also I'll use the same value for the confidence in this
example.
00:12:18 Now all of these parameters are complete, and I will run the analysis.
00:12:30 The execution is successful. And it shows me all of the different rules which have been
produced,
00:12:37 which I can show you here. These top rules are for the products on their
00:12:44 And I have a support, confidence, and lift which is being calculated. Then we have the
combination of products which are being purchased together.
00:12:52 So in this one here, Product 7 is the antecedant and Product 3 is the consequent. And that's
the support, that's the confidence, and that's the lift.
00:13:04 And all of these rules are produced for all the different product combinations. It also produces
an output which is a tag cloud.
00:13:17 And so in this example, the confidence is shown by the color. The higher the confidence, the
more red the color becomes.
00:13:29 And the lift is shown by the font size. So you can very quickly and easily identify which rules
have the highest confidence and which have the highest lift.
00:13:52 So, let's summarize what we have seen about association rules. Their strengths it produces
clear and understandable results.
00:14:03 The calculations are straightforward and therefore easy to understand. The results are
actionable.
00:14:11 This is what we call an undirected or unsupervised type of data mining we dont have a
target. However, there are some weaknesses it requires exponentially more computations as
the problem size grows.
00:14:28 Many of the results can be either trivial or inexplicable. It can ignore rare items.
00:14:37 It does not allow us to directly include any customer data if this is available as we are
simply analyzing transactional data. Okay, thats it for this unit.
00:14:52 In the next unit, we will cover the topic "Cluster Analysis". See you there.
Week 3 Unit 4
00:00:12 Welcome to Week 3, Unit 4. Today, we will be looking at cluster analysis.

00:00:20 Grouping similar customers and products is a fundamental marketing activity. Cluster analysis
or clustering is the task of grouping a set of objects in such a way that objects in the same
group called a cluster
00:00:37 are more similar to each other, so they are homogeneous in some sense or another, but are
very dissimilar to objects not belonging to that group, so they are heterogeneous.
00:00:51 It is often used in market segmentation because companies cannot connect with all their
customers, so they have to divide markets into groups of consumers, customers, or clients
called segments with similar needs and wants.
00:01:09 Organizations can then target each of these segments with specific offers that are relevant,
using the communication tone and content that is most likely to appeal to the customers within
the segment.
00:01:24 Many clustering techniques fall into a group of undirected or unsupervised data science tools.
The goal of undirected analysis is to discover structure in the data as a whole.
00:01:39 There is no target variable to be predicted. There is a wide variety of applications of cluster
analysis.
00:01:48 For example, analysis to find groups of similar customers, segmenting the market and
determining target markets,
00:01:57 product and service development, product positioning, selecting test markets, analyzing crime
patterns, and data reduction or refinement when faced with large, complex datasets.
00:02:14 Let me give you a real example: The U.S. Army cluster uniform sizes so they can reduce the
number of different sizes and standardize them.
00:02:27 Cluster analysis is also used for medical research, social services, psychiatry, education,
archaeology, astronomy, and taxonomy, for example.
00:02:42 ABC analysis is a very basic clustering approach. It clusters data based on its contribution to a
total.
00:02:52 For example, to find the top 10% of customers based on their spending, or the top 20% of
products based on their contribution to overall profit.
00:03:06 The data is first sorted in descending numeric order and then partitioned into the first A%, the
second B%, and the final C%. The A cluster may be considered the most important or the gold
segment,
00:03:24 the B cluster the next most important or the silver segment, and finally the C cluster is the least
important or the bronze segment.
00:03:36 An example shown on the slide here shows where A=25%, B=30%, and C=45%. k-means
clustering is a method of cluster analysis which aims to partition n observations into k clusters
00:04:01 in which each observation belongs to the cluster with the nearest mean. It is one of the best
known predictive analysis algorithms.
00:04:11 The method is essentially as follows. First you have to specify how many clusters you require
this is k.
00:04:21 In this example on the slide, k=3. Then the center of each of the k clusters is initialized.
00:04:32 There are a variety of ways to initialize the center, for example randomly choosing k
observations from the dataset and using these as the initial means.
00:04:45 These are shown in the diagram by the red, green, and blue colored circles in step 1. Then the
algorithm proceeds iteratively, alternating between two steps:
00:04:59 an assignment step where each observation these are shown as the gray squares in the
diagram is assigned to the cluster whose mean yields the least within-cluster sum of
squares.
00:05:14 Since the sum of squares is the squared Euclidean distance, this assignment process
associates each observation with its "nearest" mean.
00:05:28 Then there is an update step: Then the centroid of each of the k clusters is designated as the
new mean.
00:05:37 In effect, the centroid moves slightly. And then we repeat these two steps, steps 2 and 3, until
convergence has been reached.
00:05:50 The algorithm has converged when the assignments no longer change. The standard
algorithm aims at minimizing the within-cluster sum of squares objective,
00:06:03 which is exactly equivalent to assigning observations by using the smallest Euclidean distance.
Therefore, all of the observations in a cluster should be as homogeneous as possible, and the
clusters will be heterogeneous.
00:06:23 It can be quite difficult to choose k the number of clusters. Often a range of models will be
developed with different numbers of clusters.
00:06:34 A choice has to be made regarding the number of clusters you choose in the final model. This
choice may be based on:
00:06:43 Business operational constraints the choice of the number of clusters is always limited by the
organizations capacity to use them. Having 20+ clusters does not make business sense in
many cases.
00:06:59 Business meaning the interpretability of results. Clustering is only useful if it can be
understood by the business and explained in simple words.
00:07:12 In some cases you may already know the value of k you want to use. One suggestion is to use
the square root of N/2, where N is the number of records.
00:07:25 However, this could be very large. Another popular approach is called the "silhouette", which is
a measure of the quality of the cluster analysis.
00:07:38 One of the first steps in a clustering project is data preparation. For continuous variables, data
preparation is required to rescale them.
00:07:51 This is very important when dealing with variables of different units and scales. k-means uses
the Euclidean distance, so all the variables should have the same scale for a fair comparison
in the model.
00:08:08 Two popular methods used for rescaling data normalization, which scales all numeric variables
in the range 01,
00:08:19 and standardization, which transforms it to have a zero mean and unit variance. Both of these
techniques have their drawbacks.
00:08:30 If you have outliers, normalizing your data will scale the "normal" data to a very small interval.
However, when using standardization, your new data isnt bounded, unlike normalization.
00:08:47 So you need to be very careful about outlier detection before you embark on the data
preparation phase. For categorical variables, one approach is to convert each category into a
variable with a dichotomous variable a value of 0 or 1.
00:09:10 These are called disjunctive or dummy variables. The SAP HANA Predictive Analysis
Library (PAL) supports a very comprehensive range of cluster analysis algorithms,
00:09:28 including an approach called agglomerate hierarchical clustering. This is a "bottom up"
approach.
00:09:37 Each observation starts in its own cluster, and pairs of clusters are merged as you move up
the hierarchy. The first groups are of observations that are closest.
00:09:51 Then the process groups the next closest, then the next closest, until all the observations are
clustered as one. The results of hierarchical clustering are usually presented in a
dendrogram diagram, which is shown on the slide.
00:10:10 This plots the "dissimilarity" of records and clusters. The row of nodes on the far left in the
diagram represents the individual observations,
00:10:21 and the remaining nodes represent the clusters to which the data belongs, with the lines
representing the distance, or dissimilarity. In order to decide which clusters should be
combined, a measure of dissimilarity between sets of observations is required.
00:10:42 In most methods of hierarchical clustering, this is achieved by using an appropriate metric
which is a measure of distance between pairs of observations for example the Euclidean
distance
00:10:56 and a linkage criterion that determines the distance between sets of observations. This
artificial neural network was introduced by the Finnish professor Teuvo Kohonen in the 1980s
00:11:13 and is sometimes called a Kohonen map or network. It is trained using unsupervised learning
to produce a low-dimensional typically two- dimensional
00:11:24 representation of the input space of the training samples, which we call a map. When the
network is trained, observations that are similar should appear close together on the output
map,
00:11:38 while records that are different will appear far apart. The number of observations captured by
each cell in the map will show the more populated units.
00:11:51 This indicates groupings of the records or segments, which may give a sense of the
appropriate number of clusters in the dataset. Unlike k-means cluster analysis, the value of "k"
is not pre-determined.
00:12:08 The network is created from a two-dimensional lattice of "nodes", each of which is fully
connected to the input layer. This example on the slide is a small Kohonen network of 3 * 3
nodes
00:12:23 connected to the input layer of a two-dimensional vector, for example a two-variable dataset.
Like most artificial neural networks, they operate in two modes: training and mapping.
00:12:39 "Training" builds the map using input examples, while "mapping" automatically classifies a new
input vector. They have been applied in many areas, such as image browsing systems,
medical diagnosis, interpreting seismic activity,
00:12:58 speech recognition, and data compression. There are some weaknesses, however.
00:13:05 It can sometimes be difficult to interpret the results, and it can be sensitive to the algorithm
parameter selections. So now we need to look at segmenting, as opposed to clustering.
00:13:21 Segmenting is the process of putting customers into groups based on similarities, and
clustering is the process of finding similarities in customers so that they can be grouped, and
therefore segmented.
00:13:38 These seem quite similar, but they are not quite the same. So what does this mean?
00:13:45 Customer segmentation is the practice of dividing a customer base into groups of individuals
that are similar in specific ways relevant to marketing, such as age, gender, particular
interests, and spending habits.
00:14:03 To develop a segmentation you have to decide on the borders between the segments. That is
quite simple if you only have to deal with one or two characteristics.
00:14:15 In marketing, we sometimes have hundreds of characteristics of our customers. That makes it
a lot more complex and often people use cluster models to find these borders for them
00:14:27 in the multidimensional space of their customer database. Eventually they will use these
borders for actually segmenting their customer base.
00:14:39 So clustering is finding borders between groups, and segmenting is using borders to form
groups. There are a number of criteria to be considered when you build a segmentation model.
00:14:55 The following key aspects require careful consideration and evaluation: They need to
homogeneous there must be similarity of members within the segments.
00:15:07 They need to be heterogeneous so there should be a distinct difference between the
segments. They need to be stable so segments should be stable over time so that
appropriate business or marketing activity can be implemented.
00:15:25 They need to be recognizable segments must make sense to the business. They need to be
meaningful and relevant so segments must be well-defined and actionable.
00:15:37 And they need to be manageable the number and complexity of segments is very important.
Too few segments make the solution irrelevant, too many segments will be difficult to manage.
00:15:54 Most clustering approaches are undirected, which means there is no target. The goal of
undirected analysis is to discover structure in the data as a whole.
00:16:06 Some of the important issues to consider are: the choice of the distance measure.
00:16:12 Most clustering techniques use the Euclidean distance formula this is the square root of the
sum of the squares of the distances along each attribute axis.
00:16:25 Therefore, continuous variables need to be rescaled so there is a fair comparison. Also,
categorical variables must be transformed before the clustering can take place.
00:16:38 Depending on their transformations, the categorical variables may dominate the clustering
results or they may be completely ignored.
00:16:48 The choice of the right number of clusters is very important. If the number of clusters k in the k-
means method is not chosen to match the natural structure of the data,
00:17:01 the results may not be very good. The proper way to alleviate this is to experiment with
different values for k.
00:17:10 In principle, the best k value will exhibit the smallest intra-cluster distances and largest inter-
cluster distances. The cluster interpretation is very important.
00:17:24 Once the clusters are discovered, they have to be interpreted. There are different ways to
utilize clustering results:
00:17:32 Cluster membership can be used as a label for a separate classification problem. Other data
science techniques, for example decision trees, can be used to find descriptions for each
cluster.
00:17:48 Clusters can be visualized using two- dimensional and three-dimensional scatter graphs or
some other visualization technique. Any differences in attribute values among different clusters
can be examined, one attribute at a time.
00:18:06 There can be application issues, of course. Clustering techniques are used when we expect
natural groupings in the data.
00:18:16 Clusters should then represent groups of items such as products, events, or customers
that have a lot in common. Creating clusters prior to the application of some other data mining
technique
00:18:29 for example classification models, decision trees, or neural networks might reduce the
complexity of the problem by dividing the data space.
00:18:40 These space partitions can be modeled separately, and these two-step procedures can
sometimes exhibit improved results compared to the analysis or modeling without using
clustering.
00:18:54 This is referred to as segmented modeling. I will leave you with this great quote from Berry &
Linoff, in their very good book called "Data Mining Techniques".
00:19:10 In the next unit, we will discuss classification analysis using regression techniques.
Week 3 Unit 5
00:00:13 Hi and welcome back. In the last unit of the third week we will now take a look at classification
using regression techniques.
00:00:22 Regression analysis is a collective name for techniques used for the modeling and analysis of
numerical data consisting of values of a target variable and of one or more explanatory
variables.
00:00:36 The parameters of the regression are estimated so as to give a "best fit" of the data. Most
commonly the best fit is evaluated by using the least squares method, but other criteria can
also be used.
00:00:51 The target variable in the regression equation is modeled as a function of the explanatory
variables, a constant term, and an error term. The error term is treated as a random variable.
00:01:03 It represents unexplained variation in the dependent variable. The target is a continuous
variable.
00:01:11 Classification analysis uses regression techniques to identify the category a new observation
belongs to, on the basis of a training set of data containing observations whose category
membership is known.
00:01:25 In these examples above, the target has two categories: churners and non-churners,
responders and non- responders, or apples and pears.
00:01:35 There are many use cases for classification analysis, covering scenarios where the focus is on
the relationship between a dependent variable or a target variable and one or more
independent or explanatory variables.
00:01:52 Retention analysis, or churn analysis, is a major application area for predictive analysis. The
objective is to try and build a model to describe the attributes of those customers who have
been retained,
00:02:05 in contrast to those who have left or churned, and therefore develop strategies to maximize the
retention of customers. The target or dependent variable is usually a flag, binary, or Boolean
variable, which is either "Yes" or "No", or "1" or "0".
00:02:23 depending on whether the customer churned or The explanatory or independent variables are
usually numeric and categorical, describing the attributes of each customer.
00:02:37 The class of models in these cases is referred to as a classification model as we wish to
classify observations, and the more general sense, data records, into classes.
00:02:49 The algorithms for this class of models are primarily decision trees, given their ability to
produce rules, but also the major fields of regression and neural networks.
00:03:02 In the terminology of predictive analysis, classification is considered an instance of supervised
learning. This means learning where a training set of correctly identified observations is
available,
00:03:17 which means there is a target variable. The use cases for classification analysis are the largest
group within predictive analysis.
00:03:28 This very simple demonstration is going to show you a bi-variate linear regression where Y=
a+ You will have noticed already that the terminology I'm using is very different across different
disciplines.
00:03:48 In statistics, where classification is often done with logistic regression or a similar procedure,
the properties of observations are termed explanatory variables, independent variables, or
regressors.
00:04:04 These are the x variables in the equation. The categories to be predicted are known as
targets, or sometimes as dependent variables or outcomes.
00:04:14 This is the Y variable in the equation. However, in machine learning, the observations are
often known as instances, the explanatory variables are termed features,
00:04:27 and the possible categories to be predicted are called classes. And now we're going to switch
to a demonstration.
00:04:45 Here we have a very simple dataset. There are X and Y values.
00:04:50 And what I'm going to do is build a very simple regression model. So going into the SAP
Predictive Analytics software, I can see the data on the left-hand side,
00:05:02 and I need to choose the algorithm. So if I scroll down here, I'll see that there are a range of
regression algorithms available.
00:05:12 And I'm going to use a "Linear Regression" algorithm. So I go into the algorithm and set the
properties.
00:05:23 The output mode is the "Trend". The independent column in this case are my X values,
00:05:29 and the dependent column, which is the target, the thing that I'm trying to predict, is the Y
value. So now that's all set up, I can run this model...
00:05:43 and the output is a prediction of the Y values here. So that's how simple linear regression
works.
00:05:53 Now, what I can also do is save this model so that I can use it again later on to apply this
model onto new data. However, one of the things that I would normally do is look at this
statistical algorithm summary.
00:06:12 And this gives me information about the linear model that I've built, giving the R-square factor
and the F value.
00:06:22 This model summary report provides a lot of important information about the regression model.
So, for example, what is this coefficient of determination, or R squared?
00:06:36 This is a very important statistic. It is the proportion of the total variation in Y explained by
fitting the regression.
00:06:45 It is the square of the correlation between the predicted Y scores and actual Y scores, and it
ranges from 0 to 1. An R squared of 0 means that the target variable cannot be predicted from
the explanatory variable.
00:07:01 An R squared of 1 means that the target variable can be predicted without error from the
explanatory variables. An R squared between 0 and 1 indicates the extent to which the target
variable is predictable.
00:07:16 So, to give you an example an R squared of 0.1 means that 10% of the variance in Y is
predictable from X. Again, an R squared of 0.2 means that 20% is predictable, and so on.
00:07:34 Also, looking at this output, you will see something that's called the F statistic. An F statistic is
a value you get when you run a regression analysis
00:07:45 to find out if the means between two populations are significantly different. Its very similar to a
T statistic from a T-test, which you might have heard of before.
00:07:56 A T-test will tell you if a single variable is statistically significant and an F test will tell you if a
group of variables are jointly significant.
00:08:07 If you have significant results, it means that your results likely did not happen by chance.
However, if you dont have statistically significant results, your data does not show a
relationship,
00:08:21 and we use some strange terminology which says that we cant reject the null hypothesis.
Well, what does this mean?
00:08:31 It means it gets replaced with the alternate hypothesis, which is what you think might actually
be true about a situation. So, for example, lets say you think that a certain food additive might
be responsible for a series of recent heart attacks,
00:08:50 and you're going to do some data science to try and prove this. The manufacturer thinks the
additive is safe.
00:08:58 So, the null hypothesis refers to the accepted hypothesis. In our example, the additive is on
the market, people are eating it, and its generally accepted to be safe.
00:09:10 Therefore, the null hypothesis is that the additive is safe. The alternate hypothesis, which is
the one you want to replace the null hypothesis, is that the additive isnt safe.
00:09:24 Rejecting the null hypothesis in this case means that you will have to prove that the additive is
not safe using data science. In our example, the SAP software in this very simple linear
regression has calculated the F value and reports it in the summary report.
00:09:45 The value, as you can see, is 27.36 F values can be taken from statistical tables.
00:09:55 F01 and F05 are the table values of the F-distribution for 99% and 95% confidence limits.
These table values allow us to test the null hypothesis that there is no slope or linear
relationship between Y and X.
00:10:17 If the F value exceeds F05 or F01, we can reject the null hypothesis in favor of the alternative
hypothesis that there is a relationship at the appropriate confidence levels.
00:10:32 So from tables we can look up the F values. The F value at 99% is 16.26, and the F value at
95% is 6.61.
00:10:46 So comparing our F value to these table values, we can test the model. And we can reject the
null hypothesis and say that we are 99% confident that there is a linear relationship between Y
and X.
00:11:06 So, let's go back to the slide. On this slide you can see a summary of the linear regression
output.
00:11:14 We have the output values, a visualization of the output data, and a summary report. And in
the summary report you can see the R squared factor and the F value
00:11:32 To use the model for predictions, we save the model when it is run, then input new data and
apply the saved model. Applying the model onto new data and predicting the target value is
also called scoring.
00:11:48 So what I'd like to do now is go back into the Predictive Analytics tool and show you how we
do this. So we have a new dataset here this is the "apply" data.
00:12:00 We have the X and the Y value again. So I go back into the "Designer" view,
00:12:08 and what I have to do is select the model that we previously saved. And this is the model here,
which I can drag in.
00:12:19 I need to configure the settings, which I've already done, and then run that model onto the
"apply" data,
00:12:30 and this give us the output. So we have scored that data and we have predicted values on the
new data.
00:12:47 Multiple linear regression attempts to model the relationship between two or more explanatory
variables and a target variable by fitting a linear equation to the observed data.
00:13:00 Every value of the explanatory variable X is associated with a value of the target variable Y. A
multiple linear regression model is represented as a simple equation:
00:13:14 Y = a + b1 * x1 + b2 * x2 + b3 * x3, and so on, where a is a constant, the b values are weights
that are calculated from the regression,
00:13:32 and the x values are the explanatory variables, and Y is the target.
00:13:39 In this next demonstration, the model target is called "Yield". There are two x explanatory
variables, called "Fertilizer" and "Rainfall".
00:13:52 Okay, let's switch to Predictive Analytics and see an example of multiple linear regression. So
here is the data we're going to use.
00:14:06 There are two x values, the explanatory variables. One's called "Fertilizer" and the other is
called "Rainfall",
00:14:14 and we have a target which is called "Yield". So what I want to do here is use a multiple
regression algorithm.
00:14:23 Again, I scroll down to the regression algorithms and I can choose the "R-Multiple Linear
Regression" algorithm. Now what I need to do, just as before, is configure the settings.
00:14:42 It's a "Trend" output. The independent columns, which are the explanatory variables, are
"Fertilizer" and "Rainfall".
00:14:49 And the "Dependent Column" is "Yield". So I've set those and I now run the model,
00:14:58 and I can see the results. These are the predicted values of "Yield",
00:15:04 and those were the actual values of "Yield". So now you can see how easy it is to run a
multiple linear regression in the Predictive Analytics software.
00:15:26 This shows the multiple linear regression output. You will see a multiple R squared statistic as
well as an F statistic.
00:15:40 Although you have seen this slide before, I want to emphasize the problem of overfitting and
generalization. Model overfitting occurs when a model provides an excellent fit to the data that
it is trained on,
00:15:54 but when the model is applied onto new data, its performance or accuracy is very poor.
Overfitting generally occurs when a model is excessively complex,
00:16:06 having too many parameters relative to the number of observations. The standard approach to
avoiding overfitting is to split the data into an estimation dataset to train or build the model on,
00:16:22 and a validation set to test the model on unseen or hold-out data. Other techniques include
cross-validation where multiple models are run on samples of the data and the models
compared.
00:16:39 In decision trees, the results are often pruned to more populated leaf nodes. And here again
you can see the overfitted model at the top left, the underfitted model at the top right,
00:16:56 and then somewhere in between we have to locate our robust model that has low training error
and low test error, and both errors have the same type of magnitude.
00:17:11 So what are the strengths and weaknesses of this approach? It's obviously fairly easy to
understand and it's easy to apply.
00:17:24 However, as with any technique, it can be significantly affected by outliers and can suffer from
overfitting. I have added in some extra supplementary information for you that presents bi-
variate regression variations,
00:17:41 polynomial regression, and logistic regression. With this, Id like to close the third week.
00:17:49 I hope you enjoyed these units and I am happy to get in touch with you in our discussion forum
if you have any content-related questions. Now, I wish you all the best for the Weekly
Assignment and see you next week where we will be continuing the topic of modeling.
www.sap.com
2016 SAP SE or an SAP affiliate company. All rights reserved.

No part of this publication may be reproduced or transmitted in any form
or for any purpose without the express permission of SAP SE or an SAP
affiliate company.
SAP and other SAP products and services mentioned herein as well as their
respective logos are trademarks or registered trademarks of SAP SE (or an
SAP affiliate company) in Germany and other countries. Please see
http://www.sap.com/corporate-en/legal/copyright/index.epx#trademark for
additional trademark information and notices. Some software products
marketed by SAP SE and its distributors contain proprietary software
components of other software vendors.
National product specifications may vary.
These materials are provided by SAP SE or an SAP affiliate company for
informational purposes only, without representation or warranty of any kind,
and SAP SE or its affiliated companies shall not be liable for errors or
omissions with respect to the materials. The only warranties for SAP SE or
SAP affiliate company products and services are those that are set forth in
the express warranty statements accompanying such products and services,
if any. Nothing herein should be construed as constituting an additional
warranty.
In particular, SAP SE or its affiliated companies have no obligation to pursue
any course of business outlined in this document or any related presentation,
or to develop or release any functionality mentioned therein. This document,
or any related presentation, and SAP SEs or its affiliated companies
strategy and possible future developments, products, and/or platform
directions and functionality are all subject to change and may be changed by
SAP SE or its affiliated companies at any time for any reason without notice.
The information in this document is not a commitment, promise, or legal
obligation to deliver any material, code, or functionality. All forward-looking
statements are subject to various risks and uncertainties that could cause
actual results to differ materially from expectations. Readers are cautioned
not to place undue reliance on these forward-looking statements, which
speak only as of their dates, and they should not be relied upon in making
purchasing decisions.

OpenSAP Ds1 Week 3 Transcript

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

OpenSAP Ds1 Week 3 Transcript

Uploaded by

Copyright:

Available Formats

openSAP

Getting Started with Data Science

00:00:12 Welcome to Week 3, Unit 4. Today, we will be looking at cluster analysis.

2016 SAP SE or an SAP affiliate company. All rights reserved.

You might also like